kiwitobes.com

kiwitobes.com

Author, Software Developer, and Data Magnate

kiwitobes.com RSS Feed
 

My latest two books now available!

The first of these is Programming the Semantic Web. I wrote this with two of my coworkers, Jamie Taylor and Colin Evans. We were attempting to make the first ever practical guide to why regular programmers should pay attention to semantic technologies. After writing this book, I was so convinced myself that I’ve moved all my projects away from traditional relational databases to graph databases.

That animal on the cover is a Red Panda, also known as a Firefox. Many thanks to our editor Mary Treseler for being so awesome through the process of writing this.

The second is Beautiful Data, which is an essay collection that I co-edited with Jeff Hammerbacher and to which I also contributed. We found a group of people who we thought were doing awesome stuff with data and convinced them to write essays.

We have an awesome list of contributors: Peter Norvig, Nathan Yau, Jonathan Follett, Matt Holm, J.M. Hughes, Raghu Ramakrishnan, Brian Cooper, Utkarsh Srivastava, Jason Dykes, Jo Wood Jeff Jonas, Lisa Sokol, Jud Valeski, Alon Halevy, Jayant Madhavan, Aaron Koblin, Valdean Klump, Michal Migurski Jeff Heer, Coco Krumme, Matt Wood, Ben Blackburne, Jean-Claude Bradley, Rajarshi Guha, Andrew Lang, Pierre Lindenbaum, Cameron Neylon, Antony Williams, Egon Willighagen, Lukas Biewald, Brendan O’Connor, Hadley Wickham, Deborah Swayne, David Poole Andrew Gelman, Jonathan P. Kastellec, Yair Ghitza and Jeff and myself.

All royalties for Beautiful Data are split between the Sunlight Foundation and Creative Commons.

Julie Steele, the O’Reilly editor on this book, was so awesome at making sure everyone got their essays in and they were reviewed properly.

They’re both great books, I’m really proud of how they turned out. I’m not sure I’ll be writing again for a while though!

Quick updates: Wedding, Hack Day, Books

  • I’m about to head off to my wedding in Mexico! I’m afraid while I’m gone I won’t be checking email very much, so please don’t be sad if I don’t write back to you immediately, I’ll try to get through everything when I get back!
  • If you like my posts on semantics and free data, and you live in San Francisco, you should check out Freebase Hack Day to meet a bunch of like-minded people.
  • Finally, I believe both Programming the Semantic Web and Beautiful Data will be out in July. I’m not sure of the exact date, but we’re hoping that everything is out in time for OSCON.

Why Semantics?

In February I gave a tutorial and a talk at the most awesome conference ever (go Tash!) called Webstock, in Wellington, New Zealand. The talk was called Why Semantics, and was essentially about the ideas behind the semantic web and why they’re interesting to normal working developers. After I gave the talk, I had several famous (at least to me) developers tell me that they finally got it, had made many of the data-modeling mistakes that I outlined, and no longer thought the Semantic Web was all hype.

The video was just uploaded to Vimeo by the Webstock team:

(if the embed isn’t showing, you can find the video here)

And here’s the abstract:

Ever since there was a web, people have been talking about the “semantic web”, which is always just around the corner. Even though this hasn’t exactly gone to plan, people working on the ideas behind semantic data modeling have actually come up with a lot of cool stuff.

Modern web development is very concerned with rapid iteration, which has led to the increasing popularity of lightweight frameworks built on dynamic languages such as Rails, Pylons and Django. However, most of us are still stuck using traditional data-modeling methods like relational databases which aren’t designed for constant schema changes. Further, because people don’t think about “standard” ways to share data, there are thousands of different web APIs, all of which have to be dealt with separately.

In this talk Toby will explain what “semantic data” is, how entities and data can be modeled using graphs, and show examples of modeling, integrating, and extending data models for large datasets. You’ll lean how the semantic models support rapid and iterative application development, and easy integration of existing databases. Toby will introduce fast scalable back-ends for storing and querying semantic data and show examples of semantic data already available on the web.

He’ll also briefly discuss how these approaches lead into the standards-based Semantic Web, and how attendees can find short-term value in adopting some of the Semantic Web standards and platforms.

Enjoy! Let me know what you think.

Update: You can find a PDF of the slides here.

My latest project: Freerisk

For the past few months, between writing books and my day job, I’ve been working on a project with my friend Jesper called Freerisk.

A few months ago after we first heard Tim O’Reilly’s “Work on stuff that matters” speech, we started talking about what issues, besides the environmental concerns mentioned in his speech, were import to us that we actually had the skills to work on. We came to the idea of how hackers could help the financial system, particularly when it came to evaluating default-risk of companies or looking for fraudulent behavior.

The financial system itself has always been very closed. The government republishes filings by the SEC in a variety of messy formats, but those who want clean data need to pay subscription fees and have very limited republication rights. So our plan is to make Freerisk a huge open data store of financial data taken primarily from company filings. It’s all going to be available to download or query using standards like SPARQL.

On top of that, there will be APIs for building risk models and submitting your results. We hope to show that “financial hackers” can come up with more interesting and accurate calculators that can model a wider variety of risk scenarios.

If you’re interested in this, several people have written about the project:

We’ve also given several presentations. The O’Reilly emerging technologies conference was kind enough to make and post a video of our talk there (this was our first one, so it’s a little rough, but it should give you a good idea!)

We are looking for people who are interested in getting involved in this project. We have started a discussion group called Open Finance Hackers (just started, nothing there yet). If you’re interested in this at all, please email me and join the group.

A crazy few months

Apologies for the lack of recent posts (I think you’ll forgive me in just a moment). I’ve had a crazy few months, but here’s what I’ve been up to, with links for stuff that you can pre-order and download!

  • Finished the draft of my second book, with my coworkers Jamie Taylor and Colin Evans. It’s called “Programming the Semantic Web” and it’s already listed in Amazon (the description there right now will be changed, trust me)
  • Working on collecting and editing essays for what will be a great collection, called Beautiful Data. I’m not sure if I’m allowed to tell you who the contributors are yet, but I will say they’re fantastic and we were very lucky to get them.
  • I gave a 3-hour workshop and a 40-minute session talk at Webstock which was held in Wellington, New Zealand a couple of weeks ago. It was an amazing experience and warrants a whole post on its own. For now, the slides for both sessions are available as PDFs at http://kiwitobes.com/webstock/

And coming up, there’s still more stuff going on:

  • I’m giving a talk at ETech on March 10th. It’s about the failure of risk rating agencies and ideas for how the tech community can help
  • I’ll also be at Web 2.0 Expo giving another talk on Sources for Data Geeks on April 2nd
  • And I’m getting married on July 4th!

(because a lot of people ask me, the answer is: no, conference speaking is not even slightly lucrative. I do it for fun)

Personal data integration (part 1)

I’ve been toying with the idea of attempting “semantic integration” of a lot of personal data in my life. I’ll be sure to share more later, but so far I’ve managed to pull together my September phone records, my email history, my contacts, my calendar and my Facebook friends (via the API, not something sketchy!) into a single triple-store.

Using this data, I was able to create this chart, which shows my friend network (I have removed myself and Brooke, since we’re connected to everyone and it ruins the layout). The people who I emailed, texted or called in September are shown in green.

social_graph_2gml-yed.jpg

You can see tight clusters of my friend groups. The tightest is the big hairball near the bottom that makes up much of Brooke’s Stanford GSB class, but also clear are groupings for my friends from MIT, Chapel Hill, Boston (post-MIT return), my San Francisco tech friends and my family. My family is the only group that is isolated from the rest of the graph — everyone else is connected, which is partly because I’ve introduced some of these groups to each other, and partly just because it’s a small world.

Also good to see is that almost every cluster has at least one green node (my family notably doesn’t, but that’s because my parents aren’t on Facebook), so I’ve generally done a good job of keeping in touch with at least a few people from different phases of my life.

There’s a lot of talk about breaking the silos in the enterprise and, in the semantic-web community, data integration across the entire web. But right now, people don’t even have decent integration across their own personal information. The current proliferation of single-feature applications encourages you to store different aspects of your life in different places — the advantage of course, is that something highly specialized is much more pleasant to use, but the disadvantage is that there’s no way to query across these aspects. I’m interested in experimenting with ways that help people “break the silos” with their own information, in the hope that this will both yield useful applications and help us get a better grip on the bigger problems.

I now have code to keep my triple-store synced with my friend network, my contacts, my phone records, my email and my calendar. I can construct queries across all of this (who did I forget to call on their birthday? Who have I seen recently who went to Stanford?). I’ll be sharing this code at some point, but I want to see how far I can take this. I’m also interested in hearing from anyone who has tried similar experiments and wants to collaborate.

So, anyone have any thoughts on other sources of personal data or questions you might want to ask once it’s integrated?

Web 2.0 NYC, Freebase UG meeting, and Taleb

A few quick updates:

  • I’ll be speaking at Web 2.0 in New York City this Thursday at 3pm. If you’re at the conference, find me and say hi!
  • While I’m gone, Freebase is having a user group meeting. Here is the info. Great speakers, you’ll seriously love the GeoSearch API
  • A new article by my favorite non-fiction author, Nassim Taleb, is at Edge. Highly recommended

I’m working on a lot of new projects right now, I’ll have more to share soon.

O’Reilly interview at OSCON

While I was at OSCON earlier this year, I did a 20 minute video interview with O’Reilly. I think the idea is to take a lot of interviews and edit them down to shorter segments for some kind of video supplement, but they’ve also posted the entire thing on Youtube.

I talk a little bit about my biotech experience, my book, working at Freebase and the importance of open data to new applications. The whole 20-minute segment is embedded below.

Let me know what you think!

A San Francisco Restaurant Health-map (with code)

If you’re just interested in a heat-map of restaurant sketchiness in San Francisco, here it is! If you want to learn about how it was done, keep reading below.
San Francisco Map
Click to see the interactive map

(In addition to showing the “sketchy areas”, it also is great just for seeing where the clusters restaurants are, which is really what defines the neighborhood centers)

Thanks to the ever resourceful Adrian Holovaty, I figured out that one could actually get license-free restaurant listings by scanning city-government records (those of you not looking to republish the data could just use the Yelp API or something). I got about 3000 restaurants into Freebase, along with their health department scores.

Addresses in Freebase are geocoded automatically by the geobot, so it was pretty easy for me to make this map of San Francisco, along with all the restaurants colored by their score.

If you’d like to do something similar yourself, I’ve created a few templates and scripts that you can work through. The Google Maps API and Freebase API are well-documented, but sometimes it’s nice to have a really basic tutorial to get you started.

  1. You’ll need an API key from Google
  2. Download this Base map template
  3. If you want custom icons, you’ll need to draw or generate them. In this case, I wrote a python script to make a set of colored dots going from red to green. You can (right-click) download the script make_icons.py (requires PIL).
  4. I uploaded the icons to a directory called “icons”, and the created the icons in the page with this script
    // Create your icons
    r=[];
    for (i=2;i<11;i++) {
    r[i] = new GIcon();
    r[i].image = "icons/rated"+i+".png";
    r[i].shadow=null;
    r[i].iconSize=new GSize(8,8);
    r[i].shadowSize=new GSize(0,0);
    r[i].iconAnchor=new GPoint(3,3);
    }

    Understanding how to do custom icons was a little tricky, a lot of attributes need to be set before they work properly.
  5. If you don’t have the Freebase API installed, you’ll need to run “easy_install simplejson” and “easy_install freebase” to get it
  6. Finally, you can generate all the overlays and paste them into your code with a script like make_overlays.txt. This script just prints a bunch of javascript, which you can capture and paste into your map HTML file.

Easy, wasn’t it? I suggest you take a look at make_overlays.py and see what it’s doing. Essentially, there’s a query at the top to pull out the scores and addresses of all the topics that have my personal type “health_department_rated_business”. The addresses have geolocations attached by the geobot.

If you wanted to use this recipe to make a map of something else, you can change the query. For example, I could just pull out businesses that start with the letter ‘M’.

query={'type':'/business/business_location/address',
'name~=':'^M',
'name':None,
'/business/business_location/address':{'citytown':'San Francisco',
'/location/location/geolocation':{'latitude':None,'longitude':None}}
}

(If you try this, remember to get rid of the references to “score” in the loop below)

If you make a map using this recipe, let me know and I’ll link to it!

Update: Fixed the python code links. Sorry about the red-green problem, I’ll fix it as soon as I get a chance.

The “excluded middle” of technical books

(ok, I know that “excluded middle” has a specific meaning in philosophy and I’m using it incorrectly here, but I like the way it sounds)

A couple of months ago I read a book called A Semantic Web Primer, published by the MIT Press. It was recommended to me by my coworker Jamie, who said it was about the only thing worth reading on the subject.

I will say that I’m glad I read it, because now I understand the terminology and the way that the Semantic Web community talks about knowledge and ontology. What I found intriguing about the book, however, was the nature of the content:

  • Begin with hyperbolic vision of the future where software agents are negotiating my doctor’s appointment
  • Explain a little about RDF concepts, then spend almost half the book describing the XML serialization
  • Explain a little about ontology and then do a deep-dive on OWL, which is mostly a way of describing which properties are valid for certain classes
  • This is where it gets weird — The “applications” section. Suddenly we’re talking about how Elsevier and Audi are using “The Semantic Web” to solve all manner of problems by having shared ontologies. Or maybe they’re planning to. The descriptions of what they’re doing are no more than well-padded paragraphs with no detail

It struck me that there was a massive chunk of the book missing, which was bridging the technical details of the RDF spec with how a car company might design and implement a shared ontology using the specs just described. This gap was so apparent that it got me thinking about the different kinds of computer books out there, which generally come in three flavors:

  • High level books about technology concepts, principles or “the future”
  • Learning a specific technology at a pretty deep level
  • Algorithms and computer science books that are math-heavy and at best use pseudocode

The first category tends to sell the most, because it’s accessible to the largest group. The second has been struggling for a while because the internet makes it so easy to learn and reference information about specific technologies, and the final group will probably always have a place amongst people who really want to deep-dive into the algorithms.

When I wrote Programming Collective Intelligence, I was hoping to find a middle-ground, which would introduce readers gently to the algorithms, show them working code and then try it out on real data that they could find on the web. The book was criticized by different people for not fitting into the aforementioned categories: “Not big-picture enough”, “This isn’t real production code”, “Not deep enough on algorithms”, “why did he use Python instead of pseudocode” or “Why would I want to learn 3 things at once?”.

The overall response, however, was overwhelmingly positive — most people loved that they could learn something new, actually try it, and then have an idea about how they could work it into their project. Tim O’Reilly called it the “start of a new category rather than one more entry into an existing one”.

Anyway, I guess what I’m getting at here, is that I’d like to see more books that fill out that middle ground — show me concepts, implementation and applications all at once. People can read any number of online tutorials to get a deeper understanding of how to do something with a particular technology. Once they understand the basics of algorithms, there are plenty of textbooks and journals to teach them more. As for big picture, throw out a few ideas and people’s creativity will fill in far more than can be covered by any book.