Reproducibility in the age of Mechanical Turk: We’re not there yet

There’s been increasing interest in the computer science research community in exploring the reproducibility of our research findings. One such project recently received quite a bit of attention for exploring the reproducibility of 613 papers in ACM conferences. The effort hit close to home: hundreds of authors were named and shamed, including those of us behind the VLDB 2012 paper Human-powered Sorts and Joins, because we did not provide instructions to reproduce the experiments in our paper. I’m grateful to Collberg et al. for their work, as it started quite a bit of discussion, and in our particular scenario, resulted in us posting the code and instructions for our VLDB 2012 and 2013 papers on github.

In cleaning up the code and writing up the instructions, I had some time to think through what reproducibility means for crowd computing. Can we, as Collberg et al. suggest, hold crowd research to the following standard:

Can a CS student build the software within 30 minutes...without bothering the authors?

My current thinking is a strong no: not only can crowd researchers not hold their work to this standard of reproducibility, but it would be irresponsible for our community to reach that goal. In fact, even if we opt for a different interpretation of reproducibility that requires an independent reconstruction of the research, making crowdsourcing research reproducible requires care.

Reproducibility is a laudable goal for all sciences. For computer science systems research, it makes a lot of sense. Systems builders are in the business of designing abstractions, automating processes, and proving properties of the systems they build. In general, these skills should lend themselves nicely to building standalone reproductions-in-a-box that make it easy to rerun the work of other researchers.

So why does throwing a crowd into the mix make reproducibility harder? It’s the humans. Crowd research draws on human-oriented social sciences like psychology and economics as much as it does on computer science, and as a result we have to draw on approaches and expectations that those communities set for themselves. The good thing is that in figuring out an appropriate standard for reproducibility, we can borrow lessons from these more established communities, so the solutions do not need to be novel.

How does crowd research challenge the laudable goal of reproducibility?

From here, I’ll spell out what makes crowd research reproducibility hard. It won’t be a complete list, and I won’t pose many solutions. As a community, I hope we can have a larger discussion around these points to define our own standards for reproducibility.

  1. Humans don’t fit nicely into virtual machines. You pay crowd workers to do work for you when you want to add a human touch: there’s some creativity or decisionmaking that you couldn’t automate, but a human could do quite nicely on your behalf. Whereas you might package a system/experiment with a complex environment into a virtual machine for reproducibility’s sake, you can’t quite do the same with human creativity.

  2. Cost. Crowdsourcing is not unique in costing money to reproduce. Some research requires specialized hardware that is notoriously costly to acquire, install, and administer. Even when the hardware isn’t proprietary, costs can be prohibitive: some labs have horror stories of researchers that accidentally left too many Amazon EC2 machines running for several days, incurring bills in the tens of thousands of dollars. Still, compared to responsibly spinning up a few tens of machines on EC2 for a few hours, crowdsourced workflows can bankrupt you faster.

    Each of our VLDB papers cost around $1000 to run: each paper saw about 1000 Turker IDs complete around 65,000 HIT assignments at 1.5 cents apiece. This expense included errors we made along the way, but our errors were nowhere near what they could have been. For example, accidentally creating too many pairwise comparison tasks could have easily increased our costs by an order of magnitude in just a few hours. Reproducing crowdsourced systems research requires an upfront cost in the platform you’re using, but it also requires a nontrivial budget.

    This expense is not insurmountable in the way that replicating the recruiting strategy of a psychology experiment or the wetlab setup of a biology experiment might be, but it’s should at least make you wary of getting up and running in a half hour. Whereas providing researchers with a single script to reproduce all of your experiments would be great for most systems, providing a single executable that spends a thousand dollars in a few hours might be irresponsible.

  3. Investing in the IRB process. One of the warnings we put in our reproducibility instructions was that you’d be risking future government agency funding if you ran our experiments without seeking Institute Review Board approval to work with human subjects. Getting human subjects training and experiment approval takes on the order of a month for good reason: asking humans to sort some images seems harmless, but researchers have a history of poor judgement when it comes to running experiments on other people. Working with your IRB is a great idea, but it’s another cost of reproducibility.

  4. Data sharing limitations. We could save researchers interested in reproducing and improving on our work a lot of money if we released our worker traces, allowing other scientists to inspect the responses workers gave us when we sent tasks their way. Other researchers could vet our analyses without having to incur the cost of crowdsourcing for themselves.

    There are many benefits to such data sharing, but in releasing worker traces, you risk compromising worker anonymity. Turker IDs, while opaque, are not anonymous. As the AOL search log fiasco shows, even if we obscured identifiers further, it’s still possible to identify seemingly anonymous users from usage logs. IRBs are pretty serious about protecting personally identifiable information, and our IRB application does not cover sharing our data for these reasons. This limitation, like the others, is not unsolvable, but it will require the community to come together to figure out best practices for keeping worker identities safe.

  5. Tiny details matter. Crowdsourced workflows have people at their core. Providing workers with slightly different instructions can result in drastically different results. When a worker is confused, they might reach out to you and ask for clarification. How do you control for variance in experimenter responses or worker confusion? What if, instead of requiring informed consent on only the first page a worker sees, as our IRB requested, your IRB asks you to display the agreement on every page? These little differences matter with humans in the loop. Separating the effects of these differences in experimental execution is important to understanding whether an experiment reproduced another lab’s results.

  6. Crowds change over time. When we ran our experiments for our VLDB 2012 paper, we followed the reasonably rigorous CrowdDB protocol for vetting our results. We ran each experiment multiple times during the east coast business hours of different weekdays, trusting only experiments that we could reproduce ourselves. This process helped eliminate some irreproducible results. Several months later, Eugene and I re-ran all of our experiments before the paper’s camera ready deadline. No dice: some of the results had changed, and we had to remove some findings we were no longer confident in. As Mechanical Turk sees changing demographic patterns, you can expect your results to change as the underlying crowd does. These changes will compound the noise that you will already see across different workers. This is no excuse for avoiding reproducibility: every experimental field has to account for diverse sources of variance, but it makes me wary of the one-script-to-reproduce-them-all philosophy that you might expect of other areas of systems research.

  7. Platforms change over time. Even after all of the work we put into documenting our experiments for future generations, they won’t run out of the box. Between the time that we ran our experiments and the time we released the reproducibility code, Amazon added an SSL requirement for servers hosting external HITs. This is a wonderful improvement as far as security goes, but underscores the fact that relying on an external marketplace for your experiments is one more factor to compound the traditional bit rot that software projects see.

  8. Industry-specific challenges. Our VLDB 2012 and 2013 research was performed solely in academia. I’ve since moved to do crowdsourcing research and development in industry. This new environment poses new challenges to reproducibility. While most of the code powering machine learning and workflow design for crowd work in industry is proprietary, so is the crowd. For our work on the Locu team, we’ve got a few hundred workers that we’ve established long-term relationships with. We’ve had relationships with many crowd workers for over two years. Open sourcing the code behind our tools is one thing, but imagining other researchers bootstrapping the relationships and workflows we’ve developed for the purposes of reproducibility is near impossible. Still, I believe industry has a lot to contribute to discussions around crowd-powered systems: the mechanism design, incentives, models, and interfaces we develop are of value to the larger community. If industry is going to contribute to the discussion, we’ll have to work through some tradeoffs, including less-than-randomized evaluations, difficult-to-independently-reproduce conclusions, and as a result, more contributions to engineering than to science.

As crowd research matures, it will be important for us to ask what reproducibility means to our community. The answer will look pretty different from that of other areas of computer science. What are your thoughts?

Thank you to Peter Bailis and Michael Bernstein for providing feedback on drafts of this piece, and to my coauthors for helping get our work to a reproducible state.

Web Scraping Tools for Non-developers

I recently spoke with a resource-limited organization that is investigating government corruption and wants to access various public datasets to monitor politicians and law firms. They don’t have developers in-house, but feel pretty comfortable analyzing datasets in CSV form. While many public datasources are available in structured form, some sources are hidden in what us data folks call the deep web. Amazon is a nice example of a deep website, where you have to enter text into a search box, click on a few buttons to narrow down your results, and finally access relatively structured data (prices, model numbers, etc.) embedded in HTML. Amazon has a structured database of their products somewhere, but all you get to see is a bunch of webpages trapped behind some forms.

A developer usually isn’t hindered by the deep web. If we want the data on a webpage, we can automate form submissions and key presses, and we can parse some ugly HTML before emitting reasonably structured CSVs or JSON. But what can one accomplish without writing code?

This turns out to be a hard problem. Lots of companies have tried, to varying degrees of success, to build a programmer-free interface for structured web data extraction. I had the pleasure of working on one such project, called Needlebase at ITA before Google acquired it and closed things down. David Huynh, my wonderful colleague from grad school, prototyped a tool called Sifter that did most of what one would need, but like all good research from 2006, the lasting impact is his paper rather than his software artifact.

Below, I’ve compiled a list of some available tools. The list comes from memory, the advice of some friends that have done this before, and, most productively, a question on Twitter that Hilary Mason was nice enough to retweet.

The bad news is that none of the tools I tested would work out of the box for the specific use case I was testing. To understand why, I’ll break down the steps required for a working web scraper, and then use those steps to explain where various solutions broke down.

The anatomy of a web scraper

There are three steps to a structured extraction pipeline:

  1. Authenticate yourself. This might require logging in to a website or filling out a CAPTCHA to prove you’re not…a web scraper. Because the source I wanted to scrape required filling out a CAPTCHA, all of the automated tools I’ll review below failed step 1. It suggests that as a low bar, good scrapers should facilitate a human in the loop: automate the things machines are good at automating, and fall back to a human to perform authentication tasks the machines can’t do on their own.

  2. Navigate to the pages with the data. This might require entering some text into a search box (e.g., searching for a product on Amazon), or it might require clicking “next” through all of the pages that results are split over (often called pagination). Some of the tools I looked at allowed entering text into search boxes, but none of them correctly handled pagination across multiple pages of results.

  3. Extract the data. On any page you’d like to extract content from, the scraper has to help you identify the data you’d like to extract. The cleanest example of this that I’ve seen is captured in a video for one of the tools below: the interface lets you click on some text you want to pluck out of a website, asks you to label it, and then allows you to correct mistakes it learns how to extract the other examples on the page.

As you’ll see in a moment, the steps at the top of this list are hardest to automate.

What are the tools?

Here are some of the tools that came highly recommended, and my experience with them. None of those passed the CAPTCHA test, so I’ll focus on their handling of navigation and extraction.

  • Web Scraper is a Chrome plugin that allows you to build navigable site maps and extract elements from those site maps. It would have done everything necessary in this scenario, except the source I was trying to scrape captured click events on links (I KNOW!), which tripped things up. You should give it a shot if you’d like to scrape a simpler site, and the youtube video that comes with it helps get around the slightly confusing user interface.

  • looks like a clean webpage-to-api story. The service views any webpage as a potential data source to generate an API from. If the page you’re looking at has been scraped before, you can access an API or download some of its data. If the page hasn’t been processed before, walks you through the process of building connectors (for navigation) or extractors (to pull out the data) for the site. Once at the page with the data you want, you can annotate a screenshot of the page with the fields you’d like to extract. After you submit your request, it appears to get queued for extraction. I’m still waiting for the data 24 hours after submitting a request, so I can’t vouch for the quality, but the delay suggests that uses crowd workers to turn your instructions into some sort of semi-automated extraction process, which likely helps improve extraction quality. The site I tried to scrape requires an arcane combination of javascript/POST requests that threw’s connectors for a loop, and ultimately made it impossible to tell how to navigate the site. Despite the complications, seems like one of the more polished website-to-data efforts on this list.

  • Kimono was one of the most popular suggestions I got, and is quite polished. After installing the Kimono bookmarklet in your browser, you can select elements of the page you wish to extract, and provide some positive/negative examples to train the extractor. This means that unlike, you don’t have to wait to get access to the extracted data. After labeling the data, you can quickly export it as CSV/JSON/a web endpoint. The tool worked seamlessly to extract a feed from the Hackernews front page, but I’d imagine that failures in the automated approach would make me wish I had access to’s crowd workers. The tool would be high on my list except that navigation/pagination is coming soon, and will ultimately cost money.

  • Dapper, which is now owned by Yahoo!, provides about the same level of scraping capabilities as Kimono. You can extract content, but like Kimono it’s unclear how to navigate/paginate.

  • Google Docs was an unexpected contender. If the data you’re extracting is in an HTML table/RSS Feed/CSV file/XML document on a single webpage with no navigation/authentication, you can use one of the Import* functions in Google Docs. The IMPORTHTML macro worked as advertised in a quick test.

  • iMacros is a tool that I could imagine solves all of the tasks I wanted, but costs more than I was willing to pay to write this blog post. Interestingly, the free version handles the steps that the other tools on this list don’t do as well: navigation. Through your browser, iMacros lets you automate filling out forms, clicking on “next” links, etc. To perform extraction, you have to pay at least $495.

  • A friend has used Screen-scraper in the past with good outcomes. It handles navigation as well as extraction, but costs money and requires a small amount of programming/tokenization skills.

  • Winautomation seems cool, but it’s only available for Windows, which was a dead end for me.

So that’s it? Nothing works?

Not quite. None of these tools solved the problem I had on a very challenging website: the site clearly didn’t want to be crawled given the CAPTCHA, and the javascript-submitted POST requests threw most of the tools that expected navigation through links for a loop. Still, most of the tools I reviewed have snazzy demos, and I was able to use some of them for extracting content from sites that were less challenging than the one I initially intended to scrape.

All hope is not lost, however. Where pure automation fails, a human can step in. Several proposals suggested paying people on oDesk, Mechanical Turk, or CrowdFlower to extract the content with a human touch. This would certainly get us past the CAPTCHA and hard-to-automate navigation. It might get pretty expensive to have humans copy/paste the data for extraction, however. Given that the tools above are good at extracting content from any single page, I suspect there’s room for a human-in-the-loop scraping tool to steal the show: humans can navigate and train the extraction step, and the machine can perform the extraction. I suspect that’s what is up to, and I’m hopeful they keep the tool available to folks like the ones I initially tried to help.

While we’re on the topic of human-powered solutions, it might make sense to hire a developer on oDesk to just implement the scraper for the site this organization was looking at. While a lot of the developer-free tools I mentioned above look promising, there are clearly cases where paying someone for a few hours of script-building just makes sense.

Locu has a new home
On Monday, we announced that Locu has been acquired by GoDaddy. As a friend, technologist, or researcher, the acquisition might initially surprise you.  Rather than repeat myself a thousand times, I figured I’d share some thoughts on the topic.  Standard caveat: these words represent my thoughts, not my employer’s.
  • I’m personally excited about the acquisition.  We’ve been working with the folks from GoDaddy for several months now, and the team is sharp and energized about helping hundreds of millions of local merchants find their home on the web.
  • Locu remains Locu as a team, a set of offices, a product, and a mission.  For the most part, Locu will be bringing new technology and design to the table, and GoDaddy will be bringing a level of scale that would take years to build up on our own.  Locu offers a healthy dose of data structuring and crowdsourcing technology alongside the design chops to make previously complicated things simple.  GoDaddy is the largest privately held company in the world that focuses on helping small businesses with their web presence, and brings years of sales and marketing experience to Locu’s products.  GoDaddy also has a deep understanding of scale both in terms of the tens of millions of people they work with, and the billions of dollars of revenue they bring in.
  • Aside from the business side of things, we’re still very excited to be releasing open source projects and publishing more about our approach to structured data extraction and crowd work.  The open source and research communities have been so fundamental to what we do, and I’m excited we can continue to repay that debt.
  • As a human being, I care a lot about the values of the company I work for.  It would be ignorant to ignore the fact that previous incarnations of GoDaddy have been responsible for sexist Super Bowl commercials, and have supported web-endangering efforts like SOPA.  We’ve been assured that the people who were behind these efforts are no longer working at GoDaddy.  In fact, an entirely new leadership team (including CEO, COO, CTO, Chief Architect, etc.) has been put in place since these controversies, and I count myself as one of the folks that expects a lot of them in the coming years.
From everything I’ve heard, I know that acquisitions are hard to execute well.  If we pull this off, we’ll be improving the lives of local merchants and crowd workers alike, and putting new force behind structured data.  I’m excited to give it a shot!
Many thanks to Rene Reinsberg for giving me feedback on many things in life, including this post.
My N=1 Guide to Grad School

A little delayed, but I put together a guide of advice I’ve given other students in grad school.  Send feedback, or write your own!


Life update: I’ve defended my thesis and I’m now the Director of Data at Locu.  This doesn’t change much on the blog, as I’ll still periodically update it with random thoughts.  I’m also doing a bit of blogging on the Locu blog on topics like our technology workflow, designing for crowds, and the human side of crowdsourcing.

It’s an exciting and very different next step for me.  I’m still very excited about introducing new students to data and computer science, and will keep that up as well.

What Should be Included in a Data Science Curriculum?

(I recently wrote an answer to What Should be Included in a Data Science Curriculum? on Quora.  Here’s a subset of that answer)

Eugene Wu and I recently taught a 6-day (3 hours per day) course on data literacy basics targeted at computer science undergraduates.  Our initial motivation was selfish: as databases researchers, we didn’t have a lot of experience with an end-to-end raw data->data product pipeline.  After a few trial runs of our own, we realized certain data processing patterns kept showing up, and saw that we had a small course worth of content on our hands.  The important thing here is that even with undergraduate- and graduate-level machine learning, statistics, and database courses under our belts, we still had a lot to learn about working with honest-to-goodness dirty data.

Each module of our course could have had an entire semester dedicated to it, and so we favored basic skills with lots of hands-on experience over intellectual depth and rigor.  We kept lectures to 20-30 minutes, giving students the remaining 2.5 hours to go through the labs we set up while we walked around answering questions.  Lectures allowed students to know what they were in for at a high level, and the lab portion allowed them to cement those concepts with real datasets, code, and diagrams.  All of the course content is available on github, and as an example, here is a direct link to day 1’s lab.

The syllabus we covered was:

  • Day 1: an end-to-end experience in downloading campaign contribution data from the federal election commission, cleaning it up, and programmatically displaying it using basic charts.
  • Day 2: visualization/charting skills using election and county health data.
  • Day 3: statistics to take the hunches they got on day 2 and quantify them, learning about T-Tests and linear regression along the way.
  • Day 4: text processing/summarization using the Enron email corpus.
  • Day 5: MapReduce to scale up Day 4’s analysis using Elastic MapReduce on Amazon Web Services.  This felt a bit forced, but the students were clamoring for distributed data processing experience.
  • Day 6: the students teach us something they learned on their own datasets using techniques we’ve taught them.

While we set out to give computer science students with familiarity in python programming a dive into data, we ended up with folks from the physical sciences, doctors, and a few social scientists who had their own datasets to answer questions about.  The last day allowed them to experiment with their new skills on their own data.  Attendance on this day was lower than the previous days: the majority of the folks in attendance on day 6 were on the more experienced end, and I suspect that the undergrads, who were not yet exposed to data problems of their own, didn’t find it as engaging.  It would be interesting to see how to develop course content that allows self-directed data science for students who still need a bit more inspiration.

I should also say that our attempt is not the first one to bring data to the classroom. Jeff Hammerbacher and Mike Franklin at Berkeley have a wonderful semester-length course on data science.  The high-level outline of the course seems similar, but they get farther into data product design, and jump into each topic in more depth.  Their resources page has a nice set of links to other educational efforts worth checking out.

One Gray Lady

I consume content through many aggregators, but The New York Times (The Gray Lady) is the single source of content I go to directly at least daily to know what’s happening in the world. While it’s good for news, what sets The Times apart from other content sources is its depth of reporting. There’s one problem, though: by default, longer NYT articles do not appear in Single Page mode. This has caused me problems in the past, ranging in severity from slightly annoying (having to click Next Page) to pretty frustrating (loading articles for offline reading only to realize I only had the first page).

So I created One Gray Lady, a Greasemonkey plugin that loads all NYT content in single page mode.

To install it in Google Chrome or Firefox with the Greasemonkey plugin, click here.

I have only tested the code in Chrome, and while I did a bit of testing on various URLs, I’m sure I missed something. Feel free to send updates or suggestions!

John Glaser on Healthcare Information Technology

I recently sat in on a lecture for Professor Peter Szolovits’s Biomedical Computing course. The lecture was open to a greater audience, given the prominence of the speaker. As a non-expert, I found it to be a useful look into the current state of healthcare IT and the coming legislative and technical challenges facing the industry. My notes are below.

John Glaser, Ph.D.
Formerly CIO of Partners/Brigham And Women’s Hospital
Currently CEO of Siemens Health Services

Free advice: get a healthcare proxy and power of attorney set up. Easier to do now than have someone else guess later how you want to live/die.

Why does Health IT suck?

  • Not for lack of money put into the system
  • Not for lack of smart people working on the problem

Current model

  • Insurance companies/patients pay per volume (per birth, per surgery, etc.) almost regardless of quality
  • Boards of directors are very conservative. Don’t want to be the board that made an IT decision that made a huge hospital fail.

U.S. Numbers to give context

  • 60% of hospitals are <= 100 beds
  • Of 500K physicians, majority work in 2-3-doctor practice (not IT-savvy, or modestly interested in IT at best)
  • 2/3 of medical decisions are heuristic/not scientific, and many have a difficult-to-verify outcome
  • volatile knowledge domain: 700k academic articles have come out in the last (decade?)
  • 20% of doctors are a decade away from retirement, so perhaps newer doctors will bring IT mentality with them?
  • PricewaterhouseCoopers survey: 58% of (independent?) doctors considering quitting, selling practice, or joining a larger practice
  • various societies are discussing requirements: to become board (re-)certified (oncology, etc.), you have to show facility in technology.

Health IT Services

  • huge fragmentation: the 3rd largest health IT services company has 7% of market. if they win every open engagement from now until (?), they will have 11% of the market.
  • lots of players: 300 electronic health record providers in US, 25% exit and 25% enter per year
  • engagements are long: bringing up a new hospital IT system takes 2-4 years. from the moment you decide to change IT systems, you will continue to use your old one for the next 4-5 years as you transition.

Affordable Care Act (ACA)

  • costs are projected to go up 26% in the next decade. ACA stipulates that govt. will compensate 12% more in the next decade: providers have to make up the difference.
  • to incentivize quality care, govt. will hold on to 10% of payments until you prove treatment was effective (hard to define).
  • currently, for a single procedure (e.g., total hip replacement) you might get 12 different bills (e.g., surgeon, materials, anesthesia). new system: govt. pays a single provider one bill, with a fixed amount. incentivizes a holistic view.
  • risk: hospitals go out of business. potential future doctors don’t enter medicine. doctors “fire” bad patients to make their numbers look good.


  • doctors in small practices joining larger networks to avoid managing the ACA requirements.
  • single payment requirement will cause groups of doctors to more tightly collaborate (contractually).

Transition challenges

  • ACA is rolling out over the course of a decade.
  • need to be careful, since some patients will be handled by old rules, and some by new rules. so do you not apply decision support-based treatment to patients on old rules, or just do fee-for-service? lots of mental overhead for doctors.

Fixed fee challenges

  • paying a fixed amount per treatment doesn’t work for everything. Diabetes is sort of predictable, but a trauma might range from a broken toe to severe burns on 90% of body.
  • (Adam’s note) perhaps large pools of insured patients will smooth over the individual spikes in cost of care.

Information Technology needs

  • systems must span inpatient, outpatient, emergency care, rehab
  • need revenue cycle + contract management system that handles continuum of care. this is complex: medicare + blue cross might pay diff amounts for “good” diabetes treatment, and “good” might be defined differently.
  • systems should manage individuals and populations: how did all 100 people w/ respiratory problems do last month? which patients strayed from predicted path? what should have happened? why/why not?
  • sophisticated business intelligence + analysis: predict who will get worse, etc.
  • interoperability w/ different providers
  • rules+workflow engines to ensure followups/next steps/help primary care doctors coordinate care, manage exceptions, follow up properly. also allow this in collaborative care environment w/ lots of specialists checking in and out.
  • high availability + low total cost of ownership
  • engage patients

New challenges for primary care physicians (PCPs)

  • At the moment, PCP moves from one patient to the next every 15 minutes, sees 100s of lab results per day
  • Only 25% of data from specialists comes back to a PCP within a month
  • In future, PCPs will be responsible for closing the loop on specialists, tests, etc., with more accountability, but still be given just as much or more information, with similar delays. Workflow management systems are key here!

Interesting technical challenges

  • filtering patient care notes: 10s of pages of patient care history. No doctor can read them all before seeing patient. how to help doctors find relevant notes across different doctors, annotations, etc.
  • supporting collaboration between multiple providers
  • parsing notes to remind providers. e.g., “Ask about patient’s daughter next time.”
  • cleaning up conflicting medical record data: was it type 1 or type 2 diabetes? was it a heart attack, or just a test for one?
Human-powered Sorts and Joins

(Cross-posted on the Crowd Research Blog)

There has been a lot of excitement in the database community about crowdsourced databases.  At first blush, it sound like databases are yet another application area for crowdsourcing: if you have data in a database, a crowd can help you process it in ways that machines cannot.  This view of crowd-powered databases misses the point.  The real benefit of thinking of human computation as a databases problem is that it helps you manage complex crowdsourced workflows.

Many crowd-powered tasks require complicated workflows in order to be effective, as we see in algorithms like Soylent’s Find-Fix-Verify.  These custom workflows require thousands of lines of code to curry data between services like MTurk and business logic in several languages (1000-2000 in the case of Find-Fix-Verify!).  If we provide workflow developers with a set of common operators, like filters and sorts, and a declarative interface to combine those operators, such as SQL or PigLatin, we can reduce the painful crowdsourced plumbing code while focusing on a set of operators to improve as a community.

This is not an academic argument: Find-Fix-Verify can be implemented with a FOREACH-FOREACH-SORT in PigLatin, or a SELECT-SELECT-ORDERBY in SQL, resulting in several tens of lines of code.  All told, we can get a two order-of-magnitude reduction in workflow code.  The task at hand is thus to make the best-of-breed reusable operators for crowd-powered workflows.  In our VLDB 2012 paper, we look at two such operators: Sorts and Joins.


Human-powered sorts are everywhere.  When you submit a product review with a 5-star rating, you’re implicitly contributing a datapoint to a large product ranking algorithm.  In addition to rating-based sorts, there are also comparison-based ones, where a user is asked to compare two or more items along some axis.  For a particularly cute example of comparison-based sorting, see The Cutest, a site that identifies the cutest animals in the world by getting pairwise comparisons from heartwarmed visitors.

The two sort-input methods can be found in the image below.  On the left, users compare five squares by size.  On the right, users rate each square on a scale from one to seven by size after seeing 10 random examples.

Comparison- and Rating-based Sort

In our paper, we show that comparisons provide accurate rankings, but are expensive: they require a number of comparisons quadratic in the number of items being compared.  Rating is quite accurate, and cheaper than sorts: it’s linear in the number of items rated.  We also propose a hybrid of the two that balances cost and accuracy, where we first rate all items, and then compare items with similar ratings.

These techniques can reduce the cost of sorting a list of items by 2-10x.  Human-powered sorts are valuable for a variety of tasks.  Want to know which animals are most dangerous?  From least to most dangerous, a crowd of Turkers said:

flower, ant, grasshopper, rock, bee, turkey, dolphin, parrot, baboon,
rat, tazmanian devil, lemur, camel, octopus, dog, eagle,
elephant seal, skunk, hippo, hyena, great white shark, moose,
komodo dragon, wolf, tiger, whale, panther

The different sort implementations highlight another benefit of declaratively defined workflows.  A system like Qurk can take user constraints into account (linear costs? quadratic costs? something in between?) and identify a comparison-, rating-, or hybrid-based sort implementation to meet their needs.


Human-powered Joins are equally pervasive.  The area of Entity Resolution has captured the attention of researchers and practitioners for decades.  In the space of finance, is IBM the same as International Business Machines?  Intelligence analysis runs into a combinatorial explosion in the number of ways to say Muammar Muhammad Abu Minyar al-Gaddafi's name.  And most importantly, how can I tell if Justin Timberlake is the person in the image I'm looking at?

We explored three interfaces for solving the celebrity matching problem (and more broadly, the human-powered entity resolution problem).  The first is a simple join interface, asking users if the same celebrity is displayed in two images.  The second employs batching, asking Turkers to match several pairs of celebrity images.  The third interface employs more complex batching by asking Turkers to match celebrities arrayed in two columns.

Simple Joins, Naive Batching, and Smart Batching

As we batch more pairs to match per task, cost goes down, but so does Turker accuracy.  Still, we found that we can achieve around a 10x cost reduction without significantly losing in result quality.  We can achieve even more savings by having workers identify features of the celebrities, so that we don’t, for example, try to match up males with females.

We’re Not Done Yet

We now have insight into how to effectively design two important human-powered operators, sorts and joins.  There are two directions to go from here: bring in learning models, and design more reusable operators.

Our paper shows how to achieve more than order-of-magnitude cost reductions in join and sort costs, but this is often not enough.  To further reduce costs while maintaining accuracy, we’re looking at training machine learning classifiers to perform simple join and sort tasks, like determining that Cambridge Brewing Co. is likely the same as Cambridge Brewing Company.  We’ll still need humans to handle the really tricky work, like figuring out which of the phone numbers for the brewing company is the right one.

Sorts and joins aren’t the only reusable operators we can implement.  Next up: human-powered aggregates.  In groups, humans are surprisingly accurate at estimating quantities (jelly beans in a jar, anyone?).  We’re building an operator that takes advantage of this ability to count with a crowd.

For more, see our full paper, Human-powered Sorts and Joins.
This is joint work with Eugene Wu, David Karger, Sam Madden, and Rob Miller.

I’m a (STEM) Graduate Student: Please Tax Me

Over the past month, a petition has been circulating asking the Obama administration to bring graduate student stipends back to their pre-1986 tax-exempt status. I urge you to not sign this petition, as it is misguided and damaging to our image. If you believe graduate student researchers are more valuable than their compensation, then demand more compensation, not a tax loophole.

First, the caveat: I can only speak for the STEM fields. In these fields, a combination of government, corporate, and university grants support research-track students in the lab and classroom. This compensation usually comes in the form of full tuition coverage and a stipend in the range of $1500-$2500 per month, and sometimes includes health coverage.

Our stipends put our yearly income at $18,000-$30,000/year. Compare this to a poverty threshold of $18,530 for a family of three, or $29,990 for a family of six. In computer science, you can double your income with a summer internship, placing you above the median 2009 household income. At first glance, it seems like we are reasonably compensated before we take into account the education, advising, networking, and travel opportunities our life decision has earned us.

Of course, the argument in the petition is more nuanced than one of unreasonable taxation. The petition speaks to the value of our “innovative, cutting-edge thinking” relative to “bankers, lobbyists, or hedge-fund managers.” The comparison is certainly timely, but sweeps under the rug other valuable fields, like Nursing or Carpentry. Both of these fields earn more than the median graduate student in STEM, but optimistically, we are in a position of higher upward mobility once we graduate.

Perhaps a better comparison is what we could earn if we had not chosen graduate studies. With a B.S. in Computer Science, my undergraduate colleagues at large technology firms and startups are earning 3-5x what I earn through my stipend. Am I more valuable as a researcher than I would be in their shoes? This seems like a good conversation to have.

This is a discussion one of relative value. In the absolute sense, graduate students in STEM are not poor, and should pay taxes in whatever tax bracket we fall. Perhaps we’re not compensated enough for what we provide to society. I would like to believe that STEM’s contribution to social and economic development is significant. If we’re seeing a dirth of STEM researchers and our value to society is high, the market failure should be supplemented by the government. Not in the form of yet another tax break, but as an increase in the number of stipends or the amount of compensation distributed per researcher.

STEM is under attack. We should elevate its image by discussing how valuable our work is, not by asking for pity. Demand what you are worth, but remember how lucky you are.