Reproducibility in the age of Mechanical Turk: We’re not there yet

There’s been increasing interest in the computer science research community in exploring the reproducibility of our research findings. One such project recently received quite a bit of attention for exploring the reproducibility of 613 papers in ACM conferences. The effort hit close to home: hundreds of authors were named and shamed, including those of us behind the VLDB 2012 paper Human-powered Sorts and Joins, because we did not provide instructions to reproduce the experiments in our paper. I’m grateful to Collberg et al. for their work, as it started quite a bit of discussion, and in our particular scenario, resulted in us posting the code and instructions for our VLDB 2012 and 2013 papers on github.

In cleaning up the code and writing up the instructions, I had some time to think through what reproducibility means for crowd computing. Can we, as Collberg et al. suggest, hold crowd research to the following standard:

Can a CS student build the software within 30 minutes...without bothering the authors?

My current thinking is a strong no: not only can crowd researchers not hold their work to this standard of reproducibility, but it would be irresponsible for our community to reach that goal. In fact, even if we opt for a different interpretation of reproducibility that requires an independent reconstruction of the research, making crowdsourcing research reproducible requires care.

Reproducibility is a laudable goal for all sciences. For computer science systems research, it makes a lot of sense. Systems builders are in the business of designing abstractions, automating processes, and proving properties of the systems they build. In general, these skills should lend themselves nicely to building standalone reproductions-in-a-box that make it easy to rerun the work of other researchers.

So why does throwing a crowd into the mix make reproducibility harder? It’s the humans. Crowd research draws on human-oriented social sciences like psychology and economics as much as it does on computer science, and as a result we have to draw on approaches and expectations that those communities set for themselves. The good thing is that in figuring out an appropriate standard for reproducibility, we can borrow lessons from these more established communities, so the solutions do not need to be novel.

How does crowd research challenge the laudable goal of reproducibility?

From here, I’ll spell out what makes crowd research reproducibility hard. It won’t be a complete list, and I won’t pose many solutions. As a community, I hope we can have a larger discussion around these points to define our own standards for reproducibility.

Humans don’t fit nicely into virtual machines. You pay crowd workers to do work for you when you want to add a human touch: there’s some creativity or decisionmaking that you couldn’t automate, but a human could do quite nicely on your behalf. Whereas you might package a system/experiment with a complex environment into a virtual machine for reproducibility’s sake, you can’t quite do the same with human creativity.
Cost. Crowdsourcing is not unique in costing money to reproduce. Some research requires specialized hardware that is notoriously costly to acquire, install, and administer. Even when the hardware isn’t proprietary, costs can be prohibitive: some labs have horror stories of researchers that accidentally left too many Amazon EC2 machines running for several days, incurring bills in the tens of thousands of dollars. Still, compared to responsibly spinning up a few tens of machines on EC2 for a few hours, crowdsourced workflows can bankrupt you faster.
Investing in the IRB process. One of the warnings we put in our reproducibility instructions was that you’d be risking future government agency funding if you ran our experiments without seeking Institute Review Board approval to work with human subjects. Getting human subjects training and experiment approval takes on the order of a month for good reason: asking humans to sort some images seems harmless, but researchers have a history of poor judgement when it comes to running experiments on other people. Working with your IRB is a great idea, but it’s another cost of reproducibility.
Data sharing limitations. We could save researchers interested in reproducing and improving on our work a lot of money if we released our worker traces, allowing other scientists to inspect the responses workers gave us when we sent tasks their way. Other researchers could vet our analyses without having to incur the cost of crowdsourcing for themselves.
Tiny details matter. Crowdsourced workflows have people at their core. Providing workers with slightly different instructions can result in drastically different results. When a worker is confused, they might reach out to you and ask for clarification. How do you control for variance in experimenter responses or worker confusion? What if, instead of requiring informed consent on only the first page a worker sees, as our IRB requested, your IRB asks you to display the agreement on every page? These little differences matter with humans in the loop. Separating the effects of these differences in experimental execution is important to understanding whether an experiment reproduced another lab’s results.
Crowds change over time. When we ran our experiments for our VLDB 2012 paper, we followed the reasonably rigorous CrowdDB protocol for vetting our results. We ran each experiment multiple times during the east coast business hours of different weekdays, trusting only experiments that we could reproduce ourselves. This process helped eliminate some irreproducible results. Several months later, Eugene and I re-ran all of our experiments before the paper’s camera ready deadline. No dice: some of the results had changed, and we had to remove some findings we were no longer confident in. As Mechanical Turk sees changing demographic patterns, you can expect your results to change as the underlying crowd does. These changes will compound the noise that you will already see across different workers. This is no excuse for avoiding reproducibility: every experimental field has to account for diverse sources of variance, but it makes me wary of the one-script-to-reproduce-them-all philosophy that you might expect of other areas of systems research.
Platforms change over time. Even after all of the work we put into documenting our experiments for future generations, they won’t run out of the box. Between the time that we ran our experiments and the time we released the reproducibility code, Amazon added an SSL requirement for servers hosting external HITs. This is a wonderful improvement as far as security goes, but underscores the fact that relying on an external marketplace for your experiments is one more factor to compound the traditional bit rot that software projects see.
Industry-specific challenges. Our VLDB 2012 and 2013 research was performed solely in academia. I’ve since moved to do crowdsourcing research and development in industry. This new environment poses new challenges to reproducibility. While most of the code powering machine learning and workflow design for crowd work in industry is proprietary, so is the crowd. For our work on the Locu team, we’ve got a few hundred workers that we’ve established long-term relationships with. We’ve had relationships with many crowd workers for over two years. Open sourcing the code behind our tools is one thing, but imagining other researchers bootstrapping the relationships and workflows we’ve developed for the purposes of reproducibility is near impossible. Still, I believe industry has a lot to contribute to discussions around crowd-powered systems: the mechanism design, incentives, models, and interfaces we develop are of value to the larger community. If industry is going to contribute to the discussion, we’ll have to work through some tradeoffs, including less-than-randomized evaluations, difficult-to-independently-reproduce conclusions, and as a result, more contributions to engineering than to science.

As crowd research matures, it will be important for us to ask what reproducibility means to our community. The answer will look pretty different from that of other areas of computer science. What are your thoughts?

Thank you to Peter Bailis and Michael Bernstein for providing feedback on drafts of this piece, and to my coauthors for helping get our work to a reproducible state.