Argonaut: Processing Complex Work with the Crowd
One of my favorite times of year at a company is when interns join for the summer. Internships are great avenues for those fun projects you’ve had in the back of your mind but haven’t had time to test out. Two summers ago at Locu, I had the great fortune to work with Daniel Haas, a grad student at Berkeley’s AMP Lab. His three months of work laid the foundation for a paper on a framework called Argonaut that he and Lydia Gu, my Locu/GoDaddy colleague who joined the project, are presenting today at VLDB.
While we primarily used Argonaut for structured data extraction, Argonaut’s concepts can be applied to other areas of complex crowd work. It’s amazing to think our learnings come from over half a million hours of worker contributions. Without them, none of these learnings would be possible.
As a sneak peak, I’ll highlight a few fun learnings we report on in more detail in the paper. I’ll also editorialize a bit on what I think the findings mean for crowd work, especially as people do more interesting and complex things with it. Here’s the scoop:
Complex work. Traditionally in crowd work, we’re told to design microtasks: simple yes/no or multiple choice questions that are well defined. You can imagine this is pretty dehumanizing and not inspiring to workers. With the Argonaut model, we send large, meaty tasks to workers. Tasks might take upward of an hour to complete, and are generally easier to design since there’s no microtask decomposition to think about. They are closer to what you’d imagine knowledge work being like: we trust humans to do what they’re good at on challenging tasks.
Review, don’t repeat. To avoid workers making mistakes in traditional microtask work, we send multiple workers each task, and use voting-based schemes (like majority vote or expectation maximization) to identify the correct answer. With Argonaut, we do something different: only one worker completes each complex task. Entry-level workers are sometimes reviewed by trusted ones, which allows us to catch mistakes, and also allows us to send tasks back to workers so that they can correct them and learn by example. In the paper we show that review works: a large majority of tasks that are reviewed end up of higher quality, and workers get to see how to improve their own work, unlike the opaque voting-based schemes of the microtask world.
Spotcheck with help from models. The technical heart of the Argonaut paper is a TaskGrader model. We built a regression on a few hundred features of each task, like the worker’s previous work history, the length of the task, the time of day in the worker’s timezone, etc. The regression predicted, based on these features, how much a review might change/improve a worker’s work. Given a fixed budget for review or a fixed number of reviewers, we can now identify which tasks the reviewers should look at for maximal task quality improvement. In the paper, we find that for a practical review budget, you can catch around 50% more errors with the same amount of review, just by pointing reviewers at the tasks that will benefit most from their attention.
Optimizing for longevity and upward mobility changes everything. One topic/hypothesis that doesn’t receive enough attention in the paper is that having long-term relationships with crowd workers changes everything. Half of the crowd workers contributed to our system for more than 2.5 years. The ones that performed the best ended up being promoted to reviewer status, and were selected to do more interesting work when it came up. This had pretty drastic effects on our worker and task models. Hidden in Figure 7 of the paper is a neat finding: almost the entirety of the predictive power of the TaskGrader comes from task-specific features. Worker-specific features on their own don’t appear to be too predictive: by the time you establish long-term relationships with workers, the discerning properties of a task’s quality are not the trusted people you’re working with, but the difficulty of the task they are completing. This is in stark contrast to traditional microtask crowd work, where the most celebrated work quality algorithms identify the trusted workers and weigh their responses more heavily.
While this paper brings us to the tip of the iceberg of complex work and hierarchical machine-mediated review, there are a ton of questions we have yet to answer. Most important to me are questions around just how complex of work we can do with these models. Can we support high-quality creative and analytical tasks beyond structured data extraction? How generalizable is the TaskGrader to other tasks? Finally, what does it mean for crowd work if longevity and upward mobility matter as much as they do in traditional employment scenarios?