N=1 (marcua’s blog)

ayb: A multi-tenant database that helps you own your data

2023-06-25T00:00:00+00:00

Introduction

Today’s databases are simultaneously ubiquitous and frustratingly inaccessible to most people. Your own data likely lives in thousands of databases at various companies and organizations, but if you wanted to create a database for yourself in which to store data and share it, you’d need the skills of both a system administrator and a software engineer. By virtue of the complexity of database management systems (and market forces), your data either lives in other people’s databases or in hard-to-share unstructured files on your computer. If it was easier to create and share databases, more people and teams would use them to manage and own their data.

To make this concrete, imagine a journalist or student that’s looking to create a database around a new dataset and build a visualization they host on a website. Using existing database tools, they have to secure a machine, install Postgres/MySQL, update various configuration settings to allow incoming connections, and create and host a web application to gatekeep the database credentials. Instead, what if our user could: 1) Register for an account on a service provided by their newsroom or school, 2) Issue a create database command and make their database publicly accessible in read only mode, and 3) Issue SQL from the command line or over HTTP?

Toward that vision, I’ve been building ayb, which is a multi-tenant database management system with easy-to-host instances that enable you to quickly register an account, create databases, share them with collaborators, and query them from a web application or the command line. With ayb, all your (data)base can finally belong to you. ayb is open source (Apache 2.0-licensed) and requires a single command to start a server. ayb features:

Popular storage formats as an ejection seat. ayb databases are just SQLite files, and ayb relies on SQLite for query processing. SQLite is the most widely used database on the planet, and if you one day decide ayb isn’t for you, you can walk away with your data in a single file. (We’ll support other formats like DuckDB’s after its storage format stabilizes.)
Easy registration and database creation. To borrow a tired analogy, an ayb instance is like GitHub for your databases: once you create an account on an ayb instance, you can create databases quickly and easily. Next on the roadmap are features like authentication and permissions so that you can easily share your databases with particular people or make them publicly accessible.
Multi-tenancy. While ayb is easy to get up and running, you shouldn’t have to be a system administrator to get started. Each ayb instance can serve multiple users’ databases, so that on a team, in a classroom, or in a newsroom, one person can get it running and everyone else can utilize the same instance. Once ayb has authentication and permissions, I’ll plan on running a public ayb instance so people can spin up databases without having to run their own instance. Clustering/distribution is on the roadmap so that eventually, if your instance ever needed to, it could run on multiple nodes.
An HTTP API. To make it easy to integrate into web applications, ayb exposes databases over an HTTP API. Other wire protocols (e.g., the PostgreSQL wire protocol) are on the roadmap for broader compatibility with existing applications.

As of June 2023, ayb is neither feature complete nor production-ready. Functionality like authentication, permissions, collaboration, isolation, high availability, and transaction support are on the Roadmap but not available today. If you want to collaborate, reach out!

In the rest of this article, you can:

See ayb action in an end-to-end example
Better understand who ayb is for in students, sharers, and sovereigns.
Learn where to go from here.

An end-to-end example

Installing

ayb is written in Rust, and is available as the ayb crate. Assuming you have Rust installed on your machine, installing ayb takes a single command:

cargo install ayb

Running a server

An ayb server stores its metadata in SQLite or PostgreSQL, and stores the databases it’s hosting on a local disk. To configure the server, create an ayb.toml configuration file to tell the server what host/port to listen for connections on, how to connect to the database, and the data path for the hosted databases:

$ cat ayb.toml
host = "0.0.0.0"
port = 5433
database_url = "sqlite://ayb_data/ayb.sqlite"
# Or, for Postgres:
# database_url = "postgresql://postgres_user:test@localhost:5432/test_db"
data_path = "./ayb_data"

Running the server then requires one command

$ ayb server

Running a client

Once the server is running, you can set its URL as an environment variable called AYB_SERVER_URL, register a user (in this case, marcua), create a database marcua/test.sqlite, and issue SQL as you like. Here’s how to do that at the command line:

$ export AYB_SERVER_URL=http://127.0.0.1:5433

$ ayb client register marcua
Successfully registered marcua

$ ayb client create_database marcua/test.sqlite
Successfully created marcua/test.sqlite

$ ayb client query marcua/test.sqlite "CREATE TABLE favorite_databases(name varchar, score integer);"

Rows: 0

$ ayb client query marcua/test.sqlite "INSERT INTO favorite_databases (name, score) VALUES (\"PostgreSQL\", 10);"

Rows: 0

$ ayb client query marcua/test.sqlite "INSERT INTO favorite_databases (name, score) VALUES (\"SQLite\", 9);"

Rows: 0

$ ayb client query marcua/test.sqlite "INSERT INTO favorite_databases (name, score) VALUES (\"DuckDB\", 9);"

Rows: 0

$ ayb client query marcua/test.sqlite "SELECT * FROM favorite_databases;"
 name       | score 
------------+-------
 PostgreSQL | 10 
 SQLite     | 9 
 DuckDB     | 9 

Rows: 3

The command line invocations above are a thin wrapper around ayb’s HTTP API. Here are the same commands as above, but with curl:

$ curl -w "\n" -X POST http://127.0.0.1:5433/v1/marcua -H "entity-type: user"

{"entity":"marcua","entity_type":"user"}

$ curl -w "\n" -X POST http://127.0.0.1:5433/v1/marcua/test.sqlite -H "db-type: sqlite"

{"entity":"marcua","database":"test.sqlite","database_type":"sqlite"}

$ curl -w "\n" -X POST http://127.0.0.1:5433/v1/marcua/test.sqlite/query -d 'CREATE TABLE favorite_databases(name varchar, score integer);'

{"fields":[],"rows":[]}

$ curl -w "\n" -X POST http://127.0.0.1:5433/v1/marcua/test.sqlite/query -d "INSERT INTO favorite_databases (name, score) VALUES (\"PostgreSQL\", 10);"

{"fields":[],"rows":[]}

$ curl -w "\n" -X POST http://127.0.0.1:5433/v1/marcua/test.sqlite/query -d "INSERT INTO favorite_databases (name, score) VALUES (\"SQLite\", 9);"

{"fields":[],"rows":[]}

$ curl -w "\n" -X POST http://127.0.0.1:5433/v1/marcua/test.sqlite/query -d "INSERT INTO favorite_databases (name, score) VALUES (\"DuckDB\", 9);"

{"fields":[],"rows":[]}

$ curl -w "\n" -X POST http://127.0.0.1:5433/v1/marcua/test.sqlite/query -d "SELECT * FROM favorite_databases;"

{"fields":["name","score"],"rows":[["PostgreSQL","10"],["SQLite","9"],["DuckDB","9"]]}

Students, sharers, and sovereigns

If ayb is successful, it will become easier to create a database, interact with it, and share it with relevant people/organizations. There are three groups that would benefit most from such a tool, and by studying the problems they face, we can make ayb more useful for them.

Students. The barrier to learn how to work with data is too high, and much of it is in operational overhead (how to set up a database? how to connect to it? how to share what I’ve learned?). Ideally, aside from registering for an account and creating a database, there should be no operational overhead to writing your first SQL query. It should be easy to fork a data set and start asking questions, and it should be easy to start inserting rows into your own small data set. If you get stuck, it should be easy to give a mentor or teacher access to your database and get some help.

Sharers. Scientists, journalists, and other people who want to share a data set have largely ad hoc means to share that data, and their collaborators and readers’ experience is limited by the ad hoc sharing decisions. You’ve encountered this if you’ve ever tried to do something with the CSV file someone shared over email or if you’ve wanted to visualize the data presented in an article in a slightly different way. Sharers should be able to create a database, add collaborators, and eventually open it up to the public to fork/query in a read-only way with as little overhead for themselves or the recipient as possible. While this design pushes computation onto the shared instances and away from capable laptops, it enables consistency in data and allows collaborators to benefit from future updates.

Sovereigns. When you use most hosted applications, you’re not in control of your own data. Today’s application stack places ownership of the database and data with the organization that wrote the application. While this model has several benefits, it also means that your data isn’t yours, which has privacy, security, economic, and extensibility implications. The company that hosts the application has sovereignty over the database that hosts the data, and at best they allow you to export portions of your data in sometimes unhelpful formats. The most speculative use case for ayb is that it grants end-users sovereignty over their data. Imagine a world where, before signing up for an application, you spin up an ayb database and authorize the application to your new database. As long as you’re still getting value from an app, it can provide functionality on top of your data. If you ever change your mind about the app, the data is yours by default, and you can change who has access to your data.

Where to go from here

Thank you for reading this far. From here, you can:

Build your own database! It takes minutes!
Read more about what functionality is coming soon to ayb in the Roadmap.
Make a contribution to the project, starting with the details on how to contribute.
Create an issue, email me, or reach out on social media to get started (my contact information is here and I promise to be welcoming).

Thank you

Thank you to Sofía Aritz, Meredith Blumenstock, Daniel Haas, Meelap Shah, and Eugene Wu for reading and suggesting improvements to early drafts of this blog post. Shout out to Meelap Shah and Eugene Wu for convincing me to not call this project stacks, to Andrew Lange-Abramowitz for making the connection to the storied meme, and to Meredith Blumenstock for listening to me fret over it all.

Data diffs: Algorithms for explaining what changed in a dataset

2022-02-20T00:00:00+00:00

tl;dr: part 1 explains what an explanation algorithm is, and part 2 describes an open source SQL data differ.

“Why did this happen?” “What changed?”

In the data world, most reporting starts by asking how much?: “how many new customers purchase each week?” or “what is the monthly cost of medical care for this group?”

Inevitably the initial reports result in questions about why?: “why did we see less purchases last week?” and “why are the medical costs for this group increasing?”

The academic community has an answer to such why? questions: explanation algorithms. An explanation algorithm looks at columns/properties of your dataset and identifies high-likelihood explanations (called “predicates” in database-speak). For example, the algorithms might find that you got less customers in the segment of people who saw a new marketing campaign, or that the medical costs for the group you’re studying can largely be attributed to costly treatments in a subgroup.

The academic interest is founded in real pain. When a journalist, researcher, or organization asks why?, the resulting data anlysis largely goes into issuing ad hoc GROUP BY queries or unscientifically creating pivot tables to try to slice and dice datasets to explain some change in a dataset over time. Companies like Sisu (founded by Peter Bailis, one of the authors of the DIFF paper discussed below) are built on the premise that data consumers are increasingly asking why?

You can rephrase lots of different questions in the form of an explanation question. This is an area I’ve been interested in for a while, especially as it might help people like journalists and social scientists better identify interesting trends. In A data differ to help journalists (2015), I said:

It would be nice to have a utility that, given two datasets (e.g., two csv files) that are schema-aligned, returns a report of how they differ from one-another in various ways. The utility could take hints of interesting grouping or aggregate columns, or just randomly explore the pairwise combinations of (grouping, aggregate) and sort them by various measures like largest deviation from their own group/across groups.

At the time of that post, I hadn’t yet connected the dots between the desire for such a system and the active work going on in the research world. Thanks to database researchers, that connection now exists! In this post, I’ll first cover two approaches to explanation algorithms, and then introduce an open source implementation of one of them in my datools library.

Two ways to ask for explanations

In 2013, Eugene Wu and Sam Madden introduced Scorpion, a system that explains why an aggregate (e.g., the customer count last week) is higher or lower than other example data. Figure 1 in their paper explains the problem quite nicely. They imagine a user looking at a chart, in this case of aggregate temperatures from a collection of sensors, and highlighting some outliers to ask “compared to the other points on this chart, why are these points so high?”


A figure that shows how a user might highlight outliers on a chart (source: Scorpion paper)

Scorpion has two nice properties. First, it operates on aggregates: it’s not until you look at some weekly or monthly statistics that you notice that something is off and search for an explanation. Second, it’s performant on a pretty wide variety of aggregates, with optimizations for the most common ones (e.g., sums, averages, counts, standard deviations). I believe that of all the explanation algorithms, Scorpion pairs the most intuitive phrasing of the question (“why so high/low?”) with the most intuitive experience (highlighting questionable results on a visualization).

The challenge in implementing Scorpion is that, as presented, it does its processing outside of the database that stores the data. Specifically, the way Scorpion partitions and merges subsets of the data to identify an explanation requires decision trees and clustering algorithms that traditionally execute outside of the database¹. It is also specific to aggregates, which are commonly the source of why questions, but aren’t the only places that question arises.

This is where DIFF comes in². In 2019, Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, Jeff Naughton, Peter Bailis, and Matei Zaharia introduced an explanation algorithm in the form of a database operator called DIFF that can be expressed in SQL. If you’re so inclined, here’s the syntax for the DIFF operator:


The syntax for the DIFF operator (source: DIFF paper)

An example with SQL might help in understanding how it works:


A simple example of the DIFF operator in action (source: DIFF paper)

In this example, the DIFF operator compares the crash logs of an application from this week to those of last week, considering columns like application version, device, and operating system for an explanation. The most likely explanation happened 20x more this week than last week (risk_ratio = 20.0), and explains 75% of this week’s crashes (support = 75%).

DIFF requires that we do some mental gymnastics to transform “why was X so high?” into “how are these two groups different?”. It also requires the user to wrap their head around statistics like risk ratios and support. In exchange for that mental overhead, DIFF is exciting for its praticality. As the example shows, DIFF’s authors envision it being expressed in SQL, which means it could be implemented on top of most relational databases. While a contribution of the paper is a specialized and efficient implementation of DIFF that databases don’t have today, it can also be implemented entirely in the database as a series of SQL GROUP BY/JOIN/WHERE operators.

If you have a relational database, love SQL, and want to run an explanation algorithim, DIFF is exciting because those three things are all you need. Luckily for you, dear reader, I had a relational database, loved SQL, and wanted to run an explanation algorithm.

An open source implementation of DIFF

Over the past few months, I’ve been implementing DIFF as a thin Python wrapper that generates the SQL necessary to compute the difference between two schema-aligned queries. The core of the implementation to do this, including comments, requires a little under 300 lines of code. To see a full example of the tool in action, you can check out this Jupyter Notebook, but I’ll show snippets below to give you a sense of how it works.

First, we need a dataset. For that, I took inspiration from the Scorpion paper’s experiments, one of which relied on sensor data from Intel collected by my grad school advisor Sam Madden (and a few collaborators). Using Simon Willison’s excellent sqlite-utils library, I load the data into SQLite and inspect it:

# Retrieve and slightly transform the data
wget http://db.csail.mit.edu/labdata/data.txt.gz
gunzip data.txt.gz
sed -i '1s/^/day time_of_day epoch moteid temperature humidity light voltage\n/' data.txt
head data.txt

# Get it in SQLite
pip install sqlite-utils
sqlite-utils insert intel-sensor.sqlite readings data.txt --csv --sniff --detect-types
sqlite-utils schema intel-sensor.sqlite

That last sqlite-utils schema shows us what the newly generated readings table looks like:

CREATE TABLE "readings" (
   [day] TEXT,
   [time_of_day] TEXT,
   [epoch] INTEGER,
   [moteid] INTEGER,
   [temperature] FLOAT,
   [humidity] FLOAT,
   [light] FLOAT,
   [voltage] FLOAT
);

OK! So we have a row for each sensor reading, with the day and time_of_day it happened, an epoch to time-align readings from different sensors, a moteid (the ID of the sensor, otherwise known as a mote), and then the types of things that sensors tend to sense: temperature, humidity, light, and voltage.

In the Scorpion paper (Sections 8.1 and 8.4), a user notices that various sensors placed throughout a lab detect too-high temperature values (reading the experiment code, this happens in the days between 2004-03-01 and 2004-03-10). A natural question is why this happened. The Scorpion algorithm discovers that moteid = 15 (a sensor with ID 15) was having a bad few days.

Can we replicate this result with DIFF? Let’s see! The DIFF implementation is part of a library I’ve been building called datools, which is a collection of tools I use for various data analyses. Let’s install datools:

pip install datools

Now let’s use it!

from sqlalchemy import create_engine
from datools.explanations import diff
from datools.models import Column

engine = create_engine('sqlite:///intel-sensor.sqlite')

candidates = diff(
        engine=engine,
        test_relation='SELECT moteid, temperature, humidity, light, voltage FROM readings WHERE temperature > 100 AND day > "2004-03-01" and day < "2004-03-10"',
        control_relation='SELECT moteid, temperature, humidity, light, voltage FROM readings WHERE temperature <= 100 AND day > "2004-03-01" and day < "2004-03-10"',
        on_column_values={Column('moteid'),},
        on_column_ranges={},
        min_support=0.05,
        min_risk_ratio=2.0,
        max_order=1)
for candidate in candidates:
    print(candidate)

What’s diff have to say?

Explanation(predicates=(Predicate(moteid = 15),), risk_ratio=404.8320855614973)
Explanation(predicates=(Predicate(moteid = 18),), risk_ratio=200.5765335449176)

Wow! moteid = 15 is the top predicate that datools.diff identified as being the difference between the test_relation and control_relation! With a risk_ratio = 404.83, we learn that sensor 15 is about 400 times more likely to appear in the set of records with high temperature readings than in the set of records with low temperature readings. Hooray for replicating the Scorpion result! Poor sensor 15!

Let’s break that call to diff down a bit so we understand what’s going on:

engine: a SQLAlchemy engine that’s connected to some database, in this case the SQLite database.
test_relation: the “test set,” which is a query with records that show a particular condition. In our case, it’s the higher-temperature records during the period of interest. This could alternatively be a SQL query for “patients with high medical costs” or “customers who purchased.”
control_relation: the “control set,” which is a query with records that don’t show that particular condition. In our case, it’s the lower-temperature records during the period of interest. This could alternatively be a SQL query for “patients who don’t have high medical costs” or “leads who haven’t purchased.”
on_column_values: these are set-valued columns you want to consider as explanations. In our case, we’re considering the moteid column, so we can identify a specific sensor thats misbehaving.
on_column_ranges: these are range-valued columns you want to consider as explanations. diff will bucket these columns into 15 equi-sized buckets, which works well for continuous variables like {Column('humidity'), Column('light'), Column('voltage'),}. In this example, we don’t provide any (more on why later), but in the Jupyter Notebook, you can see this in action.
min_support: The smallest fraction ([0, 1]) of the test set that the explanation should explain. For example, min_support=0.05 says that if an explanation doesn’t include at least 5% of the test set, we don’t want to know about it.
min_risk_ratio: The smallest risk ratio that the explanation should cover. For example, min_risk_ratio=2.0 says that if an explanation isn’t at least 2 times as likely to appear in the test set than in the control set, we don’t want to know about it.
max_order: How many columns to consider for a joint explanation. For example, in the Scorpion paper, the authors find that not just sensor 15 (one-column explanation), but sensor 15 under certain light and voltage conditions (three column-explanation), is the best explanation for outlier readings. To analyze three-column explanations, you’d set max_order=3. Sadly and hopefully temporarily, while max_order is the most fun, interesting, and challenging-to-implement parameter of the DIFF paper, datools.diff only supports max_order=1 for now.

An astute reader will note that I coaxed the results in my example a bit by asking DIFF to consider only moteid explanations (on_column_values={Column('moteid'),}). The Scorpion paper considers the other columns as well and still gets the strongest signal from moteid. In the Jupyter Notebook, we dive into this more deeply and run into an issue replicating the Scorpion results with diff. I offer some hypotheses for this in the notebook, but to have a more informed opinion, we’ll have to wait until datools.diff supports max_order > 1.

Where to go from here?

Before we go off and celebrate the replication of the Scorpion paper’s findings with the DIFF paper’s algorithm, you should know that it’s not all roses. Luckily, I’m just as excited about improving datools.diff as I was when I first wrote it, so consider the list below to be both limitations of the current version and a roadmap for the library. If you’re curious, this project board tracks the things I’m working on most actively.

Make diff work on more than just SQLite. diff generates SQL, and I’d love for that SQL to run on any database. This is largely a matter of improving the test harness to provision other databases and fixing whatever breaks. The next few databases I’m targeting are DuckDB, Postgres, and Redshift, but if you’re interested in collaborating on something else, I’d love to help.
Support max_order > 1. One of the DIFF paper’s contributions is in how to spar with the combinatorial explosion you encounter in looking for multi-column explanations. I’d love to support at least 2- or 3-column explanations.
Use diff on more datasets. If you’ve got a dataset (especially a public one) you’re hoping to try this on, let me know!
Replicate diff on Scorpion’s analysis after implementing higher-order explanations. The full Jupyter Notebook shows that diff can’t yet replicate Scorpion’s results when we ask it to consider more columns than moteid. The notebook offers explanations ranging from “DIFF and Scorpion are different algorithms and have different tradeoffs” to “Why are we considering an output measure as an explanation?” I think it’s worth revisiting this after implementing max_order > 1, so that we can see how datools.diff handles more complex explanations.
Share more about datools. diff is part of the datools package, but I haven’t told you much about datools. Countless words have been spilled about how SQL, despite being here to stay, also has its rough edges. datools smooths some of these rough edges out³.

Thank you

Eugene Wu not only introduced me to the concept of explanation algorithms, but also patiently guided me through starts and stops as I tried to implement various papers. Peter Bailis not only showed that the need for explanation algorithms is felt broadly, but also supportively contextualized DIFF relative to even more recent state-of-the-art solutions. I’m grateful to both of them for their feedback.

Footnotes

Strictly speaking, it doesn’t have to be the case that more complex analytics or machine learning algorithms have to be run outside the database. MADlib speaks to this nicely, although in practice the approach hasn’t taken off as widely as I wish it did. ↩
This blog post is not a literature review of explanation algorithms. Across the statistics and databases communities, several bodies of work related to explanation algorithms have predated or proceeded in parallel with Scorpion and DIFF. These two algorithms just happen to have shaped my understanding of the space the most. ↩
As an example, not every database (I’m looking at you, SQLite and Redshift) supports things like grouping sets and data cubes, but these operators are critical for making tools like DIFF-in-SQL work effectively. datools offers wrappers that, if a database supports grouping sets, will use the native functionality, but if the database doesn’t, will do the next best thing. ↩

How to archive a website that’s shutting down soon

2022-01-09T00:00:00+00:00

Preserving a piece of digital history

When we moved to NYC a few years ago, I wanted to keep up with what was going on in the neighborhood, but also wanted to avoid relying on Facebook or other closed ecosystems to stay connected. Friends recommended an open forum administered and moderated by volunteers called Jackson Heights Life (and its sibling forum, Astorians). Each forum offered an RSS feed with its latest posts, which allowed me to track updates on neighborhood happenings from the comfort of my feed reader. This all worked well until the volunteers for the two forums announced that for various reasons, they were shutting the forums down with the new year.

Responses to the announcements varied, but some commenters wondered whether the forums’ contents would be preserved, as they were nearing two decades of history. My immediate thought was to check The Internet Archive, but the results weren’t promising: there were very few crawls of the website in 2021, and the ones that existed were largely of images, not forum posts. I reached out to forum moderators, and they said that they been advised about digital preservation services, but that the costs were prohibitive (>= $1,000 per forum). Nostalgia struck, and I reasoned that my trying to capture the websites in the final few days of 2021 was better than nothing.

What follows is a hopefully reproducible set of steps involving command line tools like wget, git, grep, and sed, and tools like GitHub Pages to host the archived website for free. It sadly requires some technical skills. After a few caveats, I explain the process and tools.

Trade-offs / caveats / considerations

Before we begin, here are a few things to consider:

I am not an archivist. If you have the benefit of time, money, or connections, consider consulting a professional. I spend a lot of time thinking about websites in my day job, but I’m no expert on digital preservation. If an archivist looked at my process, I bet they’d have a lot of suggestions/improvements/rewrites.
I am not a lawyer. I had permission to crawl the forums from the owners/administrators/moderators, and confirmed this multiple times. Someone wiser than me would have to advise you on licensing/ownership/copyright, and you might want to get advice on this before you start.
Time. It took me a few days to do all this, and I estimate about 1.5 full-time days of work (on new year’s eve, hooray!). Hopefully you can save some time by reading this document, but expect to spend at least an afternoon on your new hobby.
Broken URLs. The previous owners have grand plans for www.astorians.com and www.jacksonheightslife.com, which is great for the neighborhood! This does mean, however, that old links to the website will break. Before you go setting up 301 redirects to preserve SEO, keep in mind that the (necessary) --adjust-extension and --restrict-file-names=ascii,windows and flags to wget that I share below will also change filenames on you, so preserving old URLs is virtually impossible. This is a guide to creating an archive of the content, not a facsimile of the old website and its behavior.
Server-side rendering. I got very lucky with the PHP forums I was crawling. They largely rendered content server-side, and clicking a link would take you to a new URL, which would be server-side rendered again. If the website I was crawling was some modern single-page application that was rendered client-side, I’m pretty sure the wget tricks would not work.

This last point is worth considering more broadly: today’s website architectures are perhaps optimized for latency or responsiveness, but not for preservation. For every walled garden or web application we encounter, how might we help its contents outlast the decades?

Let’s get started

Here’s what you need to do to archive a website, with examples.

Find a machine

I do all of my development on a remote server, and have a $5/month virtual server for all of my side projects. That helps with things like leaving long-running commands to run while you sleep, but it’s not a requirement: you can run all of these commands from most laptops with access to a shell.

Crawl the website

The first step is to crawl the website using a tool like wget in recursive mode. The following command (with a more aggressive wait as the deadline approached) is the one I used:

wget -P . --recursive --page-requisites --adjust-extension --convert-links --wait=1 --random-wait --restrict-file-names=ascii,windows https://www.example.com

The wget documentation is rich with examples and flags. I didn’t explore a bunch that might have simplified things for me like --mirror or --backup-converted. The forums I crawled had hundreds of thousands of pages, and all in all it took approximately a full night’s sleep to crawl each site. If you’re running this on a laptop, you’ll want to turn off automatic sleep mode (leaving yourself a reminder to bring it back when you’re done). On a remote server, don’t forget to use tmux or screen so your shell persists. Here are a few notes on the flags I used:

--recursive: Keep requesting linked pages.
--page-requisites: …and their static assets (e.g., .js, .css)
--adjust-extension: Add extensions like .html to file names that are missing it
--convert-links: Rewrite references to files so that they work locally
--wait=1: How many seconds to wait between requests so you don’t harm the server. You can adjust this depending on your deadline.
--random-wait: Introduce randomness to the wait time on the previous line, in case something on the server might prevent crawlers.
--restrict-file-names=ascii,windows: (Probably necessary, but will changefile names, breaking URLs.) I used this to convert the query string into something that wouldn’t confused browsers/servers. For example, if you’re crawling index.php?some=args and hoping to convert that to something a web server can serve without running PHP, this flag will rewrite the path.

Move the content into Git/GitHub

Before mucking with the downloaded source and potentially making a mistake, put the content into source control. I used GitHub because it offers free web hosting through GitHub Pages, but you do you. Each of the forums’ hundreds of thousands of pages worked out to about ~5 gigabytes of space. Intuition and previous things people told me made me worry about storing that much stuff in Git, but I’ve seen way larger repositories in professional contexts, so I wasn’t going to worry about it if it worked. Things largely worked, but some git operation took tens of minutes and I had to set ServerAliveInterval 60 and ServerAliveCountMax 30 in ~/.ssh/config to avoid timing out the original git push.

Given how slow some of the git operations end up, you might want to make sure the crawl worked at all before your initial git add/git commit. To do that, jump ahead to test the website locally to make sure things look reasonable (but not perfect, yet).

Set up GitHub Pages

Let’s host this archive for free! There are many options for doing this, from putting it up on S3/CloudFront to using Netlify or Vercel to host the static assets. I used GitHub Pages because it’s free, but I’m not here to sell you anything. Many solutions would have sufficed. Here are some notes on what I learned:

You can skip the Jekyll stuff. GitHub Pages has really nice integration with Jekyll so you can set up templates, etc., but you’ve just crawled a bunch of HTML and hope to never edit again. You can disregard this part of their documentation.
You can set up a custom domain. Early on in the documentation, they describe using https://yourname.github.io/projectname as the URL, but you can also set up a custom CNAME. The owners the original domains were kind enough to point archive.astorians.com and archive.jacksonheightslife.com my way, which I think is a bit more fitting for an archive. They had other plans for the www CNAMEs, so to the residents of Queens I say: get excited!
You can watch your website deploy after pushing changes in GitHub Actions. For example, here is the list of deploys for archive.jacksonheightslife.com. This is helpful because it took ~10 minutes per deploy, and having the Actions list opened in a tab helped me know when the deploys were done.

Test the website locally

After the crawl completes (and not while it’s running, since wget’s --convert-links runs after the crawl is done), you can run a local web server with:

timeout 2h \
        python3 -m http.server \
               --bind 0.0.0.0 8001

With 0.0.0.0, you will be able to access the server remotely, and you should pick a port (8001 in the example) that’s open on the server. The timeout 2h kills the server after 2 hours since I don’t like forgetting to leave services running on my remote server. In my case, I visited http://my.address.tld:8001 after the crawl and immediately ran into issues, which I addressed in the next step.

Manually download missing assets and rewrite the source

This is the most labor-intensive part of the process.

As you click around the local server, visual discrepancies will likely immediately pop up. wget can’t anticipate, download, or rewrite every path (in my case, some URLs were constructed in JavaScript, and static references in CSS backgrounds weren’t rewritten).
To spot less obvious issues, open the network tab in your favorite browser and look for any requests to the original domain. When the original website goes down, these assets will be missing and break the archived website.
Manually download the missing assets (you can use non-recursive wget) and place them in the appropriate directory.
Unfortunately, missing assets aren’t the worst of your problems. In my case, even after wget’s --convert-links, I had ~150,000 pages for each forum with absolute URLs (or JavaScript that constructed those URLs) that I had to rewrite. Here are some handy command line tricks to help you on your journey in rewriting hundreds of thousands of files:
- If you identify a pattern of static assets that wget didn’t collect, and you can describe that pattern with a regular expression, use something like grep -rho '"http://www.astorians.com/community/Theme[^\"]*"' . | sort | uniq | xargs wget -x - to download all of the missing files. Replace the ...astorians... path string with your regular expression.
- If you encounter an incorrect string/reference in many files that you want to rewrite, use something like git grep -lz 'OLD_STRING_TO_REPLACE' | xargs -0 sed -i '' -e 's/OLD_STRING_TO_REPLACE/NEW_STRING_TO_USE/g', replacing OLD_STRING_TO_REPLACE and NEW_STRING_TO_USE with regular expressions for the old and new string.

Test the website on GitHub pages

Having tested the websites locally and manually fixing issues, you’d think life is good. When you look at the GitHub Pages-hosted website for the first time, expect more things to break. For me, GitHub Pages used HTTPS/TLS (yay!), which prevented the browser from loading insecure http://... static assets. Play some music and go back to the manually… step. With 10 minutes between each deploy, you’ll be here a while.

Feel good

At some point, you’ll have iteratively refined your way to success. The first website refresh that looks half-decent will give you quite a thrill. You’ve preserved a bit of internet history. Go you.

Thank you

Thank you to the owners, administrators, moderators, and commenters of Astorians and Jackson Heights Life. Your role was a lot more involved than mine. Thank you also to Meredith Blumenstock for reading a draft of this writeup.

Autodata: Automating common data operations

2021-02-07T00:00:00+00:00

Introduction

Much of the work a data scientist or engineer performs today is rote and error-prone. Data practitioners have to perform tens of steps in order to believe their own analyses and models. The process for each step involves modifications to hundreds/thousands of lines of copy/pasted code, making it easy to forget to tweak a parameter or default. Worse yet, because of the many dependent steps involved in a data workflow, errors compound. It’s no surprise that even after checking off every item of a good data practices checklist, the data practitioner doesn’t fully trust their own work.

Luckily, the data community has been making a lot of common operations less arcane and more repeatable. The community has been automating common procedures including data loading, exploratory data analysis, feature engineering, and model-building. This new world of autodata tools takes some agency away from practitioners in exchange for repeatability and a reduction in repetitive error-prone work. Autodata tools, when used responsibly, can standardize data workflows, improve the quality of models and reports, and save practitioners time.

Autodata doesn’t replace critical thinking: it just means that in fewer lines of code, a data practitioner can follow a best practices. Fully realized, an autodata workflow will break a high-level goal like “I want to predict X” or “I want to know why Y is so high” into a set of declarative steps (e.g., “Summarize the data,” “Build the model”) that require little or no custom code to run, but still allow for introspection and iteration.

In this post, I’ll first list some open source projects in the space of autodata, and then take a stab at what the future of autodata could look like. There’s no reason to trust the second part, but it might be fun to read nonetheless.

Problems and projects

Here are a few trailblazing open source projects in the world of autodata, categorized by stage in the data analysis pipeline. I’m sure I’ve missed many projects, as well as entire categories in the space. The survey reflects the bias in my own default data stack, which combines the command line, Python, and SQL. This area deserves a deeper survey: I’d love to collaborate with anyone that’s compiling one.

One deliberate element of this survey is that I largely focus on tools that facilitate data tinkering rather than on how to create enterprise data pipelines. In my experience, even enterprise pipelines start with one data practitioner tinkering in an ad-hoc way before more deeply reporting and modeling, and autodata projects will likely narrow the gap between tinkering and production.

Data ingestion

You can’t summarize or analyze your data in its raw form: you have to turn it into a data frame or SQL-queriable database. When presented with a new CSV file or collection of JSON blobs, my first reaction is to load the data into some structured data store. Most datasets are small, and many analyses start locally, so I try loading the data into a SQLite or DuckDB embedded database. This is almost always harder than it should be: the CSV file will have inconsistent string/numeric types and null values, and the JSON documents will pose additional problems around missing fields and nesting that prevents their loading into a relational database. The problem of loading a new dataset is the problem of describing and fitting it to a schema.

I’ve been intrigued by sqlite-utils, which offers CSV and JSON importers into SQLite tables. DuckDB has similar support for loading CSV files. If your data is well-structured, these tools will allow you to load your data into a SQLite/DuckDB database. Unfortunately, if your data is nested, missing fields, or otherwise irregular, these automatic loaders tend to choke.

There’s room for an open source project that takes reasonably structured data and suggests a workable schema from it¹. In addition to detecting types, it should handle the occasional null value in a CSV or missing field in JSON, and should flatten nested data to better fit the relational model. Projects like genson handle schema detection but not flattening/relational transformation. Projects like visions lay a nice foundation for better detecting column types. I’m excited for projects that better tie together schema detection, flattening, transformation, and loading so that less manual processing is required.

So far, this section has assumed reasonably clean/structured data that just requires type/schema inference. Academia and industry each have quite a bit to say about data cleaning², and there are also a few open source offerings. The OpenRefine project has been around for a while and shows promise. The dataprep project is building an API to standardize the early stages of working with new datasets, including cleaning and exploratory data analysis. Understandably, these tools rely quite heavily on a human in the loop, and I’m curious if/how open source implementations of auto-data cleaning will pop up.

Exploratory data analysis

When presented with a new dataset, it’s important to interrogate the data to get familiar with empty values, outliers, duplicates, variable interactions, and other limitations. Much of this work involves standard summary statistics and charts, and much of it can be automated. Looking at the data you’ve loaded before trying to use it is important, but wasting your time looping over variables and futzing with plotting libraries is not.

The pandas-profiling library will take a pandas data frame and automatically summarize it. Specifically, it generates an interactive set of per-column summary statistics and plots, raise warnings on missing/duplicate values, and identify useful interaction/correlation analyses (see an example to understand what it can do). Whereas pandas-profiling is geared toward helping you get a high-level sense of your data, the dabl project has more of a bias toward analysis that will help you build a model. It will automatically provide plots to identify the impact of various variables, show you how those variables interact, and give you a sense of how separable the data is.

Feature engineering

To build predictive models over your data, you have to engineer features for those models. For example, for your model to identify Saturdays as a predictor of poor sales, someone has to extract a day_of_the_week feature from the purchase_datetime column. In my experience, a ton of data engineering time goes into feature engineering, and most of that work could be aided by machines. Data engineers spend lots of time one hot encoding their categorical variables, extracting features from datetime fields, vectorizing text blobs, and rolling up statistics on related entities. Feature engineering is further complicated by the fact that you can take it too far: because of the curse of dimensionality, you should derive as many details as possible from the dataset, but not create so many features that they rival the size of your dataset. Often, engineers have to whittle their hard-earned features down once they realize they’ve created too many.

I’m heartened to see automatic feature engineering tools like featuretools for relational data and tsfresh for time series data. To the extent that engineers can use these libraries to automatically generate the traditional set of features from their base dataset, we’ll save days to weeks of work building each model. There’s room for more work here: much of the focus of existing open source libraries has been about automatically creating new features (increasing dimensionality) and not enough has been on identifying how many features to create (preserving model simplicity).

Model-building

A project like scikit-learn offers so many models, parameters, and pipelines to tune when building a classification or regression model. In practice, every use I’ve seen of scikit-learn has wrapped those primitives in a grid/random search of a large number of models and a large number of parameters. Data practitioners have their go-to copy-pastable templates for running cross validated grid search across the eye-numbing number of variables that parameterize your favorite boosted or bagged collection of trees. Running the search is pretty mindless and not always informed by some deep understanding of the underlying data or search space. I’ve seen engineers spend weeks running model searches to eke out a not-so-meaningful improvement to an F-score, and would have gladly opted for a tool to help us arrive at a reasonable model faster.

Luckily, AutoML projects like auto-sklearn aim to abstract away model search: given a feature-engineered dataset, a desired outcome variable, and a time budget, auto-sklearn will emit a reasonable ensemble in ~10 lines of code. The dabl project also offers up the notion of a small amount of code for a reasonable baseline model. Whereas auto-sklearn asks the question “How much compute time are you willing to exchange for accuracy?” dabl asks “How quickly can you understand what a reasonable model can accomplish?”

Repeatable pipelines

The sections above present data problems as one-time problems. In practice, much of the work described above is repeated as new data and new questions arise. If you transformed your data once to ingest or feature engineer it, how can you do repeat that transformation each time you get a new data dump? If you felt certain in the limitations of the data the first time you analyzed it, how can you remain certain as new records arrive? When you revisit a report or model to update it with new data or test a new hypothesis, how can you remember the process you used to arrive at the report or model last time?

There are solutions to many of these problems of longevity. dbt helps you create repeatable transformations so that the data loading workflow you created on your original dataset can be applied as new records and updates arrive. great_expectations helps you assert facts about your data (e.g., unique columns, maximum values) that should be enforced across updates, and offers experimental functionality to automatically profile and propose such assertions.

Whereas the open source world has good answers to repeatable data transformation and data testing, I haven’t been able to find open source tools to track and make repeatable all of the conditions that let to a trained model. There are a few companies in the space³, and I hope that open source offerings arise.

The future of autodata

Autodata is in its infancy: some of the projects listed above aren’t yet at 1.0 versions. What could the future of autodata look like? While I have no track record of predicting the future, here are a few phases we might encounter.

Composition of primitives

At the moment, autodata projects exist, but aren’t data practitioners’ go-to tools. The tools that do exist focus on primitives: today’s autodata projects look at a single part of the data pipeline like schema inference or hyperparameter selection and show that it can be automated with little loss of performance/accuracy. For the foreseeable future, practitioners will still rely on their existing pipelines, but plug a promising project into their data pipeline here or there to save time.

As the automatable primitives are ironed out, more of the projects will be strung together to form pipelines that rely on multiple autodata components. For example, if sqlite-utils used a state-of-the-art schema detection library, “define the schema and load my data” might simply turn into “load my data.” Similarly, if AutoML projects relied on best-of-class automatic feature engineering libraries, feature engineering as an explicit step might be eliminated in some cases.

Limitations and introspection

As higher-level autodata abstractions mature, data pipelines will become accessible to a wider audience. This is a double-edged sword: despite the fact that working with data today requires somewhat arcane knowledge, practitioners still misuse models and misunderstand analyses. As autodata expands the number of people who can create their own data pipelines, communicating the misappropriation of autodata will be critical.

Sociotechnical researchers in areas like Ethical AI are already sounding the alarm on the hidden costs of unwavering faith in algorithms. A big research focus in the next phase of autodata will revolve around how to communicate these exceptions and limitations in software. If a pipeline had to omit part of a dataset in order to load the rest, the desire for auto (“the data was loaded! forget the details!”) will be at odds with the desire for data (“the 1% of data you didn’t load introduced a systemic bias in the model you built!”). If an autodata system selects a more complex model because it improves precision by 5%, how can that same system later warn you that the model has not continued to perform in the face of new data? A few specific areas of research will be critical here:

Human-computer interaction researchers often invoke the concept of mixed-initiative interaction to describe how humans can take turns refining the output of machines. How might we add friction to a pure autodata pipeline so that the operator is aware of the limitations of the “optimal” pipeline? How can the machine take feedback from the operator so the model avoids the operator’s (or society’s) biggest concerns?
Researchers and practitioners are starting to employ the concepts of observability and monitoring to deployed models, but there’s more work to be done. What is the right metadata to attach to the output of an autodata pipeline so that downstream use cases in the current pipeline can raise exceptions when a report/model’s assumptions are broken? What interfaces and modalities will alert the user, who might be the end-user of an application years after a model was created or might be a journalist investigating bias, that it’s no longer sensible to trust the pipeline’s output?

Declarative autodata

As autodata pipelines and abstractions mature, their interfaces can become more declarative. This will allow us to ask higher-level questions. For example, work like Scorpion and Sisu help produce hypotheses to questions like “what might have caused this variable to change?”

When declarative autodata is fully realized, you will be able to start with semi-structured data (e.g., CSVs of coded medical procedure and cost information, or customer fact and event tables), and ask a question of that data (e.g., “Why might bills be getting more expensive?” or “What is this customer’s likelihood to churn?”). Aside from how you ask the question and receive the answer, you might largely leave the system to take care of the messy details. If you’re lucky, the system will even tell you whether you can trust those answers today, and whether a consumer can trust those answers a few years down the road.

Thank you to Peter Bailis, Lydia Gu, Daniel Haas, and Eugene Wu for their suggestions on improving a draft of this post. The first version they read was an unstructured mess of ideas, and they added structure, clarity, and a few missing reference. I’m particularly grateful for the level of detail of their feedback: I wasn’t expecting so much care from such busy people!

Footnotes

In terms of papers, Sato offers some thoughts on how to detect types and the Section 4.3 of the Snowflake paper speaks nicely to a gradual method for determining the structure of a blob. ↩
As a taste of the work in this space, companies like Trifacta and research on projects like Wrangler have shown us what’s possible. ↩
See CometML, Determined AI, and Weights & Biases. ↩

2019 Book list

2021-01-10T22:21:00+00:00

In 2019, I reconnected with books. This is largely thanks to Libby, the app that made it easy to check out and read books, and the New York Public Library, from which I got a free card.

I did a mix of reading and listening to books, largely listening to one while I ran and did chores, and reading one at night and on weekends. Libby/your library probably give you access to a bunch of e-books and audiobooks, so give both a shot!

Here are the books I read or listened to in 2019 in the order that I consumed them. After a few warmup/fun books, I started reading books from the Pulitzer prize list for previous years in a bunch of the book subcategories, as well as recommendations from friends.

Remote: Office Not Required by Jason Fried and David Heinemeier Hansson
Between the World and Me by Ta-Nehisi Coates
Night by Elie Wiesel
Never Lose a Customer Again by Joey Coleman
The Hunger Games by Suzanne Collins
Catching Fire (Hunger Games 2) by Suzanne Collins
Mockingjay (Hunger Games 3) by Suzanne Collins
Traction: How a startup can achieve explosive customer growth by Gabriel Weinberg and Justin Mares
Blood in the water: The Attica prison uprising of 1971 and its legacy by Heather Ann Thompson
Evicted: Poverty and profit in the American city by Matthew Desmond
The underground railroad by Colson Whitehead
Hamilton: The revolution by Jeremy McCarter and Lin-Manuel Miranda
It Doesn’t Have to be Crazy at Work by David Heinemeier Hansson and Jason Fried
The Blood telegram: Nixon, Kissinger, and a forgotten genocide by Gary J. Bass
Hillbilly Elegy: A memory of a family and culture in crisis by J. D. Vance
Locked In: The true causes of mass incarceration and how to achieve real reform by John Pfaff
The Well-tempered City by Jonathan F. P. Rose
Black Flags: The rise of ISIS by Joby Warrick
The Triple Agent by Joby Warrick
The Dark Side: The Inside Story of How the War on Terror Turned Into a War on American Ideals by Jane Mayer
The Lowland by Jhumpa Lahiri
The Line Becomes a River by Francisco Cantú
Toms River: A Story of Science and Salvation by Dan Fagin
Ghost Work: How to stop silicon valley from building a new global underclass by Mary L. Gray and Siddharth Suri
Strangers drowning: Grappling with Impossible Idealism, Drastic Choices, and the Overpowering Urge to Help by Larissa MacFarquhar
All the Light we Cannot See by Anthony Doerr
We Fed an Island by José Andrés with Richard Wolffe
Sapiens: A brief history of humankind by Yuval Noah Harari
Less by Andrew Sean Greer
Amity and Prosperity: one family and the fracturing of America by Eliza Griswold
Ghost Wars: the secret history of the CIA Afghanistan, and Bin Laden, from the Soviet invasion to September 10, 2001 by Steve Coll

2020 Book list

2021-01-10T22:21:00+00:00

In 2020, I spent too much time doomscrolling and otherwise distracted by the world. As a result, I spent less time reading/listening to books. I continued the 2019 approach of reading from the Pulitzer Prize list and relied on some excellent recommendations from friends.

Here are the books I read or listened to in 2020 in the order that I consumed them:

Why We Sleep: Unlocking the power of sleep and dreams by Matthew Walker
Directorate S: The C.I.A. and America’s Secret Wars in Afghanistan and Pakistan by Steve Coll
The Death and Life of Great American Cities by Jane Jacobs
The Girl with the Dragon Tattoo by Stieg Larsson
Spies of no Country: Secret lives at the birth of Israel by Matti Friedman
Never Split the Difference by Chris Voss
How the Other Half Banks by Mehrsa Baradaran
The Color of Money: Black banks and the racial wealth gap by Mehrsa Baradaran
The Looming Tower: Al-Qaeda and the road to 9/11 by Lawrence Wright
The Mastermind by Evan Ratliff
The Power Broker: Robert Moses and the Fall of New York by Robert A. Caro
Sandworm: A New Era of Cyberwar and the Hunt for the Kremlin’s Most Dangerous Hackers by Andy Greenberg
American Gods: A novel by Neil Gaiman
The Devil’s Highway: A true story by Luis Alberto Urrea
Why nations fail: The origins of power, prosperity, and poverty by Daron Acemoglu and James A. Robinson
Turing’s Cathedral: The origins of the digital universe by George Dyson
(not a book, but a great podcast) In The Dark (especially season 2)
Private Empire: ExxonMobil and American Power by Steve Coll

Fun with Voter Data

2016-02-28T11:57:47+00:00

Since elections are on everyone’s mind, I played around with some voter data. Your name/address/phone/party affiliation/participation is available in public voter data. I created an example of how it can be used in a Jupyter/IPython notebook.

This was my first use of Jupyter notebooks to write a data analysis piece. It was a lot of fun, and I hope all of you do it too!

Some fun findings: 1) two 115-year-old active voters, and 2) an example of how a campaign can create a call/mailing list of active voters to ask them for additional help.

Thanks to Meredith Blumenstock and Derek Willis for reading an early draft of this!

Crowdsourced Data Management: Industry and Academic Perspectives

2016-01-28T14:41:02+00:00

A few years ago, Aditya and I were catching up at Voltage cafe in Kendall Square when he asked if I’d be interested in writing a book on crowd-powered data processing systems. At the time, he was a postdoc at MIT and I was in startupland at Locu, and in the time that he became a professor at UIUC and I co-founded a company, we went ahead and finished the book.

The book, which is freely available as a PDF, has two parts. In the first half, we review the state of academic research in crowdsourcing, with a special eye for data processing. The first half was a natural follow-on to our research in grad school. The second half of the book features summaries of 13 interviews with industry users of crowd work and 4 operators of crowdsourcing marketplaces. This half is filled with summary statistics and rich quotes from folks at companies like Google, Facebook, and Microsoft on how they manage large crowd workforces, what their use cases are, which aspects of the research literature they benefit from, and where they could use a little more help from researchers.

I really enjoyed two aspects of working on this book. First, it was wonderful to work with Aditya, who I never got to collaborate with in grad school. Second, the experience opened Aditya and me up to just how much you can learn from qualitative work like the interviews and surveys in the second part of the book. Both of us felt that this second lesson would have a lasting impact on how we approach learning a new topic, and how to keep industry and academia in sync on the most important problems in a field.

My only regret with the book as that, due to the formatting guidelines of our publisher, the Acknowledgements section is at the end of the book. One of my not-so-secret delights is reading the acknowledgements that people put in their Ph.D. theses, and I like it when they can be front and center. Nonetheless, it’s there, and I’m grateful!

To make the book more accessible, we’ll be putting together summaries of our favorite sections as blog posts. You can read the first one on Aditya’s blog.

Argonaut: Processing Complex Work with the Crowd

2015-09-01T13:37:59+00:00

One of my favorite times of year at a company is when interns join for the summer. Internships are great avenues for those fun projects you’ve had in the back of your mind but haven’t had time to test out. Two summers ago at Locu, I had the great fortune to work with Daniel Haas, a grad student at Berkeley’s AMP Lab. His three months of work laid the foundation for a paper on a framework called Argonaut that he and Lydia Gu, my Locu/GoDaddy colleague who joined the project, are presenting today at VLDB.

While we primarily used Argonaut for structured data extraction, Argonaut’s concepts can be applied to other areas of complex crowd work. It’s amazing to think our learnings come from over half a million hours of worker contributions. Without them, none of these learnings would be possible.

As a sneak peak, I’ll highlight a few fun learnings we report on in more detail in the paper. I’ll also editorialize a bit on what I think the findings mean for crowd work, especially as people do more interesting and complex things with it. Here’s the scoop:

Complex work. Traditionally in crowd work, we’re told to design microtasks: simple yes/no or multiple choice questions that are well defined. You can imagine this is pretty dehumanizing and not inspiring to workers. With the Argonaut model, we send large, meaty tasks to workers. Tasks might take upward of an hour to complete, and are generally easier to design since there’s no microtask decomposition to think about. They are closer to what you’d imagine knowledge work being like: we trust humans to do what they’re good at on challenging tasks.

Review, don’t repeat. To avoid workers making mistakes in traditional microtask work, we send multiple workers each task, and use voting-based schemes (like majority vote or expectation maximization) to identify the correct answer. With Argonaut, we do something different: only one worker completes each complex task. Entry-level workers are sometimes reviewed by trusted ones, which allows us to catch mistakes, and also allows us to send tasks back to workers so that they can correct them and learn by example. In the paper we show that review works: a large majority of tasks that are reviewed end up of higher quality, and workers get to see how to improve their own work, unlike the opaque voting-based schemes of the microtask world.

Spotcheck with help from models. The technical heart of the Argonaut paper is a TaskGrader model. We built a regression on a few hundred features of each task, like the worker’s previous work history, the length of the task, the time of day in the worker’s timezone, etc. The regression predicted, based on these features, how much a review might change/improve a worker’s work. Given a fixed budget for review or a fixed number of reviewers, we can now identify which tasks the reviewers should look at for maximal task quality improvement. In the paper, we find that for a practical review budget, you can catch around 50% more errors with the same amount of review, just by pointing reviewers at the tasks that will benefit most from their attention.

Optimizing for longevity and upward mobility changes everything. One topic/hypothesis that doesn’t receive enough attention in the paper is that having long-term relationships with crowd workers changes everything. Half of the crowd workers contributed to our system for more than 2.5 years. The ones that performed the best ended up being promoted to reviewer status, and were selected to do more interesting work when it came up. This had pretty drastic effects on our worker and task models. Hidden in Figure 7 of the paper is a neat finding: almost the entirety of the predictive power of the TaskGrader comes from task-specific features. Worker-specific features on their own don’t appear to be too predictive: by the time you establish long-term relationships with workers, the discerning properties of a task’s quality are not the trusted people you’re working with, but the difficulty of the task they are completing. This is in stark contrast to traditional microtask crowd work, where the most celebrated work quality algorithms identify the trusted workers and weigh their responses more heavily.

While this paper brings us to the tip of the iceberg of complex work and hierarchical machine-mediated review, there are a ton of questions we have yet to answer. Most important to me are questions around just how complex of work we can do with these models. Can we support high-quality creative and analytical tasks beyond structured data extraction? How generalizable is the TaskGrader to other tasks? Finally, what does it mean for crowd work if longevity and upward mobility matter as much as they do in traditional employment scenarios?

A data differ to help journalists

2015-06-02T22:14:47+00:00

I recently read an article that reminded me of a type of reporting I’ve seen a few times now. In this article, the reporters compare a medical expenses dataset from this year to the one from last year. They report how some aggregates (e.g., average price) grouped by various fields (e.g., treatment type) have changed over time.

It would be nice to have a utility that, given two datasets (e.g., two csv files) that are schema-aligned, returns a report of how they differ from one-another in various ways. The utility could take hints of interesting grouping or aggregate columns, or just randomly explore the pairwise combinations of (grouping, aggregate) and sort them by various measures like largest deviation from their own group/across groups.

There are a few challenges with the “just show me interesting combinations” version of this:

The approach suffers from multiple hypothesis testing and you’re likely to end up finding differences where they might not actually exist.
The system is going to present a bunch of different combinations to the user, resulting in overload. We’d have to think up some interface to present the various findings for them to be useful.

update with related work:

Manasi Vartak, Aditya Parameswaran and friends are working on SeeDB. SeeDB optimizes for findings that would visualize well, so its goal might be slightly different. It also has a notion of how a query (subset of the dataset) differs from the rest of the dataset, which we could use for comparing two schema-aligned datasets.
Michael Bernstein suggested a look at this paper, which says We found that long-term correlation data provided users with new insights about systematic wellness trends that they could not make using only the time series graphs provided by the sensor manufacturers.