Autodata: Automating common data operations

Introduction

Much of the work a data scientist or engineer performs today is rote and error-prone. Data practitioners have to perform tens of steps in order to believe their own analyses and models. The process for each step involves modifications to hundreds/thousands of lines of copy/pasted code, making it easy to forget to tweak a parameter or default. Worse yet, because of the many dependent steps involved in a data workflow, errors compound. It’s no surprise that even after checking off every item of a good data practices checklist, the data practitioner doesn’t fully trust their own work.

Luckily, the data community has been making a lot of common operations less arcane and more repeatable. The community has been automating common procedures including data loading, exploratory data analysis, feature engineering, and model-building. This new world of autodata tools takes some agency away from practitioners in exchange for repeatability and a reduction in repetitive error-prone work. Autodata tools, when used responsibly, can standardize data workflows, improve the quality of models and reports, and save practitioners time.

Autodata doesn’t replace critical thinking: it just means that in fewer lines of code, a data practitioner can follow a best practices. Fully realized, an autodata workflow will break a high-level goal like “I want to predict X” or “I want to know why Y is so high” into a set of declarative steps (e.g., “Summarize the data,” “Build the model”) that require little or no custom code to run, but still allow for introspection and iteration.

In this post, I’ll first list some open source projects in the space of autodata, and then take a stab at what the future of autodata could look like. There’s no reason to trust the second part, but it might be fun to read nonetheless.

Problems and projects

Here are a few trailblazing open source projects in the world of autodata, categorized by stage in the data analysis pipeline. I’m sure I’ve missed many projects, as well as entire categories in the space. The survey reflects the bias in my own default data stack, which combines the command line, Python, and SQL. This area deserves a deeper survey: I’d love to collaborate with anyone that’s compiling one.

One deliberate element of this survey is that I largely focus on tools that facilitate data tinkering rather than on how to create enterprise data pipelines. In my experience, even enterprise pipelines start with one data practitioner tinkering in an ad-hoc way before more deeply reporting and modeling, and autodata projects will likely narrow the gap between tinkering and production.

Data ingestion

You can’t summarize or analyze your data in its raw form: you have to turn it into a data frame or SQL-queriable database. When presented with a new CSV file or collection of JSON blobs, my first reaction is to load the data into some structured data store. Most datasets are small, and many analyses start locally, so I try loading the data into a SQLite or DuckDB embedded database. This is almost always harder than it should be: the CSV file will have inconsistent string/numeric types and null values, and the JSON documents will pose additional problems around missing fields and nesting that prevents their loading into a relational database. The problem of loading a new dataset is the problem of describing and fitting it to a schema.

I’ve been intrigued by sqlite-utils, which offers CSV and JSON importers into SQLite tables. DuckDB has similar support for loading CSV files. If your data is well-structured, these tools will allow you to load your data into a SQLite/DuckDB database. Unfortunately, if your data is nested, missing fields, or otherwise irregular, these automatic loaders tend to choke.

There’s room for an open source project that takes reasonably structured data and suggests a workable schema from it¹. In addition to detecting types, it should handle the occasional null value in a CSV or missing field in JSON, and should flatten nested data to better fit the relational model. Projects like genson handle schema detection but not flattening/relational transformation. Projects like visions lay a nice foundation for better detecting column types. I’m excited for projects that better tie together schema detection, flattening, transformation, and loading so that less manual processing is required.

So far, this section has assumed reasonably clean/structured data that just requires type/schema inference. Academia and industry each have quite a bit to say about data cleaning², and there are also a few open source offerings. The OpenRefine project has been around for a while and shows promise. The dataprep project is building an API to standardize the early stages of working with new datasets, including cleaning and exploratory data analysis. Understandably, these tools rely quite heavily on a human in the loop, and I’m curious if/how open source implementations of auto-data cleaning will pop up.

Exploratory data analysis

When presented with a new dataset, it’s important to interrogate the data to get familiar with empty values, outliers, duplicates, variable interactions, and other limitations. Much of this work involves standard summary statistics and charts, and much of it can be automated. Looking at the data you’ve loaded before trying to use it is important, but wasting your time looping over variables and futzing with plotting libraries is not.

The pandas-profiling library will take a pandas data frame and automatically summarize it. Specifically, it generates an interactive set of per-column summary statistics and plots, raise warnings on missing/duplicate values, and identify useful interaction/correlation analyses (see an example to understand what it can do). Whereas pandas-profiling is geared toward helping you get a high-level sense of your data, the dabl project has more of a bias toward analysis that will help you build a model. It will automatically provide plots to identify the impact of various variables, show you how those variables interact, and give you a sense of how separable the data is.

Feature engineering

To build predictive models over your data, you have to engineer features for those models. For example, for your model to identify Saturdays as a predictor of poor sales, someone has to extract a day_of_the_week feature from the purchase_datetime column. In my experience, a ton of data engineering time goes into feature engineering, and most of that work could be aided by machines. Data engineers spend lots of time one hot encoding their categorical variables, extracting features from datetime fields, vectorizing text blobs, and rolling up statistics on related entities. Feature engineering is further complicated by the fact that you can take it too far: because of the curse of dimensionality, you should derive as many details as possible from the dataset, but not create so many features that they rival the size of your dataset. Often, engineers have to whittle their hard-earned features down once they realize they’ve created too many.

I’m heartened to see automatic feature engineering tools like featuretools for relational data and tsfresh for time series data. To the extent that engineers can use these libraries to automatically generate the traditional set of features from their base dataset, we’ll save days to weeks of work building each model. There’s room for more work here: much of the focus of existing open source libraries has been about automatically creating new features (increasing dimensionality) and not enough has been on identifying how many features to create (preserving model simplicity).

Model-building

A project like scikit-learn offers so many models, parameters, and pipelines to tune when building a classification or regression model. In practice, every use I’ve seen of scikit-learn has wrapped those primitives in a grid/random search of a large number of models and a large number of parameters. Data practitioners have their go-to copy-pastable templates for running cross validated grid search across the eye-numbing number of variables that parameterize your favorite boosted or bagged collection of trees. Running the search is pretty mindless and not always informed by some deep understanding of the underlying data or search space. I’ve seen engineers spend weeks running model searches to eke out a not-so-meaningful improvement to an F-score, and would have gladly opted for a tool to help us arrive at a reasonable model faster.

Luckily, AutoML projects like auto-sklearn aim to abstract away model search: given a feature-engineered dataset, a desired outcome variable, and a time budget, auto-sklearn will emit a reasonable ensemble in ~10 lines of code. The dabl project also offers up the notion of a small amount of code for a reasonable baseline model. Whereas auto-sklearn asks the question “How much compute time are you willing to exchange for accuracy?” dabl asks “How quickly can you understand what a reasonable model can accomplish?”

Repeatable pipelines

The sections above present data problems as one-time problems. In practice, much of the work described above is repeated as new data and new questions arise. If you transformed your data once to ingest or feature engineer it, how can you do repeat that transformation each time you get a new data dump? If you felt certain in the limitations of the data the first time you analyzed it, how can you remain certain as new records arrive? When you revisit a report or model to update it with new data or test a new hypothesis, how can you remember the process you used to arrive at the report or model last time?

There are solutions to many of these problems of longevity. dbt helps you create repeatable transformations so that the data loading workflow you created on your original dataset can be applied as new records and updates arrive. great_expectations helps you assert facts about your data (e.g., unique columns, maximum values) that should be enforced across updates, and offers experimental functionality to automatically profile and propose such assertions.

Whereas the open source world has good answers to repeatable data transformation and data testing, I haven’t been able to find open source tools to track and make repeatable all of the conditions that let to a trained model. There are a few companies in the space³, and I hope that open source offerings arise.

The future of autodata

Autodata is in its infancy: some of the projects listed above aren’t yet at 1.0 versions. What could the future of autodata look like? While I have no track record of predicting the future, here are a few phases we might encounter.

Composition of primitives

At the moment, autodata projects exist, but aren’t data practitioners’ go-to tools. The tools that do exist focus on primitives: today’s autodata projects look at a single part of the data pipeline like schema inference or hyperparameter selection and show that it can be automated with little loss of performance/accuracy. For the foreseeable future, practitioners will still rely on their existing pipelines, but plug a promising project into their data pipeline here or there to save time.

As the automatable primitives are ironed out, more of the projects will be strung together to form pipelines that rely on multiple autodata components. For example, if sqlite-utils used a state-of-the-art schema detection library, “define the schema and load my data” might simply turn into “load my data.” Similarly, if AutoML projects relied on best-of-class automatic feature engineering libraries, feature engineering as an explicit step might be eliminated in some cases.

Limitations and introspection

As higher-level autodata abstractions mature, data pipelines will become accessible to a wider audience. This is a double-edged sword: despite the fact that working with data today requires somewhat arcane knowledge, practitioners still misuse models and misunderstand analyses. As autodata expands the number of people who can create their own data pipelines, communicating the misappropriation of autodata will be critical.

Sociotechnical researchers in areas like Ethical AI are already sounding the alarm on the hidden costs of unwavering faith in algorithms. A big research focus in the next phase of autodata will revolve around how to communicate these exceptions and limitations in software. If a pipeline had to omit part of a dataset in order to load the rest, the desire for auto (“the data was loaded! forget the details!”) will be at odds with the desire for data (“the 1% of data you didn’t load introduced a systemic bias in the model you built!”). If an autodata system selects a more complex model because it improves precision by 5%, how can that same system later warn you that the model has not continued to perform in the face of new data? A few specific areas of research will be critical here:

Human-computer interaction researchers often invoke the concept of mixed-initiative interaction to describe how humans can take turns refining the output of machines. How might we add friction to a pure autodata pipeline so that the operator is aware of the limitations of the “optimal” pipeline? How can the machine take feedback from the operator so the model avoids the operator’s (or society’s) biggest concerns?
Researchers and practitioners are starting to employ the concepts of observability and monitoring to deployed models, but there’s more work to be done. What is the right metadata to attach to the output of an autodata pipeline so that downstream use cases in the current pipeline can raise exceptions when a report/model’s assumptions are broken? What interfaces and modalities will alert the user, who might be the end-user of an application years after a model was created or might be a journalist investigating bias, that it’s no longer sensible to trust the pipeline’s output?

Declarative autodata

As autodata pipelines and abstractions mature, their interfaces can become more declarative. This will allow us to ask higher-level questions. For example, work like Scorpion and Sisu help produce hypotheses to questions like “what might have caused this variable to change?”

When declarative autodata is fully realized, you will be able to start with semi-structured data (e.g., CSVs of coded medical procedure and cost information, or customer fact and event tables), and ask a question of that data (e.g., “Why might bills be getting more expensive?” or “What is this customer’s likelihood to churn?”). Aside from how you ask the question and receive the answer, you might largely leave the system to take care of the messy details. If you’re lucky, the system will even tell you whether you can trust those answers today, and whether a consumer can trust those answers a few years down the road.

Thank you to Peter Bailis, Lydia Gu, Daniel Haas, and Eugene Wu for their suggestions on improving a draft of this post. The first version they read was an unstructured mess of ideas, and they added structure, clarity, and a few missing reference. I’m particularly grateful for the level of detail of their feedback: I wasn’t expecting so much care from such busy people!

Footnotes

In terms of papers, Sato offers some thoughts on how to detect types and the Section 4.3 of the Snowflake paper speaks nicely to a gradual method for determining the structure of a blob. ↩
As a taste of the work in this space, companies like Trifacta and research on projects like Wrangler have shown us what’s possible. ↩
See CometML, Determined AI, and Weights & Biases. ↩