Life update: I’ve defended my thesis and I’m now the Director of Data at Locu. This doesn’t change much on the blog, as I’ll still periodically update it with random thoughts. I’m also doing a bit of blogging on the Locu blog on topics like our technology workflow, designing for crowds, and the human side of crowdsourcing.
It’s an exciting and very different next step for me. I’m still very excited about introducing new students to data and computer science, and will keep that up as well.
course on data literacy basics targeted at computer science undergraduates. Our initial motivation was selfish: as databases researchers, we didn’t have a lot of experience with an end-to-end raw data->data product pipeline. After a few trial runs of our own, we realized certain data processing patterns kept showing up, and saw that we had a small course worth of content on our hands. The important thing here is that even with undergraduate- and graduate-level machine learning, statistics, and database courses under our belts, we still had a lot to learn about working with honest-to-goodness dirty data.
Each module of our course could have had an entire semester dedicated to it, and so we favored basic skills with lots of hands-on experience over intellectual depth and rigor. We kept lectures to 20-30 minutes, giving students the remaining 2.5 hours to go through the labs we set up while we walked around answering questions. Lectures allowed students to know what they were in for at a high level, and the lab portion allowed them to cement those concepts with real datasets, code, and diagrams. All of the course content is available on github, and as an example, here is a direct link to day 1’s lab.
The syllabus we covered was:
- Day 1: an end-to-end experience in downloading campaign contribution data from the federal election commission, cleaning it up, and programmatically displaying it using basic charts.
- Day 2: visualization/charting skills using election and county health data.
- Day 3: statistics to take the hunches they got on day 2 and quantify them, learning about T-Tests and linear regression along the way.
- Day 4: text processing/summarization using the Enron email corpus.
- Day 5: MapReduce to scale up Day 4’s analysis using Elastic MapReduce on Amazon Web Services. This felt a bit forced, but the students were clamoring for distributed data processing experience.
- Day 6: the students teach us something they learned on their own datasets using techniques we’ve taught them.
While we set out to give computer science students with familiarity in python programming a dive into data, we ended up with folks from the physical sciences, doctors, and a few social scientists who had their own datasets to answer questions about. The last day allowed them to experiment with their new skills on their own data. Attendance on this day was lower than the previous days: the majority of the folks in attendance on day 6 were on the more experienced end, and I suspect that the undergrads, who were not yet exposed to data problems of their own, didn’t find it as engaging. It would be interesting to see how to develop course content that allows self-directed data science for students who still need a bit more inspiration.
I should also say that our attempt is not the first one to bring data to the classroom. Jeff Hammerbacher and Mike Franklin at Berkeley have a wonderful semester-length course on data science. The high-level outline of the course seems similar, but they get farther into data product design, and jump into each topic in more depth. Their resources page has a nice set of links to other educational efforts worth checking out.
I consume content through many aggregators, but The New York Times (The Gray Lady) is the single source of content I go to directly at least daily to know what’s happening in the world. While it’s good for news, what sets The Times apart from other content sources is its depth of reporting. There’s one problem, though: by default, longer NYT articles do not appear in Single Page mode. This has caused me problems in the past, ranging in severity from slightly annoying (having to click Next Page) to pretty frustrating (loading articles for offline reading only to realize I only had the first page).
So I created One Gray Lady, a Greasemonkey plugin that loads all NYT content in single page mode.
To install it in Google Chrome or Firefox with the Greasemonkey plugin, click here.
I have only tested the code in Chrome, and while I did a bit of testing on various URLs, I’m sure I missed something. Feel free to send updates or suggestions!
I recently sat in on a lecture for Professor Peter Szolovits’s Biomedical Computing course. The lecture was open to a greater audience, given the prominence of the speaker. As a non-expert, I found it to be a useful look into the current state of healthcare IT and the coming legislative and technical challenges facing the industry. My notes are below.
John Glaser, Ph.D.
Formerly CIO of Partners/Brigham And Women’s Hospital
Currently CEO of Siemens Health Services
Free advice: get a healthcare proxy and power of attorney set up. Easier to do now than have someone else guess later how you want to live/die.
Why does Health IT suck?
- Not for lack of money put into the system
- Not for lack of smart people working on the problem
- Insurance companies/patients pay per volume (per birth, per surgery, etc.) almost regardless of quality
- Boards of directors are very conservative. Don’t want to be the board that made an IT decision that made a huge hospital fail.
U.S. Numbers to give context
- 60% of hospitals are <= 100 beds
- Of 500K physicians, majority work in 2-3-doctor practice (not IT-savvy, or modestly interested in IT at best)
- 2/3 of medical decisions are heuristic/not scientific, and many have a difficult-to-verify outcome
- volatile knowledge domain: 700k academic articles have come out in the last (decade?)
- 20% of doctors are a decade away from retirement, so perhaps newer doctors will bring IT mentality with them?
- PricewaterhouseCoopers survey: 58% of (independent?) doctors considering quitting, selling practice, or joining a larger practice
- various societies are discussing requirements: to become board (re-)certified (oncology, etc.), you have to show facility in technology.
Health IT Services
- huge fragmentation: the 3rd largest health IT services company has 7% of market. if they win every open engagement from now until (?), they will have 11% of the market.
- lots of players: 300 electronic health record providers in US, 25% exit and 25% enter per year
- engagements are long: bringing up a new hospital IT system takes 2-4 years. from the moment you decide to change IT systems, you will continue to use your old one for the next 4-5 years as you transition.
Affordable Care Act (ACA)
- costs are projected to go up 26% in the next decade. ACA stipulates that govt. will compensate 12% more in the next decade: providers have to make up the difference.
- to incentivize quality care, govt. will hold on to 10% of payments until you prove treatment was effective (hard to define).
- currently, for a single procedure (e.g., total hip replacement) you might get 12 different bills (e.g., surgeon, materials, anesthesia). new system: govt. pays a single provider one bill, with a fixed amount. incentivizes a holistic view.
- risk: hospitals go out of business. potential future doctors don’t enter medicine. doctors “fire” bad patients to make their numbers look good.
- doctors in small practices joining larger networks to avoid managing the ACA requirements.
- single payment requirement will cause groups of doctors to more tightly collaborate (contractually).
- ACA is rolling out over the course of a decade.
- need to be careful, since some patients will be handled by old rules, and some by new rules. so do you not apply decision support-based treatment to patients on old rules, or just do fee-for-service? lots of mental overhead for doctors.
Fixed fee challenges
- paying a fixed amount per treatment doesn’t work for everything. Diabetes is sort of predictable, but a trauma might range from a broken toe to severe burns on 90% of body.
- (Adam’s note) perhaps large pools of insured patients will smooth over the individual spikes in cost of care.
Information Technology needs
- systems must span inpatient, outpatient, emergency care, rehab
- need revenue cycle + contract management system that handles continuum of care. this is complex: medicare + blue cross might pay diff amounts for “good” diabetes treatment, and “good” might be defined differently.
- systems should manage individuals and populations: how did all 100 people w/ respiratory problems do last month? which patients strayed from predicted path? what should have happened? why/why not?
- sophisticated business intelligence + analysis: predict who will get worse, etc.
- interoperability w/ different providers
- rules+workflow engines to ensure followups/next steps/help primary care doctors coordinate care, manage exceptions, follow up properly. also allow this in collaborative care environment w/ lots of specialists checking in and out.
- high availability + low total cost of ownership
- engage patients
New challenges for primary care physicians (PCPs)
- At the moment, PCP moves from one patient to the next every 15 minutes, sees 100s of lab results per day
- Only 25% of data from specialists comes back to a PCP within a month
- In future, PCPs will be responsible for closing the loop on specialists, tests, etc., with more accountability, but still be given just as much or more information, with similar delays. Workflow management systems are key here!
Interesting technical challenges
- filtering patient care notes: 10s of pages of patient care history. No doctor can read them all before seeing patient. how to help doctors find relevant notes across different doctors, annotations, etc.
- supporting collaboration between multiple providers
- parsing notes to remind providers. e.g., “Ask about patient’s daughter next time.”
- cleaning up conflicting medical record data: was it type 1 or type 2 diabetes? was it a heart attack, or just a test for one?
(Cross-posted on the Crowd Research Blog)
There has been a lot of excitement in the database community about crowdsourced databases. At first blush, it sound like databases are yet another application area for crowdsourcing: if you have data in a database, a crowd can help you process it in ways that machines cannot. This view of crowd-powered databases misses the point. The real benefit of thinking of human computation as a databases problem is that it helps you manage complex crowdsourced workflows.
Many crowd-powered tasks require complicated workflows in order to be effective, as we see in algorithms like Soylent’s Find-Fix-Verify. These custom workflows require thousands of lines of code to curry data between services like MTurk and business logic in several languages (1000-2000 in the case of Find-Fix-Verify!). If we provide workflow developers with a set of common operators, like filters and sorts, and a declarative interface to combine those operators, such as SQL or PigLatin, we can reduce the painful crowdsourced plumbing code while focusing on a set of operators to improve as a community.
This is not an academic argument: Find-Fix-Verify can be implemented with a FOREACH-FOREACH-SORT in PigLatin, or a SELECT-SELECT-ORDERBY in SQL, resulting in several tens of lines of code. All told, we can get a two order-of-magnitude reduction in workflow code. The task at hand is thus to make the best-of-breed reusable operators for crowd-powered workflows. In our VLDB 2012 paper, we look at two such operators: Sorts and Joins.
Human-powered sorts are everywhere. When you submit a product review with a 5-star rating, you’re implicitly contributing a datapoint to a large product ranking algorithm. In addition to rating-based sorts, there are also comparison-based ones, where a user is asked to compare two or more items along some axis. For a particularly cute example of comparison-based sorting, see The Cutest, a site that identifies the cutest animals in the world by getting pairwise comparisons from heartwarmed visitors.
The two sort-input methods can be found in the image below. On the left, users compare five squares by size. On the right, users rate each square on a scale from one to seven by size after seeing 10 random examples.
In our paper, we show that comparisons provide accurate rankings, but are expensive: they require a number of comparisons quadratic in the number of items being compared. Rating is quite accurate, and cheaper than sorts: it’s linear in the number of items rated. We also propose a hybrid of the two that balances cost and accuracy, where we first rate all items, and then compare items with similar ratings.
These techniques can reduce the cost of sorting a list of items by 2-10x. Human-powered sorts are valuable for a variety of tasks. Want to know which animals are most dangerous? From least to most dangerous, a crowd of Turkers said:
flower, ant, grasshopper, rock, bee, turkey, dolphin, parrot, baboon, rat, tazmanian devil, lemur, camel, octopus, dog, eagle, elephant seal, skunk, hippo, hyena, great white shark, moose, komodo dragon, wolf, tiger, whale, panther
The different sort implementations highlight another benefit of declaratively defined workflows. A system like Qurk can take user constraints into account (linear costs? quadratic costs? something in between?) and identify a comparison-, rating-, or hybrid-based sort implementation to meet their needs.
Human-powered Joins are equally pervasive. The area of Entity Resolution has captured the attention of researchers and practitioners for decades. In the space of finance, is IBM the same as International Business Machines? Intelligence analysis runs into a combinatorial explosion in the number of ways to say Muammar Muhammad Abu Minyar al-Gaddafi’s name. And most importantly, how can I tell if Justin Timberlake is the person in the image I’m looking at?
We explored three interfaces for solving the celebrity matching problem (and more broadly, the human-powered entity resolution problem). The first is a simple join interface, asking users if the same celebrity is displayed in two images. The second employs batching, asking Turkers to match several pairs of celebrity images. The third interface employs more complex batching by asking Turkers to match celebrities arrayed in two columns.
As we batch more pairs to match per task, cost goes down, but so does Turker accuracy. Still, we found that we can achieve around a 10x cost reduction without significantly losing in result quality. We can achieve even more savings by having workers identify features of the celebrities, so that we don’t, for example, try to match up males with females.
We’re Not Done Yet
We now have insight into how to effectively design two important human-powered operators, sorts and joins. There are two directions to go from here: bring in learning models, and design more reusable operators.
Our paper shows how to achieve more than order-of-magnitude cost reductions in join and sort costs, but this is often not enough. To further reduce costs while maintaining accuracy, we’re looking at training machine learning classifiers to perform simple join and sort tasks, like determining that Cambridge Brewing Co. is likely the same as Cambridge Brewing Company. We’ll still need humans to handle the really tricky work, like figuring out which of the phone numbers for the brewing company is the right one.
Sorts and joins aren’t the only reusable operators we can implement. Next up: human-powered aggregates. In groups, humans are surprisingly accurate at estimating quantities (jelly beans in a jar, anyone?). We’re building an operator that takes advantage of this ability to count with a crowd.
Over the past month, a petition has been circulating asking the Obama administration to bring graduate student stipends back to their pre-1986 tax-exempt status. I urge you to not sign this petition, as it is misguided and damaging to our image. If you believe graduate student researchers are more valuable than their compensation, then demand more compensation, not a tax loophole.
First, the caveat: I can only speak for the STEM fields. In these fields, a combination of government, corporate, and university grants support research-track students in the lab and classroom. This compensation usually comes in the form of full tuition coverage and a stipend in the range of $1500-$2500 per month, and sometimes includes health coverage.
Our stipends put our yearly income at $18,000-$30,000/year. Compare this to a poverty threshold of $18,530 for a family of three, or $29,990 for a family of six. In computer science, you can double your income with a summer internship, placing you above the median 2009 household income. At first glance, it seems like we are reasonably compensated before we take into account the education, advising, networking, and travel opportunities our life decision has earned us.
Of course, the argument in the petition is more nuanced than one of unreasonable taxation. The petition speaks to the value of our “innovative, cutting-edge thinking” relative to “bankers, lobbyists, or hedge-fund managers.” The comparison is certainly timely, but sweeps under the rug other valuable fields, like Nursing or Carpentry. Both of these fields earn more than the median graduate student in STEM, but optimistically, we are in a position of higher upward mobility once we graduate.
Perhaps a better comparison is what we could earn if we had not chosen graduate studies. With a B.S. in Computer Science, my undergraduate colleagues at large technology firms and startups are earning 3-5x what I earn through my stipend. Am I more valuable as a researcher than I would be in their shoes? This seems like a good conversation to have.
This is a discussion one of relative value. In the absolute sense, graduate students in STEM are not poor, and should pay taxes in whatever tax bracket we fall. Perhaps we’re not compensated enough for what we provide to society. I would like to believe that STEM’s contribution to social and economic development is significant. If we’re seeing a dirth of STEM researchers and our value to society is high, the market failure should be supplemented by the government. Not in the form of yet another tax break, but as an increase in the number of stipends or the amount of compensation distributed per researcher.
STEM is under attack. We should elevate its image by discussing how valuable our work is, not by asking for pity. Demand what you are worth, but remember how lucky you are.
There is little I like more than a fine cheese and fresh-baked bread. Still, to fill the rest of my day without expanding my waistline, I go for a mix of databases and human-computer interaction. That’s why I was excited to see several database-oriented papers presented at CHI. While many papers contained some amount of data, I’ll stick to the three that are unquestionably of interest to the databases community.
The first paper was for the social scientist in all of us. Amy Voida, Ellie Harmon, and Ban Al-Ani presented Homebrew Databases: Complexities of Everyday Information Management in Nonprofit Organizations. Nonprofits are arguably some of the most difficult database users to design for. They have minimal resources, rarely employ fulltime technical staff, and solve non-core problems as they show up. This practice leads to homebrew, just-functional-enough solutions to many data management problems. The authors provide an interesting qualitative study of how nonprofits manage volunteer demographic and contact information. They provide descriptions of the homebrewed, often fractured collections of data stored in several locations. Reading this paper, I couldn’t help but think of how perfectly these homebrewed databases resembled Franklin, Halevy, and Maier’s dataspaces.
Sean Kandel presented Wrangler, a project he’s been working on with Andreas Paepcke, Joe Hellerstein, and Jeff Heer. Wrangler lets users specify transformations on datasets by example. Each time a user shows Wrangler how to modify a record (or line of unstructured text), Wrangler updates its rank-ordered list of potential transformations that could have led to this modification. Wrangler borrows concepts such as interactive transformation languages from Vijayshankar Raman and Joe Hellerstein’s Potter’s Wheel. Its interface has a taste of David Huynh and Stefano Mazzocchi’s Refine as well as Huynh’s Potluck. Wrangler’s novelty comes in combining the interfaces and transformation languages with an inference and ranking engine. Since Wrangler is hosted, it is also capable of learning which transformations users prefer and improving its rankings over time!
The last slot goes to our own Eirik Bakke, who presented Related Worksheets along with David Karger and Rob Miller. Related worksheets make foreign key references a first-class citizen in the world of spreadsheets. Just as spreadsheets secretly made every office worker capable of maintaining a single-user, single-table relational database, Eirik has secretly enabled those workers to make references between spreadsheets without having to program. While adding foreign key references to a spreadsheet requires a simple user interface modification, its implications on how to display multi-valued cells in the spreadsheet are significant. Read the paper to see Eirik’s hierarchical solution to this problem!
Keep it up, data nerds! Soon we’ll be able to start a data community at CHI!
I often find a link through a feed reader or Twitter and want to know if there is an HN thread discussing the link. This happens more often now that I have moved over to following @newsyc20 on Twitter rather than visiting the HN website directly. I batch up a bunch of stories to read at once, and lose context of which HN thread pointed to that page.
The WWHNS bookmarklet, when clicked, looks the current page up in Ronnie Roller’s wonderful HN API, and adds a link to the top right of the current page to any existing HN comment threads.
I tested it in Chrome and Firefox. Let me know if it works in other browsers.
Caveat: This bookmarklet will work for links you followed by way of HN or another source which replicates it. It may not work if you arrived at a page from a source outside of HN, since that link might be slightly different from the one posted to HN.
To use WWHNS
- Drag this WWHNS bookmarklet to your bookmark toolbar.
- For any page, click on the
WWHNSbutton in your bookmark toolbar.
- Check out the WWHNS git repository
wwhns.htmlin a browser
- Copy the
WWHNSlink to your bookmark toolbar
- For any page, click on the
WWHNSbutton in your bookmark toolbar
To edit the bookmarklet
- Fork this git repository
wwhns.htmlin a browser
- Copy the
WWHNSlink to your bookmark toolbar
- For any page, click on the
WWHNSbutton in your bookmark toolbar
- Push the changes back to me. I’d love to see what you do with it!
When articles were published in hard-copy newspapers, reader response was left to the ultimate in asynchronous communication: letters to the editor for differences of opinion, and corrections when a mistake was discovered. As brick-and-mortar newspapers moved into the digital realm, the static publishing model initially stuck, albeit with an easier method for correcting mistakes.
When we digest a story published by a large newspaper, be it in digital or dead tree form, we assign the strongest signal to the content of the article. In exchange for giving the journalist our full attention, we expect that the news organization has put significant effort to researching, writing, and editing the story. Newspapers rarely put uncurated content front-and-center because they trust their own vetted content more, and in part to justify the expense that went into their refined content.
Along the path from single-source hard copies of stories to the everyone-gets-a-voice world of microblogging, we got comments. Blogs frequently display discussion threads following each entry, and sites such as Digg, Reddit, and Hacker News provide us with another forum to chat with the community about articles we find interesting.
Many blogging outfits, including those run by organizations as large as the New York Times, now employ comment systems beyond their purpose as a meta-article discussion medium. One often finds blog entries that end with prompts such as “What has your experience been? Let us know in the comments!” Or “If you know more about this late-breaking story, leave a comment below!” In the same way that live-blogging has taken blog entries from static entities to up-to-the-minute documents, comments sometimes become a necessary part of the stories which they adjoin. Slashdot sometimes takes this one step further: when a topic of wide interest appears, the editors open an essentially content-free story with the express purpose of leaving a place for comments.
If comments can sometimes be the content of a story, then why are they always relegated to the bottom of the story? What is the user interface for displaying articles where readers are assigned a reporter’s role? How do we assign prominence to the most informative fragments of story and user-generated content? Flickr and Facebook have figured this out to some extent—you can annotate photos and witness the result in situ. Youtube lets users embed annotations in videos. How do we apply this concept to text media? What tools already do this, and what ideas do you have for improvement? Leave your comment below!
This past week at WWW 2010 has resulted in quite the spread of Twitter papers. Topic included systems, novel uses, and studies of tweets and users. I’ve made an attempt to provide a taste of each paper/presentation I experienced. Feel free to comment if I missed anything!
At the web science conference on Monday, we saw two presentations on Twitter. Devin Gaffney presented a paper entitled #iranElection: quantifying online activism. Devin collected around 766,000 worth of tweets across nearly 74,000 users around the time of the #iranElection. He first showed that there was a spike in signups around the time that #iranElection became a trending topic with the seeming purpose of adding #iranElection updates to the tweet stream. A retweet analysis showed that as more users became interested in the #iranElection, users with influence (as measured by follower count or retweet count) lost influence relative to the entirety of relevant users.
Panagiotis Metaxas presented the other paper at the Web Science workshop, entitled From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search. In this work, the authors studied the recent Massachusetts special election between Scott Brown and Martha Coakley through the lens of Twitter. Metaxas presented the notion of twitterbombing, where, similar to googlebombing, sneaky twitter users abuse various mechanisms to appear in the relatively prominent real-time search results that search engines have recently added. 32% of tweets were repeated several times by the same account, presumably in an attempt to increase the ranking of their tweets’ content by naive real-time ranking algorithms. The authors described how they identified Republicans and Democrats through follower and retweet analysis, and showed an example where twitterbombing was used to lead searchers to a page designed to dissuade voters from voting for Coakley.
Next, at the Linked Data Workshop, Joshua Shinavier presented Real-time #SemanticWeb in <=140 characters. Joshua’s goal is to extract structured data from tweets using his TwitLogic system. Instead of extracting data from all tweets, Joshua’s system looks for tweets that follow a format called nanotations and are identified by hashtags. It is unclear what sort of adoption this format will see, but the value in such annotations (as well as those in the up-and-coming twitter annotation system) is that with precise structure, the extracted data can be a far more rich data source for the linked data web.
Moving into the main WWW2010 presentation tracks, Yi Chang and his colleagues at Yahoo! presented Time is of the essence: Improving Recency Ranking using Twitter Data, which studied how to turn relevant and popular tweets into search results. Crawling for real-time content is typically resource-intensive on search engines which have to frequently revisit many sources of such content, and belabors the servers of the content providers if recrawled too frequently. The authors of this paper studied how to use streaming Twitter results to discover URLs and avoid having to actively recrawl for new content. In a 5-hour sampling of tweets, Chang and team found 1M URLs, and after cleaning these results to avoid spam, adult content, or self-promoting tweets, approximately 5.9% of the URLs remained. From here, the authors describe how various features including tweet content, retweets, and social network topology can be used to rank the discovered URLs. Finally, the authors found that they can use the tweet text describing a URL in much the same way that search engines traditionally use the contents of anchor text linking to a webpage to index discovered URLs.
Next, Haewoon Kwak presented What is Twitter: Social Network or News Media? One impressive contribution of this work is the large dataset that the authors collected, featuring 41.7M user profiles, 1.47B following relations, 4262 trending topics, and 106M tweets mentioning these trending topics. The authors presented some interesting network structure statistics. Twitter has an asymmetric following model, and only 22% of user pairings are symmetric, compared to a symmetric follower rate of around 70-85% on other asymmetric social networks. This should not suggest that Twitter is more a news medium than a social network. For example, Twitter may be a different medium to different users, and the high rate of updates might discourage users from following everyone that follows them. Other interesting factoids presented by the authors included that 96% of retweet trees are of height 1, 35% of retweets occur within 10 minutes of the original tweet, and 55% occur within 1 hour.
Finally, Takeshi Sakaki presented Earthquake Shakes Twitter Users:
Real-time Event Detection by Social Sensors, which described how to build an earthquake detection and location system with the tweetstream as its input. The authors passed all tweets with the term ‘earthquake’ or ’shaking’ to a classifier, and showed which features of tweets helped classify positive and negative instances of tweets relating to an earthquake. They then built a temporal model to identify when the earthquake-positive tweets strayed from the norm. Finally, they compared several spatial methods for using geotagged tweets to determine the epicenter of an earthquake. The authors point out one weakness in their location logic: their algorithms have a hard time identifying an accurate location of earthquakes which have an epicenter in the ocean.
A summary of the Twitter analysis papers would not be complete without a hat tip to danah boyd, who gave a wonderful keynote which touched on the intersection of big data analysis and privacy. boyd pushed researchers with access to content outside of the context in which it was created, such as a message sent to a friend or a tweet directed at a tight social network, to be ethical with their handling of that data. Doing her talk justice would take a blog post of its own, so I will just mention one point that danah made toward the beginning of her talk. When confronted with a large dataset, big data hackers sometimes equate aggregate statistics to facts that need not be backed or understood by social models, and sometimes fail to think about the limitations of their population samples. Little things matter: sampling 5% of all tweets biases toward users that tweet more frequently. Similarly, sampling 5% of twitter accounts does not properly account for people with multiple accounts/identities or lurkers with no accounts. Social streams are a wonderful data source for data scientists, but we should ford the streams responsibly.