This past week at WWW 2010 has resulted in quite the spread of Twitter papers.  Topic included systems, novel uses, and studies of tweets and users.  I’ve made an attempt to provide a taste of each paper/presentation I experienced.  Feel free to comment if I missed anything!

At the web science conference on Monday, we saw two presentations on Twitter.  Devin Gaffney presented a paper entitled #iranElection: quantifying online activism.  Devin collected around 766,000 worth of tweets across nearly 74,000 users around the time of the #iranElection.  He first showed that there was a spike in signups around the time that #iranElection became a trending topic with the seeming purpose of adding #iranElection updates to the tweet stream.  A retweet analysis showed that as more users became interested in the #iranElection, users with influence (as measured by follower count or retweet count) lost influence relative to the entirety of relevant users.

Panagiotis Metaxas presented the other paper at the Web Science workshop, entitled From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search.  In this work, the authors studied the recent Massachusetts special election between Scott Brown and Martha Coakley through the lens of Twitter.  Metaxas presented the notion of twitterbombing, where, similar to googlebombing, sneaky twitter users abuse various mechanisms to appear in the relatively prominent real-time search results that search engines have recently added.  32% of tweets were repeated several times by the same account, presumably in an attempt to increase the ranking of their tweets’ content by naive real-time ranking algorithms.  The authors described how they identified Republicans and Democrats through follower and retweet analysis, and showed an example where twitterbombing was used to lead searchers to a page designed to dissuade voters from voting for Coakley.

Next, at the Linked Data Workshop, Joshua Shinavier presented Real-time #SemanticWeb in <=140 characters.  Joshua’s goal is to extract structured data from tweets using his TwitLogic system.  Instead of extracting data from all tweets, Joshua’s system looks for tweets that follow a format called nanotations and are identified by hashtags.  It is unclear what sort of adoption this format will see, but the value in such annotations (as well as those in the up-and-coming twitter annotation system) is that with precise structure, the extracted data can be a far more rich data source for the linked data web.

Moving into the main WWW2010 presentation tracks, Yi Chang and his colleagues at Yahoo! presented Time is of the essence: Improving Recency Ranking using Twitter Data, which studied how to turn relevant and popular tweets into search results.  Crawling for real-time content is typically resource-intensive on search engines which have to frequently revisit many sources of such content, and belabors the servers of the content providers if recrawled too frequently.  The authors of this paper studied how to use streaming Twitter results to discover URLs and avoid having to actively recrawl for new content.  In a 5-hour sampling of tweets, Chang and team found 1M URLs, and after cleaning these results to avoid spam, adult content, or self-promoting tweets, approximately 5.9% of the URLs remained.  From here, the authors describe how various features including tweet content, retweets, and social network topology can be used to rank the discovered URLs.  Finally, the authors found that they can use the tweet text describing a URL in much the same way that search engines traditionally use the contents of anchor text linking to a webpage to index discovered URLs.

Next, Haewoon Kwak presented What is Twitter: Social Network or News Media? One impressive contribution of this work is the large dataset that the authors collected, featuring 41.7M user profiles, 1.47B following relations, 4262 trending topics, and 106M tweets mentioning these trending topics.  The authors presented some interesting network structure statistics.  Twitter has an asymmetric following model, and only 22% of user pairings are symmetric, compared to a symmetric follower rate of around 70-85% on other asymmetric social networks.  This should not suggest that Twitter is more a news medium than a social network.  For example, Twitter may be a different medium to different users, and the high rate of updates might discourage users from following everyone that follows them.  Other interesting factoids presented by the authors included that 96% of retweet trees are of height 1, 35% of retweets occur within 10 minutes of the original tweet, and 55% occur within 1 hour.

Finally, Takeshi Sakaki presented Earthquake Shakes Twitter Users:
Real-time Event Detection by Social Sensors
, which described how to build an earthquake detection and location system with the tweetstream as its input.  The authors passed all tweets with the term ‘earthquake’ or ’shaking’ to a classifier, and showed which  features of tweets helped classify positive and negative instances of tweets relating to an earthquake.  They then built a temporal model to identify when the earthquake-positive tweets strayed from the norm.  Finally, they compared several spatial methods for using geotagged tweets to determine the epicenter of an earthquake.  The authors point out one weakness in their location logic: their algorithms have a hard time identifying an accurate location of earthquakes which have an epicenter in the ocean.

A summary of the Twitter analysis papers would not be complete without a hat tip to danah boyd, who gave a wonderful keynote which touched on the intersection of big data analysis and privacy.  boyd pushed researchers with access to content outside of the context in which it was created, such as a message sent to a friend or a tweet directed at a tight social network, to be ethical with their handling of that data.  Doing her talk  justice would take a blog post of its own, so I will just mention one point that danah made toward the beginning of her talk.  When confronted with a large dataset, big data hackers sometimes equate aggregate statistics to facts that need not be backed or understood by social models, and sometimes fail to think about the limitations of their population samples.  Little things matter: sampling 5% of all tweets biases toward users that tweet more frequently.  Similarly, sampling 5% of twitter accounts does not properly account for people with multiple accounts/identities or lurkers with no accounts.  Social streams are a wonderful data source for data scientists, but we should ford the streams responsibly.