Comments as content: The medium hinders the message

When articles were published in hard-copy newspapers, reader response was left to the ultimate in asynchronous communication: letters to the editor for differences of opinion, and corrections when a mistake was discovered. As brick-and-mortar newspapers moved into the digital realm, the static publishing model initially stuck, albeit with an easier method for correcting mistakes.

When we digest a story published by a large newspaper, be it in digital or dead tree form, we assign the strongest signal to the content of the article. In exchange for giving the journalist our full attention, we expect that the news organization has put significant effort to researching, writing, and editing the story. Newspapers rarely put uncurated content front-and-center because they trust their own vetted content more, and in part to justify the expense that went into their refined content.

Along the path from single-source hard copies of stories to the everyone-gets-a-voice world of microblogging, we got comments. Blogs frequently display discussion threads following each entry, and sites such as Digg, Reddit, and Hacker News provide us with another forum to chat with the community about articles we find interesting.

Many blogging outfits, including those run by organizations as large as the New York Times, now employ comment systems beyond their purpose as a meta-article discussion medium. One often finds blog entries that end with prompts such as “What has your experience been? Let us know in the comments!” Or “If you know more about this late-breaking story, leave a comment below!” In the same way that live-blogging has taken blog entries from static entities to up-to-the-minute documents, comments sometimes become a necessary part of the stories which they adjoin. Slashdot sometimes takes this one step further: when a topic of wide interest appears, the editors open an essentially content-free story with the express purpose of leaving a place for comments.

If comments can sometimes be the content of a story, then why are they always relegated to the bottom of the story? What is the user interface for displaying articles where readers are assigned a reporter’s role? How do we assign prominence to the most informative fragments of story and user-generated content? Flickr and Facebook have figured this out to some extent—you can annotate photos and witness the result in situ. Youtube lets users embed annotations in videos. How do we apply this concept to text media? What tools already do this, and what ideas do you have for improvement? Leave your comment below!

Twitter Papers at the WWW 2010 Conference

This past week at WWW 2010 has resulted in quite the spread of Twitter papers.  Topic included systems, novel uses, and studies of tweets and users.  I’ve made an attempt to provide a taste of each paper/presentation I experienced.  Feel free to comment if I missed anything!

At the web science conference on Monday, we saw two presentations on Twitter.  Devin Gaffney presented a paper entitled #iranElection: quantifying online activism.  Devin collected around 766,000 worth of tweets across nearly 74,000 users around the time of the #iranElection.  He first showed that there was a spike in signups around the time that #iranElection became a trending topic with the seeming purpose of adding #iranElection updates to the tweet stream.  A retweet analysis showed that as more users became interested in the #iranElection, users with influence (as measured by follower count or retweet count) lost influence relative to the entirety of relevant users.

Panagiotis Metaxas presented the other paper at the Web Science workshop, entitled From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search.  In this work, the authors studied the recent Massachusetts special election between Scott Brown and Martha Coakley through the lens of Twitter.  Metaxas presented the notion of twitterbombing, where, similar to googlebombing, sneaky twitter users abuse various mechanisms to appear in the relatively prominent real-time search results that search engines have recently added.  32% of tweets were repeated several times by the same account, presumably in an attempt to increase the ranking of their tweets’ content by naive real-time ranking algorithms.  The authors described how they identified Republicans and Democrats through follower and retweet analysis, and showed an example where twitterbombing was used to lead searchers to a page designed to dissuade voters from voting for Coakley.

Next, at the Linked Data Workshop, Joshua Shinavier presented Real-time #SemanticWeb in <=140 characters.  Joshua’s goal is to extract structured data from tweets using his TwitLogic system.  Instead of extracting data from all tweets, Joshua’s system looks for tweets that follow a format called nanotations and are identified by hashtags.  It is unclear what sort of adoption this format will see, but the value in such annotations (as well as those in the up-and-coming twitter annotation system) is that with precise structure, the extracted data can be a far more rich data source for the linked data web.

Moving into the main WWW2010 presentation tracks, Yi Chang and his colleagues at Yahoo! presented Time is of the essence: Improving Recency Ranking using Twitter Data, which studied how to turn relevant and popular tweets into search results.  Crawling for real-time content is typically resource-intensive on search engines which have to frequently revisit many sources of such content, and belabors the servers of the content providers if recrawled too frequently.  The authors of this paper studied how to use streaming Twitter results to discover URLs and avoid having to actively recrawl for new content.  In a 5-hour sampling of tweets, Chang and team found 1M URLs, and after cleaning these results to avoid spam, adult content, or self-promoting tweets, approximately 5.9% of the URLs remained.  From here, the authors describe how various features including tweet content, retweets, and social network topology can be used to rank the discovered URLs.  Finally, the authors found that they can use the tweet text describing a URL in much the same way that search engines traditionally use the contents of anchor text linking to a webpage to index discovered URLs.

Next, Haewoon Kwak presented What is Twitter: Social Network or News Media? One impressive contribution of this work is the large dataset that the authors collected, featuring 41.7M user profiles, 1.47B following relations, 4262 trending topics, and 106M tweets mentioning these trending topics.  The authors presented some interesting network structure statistics.  Twitter has an asymmetric following model, and only 22% of user pairings are symmetric, compared to a symmetric follower rate of around 70-85% on other asymmetric social networks.  This should not suggest that Twitter is more a news medium than a social network.  For example, Twitter may be a different medium to different users, and the high rate of updates might discourage users from following everyone that follows them.  Other interesting factoids presented by the authors included that 96% of retweet trees are of height 1, 35% of retweets occur within 10 minutes of the original tweet, and 55% occur within 1 hour.

Finally, Takeshi Sakaki presented Earthquake Shakes Twitter Users:
Real-time Event Detection by Social Sensors
, which described how to build an earthquake detection and location system with the tweetstream as its input.  The authors passed all tweets with the term ‘earthquake’ or ’shaking’ to a classifier, and showed which  features of tweets helped classify positive and negative instances of tweets relating to an earthquake.  They then built a temporal model to identify when the earthquake-positive tweets strayed from the norm.  Finally, they compared several spatial methods for using geotagged tweets to determine the epicenter of an earthquake.  The authors point out one weakness in their location logic: their algorithms have a hard time identifying an accurate location of earthquakes which have an epicenter in the ocean.

A summary of the Twitter analysis papers would not be complete without a hat tip to danah boyd, who gave a wonderful keynote which touched on the intersection of big data analysis and privacy.  boyd pushed researchers with access to content outside of the context in which it was created, such as a message sent to a friend or a tweet directed at a tight social network, to be ethical with their handling of that data.  Doing her talk  justice would take a blog post of its own, so I will just mention one point that danah made toward the beginning of her talk.  When confronted with a large dataset, big data hackers sometimes equate aggregate statistics to facts that need not be backed or understood by social models, and sometimes fail to think about the limitations of their population samples.  Little things matter: sampling 5% of all tweets biases toward users that tweet more frequently.  Similarly, sampling 5% of twitter accounts does not properly account for people with multiple accounts/identities or lurkers with no accounts.  Social streams are a wonderful data source for data scientists, but we should ford the streams responsibly.

256 colors in your xterm!

Have you ever used emacs or vim from the command line in GNU/Linux and been offended by the horrible color scheme you saw? I’m embarrassed to admit that I’ve been through tons of vim color schemes and have never been able to understand why the colors did not show up as desired.

Yang’s blog post has changed my life. See here for more notes on which color schemes work well for vim. I’ve been enjoying wombat256.

On Ubuntu on my laptop, I added “export TERM=xterm-256color” to the end of my “~/.bashrc”—You will have to re-open another terminal to see the results after saving your bashrc, or type “source ~/.bashrc” in your current terminal if you’re too antsy.

Notes from NoSQL Live Boston 2010

I was excited to sit in on NoSQL Live Boston today. Thanks to 10gen for hosting and all of the speakers for putting the time in!

The NoSQL community is an interesting one. I was pleased to see Dwight Merriman suggest that the community look past its awkward and misleading name when figuring out how to define itself, and instead find other commonalities: removing the emphasis on joins, focusing on horizontal scalability, and building out non-relational data models. There was no consistent theme to the community, which is the point—if the era of one-size-fits-all solutions is over, you will be hard-pressed to easily define the movement.

There are some special treats in here: numbers from deployments at LinkedIn, StumbleUpon, and Twitter. Take a look at the “Scaling w/ NoSQL” panel for that.

Without further ado, here are my notes. I’ve found that these are often filled with typographical errors, so anything you offer up as a fix would be greatly appreciated.

Dwight Merriman (CEO at 10gen)

  • What is NoSQL? Look beyond the name, we’re stuck with it
    • No joins in-app + light transactional semantics => horizontal scalability
  • Questions to ask of different offerings
    • What is your data model?
    • What is your consistency model?
    • What are the functional differences in operations, querying, etc.?

Tim Anglade (CTO GemKitty)

General idea: what’s the future of NoSQL, how to get more adoption

European nosql conference—nosqleu.com

We’re currently at the stage where makers took prototypes from academia, turned into hobby projects. Startups adopted as side-projects. Now VC-backed developers do work on nosql dbs full-time.

How to see adoption+support going forward?

  • more development
  • marketing
  • education—currently it’s easy to only learn about relational model, sql. Need that model for nosql ecosystem.
  • certification—because RDBMSs are more standardized, certifications are easier, so it’s easier to hire junior developers and engage lots of vendors.
  • branding—“SQL” currently gets more searches than “mysql,” “oracle,” or “sql server.” For “NoSQL” its the opposite—less searches than for the nosql products (mongo, redis, couchdb, etc.)
  • references—need a nosql book of reference. What is a document-oriented store, or Key/value (K/V) store?
  • industry group that interfaces w/ industry, academia, and education. Runs conferences.

Panel: Scaling w/ NoSQL

Speakers

  • Mark Atwood—Gear6 (memcached support)
  • Alex Feinberg—Voldemort developer at LinkedIn—simple get/put/delete K/V store.
  • Doug Judd—Hypertable (bigtable implementation in c++, on top of HDFS)
  • Ryan King—Twitter, which is replacing MySQL w/ Cassandra
  • Ryan Rawson—HBase developer at Stumbleupon

How does each system scale?

  • memcached—completely shared-nothing. Facebook has several TBs of memory pooled in memcached.
  • Voldemort—based on dynamo’s consistency model, so completely symmetric. Largest LinkedIn cluster does 7k req/sec on the client, which results in 14k req/sec on each server in the pool (read quorum = 2).
  • Cassandra—also symmetric based on dynamo’s consistency model (eventual consistency) but uses bigtable data model. Twitter currently stores all data in mysql, but cassandra is repeating all writes and they are currently testing reads live but not displaying the read results to users. Biggest benefit of scale—memcache helps scale reads, but cassandra, due to eventual consistency, scales writes nicely.
  • Hypertable is based on HDFS, which is replicated, highly scalable.
  • HBase is also based on HDFS. ZooKeeper helps master nodes run elections and lets new nodes take over tablets easily.

What’s life like for operations folks?

  • Voldemort—easy to deploy, no single point of failure, and backups are built in through replication. Workload is expectable—no long-running queries, unlike SQL. Thus, little babysitting.
  • Cassandra—currently, the engineering team are the operations folks. Numerous failure cases don’t require waking someone up at night. Cluster managed membership/rounting. Upgrade==rolling restart. mysql/memcache is harder to add capacity (data consistency issues)/change configs.
  • Hypertable is easy to deploy, but hadoop’s HDFS is harder to get right.
  • Rawson points out that HBase is easy, and handles drastically varying row sizes. Config changes require rsyncing configs to all machines, which doesn’t scale well. King points out that some combination of capistrano and ‘murder,’ a twitter open source project, help deploy config changes.
  • Feinberg points out that configuration is always more of a dark art once data on disk > data in memory.

Use cases/deployment in the wild

  • memcache—lots of use cases, but most popular are sessions and prebaked HTML
  • voldemort—scalable writes, UI settings, e-mail system, rate-limiting, shopping cart (original dynamo paper use case).
  • cassandra—King points out Twitter’s use is simple. Some stats: 45 nodes, 9-10B rows. Avg tweets/sec: 600-700 (50M daily) with highly skewed spikes. When deployed, reads will need to be 100k/sec against the cluster.
  • Hypertable—only listed analytics workloads: virus sitings (500M events/day), spam classification, site access statistics. No online/live query access stories.
  • HBase—at stumbleupon, they have several uses. Numbers: 12K requests/sec in production cluster of 15 nodes. Reqs/sec are uneven—some nodes have 100’s reqs/sec, others have up to 2.4K reqs/sec. Separate cluster to handle analytics: 20 machines handle 7M rows/sec in mapreduce. If they double to 40 machines, they see ~15M rows/sec in mapreduce, so linear scaleup in mapreduce. Bulkloads on this cluster result in ~1M rows/sec insert speeds, and add up to 700GB compressed on disk.

Random discussion

  • HDFS not designed for lots of random reads (Yahoo! experiment). But HBase does aggressive caching to avoid hitting disk, so in practice the HBase/Hypertable folks don’t think it’s a big issue.
  • Hypertable vs. HBase: Judd says c++ makes for more efficient memory and cpu footprint. Rawson points out that as an apache software foundation project, HBase benefits from lots of contrib projects, such as HIVE/Pig query languages.
  • Voldemort is persistent key-value store, whereas memcache is not persistent.
  • CAP theorem mini-argument (yay!). For the uninitiated: (C)onsistency, (A)vailability, (P)artition tolerance. Brewer’s theorem (proved later by Lynch et al.) is that you can only have two of these in your system. In any real networked system w/ packet loss, Partitions are a given, so tradeoff is between Consistency (will you be able to read the value you just wrote) and Availability (will parts of the system become unavailable/see latency spikes if a node dies). Voldemort/Cassandra==eventual consistency in exchange for high availability. Bigtable copies (HBase/Hypertable) give give up on availability guarantee in exchange for straightforward consistency. King points out that in real system with caching layers and dropped messages, you have to handle read repairs and inconsistency anyway, so embrace it in favor of high availability! Feinberg points out that Voldemort (+ Cassandra) let you demand strong consistency by forcing reads to come from consensus group anyway, so you get what you want.
  • BigTable folks point out that range scans suck in all other systems. Automatic partitioning (at least in Cassandra) needs some love as well. memcache has no good notion of dynamic scalability—add more nodes and you might get some inconsistency.

Panel: NoSQL in the Cloud

Participants

  • Benjamin Day—consultant speaking on behalf of MSFT Azure platform
  • Jonathan Ellis—works for rackspace, is lead of Cassandra development for apache project.
  • Adam Kocoloski—cloudant, works on CouchDB cloud hosting offering.
  • Daniel Rinehart—Allurent—startup which is using AWS for a lot, specifically SimpleDB.

Offerings

  • Azure—offers SQL in the cloud (hosted sqlserver). Also offers blog/queue/KV cloud store.
  • Rackspace offers cloud sites (like appengine for php)—handles multitenancy in mysql (host multiple users on a mysql install). Also offers cloud files (like Amazon S3) and cloud servers (like amazon AWS but with dedicated physical hard drives per cloud server).
  • Cloudant—CouchDB cloud hosting. Have developed their own sharding layer on top of CouchDB.
  • SimpleDB—nice since amazon handles scale for you. recently added consistent reads, conditional puts (had previously relied on eventual consistency).

Why do cloud + nosql relate?

  • Ellis was contrarian here—cloud is nice, obviously. But for databases, cloud is good if you are storing something really small (and want to provision fraction of a machine), or to handle spiky traffic. But for data, you usually don’t see spikes like you see web traffic—if you have 20TB today, you will only have more than that tomorrow. So provisioning data storage in the cloud is silly. For things you’re sure you will have to store, provision real hardware that’s optimal for your setup, and keep adding hardware as you grow. Use cloud for more stateless, spiky things.

Blah blah blah—argument about whether there should be a standard “nosql storage” API to protect developers storing their stuff in proprietary services in the cloud. Probably unrealistic. To protect yourself, use an open software offering, and self-host or go with hosting solution that uses open offering.

Interesting discussion on disaster recovery. Since you’ve outsourced operations to the cloud, should you just trust the provider w/ diaster recovery. People kept talking about busses driving through datacenters or fires happening. What about the simpler problem: a developer drops your entire DB. Need to protect w/ backups no matter where you host.

Lightning Talks

Alan Hoffman—CEO of Cloudant: Queries + Views in CouchDB

  • Each JSON doc in CouchDB has a pkey. View engine lets you build indices.
  • Indices are defined by map/reduce functions that emit the key/value pairs for indexing.
  • Common pitfalls: don’t use tempviews—those are just for prototypes. Don’t do filtering or reordering in reduce tasks—just aggregate here.

Les Hill—Hashrocket: MongoDoc

  • Built Object-Document Mapper for MongoDB in Ruby. Like ORM (object relational mapper), but for document stores like MongoDB.
  • Not activerecord, but similar
  • Current MongoDB driver for Ruby looks like JSON, whereas MongoDoc (his ODM) looks like more traditional ORMs.

Flinn Mueller—Tokyo Cabinet

  • Cares about speed more than scale. TC mmaps disk for speed.
  • TC has several backing stores
    • Hash store for simple Key/Value
    • B+Tree for range scans/duplicate keys
    • Fixed-length DB for fast access
    • Table store—stores tuples/documents. Supports queries w/ conditions, orders, limits, union/intersect/diff.
  • Says he uses TC like memcache++, and as a queue, atomic counter, and tag cloud. Still uses relational DBs to store data—nosql is more of a utility.

Jim Wilson—Vistaprint: Full-Stack Javascript

  • Impedance mismatch between business logic (usually object-oriented)/data model (usually relational), and business logic (usually php)/client-side (javascript).
  • Wants to live in a world where Javascript runs on DB (JSON document stores), server (V8, node.js, etc), and client (the way it is now)

James Williams—BT/Ribbit: MongoDB on Groovy

  • NoSQL is pot-relational, schemaless. Groovy is post-java, allows metaprogramming.
  • Makes Mongo + Groovy be a good match in philosophy.

Panel: Schema design and document-oriented DBs

(I missed most of this)

Panelists

  • Paul Davis—would store patient history in a document store, but would still trust RDBMSs for mission-critical medical applications where strong consistency is required. Represented CouchDB.
  • Eliot Horowitz—10gen (MongoDB)—advocates doing joins in-app, since Mongo doesn’t have foreign key constraints anyway
  • Bryan Fink—Basho (Riak)—similar lack of foreign key constraints, also no indices.

Indexing

  • Riak has no indexes. Use SOLR/Lucene to do full-test index of documents (wtf?)
  • MongoDB—indices similar to mysql indices. Even have geospatial indexing.
  • CouchDB does indices by way of mapreduce, as described above.

Foreign Keys for relations

  • Riak supports links (references) but doesn’t enforce them and doesn’t clean up links to deleted items.
  • MongoDB—DB references exists to refer to other documents. No constant validation, and deleted objects result in broken links (avoids multisite transactions).

How to lock down schemas/do migrations

  • Riak—keep version number in the document. Modify schema on read. i.e. handle it in the application.
  • MongoDB—similar process, but indices break when schema changes. Will add rename functionality soon.
  • CouchDB—like everything in couchdb, use mapreduce.

Horizontal partitioning

  • Riak—add machines. consistent hashing + read repair on failure. mapreduces run locally, so adding machines adds cpu power for mapreduce tasks.
  • MongoDB—shard on range. currently has master-slace replication, but soon replica sets.
  • CouchDB—-no support—build your own partitioning/hashing scheme in front of couchdb installs.

Consistency

  • Riak—eventual consistency using vector clocks. In some modes, can get back multiple versions which had conflicts to be solved by application. Like in dynamo paper, claims this is actually easy to solve in most cases.
  • MongoDB—single master for any shard, so 100% consistent.

Panel: Evolution of a Graph Data structure from research to production

Panelists

  • Boris Iordonov—HypergraphDB (stores hypergraphs)
  • Peter Neubauer—Neo4J (stores graphs w/ directed edges and typed nodes that have properties).
  • Sandro Hawke—Represented W3C RDF model. Some think of it as a directed graph w/ URIs for source nodes and edges, and URIs or literal values for destination nodes.

How do you do schemas

  • HypergraphDB has schema support at low level and package-level
  • Neo4J doesn’t—leaves it to higher-level packages
  • RDF—datatypes borrowed from XML, and RDFs or OWL for schemas

Implementation details

  • HyperGraphDB offers ACID guarantees and may soon offer MVCC.
  • Neo4J gives ACID guarantees. Constant-time traversals result in 1000-2000 traversals/msec (I think this is dubious on a DFS of a graph—each traversal would be a disk seek—what benchmark gave this?) Update: this was for in-memory or cached graphs.
  • RDF is a standard, but in general query languages such as SPARQL are less about node traversal and more about graph pattern matching.

Query Model

  • HypergraphDB—supports BFS/DFS or “more complicated” traversals. Query language for graph pattern finding as well. Supports SPARQL via a Sail, but no XPath since it’s not expressive enough for hypergraphs.
  • Neo4J—traversals by way of objects that are represented as Java objects. Also supports SPARQL, XPath.
  • RDF—lots of libraries in each language for raw graph access. Also, if you prefer, use SPARQL for declarative queries.

Who uses it

  • HypergraphDB—released 2 months ago. Used for search in miami dade county. Knowledgebase for NLP/information extraction project.
  • Neo4J—opensourced in 2007, lots of interest in social networking, recommendation engines, GIS/spatial indexing, activity streams, intelligence community.
  • RDF—defense/intelligence, then health/life sciences picked it up, and now govt. data (data.gov.uk is represented by a bunch of sparql endpoints). govt data demands standards!

Sharding—graphs are hard to slice.

Building a Social Data Commons

(cross-posted on the Haystack blog)

Inspired by Ted’s vision of what he’d like to see happen to data.gov, I decided to have a try at my hopes for it. Ted’s desires for data.gov are all ones that I agree would make the data more accessible. I would now like to discuss what else I might want in a world where such steps were taken: a world in which government data was centralized, versioned, searchable, and accessible.

Now what? Given the large and growing pile of data we will optimistically uncover, we will run into new frustrations. People will claim that the published data formats are not the ones that their analysis tool requires. People will be overwhelmed by dataset size, not knowing where to start. People will unknowingly recreate someone else’s data-munging workflows on the way to repeating analyses of the same data. People will become the next bottleneck if data ever ceases to be.

There’s no one answer to the concerns listed above because everyone has a different goal for the data. To handle these issues, we will need more than a place to find up-to-date datasets—-we will also need a place where it is easy for people to share ideas and strategies for tackling data. We will need a social data commons.

Whereas blogs and wikis help report findings, steps, and missteps, a social data commons can be the place to go to “talk shop” about the available data. Even if people post their solutions using decentralized means, there will be benefit to pooling all of these resources in one place on the web. Here are some tools that will help the data-tinkerers get things done:

  • Data-munging war stories. The first stage in data analysis is often long and frustrating. One must digest the dataset in the form they received it, and transform, clean, and filter out the subset that they wish to analyze, visualize, or otherwise present. The workflow differs for each dataset and application, but to the extent that people can share tools and instructions for processing each dataset, these should be written up in the form of recipes for baking the data.

  • Crowdsourced analysis. Datasets can be overwhelming. While many exploration tasks are easily automated, it is often easiest to leave certain tasks (e.g., “Find the interesting pictures”) to humans. Mechanical Turk gives us a hint at what this might look like, and the Guardian provides a wonderful example of crowdsourced public data analysis in action.

  • Current uses showcases. To spark competition, avoid duplicating work, and inspire follow-on projects, visitors should see a showcase of the current uses of each dataset. Aside from links to sites built around a dataset, the list can include embedded visualizations of finished work.

  • Analysis wishlists. Given that data released by a government reaches more than just programmers, there will be more people with ideas than people who can implement the ideas. People with ideas should be given an outlet, and passers-by should be asked to vote on these ideas to help data geeks with some free cycles discover the most insteresting unimplemented project.

  • Data wishlists. If an agency were to dedicate resources to releasing another dataset, which one is in highest demand? As Ted mentioned, governments should let demand drive delivery.

  • Forums. No set of tools will encompass all use cases for social data analysis. A discussion forum can lead to the formation of interest groups while serving as a catch-all for needs not served by the list above.

The US government might hit a few bumps trying to implement some of these social features. For example, a conflict of interest might arise if the showcase of uses of a dataset includes a site critical of the current administration. Having the executive branch ban spam or abusive comments on a forum draws concern over limitations of free speech. These details are not roadblocks, but they do signal that we can’t expect a social overlay to spring out of data.gov per se—-if we want these features, we may have to build and manage them on a third party.

I’m sure there’s more to the social data commons than I listed here. What did I miss, and where can we seek further inspiration?

Thanks to Ted for reading the first version of this entry.

FeedMe Data on People’s Sharing Habits!

Michael and I have a blog post Over on the Haystack Blog describing some work we’ve been doing in studying how people share news and blog posts with each other through e-mail and other social networking tools.

So far, we’re posting findings from surveys, but soon we’ll give you a double-whammy: we plan on announcing what we found through a user study of a tool called FeedMe designed to help with sharing in Google Reader, and release FeedMe for everyone to use.

Stay tuned. Until then, take a look at the technical report for the juicy details!

Startup Bootcamp at MIT

I spent the day at startup bootcamp, which was an excellent opportunity to hear from startup founders and VCs about startups. As someone without a lot of experience in this field, it was wonderful to hear a bunch of different (at times conflicting) viewpoints on the startup world. I took notes on all but two of the talks, and put them here to read if you are interested.

Adam Smith: Xobni

How to execute well

  • Hire good people who are self-sufficient
  • External deadlines are useful and force progress
  • Keep onus on action
  • Expect/hope 1/4 of products to fail
  • Focus on the user, no other diversions
  • Run experiments, but use them helpfully—don’t scratch an itch just for kicks
  • Hire outside contractors for specific skills—designers, etc.
  • Look for partners/employees that are /unlike/ you, to round out weaknesses

East vs. West

  • More enterprise software on east coast
  • More resources, relationships for startups on west coast

Recommended Reading

  • Book: Founders at Work
  • Book: High Stakes No Prisoners
  • Paul Graham’s Essays

Alexis Ohanian: Reddit

“Only your mom will use your website”

Look for organic traffic, no need for advertising or PR firms.

  • Seek evangelists willing to tell others to use it
  • Before being bought by Conde Nast, budget for advertising was a $200 for stickers
  • Wufoo sends handwritten notes to longtime customers to add a personal touch
  • Reddit gives free t-shirts to visits to HQ and gives them a story to go along with the shirt
  • Zappos delivers in 1 day even though they promise 5-7 day delivery: promise low, deliver high

“Facilitate Serendipity”

Being good is insurance for when you f*** up

Cofounders: pick someone you trust w/ everything

  • non-programmer founders job is to make sure programming cofounder has nothing to think about except programming
  • once you launch, non-programmer deals w/ all user/business issues

Ken Zolot

“Who Cares?” Make sure your message answers who cares about your technology, not just how cool your technology is

Don’t wait for something to come your way

  • Do something
  • Sense what reaction is
  • Reflect
  • Change what you do

Dan Theobald: Vecna

People matter, so businesses should be ethical, socially responsible, etc.

Vecna pays employees to do community service for 10% of their work time

Don’t hire people unless you are ready to become a manager

Don’t do bargain-hunting in hiring

  • Hire smartest people you can find, and take good care of them
  • Good employee = 10medium employee = 100bad employee

Avoid “other people’s money”—outside investors are only good if you absolutely need them to scale

Read “On the Folly of Rewarding for A, while Hoping for B” by Stephen Kerr

Vecna does profit-sharing based on share of points awarded by colleagues each quarter. This makes each employee feel invested in the company

Vecna bootstrapped from IT consulting

  • Retained intellectual property in contracts
  • Have customers fit into your roadmap for a more generically applicable product

Kyle Vogt: Justin.tv

Startup Productivity Hacks

  • Buy catered lunch—more efficient employees if they don’t spend an hour walking around looking for a place to eat. For 10 employees, catering lunch saves you hiring an 11th employee. (didn’t mention how much it costs)

  • Use Google Apps for stuff you don’t want to administer yourself

  • Use data-driven (A/B) development

  • Use hiring screeners (applicants send you code solving some problem) to test skills programmers must have—cheap way to ensure they can write code.

  • Keep on-site job interviews short, and push more discovery into hiring screeners

  • Don’t hire a PR firm—they just guide you to do work on your own anyway. If you need skills, hire in-house marketing staff if necessary

  • Put one person on funding, pitches, finding investors

  • Work from home (if you live with cofoudners)

  • Use hosted servers (don’t build technology yourself unless you’re a hardware company or want to dedicate MANY resources to it)

  • Listen to your users /the right way/. Users are good at loudy identifying what doesn’t work. Can learn a lot from watching users use your tool.

  • Buy a .com domain name

  • Be transparent w/ employees about data regarding company. They will trust you more, and even suggest ideas.

  • Don’t outsource core technology

  • Hire specialists when necessary

  • Hire people smarter than you, don’t settle

  • Have a plan for actually making $ (eventually)

Angus Davis

Youngest employee at Netscape

Started Tellme, which was bought by MSFT

Rest of talk was decided by a poll: presented ~16 ideas, and had users text msg./tweet the topics to expand on!

When you hit dicey situations as a founder, remove emotional attachment by thinking of yourself as a large shareholder and making decision that way.

Hemant Taneja: Venture Capitalist at General Catalyst Partners

GCP is in Harvard square, $1.7B, 75 companies (consumer, I.T., Energy)

Touched on social media/mobile applications, which are some of their focuses

Interesting idea on energy: there are two problems

  • clean energy
  • cheap energy as demand grows

See 1500 entrepreneurs/year, invest in 2 or 3. Startups they love

  • Brilliant founders
  • Solve hard problems
  • Address very large markets
  • Are ahead of the curve
  • Are capital efficient

Should you raise VC? Only if you have to

  • Webapps are easy to bootstrap. Still helpful to get funding, but only once you have data on what your growth needs will be
  • But energy experiments require serious capital to even try an idea Chosing a VC?
  • Smart, good listener that’s not going to be swayed by biases
  • Easy-going
  • Has bandwidth to help—not on too many boards
  • Transparent
  • Have a relevant network
  • VC firm has stable funding

What is a good termsheet?

  • Enough capital for significant milestone you are funding
  • legally simple—doesn’t restrict you too much in the future
  • Balance ownership for VC, founder, and management—don’t worry about specific values, just everyone’s vision in the coming years
  • VC is not focused heavily on the downside/insuring themself
  • Board of directors has domain experience Angels are good for funding consumer intenret market, but not necessarily for bigger investments such as energy market, etc.

Dharmesh Shah: Hubspot, OnStartups.Com

Likes bootstrapping, fast liquidity.

First time startup founders: maximize odds on seeing a modest return

Your idea stinks, so don’t make it better in a void—launch early so you can get the better idea

Output messaging is about pushing messages out to millions and hoping they will stick with someone. No longer succesful, as people are good at filtering out extraneous messages

Inbound messaging (Hubspot) is about pulling people in. It’s cheaper and more effective for startups.

Tools for marketing

  • Google—focus on SEO, not adwords.

    • WebsiteGrader.com
    • Adwords is good for experimenting with popular terms, but dangerous because it hooks you on a cost model that may change immediately when someone else outbids you.
    • Pick keywords people will use, but not ones that are so popular that you’ll never make it into the top 10
    • Can rank high by high pagerank (authority—this is hard) or by providing relevant and accurate cotent.
    • Good content: make sure your title is meaningful. “ItsSoSoft” < “ItsSoSoft | iPhone blackjack application” < “iPhone blackjack application | ItsSoSoft”
  • Write a blog, not a business plan (no one reads business plans)

    • Build a following before building the product—buy domain name, start writing before you even start coding
    • Don’t be afraid to polarize
    • Blog from a personal perspective—what you’re learning by starting the company, etc. Don’t pretend to be a big organization if you’re not.
  • Facebook page for buying ads gives you the best demographic tool on the market: for a bunch of different demographic information, will dynamically tell you how many people on facebook fit those criteria.
  • Twitter is mainstream now, so if you prefer short form to long form, use it! TwitterGrader.com

Robin Chase: Zipcar

Everyone you come into contact with is a free consultant. make sure you learn from people’s questions.

Intellectual Honesty—If you know your business model is broken, don’t blindly keep selling it, since you’re just hiding something that will grow into a bigger problem. Learn how to fix, change it, and suffer consequences earlier than later!

Start light—become a learning organization. Mistakes are OK, but not learning is not OK.

As a small organization, you are the stories you tell.

Goal #1: become sustainable/profitable. Get there with intellectual honesty, good team, good plan.

Don’t forget Luck: you get to where you are when preparation meets opportunity, but you won’t get lucky or unlucky all the time.

View on big problems—climate change is the biggest. See problems as business opportunities.

Partnerships with large car companies didn’t help, but since she was already in talks with them, she used it as a marketing opportunity with others: “I’ll talk to you after my meeting w/ Volkswagon!”

Aaron Swartz

Started infogami, merged w/ reddit

2 kinds of organisms:

  • r: make tons of children, and DNA will transfer by the numbers (cockroaches). e.g. delicious
  • K: have small # children, raise them carefully (humans). e.g. github

Hollywood launch—teaser, blow up the idea in peoples minds, and have a humungous lauch

Swartz is skeptical of this approach: doesn’t mimic software

  • Kills you with traffic jump
  • Killer bug destroys you immediately
  • Hollywood bases good movies based on initial launch…software is iterative!

So what’s the alternative? GMail launch

  • Have users from day 0: What’s the smallest usable piece?
  • while (users not happy) {fix/add features;}
  • As people are happy, let more in (using invite codes, or just opening up once enough people are happy)
  • Then, once everyone is happy, do a marketing launch (maybe)

Some stories

  • reddit: when they worked hard on features, no new users. when programmers went away or slacked off, traffic rose
  • recent project: he quit when traffic was flat. a few months later, looked at google analytics, saw spike in traffic

What to draw from this? Don’t assume that because you can measure something, you can control it. Sometimes, you need to give things time to spread, and hopefully if you strike it big, you can help the people following in your footsteps.

Announced boldprogressives.org—progressive change campaign committee

Avoid Subversion-Maintained Website Vulnerability

(Note: This vulnerability also exists for cvs and git-based repositories. Change these instructions appropriately.)

The news of this vulnerability came out a while back, but I spent the afternoon securing a few scarily exploitable sites, so I figured I’d reiterate.

If you store your website in subversion, you leave behind an “.svn” directory in each directory in version control. This directory contains the files in version control with extensions which may not protect them from being downloaded (e.g, site.com/file.php becomes site.com/.svn/text-base/file.php.svn-base).

To fix this, put the following in your root .htaccess file (or something similar in httpd.conf) for Apache:

<Files ~ "\.svn">
    Order deny,allow
    Deny from all
</Files>

For nginx:

location ~ /.svn {
     deny all;
}
Earn $30 for reading blogs!

Michael Bernstein and I are working on a project that we will release to the public soon. First, we want to make sure it does what we think it will, so we’re running a user study.

We’re looking for Google Reader users to try out a new extension that helps you share interesting items with people you know. E-mail the FeedMe team at feedme@csail.mit.edu to participate!

If you are: a Firefox user who uses Google Reader regularly (addicts are welcome!)
You can get: a $30 gift card for using our Google Reader extension at least every other day, for two weeks
We are: an MIT computer science research team
Dates of the study: preferably Tuesday, August 18 to Tuesday, September 1

E-mail feedme@csail.mit.edu if you’re interested. Follow your feeds while you help science!

NYT_Transformer and Data.gov: Your chance for a weekend hacking project!

The awesome developers at The New York Times’ Open blog have just posted about NYT_Transformer, a tool for converting between various data formats (XML, comma-separated files) and data storage mediums (flat files, databases).

This isn’t the first time such a conversion utility has been written—Babel comes to mind. The NYT_Transformer has a few perks in its favor, however, namely that it seems to be used in heavy production at The New York Times, and that it allows you to convert between databases and flat files (nice touch!).

Written in php, the tool is geared toward web applications. An immediate thought, aside from batch jobs for various internal projects, would be to use the tool for a greater purpose: a data converter for data.gov.

While data.gov is a nice start at centralizing the directory of the U.S. government’s raw data feeds, it lacks a utility for converting between various data formats. Given that the sunlight foundation is currently running a competition to build tools on top of data.gov, this is the perfect opportunity to go meta and build a data.gov browser with automatic format conversion. That would help standardize the site a bit, and would be a nice signal for the folks at data.gov as to the file formats that people actually want for various datasets. If you’re interested in working on this project, let me know!