Building a Social Data Commons

(cross-posted on the Haystack blog)

Inspired by Ted’s vision of what he’d like to see happen to data.gov, I decided to have a try at my hopes for it. Ted’s desires for data.gov are all ones that I agree would make the data more accessible. I would now like to discuss what else I might want in a world where such steps were taken: a world in which government data was centralized, versioned, searchable, and accessible.

Now what? Given the large and growing pile of data we will optimistically uncover, we will run into new frustrations. People will claim that the published data formats are not the ones that their analysis tool requires. People will be overwhelmed by dataset size, not knowing where to start. People will unknowingly recreate someone else’s data-munging workflows on the way to repeating analyses of the same data. People will become the next bottleneck if data ever ceases to be.

There’s no one answer to the concerns listed above because everyone has a different goal for the data. To handle these issues, we will need more than a place to find up-to-date datasets—-we will also need a place where it is easy for people to share ideas and strategies for tackling data. We will need a social data commons.

Whereas blogs and wikis help report findings, steps, and missteps, a social data commons can be the place to go to “talk shop” about the available data. Even if people post their solutions using decentralized means, there will be benefit to pooling all of these resources in one place on the web. Here are some tools that will help the data-tinkerers get things done:

  • Data-munging war stories. The first stage in data analysis is often long and frustrating. One must digest the dataset in the form they received it, and transform, clean, and filter out the subset that they wish to analyze, visualize, or otherwise present. The workflow differs for each dataset and application, but to the extent that people can share tools and instructions for processing each dataset, these should be written up in the form of recipes for baking the data.

  • Crowdsourced analysis. Datasets can be overwhelming. While many exploration tasks are easily automated, it is often easiest to leave certain tasks (e.g., “Find the interesting pictures”) to humans. Mechanical Turk gives us a hint at what this might look like, and the Guardian provides a wonderful example of crowdsourced public data analysis in action.

  • Current uses showcases. To spark competition, avoid duplicating work, and inspire follow-on projects, visitors should see a showcase of the current uses of each dataset. Aside from links to sites built around a dataset, the list can include embedded visualizations of finished work.

  • Analysis wishlists. Given that data released by a government reaches more than just programmers, there will be more people with ideas than people who can implement the ideas. People with ideas should be given an outlet, and passers-by should be asked to vote on these ideas to help data geeks with some free cycles discover the most insteresting unimplemented project.

  • Data wishlists. If an agency were to dedicate resources to releasing another dataset, which one is in highest demand? As Ted mentioned, governments should let demand drive delivery.

  • Forums. No set of tools will encompass all use cases for social data analysis. A discussion forum can lead to the formation of interest groups while serving as a catch-all for needs not served by the list above.

The US government might hit a few bumps trying to implement some of these social features. For example, a conflict of interest might arise if the showcase of uses of a dataset includes a site critical of the current administration. Having the executive branch ban spam or abusive comments on a forum draws concern over limitations of free speech. These details are not roadblocks, but they do signal that we can’t expect a social overlay to spring out of data.gov per se—-if we want these features, we may have to build and manage them on a third party.

I’m sure there’s more to the social data commons than I listed here. What did I miss, and where can we seek further inspiration?

Thanks to Ted for reading the first version of this entry.

FeedMe Data on People's Sharing Habits!

Michael and I have a blog post Over on the Haystack Blog describing some work we’ve been doing in studying how people share news and blog posts with each other through e-mail and other social networking tools.

So far, we’re posting findings from surveys, but soon we’ll give you a double-whammy: we plan on announcing what we found through a user study of a tool called FeedMe designed to help with sharing in Google Reader, and release FeedMe for everyone to use.

Stay tuned. Until then, take a look at the technical report for the juicy details!

Startup Bootcamp at MIT

I spent the day at startup bootcamp, which was an excellent opportunity to hear from startup founders and VCs about startups. As someone without a lot of experience in this field, it was wonderful to hear a bunch of different (at times conflicting) viewpoints on the startup world. I took notes on all but two of the talks, and put them here to read if you are interested.

Adam Smith: Xobni

How to execute well

  • Hire good people who are self-sufficient
  • External deadlines are useful and force progress
  • Keep onus on action
  • Expect/hope 1/4 of products to fail
  • Focus on the user, no other diversions
  • Run experiments, but use them helpfully—don’t scratch an itch just for kicks
  • Hire outside contractors for specific skills—designers, etc.
  • Look for partners/employees that are /unlike/ you, to round out weaknesses

East vs. West

  • More enterprise software on east coast
  • More resources, relationships for startups on west coast

Recommended Reading

  • Book: Founders at Work
  • Book: High Stakes No Prisoners
  • Paul Graham’s Essays

Alexis Ohanian: Reddit

“Only your mom will use your website”

Look for organic traffic, no need for advertising or PR firms.

  • Seek evangelists willing to tell others to use it
  • Before being bought by Conde Nast, budget for advertising was a $200 for stickers
  • Wufoo sends handwritten notes to longtime customers to add a personal touch
  • Reddit gives free t-shirts to visits to HQ and gives them a story to go along with the shirt
  • Zappos delivers in 1 day even though they promise 5-7 day delivery: promise low, deliver high

“Facilitate Serendipity”

Being good is insurance for when you f*** up

Cofounders: pick someone you trust w/ everything

  • non-programmer founders job is to make sure programming cofounder has nothing to think about except programming
  • once you launch, non-programmer deals w/ all user/business issues

Ken Zolot

“Who Cares?” Make sure your message answers who cares about your technology, not just how cool your technology is

Don’t wait for something to come your way

  • Do something
  • Sense what reaction is
  • Reflect
  • Change what you do

Dan Theobald: Vecna

People matter, so businesses should be ethical, socially responsible, etc.

Vecna pays employees to do community service for 10% of their work time

Don’t hire people unless you are ready to become a manager

Don’t do bargain-hunting in hiring

  • Hire smartest people you can find, and take good care of them
  • Good employee = 10medium employee = 100bad employee

Avoid “other people’s money”—outside investors are only good if you absolutely need them to scale

Read “On the Folly of Rewarding for A, while Hoping for B” by Stephen Kerr

Vecna does profit-sharing based on share of points awarded by colleagues each quarter. This makes each employee feel invested in the company

Vecna bootstrapped from IT consulting

  • Retained intellectual property in contracts
  • Have customers fit into your roadmap for a more generically applicable product

Kyle Vogt: Justin.tv

Startup Productivity Hacks

  • Buy catered lunch—more efficient employees if they don’t spend an hour walking around looking for a place to eat. For 10 employees, catering lunch saves you hiring an 11th employee. (didn’t mention how much it costs)

  • Use Google Apps for stuff you don’t want to administer yourself

  • Use data-driven (A/B) development

  • Use hiring screeners (applicants send you code solving some problem) to test skills programmers must have—cheap way to ensure they can write code.

  • Keep on-site job interviews short, and push more discovery into hiring screeners

  • Don’t hire a PR firm—they just guide you to do work on your own anyway. If you need skills, hire in-house marketing staff if necessary

  • Put one person on funding, pitches, finding investors

  • Work from home (if you live with cofoudners)

  • Use hosted servers (don’t build technology yourself unless you’re a hardware company or want to dedicate MANY resources to it)

  • Listen to your users /the right way/. Users are good at loudy identifying what doesn’t work. Can learn a lot from watching users use your tool.

  • Buy a .com domain name

  • Be transparent w/ employees about data regarding company. They will trust you more, and even suggest ideas.

  • Don’t outsource core technology

  • Hire specialists when necessary

  • Hire people smarter than you, don’t settle

  • Have a plan for actually making $ (eventually)

Angus Davis

Youngest employee at Netscape

Started Tellme, which was bought by MSFT

Rest of talk was decided by a poll: presented ~16 ideas, and had users text msg./tweet the topics to expand on!

When you hit dicey situations as a founder, remove emotional attachment by thinking of yourself as a large shareholder and making decision that way.

Hemant Taneja: Venture Capitalist at General Catalyst Partners

GCP is in Harvard square, $1.7B, 75 companies (consumer, I.T., Energy)

Touched on social media/mobile applications, which are some of their focuses

Interesting idea on energy: there are two problems

  • clean energy
  • cheap energy as demand grows

See 1500 entrepreneurs/year, invest in 2 or 3. Startups they love

  • Brilliant founders
  • Solve hard problems
  • Address very large markets
  • Are ahead of the curve
  • Are capital efficient

Should you raise VC? Only if you have to

  • Webapps are easy to bootstrap. Still helpful to get funding, but only once you have data on what your growth needs will be
  • But energy experiments require serious capital to even try an idea Chosing a VC?
  • Smart, good listener that’s not going to be swayed by biases
  • Easy-going
  • Has bandwidth to help—not on too many boards
  • Transparent
  • Have a relevant network
  • VC firm has stable funding

What is a good termsheet?

  • Enough capital for significant milestone you are funding
  • legally simple—doesn’t restrict you too much in the future
  • Balance ownership for VC, founder, and management—don’t worry about specific values, just everyone’s vision in the coming years
  • VC is not focused heavily on the downside/insuring themself
  • Board of directors has domain experience Angels are good for funding consumer intenret market, but not necessarily for bigger investments such as energy market, etc.

Dharmesh Shah: Hubspot, OnStartups.Com

Likes bootstrapping, fast liquidity.

First time startup founders: maximize odds on seeing a modest return

Your idea stinks, so don’t make it better in a void—launch early so you can get the better idea

Output messaging is about pushing messages out to millions and hoping they will stick with someone. No longer succesful, as people are good at filtering out extraneous messages

Inbound messaging (Hubspot) is about pulling people in. It’s cheaper and more effective for startups.

Tools for marketing

  • Google—focus on SEO, not adwords.

    • WebsiteGrader.com
    • Adwords is good for experimenting with popular terms, but dangerous because it hooks you on a cost model that may change immediately when someone else outbids you.
    • Pick keywords people will use, but not ones that are so popular that you’ll never make it into the top 10
    • Can rank high by high pagerank (authority—this is hard) or by providing relevant and accurate cotent.
    • Good content: make sure your title is meaningful. “ItsSoSoft” < “ItsSoSoft | iPhone blackjack application” < “iPhone blackjack application | ItsSoSoft”
  • Write a blog, not a business plan (no one reads business plans)

    • Build a following before building the product—buy domain name, start writing before you even start coding
    • Don’t be afraid to polarize
    • Blog from a personal perspective—what you’re learning by starting the company, etc. Don’t pretend to be a big organization if you’re not.
  • Facebook page for buying ads gives you the best demographic tool on the market: for a bunch of different demographic information, will dynamically tell you how many people on facebook fit those criteria.
  • Twitter is mainstream now, so if you prefer short form to long form, use it! TwitterGrader.com

Robin Chase: Zipcar

Everyone you come into contact with is a free consultant. make sure you learn from people’s questions.

Intellectual Honesty—If you know your business model is broken, don’t blindly keep selling it, since you’re just hiding something that will grow into a bigger problem. Learn how to fix, change it, and suffer consequences earlier than later!

Start light—become a learning organization. Mistakes are OK, but not learning is not OK.

As a small organization, you are the stories you tell.

Goal #1: become sustainable/profitable. Get there with intellectual honesty, good team, good plan.

Don’t forget Luck: you get to where you are when preparation meets opportunity, but you won’t get lucky or unlucky all the time.

View on big problems—climate change is the biggest. See problems as business opportunities.

Partnerships with large car companies didn’t help, but since she was already in talks with them, she used it as a marketing opportunity with others: “I’ll talk to you after my meeting w/ Volkswagon!”

Aaron Swartz

Started infogami, merged w/ reddit

2 kinds of organisms:

  • r: make tons of children, and DNA will transfer by the numbers (cockroaches). e.g. delicious
  • K: have small # children, raise them carefully (humans). e.g. github

Hollywood launch—teaser, blow up the idea in peoples minds, and have a humungous lauch

Swartz is skeptical of this approach: doesn’t mimic software

  • Kills you with traffic jump
  • Killer bug destroys you immediately
  • Hollywood bases good movies based on initial launch…software is iterative!

So what’s the alternative? GMail launch

  • Have users from day 0: What’s the smallest usable piece?
  • while (users not happy) {fix/add features;}
  • As people are happy, let more in (using invite codes, or just opening up once enough people are happy)
  • Then, once everyone is happy, do a marketing launch (maybe)

Some stories

  • reddit: when they worked hard on features, no new users. when programmers went away or slacked off, traffic rose
  • recent project: he quit when traffic was flat. a few months later, looked at google analytics, saw spike in traffic

What to draw from this? Don’t assume that because you can measure something, you can control it. Sometimes, you need to give things time to spread, and hopefully if you strike it big, you can help the people following in your footsteps.

Announced boldprogressives.org—progressive change campaign committee

Avoid Subversion-Maintained Website Vulnerability

(Note: This vulnerability also exists for cvs and git-based repositories. Change these instructions appropriately.)

The news of this vulnerability came out a while back, but I spent the afternoon securing a few scarily exploitable sites, so I figured I’d reiterate.

If you store your website in subversion, you leave behind an “.svn” directory in each directory in version control. This directory contains the files in version control with extensions which may not protect them from being downloaded (e.g, site.com/file.php becomes site.com/.svn/text-base/file.php.svn-base).

To fix this, put the following in your root .htaccess file (or something similar in httpd.conf) for Apache:

<Files ~ "\.svn">
    Order deny,allow
    Deny from all
</Files>

For nginx:

location ~ /.svn {
     deny all;
}
Earn $30 for reading blogs!

Michael Bernstein and I are working on a project that we will release to the public soon. First, we want to make sure it does what we think it will, so we’re running a user study.

We’re looking for Google Reader users to try out a new extension that helps you share interesting items with people you know. E-mail the FeedMe team at feedme@csail.mit.edu to participate!

If you are: a Firefox user who uses Google Reader regularly (addicts are welcome!)
You can get: a $30 gift card for using our Google Reader extension at least every other day, for two weeks
We are: an MIT computer science research team
Dates of the study: preferably Tuesday, August 18 to Tuesday, September 1

E-mail feedme@csail.mit.edu if you’re interested. Follow your feeds while you help science!

NYT_Transformer and Data.gov: Your chance for a weekend hacking project!

The awesome developers at The New York Times’ Open blog have just posted about NYT_Transformer, a tool for converting between various data formats (XML, comma-separated files) and data storage mediums (flat files, databases).

This isn’t the first time such a conversion utility has been written—Babel comes to mind. The NYT_Transformer has a few perks in its favor, however, namely that it seems to be used in heavy production at The New York Times, and that it allows you to convert between databases and flat files (nice touch!).

Written in php, the tool is geared toward web applications. An immediate thought, aside from batch jobs for various internal projects, would be to use the tool for a greater purpose: a data converter for data.gov.

While data.gov is a nice start at centralizing the directory of the U.S. government’s raw data feeds, it lacks a utility for converting between various data formats. Given that the sunlight foundation is currently running a competition to build tools on top of data.gov, this is the perfect opportunity to go meta and build a data.gov browser with automatic format conversion. That would help standardize the site a bit, and would be a nice signal for the folks at data.gov as to the file formats that people actually want for various datasets. If you’re interested in working on this project, let me know!

MIT Database Systems (6.830) TA Course Notes

In Fall 2008, I had the pleasure of TAing Database Systems with Sam Madden, Mike Stonebraker, and Evan Jones. I figured that I could take notes to help students follow the lectures while clarifying any confusing points that were raised during discussion. It would also help me avoid the embarrassment of forgetting something mentioned during a lecture and having students explain it to me during office hours:).

I decided to take notes in plain text, mostly out of laziness. This turned out to be a challenge for drawing things like query plans, but forced me to distill explanations into a conversational tone that provided an alternative to traditional diagrams.

Some students in the class told me that they benefited from and enjoyed the notes, and so I decided to open them up for reuse by the rest of the web community. The topics are as follows:

Feel free to take a read through the notes, and shoot me an email with requests, corrections, or just to let me know you read something.

For completeness, This work by Adam Marcus is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.

Enjoy!

Local Communities, Information Gathering

By way of danah boyd I found the Knight Commission on the Information Needs of Communities, which published an introductory whitepaper that asks how local communities find, digest, and react to information.

Outside of designing software, my biggest hobby is reading the news, so you would imagine that I would feel well equipped to answer how I keep up-to-date with, and respond to, local events. Reading the questions the commission is tackling gave me pause, however, since I realized that all of my sources tell me how things are over there, and very little about what’s going on around here. I keep up-to-date with the people who live next door, but only by way of the abstracted and aggregated views of the national news media.

I have tried to sign up for blogs about my area, but they are either too commercialized, or have a low signal-to-noise ratio. And even then, if Hurricane Katrina strikes the northeast, I’ll know as little today about what to do next as I assume most people did the first time disaster struck.

So the national news tells us about there. Blogs tell us about what we know we want. Twitter tells us about now. Where do we find out about down the street, at the moment, and what everyone is doing about it? It seems that we’ve built a digital infrastructure to maximize information gathering at the expense of a local infrastructure for dealing with the more basic things in life.

Making the case for Raw Data

Tim Berners-Lee’s recent TED talk on Linked Data has inspired quite a few people to ask what exactly linked data is, how it differs from data on the semantic web, and how realistic it is to assume universal and unique addressability of data items. A world with linked data would be a world with richer, more explorable data, and that notion on its own makes Tim’s talk worth viewing. The most inspiring part of his talk, in my opinion, was the one in which he got the entire crowd to loudly demand RAW DATA NOW. Given the push for more open datasets in government, and given that more websites are becoming API-providing data platforms, it is important to demand raw data where possible.

The magic behind raw data

The best thing about raw data is that almost everyone knows how it works. This means that as far as the data (re)user is concerned, the datasets are text files (or perhaps a close variant) that they can download, open in some default application, and get some immediate use out of it.

If the US Federal budget dataset is released as a comma-separated file, a middle-schooler can download the file, open it in a spreadsheet application, and sum the columns to see how much we’re spending on the Department of Education this year. A more skilled high-schooler can upload the file to Many Eyes, make a pie chart out of it, and post it to their blog. A first-year college student can write a php script to allow people to comment on various parts of that pie chart, allowing you to drill in to various slices to get a finer granularity.

With raw data, you’ve opened more people to more visualization, exploration, and discussion than was available through the original web application that acted as a firewall to your database.

Hugging the data to death

During his talk, Tim spoke about “Database Huggers,” or people who, for various reasons, hide their data away in databases. Once the data sits in a database, the publisher might provide a specific and constrained view of the data by way of a website, or they might hide it even more, simply calculating some aggregate statistic over the data and claiming, without verification, that the data has certain properties.

There are several legitimate reasons for database hugging. Some data was meant to be private—academic, medical, and financial information are all datapoints we’d prefer to keep private. We’d hope our service providers will keep it out of the hands of others. Similarly, a company might have competitive reasons for keeping information private, especially when it would be equally valuable to their competitors and not too valuable to the public—lists of customers and transaction histories come to mind. Keeping this information far from the publicly accessible web is responsible and wise.

There are other cases, however, where the data should legitimately stay open and publicly accessible. Open government initiatives will result in many datasets published by organizations that will or should exist in the public domain. Many Long Tail websites, maintained by small groups of hobbyists, probably would not mind if the datasets they generate are published in their full glory. For these types of applications, raw data is ideal.

Even in the case of datasets that should be open to the public, database huggers will sometimes disable direct access to the data, instead opting to place it in a database that sits behind an html-generating web application. Thinking that you’ve hidden your data behind HTML, thus making it safe from reuse, is an unwise assumption. In about an hour, a decent programmer can write a perl script to crawl your site and tease the data apart from the obfuscated HTML that surrounds it, reverse-engineering your database without asking for permission. In fact, there are tools that make this process easier than writing a one-off perl script. And if you think you can block the person from accessing every page on your site in a short period of time, then they will just collaborate with everyone else who wants the data, write a Greasemonkey script to collect parts of the site that they browse, and eventually collect your entire presented dataset.

Databases are not inherently evil. They provide an excellent way to store, index, and query data, but they also have a way of separating the average user from that data. Most websites, for example, do not publish a read-only username and password to their database, for fear of arbitrary queries that could easily take down their machines, or at least keep the machines busy for a long time. We should design tools to maintain the excellent services that databases have been built to provide over the last four decades, without limiting the access to the raw data when such access would be most valuable.

Are APIs the future of raw data?

There is a middle ground between the highly private datasets and the obviously open ones. Most forward-thinking organizations have realized this. They have also realized that if they have something to sell, be it in meatspace or screenspace, it’s better to release the data about their offerings to anyone that wants to use it, so that people eventually end up at their site. They do this by providing a web API to make their dataset queriable, essentially telling other software developers which questions they can answer about the dataset (query for books by author, query for restaurants by cuisine). Amazon has some APIs, as does Yelp, and you’d have to be a pretty self-loathing web 2.0 company to not provide an API over some portion of your data. So are APIs the solution? Not always.

APIs are a step in the right direction—open data is better than obfuscated data. APIs help both third-party developers and dataset publishers get more out of a dataset. They have a few drawbacks as well:

  • The API is an HTTP interface to your database. This means that if someone else makes a third-party application that is immensely popular, it’s your database that pays for the brunt of its popularity. You weren’t expecting a huge ramp-up in server load? Too bad.
  • As kind as the dataset publisher is, they can’t predict every use of the data—if they could, they already would have implemented the best use cases. If they can’t predict how the consumer/developer will use the data, they might not publish a good hook into the dataset. This would either prevent or make awkward the interaction between the third-party application and the publisher.
  • Building an API for a dataset makes the people who are nice enough to share their data do more work on top of designing their application. Following common REST or CRUD conventions makes this easier, but still puts the onus on the developer. As a corollary, APIs don’t change with the data. APIs are frequently revised, meaning that a change in your data requires constant upkeep of your API.

One might argue that some of the criticisms of APIs are unfair:

  • Saying that raw data will reduce the load on your database implies that the third party has some cache of the data, which is thus slightly out-of-date. You could imagine some sort of Comet-updated raw dataset system, but it’s unlikely for now that dataset publishers will be willing to stream live updates to third parties.
  • Perhaps the limited API functionality is for good reason. Amazon might never want you to be able to download their entire dataset—they don’t want to waste the bandwidth and they don’t want competitors to know exactly how many items they have on hand.
  • Publishing any sort of raw data will require extra work on behalf of the dataset publisher. Perhaps API-writing is the least invasive of their time?

An ideal data management tool would allow raw data publishing when possible, and make it easier to build APIs when some limited access is desirable. We should not pretend to know the point at which raw data is superior to APIs, but the point exists somewhere. It’s important to understand the benefits that raw data provides on top of web APIs, so that you can think about when it would be valuable to use.

After all this time, the answer was text files?

You’ve probably become skeptical of these suggestions. Are we really supposed to throw away decades of database research in how to properly store, index, and query reasonably sized datasets so that a middle-schooler can look at the data in a different way? Of course not. The interesting research question becomes whether we can give the user the illusion of raw data while still benefiting from database technology where possible.

That’s one research direction we’re taking within the Haystack group. With the constraint that the raw data, in human-readable text files, should always be available, we’d like to blur the boundaries between databases and data-aware webservers.

Specifically, what we plan on designing is an apache web server module that recognizes when it is serving a dataset, perhaps by taking note that it is serving a .csv, .rdf, or .json file. In such cases, the server would cook the data into a database behind the scenes. Data-aware clients (in javascript for the time being, but in the browser one day) can then query the web server about the data directly. Updates become difficult, but we can make consistency guarantees about the original raw data text files to ensure that someone can download them and see up-to-date information.

If you prefer programmatic access to the files, the module turns into a REST(, SQL, SPARQL, you favorite path language)-capable endpoint. If you prefer to get down and dirty with the data, you’ve got the text files.

We certainly don’t want to stand in the way of a world with Linked Data, so if you’d like, the tool will eventually return data with URIs. We can’t guarantee the URIs will resolve to anything useful, but that just might require a human’s touch. We’re not sure how that fits into the picture for the average data publisher, since the marginal benefit to the individual of universally addressing your own data is small, whereas the benefit to everyone else of adding another linked dataset grows with the number of datasets it is linked to.

And now, for some questions

We’re early in the development of our tools, so we’re open to your ideas and suggestions. Keeping text files up-to-date with the database that’s proxying them is nontrivial. Thinking of the ideal client/server mode of operation will also take time. We probably haven’t thought of the most important must-have feature yet, so any suggestions are welcome.

Thanks to Ted Benson, Sam Madden, and David Karger for their thoughts on this post.

(Cross-posted on the Haystack Blog)

Rethinking the Newsroom

I’ve had a lot of discussions with people about what news will look like in a digital age.  Clay Shirky has just written this piece about that topic.  He’s managed to say a few major things:

  • Big newsrooms were a means to an end.  Unfortunately for them, what we actually want is information via journalism, and until recently, not everyone could afford the printing press that was a prerequisite.
“Society doesn’t need newspapers. What we need is journalism. For a century, the imperatives to strengthen journalism and to strengthen newspapers have been so tightly wound as to be indistinguishable. That’s been a fine accident to have, but when that accident stops, as it is stopping before our eyes, we’re going to need lots of other ways to strengthen journalism instead.”
  • It turns out that a side benefit of being a market with little competition was that advertisers could pay for your more extravagant expenses.
“For a long time, longer than anyone in the newspaper business has been alive in fact, print journalism has been intertwined with these economics. The expense of printing created an environment where Wal-Mart was willing to subsidize the Baghdad bureau. This wasn’t because of any deep link between advertising and reporting, nor was it about any real desire on the part of Wal-Mart to have their marketing budget go to international correspondents. It was just an accident. Advertisers had little choice other than to have their money used that way, since they didn’t really have any other vehicle for display ads.”
  • Putting information into bits on computers has made it easy to extract the bits and redisplay them without DRM or paywalls.  You won’t get around that, and so the model has to change.

“And so it is today. When someone demands to know how we are going to replace newspapers, they are really demanding to be told that we are not living through a revolution. They are demanding to be told that old systems won’t break before new systems are in place. They are demanding to be told that ancient social bargains aren’t in peril, that core institutions will be spared, that new methods of spreading information will improve previous practice rather than upending it. They are demanding to be lied to.”

While well thought out, it leaves us wondering what comes next.  Competition is good in that it lowers the price to entry.  It’s not good in that no one entity has enough money to run the Baghdad Bureau.  We’ve also heard that local newsrooms are shutting down, since craigslist has killed their income sources.  So if we’re losing local journalism and the high-expense war reporting, where does that leave us?

One question I’d love to have the answer to is where the big expenses lie.  Is it the Baghdad Bureau, or the printing press?  If it is true that the NYT spends so much on printing and delivery that it can afford to send all of its paying subscribers a free kindle, does having a web-only presence allow it to continue paying for the high-quality journalism without paying for print media?  A read-through of their quarterly report might help with that—-let me know if you know about this!

[update: put quotes around the parts I quoted from they Shirky piece]