The Poetics of Metadata and the Potential of Paradata (Revised)

I Have Learned So Much[This is the text, more or less, of the talk I delivered at the 2011 biennial meeting of the Society for Textual Scholarship, which took place March 16-18 at Penn State University. I originally planned on talking about the role of metadata in two digital media projects—a topic that would have fit nicely with STS's official mandate of investigating print and digital textual culture. But at the last minute (i.e. the night before), I changed the focus of my talk, turning it into a thinly-veiled call for digital textual scholarship (primarily the creation of digital editions of print works) to rethink everything it does. (Okay, that's an exaggeration. But I do argue that there's a lot the creators of digital editions of texts should learn from born-digital creative projects.)

Also, it was the day after St. Patrick's Day. And the fire alarm went off several times during my talk.

None of these events are related.]

The Poetics of Metadata and the Potential of Paradata
in We Feel Fine and The Whale Hunt

I once made fun of the tendency of academics to begin their papers by apologizing in advance for the very same papers they were about to begin. I’m not exactly going to apologize for this paper. But I do want to begin by saying that this is not the paper I came to give. I had that paper, it was written, and it was a good paper. It was the kind of paper I wouldn’t have to apologize for.

But, last night, I trashed it.

I trashed that paper. Call it the Danny Boy effect, I don’t know. But it wasn’t the paper I felt I needed to deliver, here, today.

Throughout the past two days I’ve detected a low level background hum in the conference rooms, a kind of anxiety about digital texts and how we interact with them. And I wanted to acknowledge that anxiety, and perhaps even gesture toward a way forward in my paper. So, I rewrote it. Last night, in my hotel room. And, well, it’s not exactly finished. So I want to apologize in advance, not for what I say in the paper, but for all the things I don’t say.

My original talk had positioned two online works by the new media artist Jonathan Harris as two complementary expressions of metadata. I had a nice title for that paper. I even coined a new word in my title.

Flashing Talk TitleBut this title doesn’t work anymore.

I have a new title. It’s a bit more ambitious.

New Title for the Poetics of Metadata

But at least I’ve still got that word I coined.

Paradata.

It’s a lovely word. And truth be told, just between you and me, I didn’t coin it. In the social sciences, paradata refers to data about the data collection process itself—say the date or time of a survey, or other information about how a survey was conducted. But there are other senses of the prefix “para” I’m trying to evoke. In textual studies, of course, para-, as in paratext, is what Genette calls the threshold of the text. I’m guessing I don’t have to say anything more about paratext to this audience.

But there’s a third notion of “para” that I want to play with. It comes from the idea of paracinema, which Jeffrey Sconce first described in 1996. Paracinema is a kind of “reading protocol” that valorizes what most audiences would otherwise consider to be cinematic trash. The paracinematic aesthetic redeems films that are so bad that they actually become worth watching—worth enjoying—and it does so in a confrontational way that seeks to establish a counter-cinema.

Following Sconce’s work, the videogame theorist Jesper Juul has wondered if there can be such a thing as paragames—illogical, improbable, and unreasonably bad games. Such games, Juul suggests, might teach us about our tastes and playing habits, and what the limits of those tastes are. And even more, such paragames might actually revel in their badness, becoming fun to play in the process.

Trying to tap into these three different senses of “para,” I’ve been thinking about paradata. And I’ve got to tell you, so far, it’s a mess. (And this part of my paper was actually a mess in the original version of my paper as well). My concept of paradata is a big mess and it may not mean anything at all.

This is what I have so far: paradata is metadata at a threshold, or paraphrasing Genette, data that exists in a zone between metadata and not metadata. At the same time, in many cases it’s data that’s so flawed, so imperfect that it actually tells us more than compliant, well-structured metadata does.

So let me turn now to We Feel Fine, a massive, ongoing digital databased storytelling project rich with metadata—and possibly, paradata.

We Feel Fine Logo

We Feel Fine is an astonishing collection of tens of thousands of sentences extracted from tens of thousands of blog posts, all containing the phrase “I feel” or “I am feeling.” It was designed by new media artist Jonathan Harris and the computer scientist Sep Kamvar and launched in May 2006.

The project is essentially an automated script that visits thousands of blogs every minute, and whenever the script detects the words “I feel” or “I am feeling,” it captures that sentence and sends it to a database. As of early this year, the project has harvested 14 million expressions of emotions from 2.5 million people. And the site has done this at a rate of 10,000 to 15,000 “feelings” a day.

Let me repeat that: every day approximately 10,000 new entries are added to We Feel Fine.

The heart of the project appears to be the multifaceted interface that has six so-called “movements”—six ways of visualizing the data collected by We Feel Fine’s crawler.

The default movement is Madness, a swarm of fifteen-hundred colored circles and squares, each one representing a single sentence from a blog post, a single “feeling.” The circles contain text only, while the squares include images associated with the respective blog post.

Madness Movement of We Feel Fine

The colors of the particles signify emotional valence, with shades of yellow representing more positive emotions, red signaling anger. Blue is associated with sad feelings, and so on. This graphic, by the way, comes from the book version of We Feel Fine.

Pie Chart from the We Feel Fine Book

The book came out in 2009. In it, Harris and Kamvar curate hundreds of the most compelling additions to We Feel Fine, as well as analyze the millions of blog posts they’ve collected with with extensive data visualizations—graphs, and charts, and diagrams.

Opening Montage to the We Feel Fine book

Data Visualizations from the We Feel Fine book

The book is an amazing project in and of itself and deserves its own separate talk. It raises important questions about archives, authorship, editorial practices, the material differences between a dynamic online project and a static printed work, and so on. I’ll leaves aside these questions right now; instead, I want to turn to the site itself. Let’s look at the Madness movement in action.

(And here I went online and interacted with the site. Why don’t you do that too, and come back later?)

(Also, right about here a fire alarm went off. Which, semantically, makes no sense. The alarm turned on, but I said it went off.)

(I can’t reproduce the sound of that particular fire alarm going off. I bet you have some sort of alarm on your phone or something you could make go off, right?)

(No? You don’t? Or you’re just as confused about on and off as I am? Then enjoy this short video intermission, which interrupts my talk, which I’m writing and which you’re reading, about as intrusively as the alarms interrupted my panel.)

(Okay. Back to my talk, which I’m writing, and which you’re reading.)

In the Madness movement you can click on any single circle, and the “feeling” will appear at the top of the screen. Another click on that feeling will drill down to the original blog post in its original context. So what’s important here is that a single click transitions from the general to the particular, from the crowd to the individual. You can also click on the squares to show “feelings” that have an image associated with them. And you have the option to “save” these images, which sends them to a gallery, just about the only way you can be sure to ever find any given image in We Feel Fine again.

Slide10 - The Madness Movement in We Feel Fine

At the top of the screen are are six filters you can use to narrow down what appears in the Madness movement. Working right to left, you can search by date, by location, the weather at that location at the time of the original blog post, the age of the blogger, the gender of the blogger, and finally, the feeling itself that is named in the blog post. While every item in the We Feel Fine database will have the feeling and date information attached to it, the age, gender, location, and weather fields are populated only for those items in which that information is publicly available—say a LiveJournal or Blogger profile that lists that information, or a Flickr photo that’s been geotagged.

What I want to call your attention to before I run through the other five movements of We Feel Fine is that these filters depend upon metadata. By metadata, I mean the descriptive information the database associates with the original blog post. This metadata not only makes We Feel Fine browsable, it makes it possible. The metadata is the data. The story—if there is one to be found in We Feel Fine—emerges only through the metadata.

You can manipulate the other five movements using these filters. At first, for example, the Murmurs movement displays a reverse chronological streaming, like movie credits, of the most recent emotions. The text appears letter-by-letter, as if it were being typed. This visual trick heightens the voyeuristic sensibility of We Feel Fine and makes it seem less like a database and more like a narrative, or even more to the point, like a confessional.

Murmurs Mode of We Feel Fine

The Montage movement, meanwhile, organizes the emotions into browsable photo galleries:

Montage Movement of We Feel Fine

By clicking on a photo and selecting save, you can add photos to a permanent “gallery.” Because the database grows so incredibly fast, this is the only way to ensure that you’ll be able to find any given photograph again in the future. There’s a strong ethos of ephemerality in We Feel Fine. To use one of Marie-Laure Ryan’s metaphors for a certain kind of new media, We Feel Fine is a kaleidoscope, an assemblage of fragments always in motion, never the same reading or viewing experience twice. We have little control over the experience. It’s only through manipulating the filters that we can hope to bring even a little coherency to what we read.

The next of the five movements is the Mobs movement. Mobs provides five separate data visualization of the most recent fifteen-hundred feelings. One of the most interesting aspects of the Mobs movement is that it highlights those moments when the filters don’t work, or at least not very well, because of missing metadata.

The Mobs Movement of We Feel Fine

For instance, clicking the Age visualizations tells us that 1,223 (of the most recent 1,500) feelings have no age information attached to them. Similarly, the Location visualization draws attention to the large number of blog posts that lack any metadata regarding their location.

Unlike many other massive datamining projects, say, Google’s Ngram Viewer, We Feel Fine turns its missing metadata into a new source of information. In a kind of playful return of the repressed, the missing metadata is colorfully highlighted—it becomes paradata. The null set finds representation in We Feel Fine.

The Metrics movement is the fourth movement. And it shows what Kamvar and Harris call the “most salient” feelings, by which they mean “the ways in which a given population differs from the global average.”

The Metrics Movement of We Feel Fine

Right now, for example, we see that “Crazy” is trending 3.8 times more than normal, while people are feeling “alive” 3.1 times more than usual. (Good for them!). Here again we see an ability to map the local against the global. It addresses what I see as one of the problems of large-scale data visualization projects, like the ones that Lev Manovich calls “cultural analytics.”

Ngram and the like are not forms of distant reading. There’s distant reading, and then there’s simply distance, which is all they offer. We Feel Fine mediates that distance, both visually, and practically.

(And here I was going to also say the following, but I was already in hot water at the conference for my provocations, so I didn’t say it, but I’ll write it here: Cultural analytics echo a totalitarian impulse for precise vision and control over broad swaths of populations.)

And finally, the Mounds movement, which simply shows big piles of emotion, beginning with whatever feeling is the most common at the moment, and moving on down the line towards less common emotions. The Mounds movement is at once the least useful visualization but also the most playful, with its globs that jiggle as you move your cursor over them.

The Mounds Movement in We Feel Fine

(Obviously you can’t see it above, in the static image but…) The mounds convey what game designers call “juiciness.” As Jesper Juul characterizes juiciness, it’s “excessive positive feedback in response to the player’s actions.” Or, as one game designer puts it, a juicy game “will bounce and wiggle and squirt…it feels alive and responds to everything that you do.”

Harris’s work abounds with juicy, playful elements, and they’re not just eye candy. They are part of the interface, part of the design, and they make We Feel Fine welcoming, inviting. You want to spend time with it. Those aren’t characteristics you’d normally associate with a database. And make no mistake about it. We Feel Fine is a database. All of these movements are simply its frontend—a GUI Java applet written in Processing that obscures a very deliberate and structured data flow.

The true heart of We Feel Fine is not the responsive interface, but the 26,000 lines of code running on 5 different servers, and the MySQL database that stores the 10,000 new feelings collected each and every day. In their book, Kamvar and Harris provide an overview of the dozen or so main components that make up We Feel Fine’s backend.

It begins with a URL server that maintains the list of URLs to be crawled and the crawler itself, which runs on a single dedicated server.

Pages retrieved by the crawler are sent to the “Feeling Indexer,” which locates the words “feel” or “feeling” in the blog post. The adjective following “feel” or “feeling” is matched against the “emotional lexicon”—a list of 2,178 feelings that are indexed by We Feel Fine. If the emotion is not in the lexicon, it won’t be saved. That emotion is dead to We Feel Fine. But if the emotion does match the index, the script extracts the sentence with that feeling and any other information available (this is where the gender, location, and date data are parsed).

Next there’s the actual MySQL database, which stores the following fields for each data item: the extracted sentence, the feeling, the date, time, post URL, weather, and gender, age, and location information.

Then there’s an open API server and several other client applications. And finally, we reach the front end.

Now, why have I just taken this detour into the backend of We Feel Fine?

Because, if we pay attention to the hardware and software of We Feel Fine, we’ll notice important details that might otherwise escape us. For example, I don’t know if you noticed from the examples I showed earlier, but all of the sentences in We Feel Fine are stripped of their formatting. This is because the Perl code in the backend converts all of the text to lowercase, removes any HTML tags, and eliminates any non-alphanumeric characters:

The algorithm tampers with the data. The code mediates the raw information. In doing so, We Feel Fine makes both an editorial and aesthetic statement.

In fact, once we understand some of the procedural logic of We Feel Fine, we can discover all sorts of ways that the database proves itself to be unreliable.

I’ve already mentioned that if you express a feeling that is not among the 2,178 emotions tabulated, then your feeling doesn’t count. But there’s also the tricky language misdirection the algorithm pulls off, in which the same “feeling” is interpreted by the machine to be the same, no matter how it is used in the sentence. In this way, the machine exhibits the same kind of “naïve empiricism” (using Johanna Drucker’s dismissive phrase) that some humanists do interpreting quantitative data.

And finally, consider many of the images in the Montage movement. When there are multiple images on a blog page, the crawler only grabs the biggest one—and not biggest in dimensions, but biggest in file size, because that’s easier for the algorithm to detect—and this image often ends up being the header image for the blog, rather than connected to the actual feeling itself, as in this example.

Montage Image from We Feel Fine

The star pattern happens to be a sidebar image, rather than anything associated with the actual blog post that states the feeling:

The Stars in Context

So We Feel Fine forces associations. In experimental poetry or electronic literature communities, these kinds of random associations are celebrated. The procedural creation of art, literature, or music has a long tradition.

But in a database that seeks to be a representative “almanac of human emotions”? We’re in new territory there.

But in fact, it is representative, in the sense that human emotions are fungible, ephemeral, disjunctive, and, let’s face, sometimes random.

Let me bring this full circle, by returning to the revised title of my talk. I mentioned at the beginning that I felt this low-grade but pervasive concern about digital work these past few days at STS. I’ve heard questions like Are we doing everything we can to make digital editions accessible, legible, readable, and teachable? Where are we failing, some people have wondered. Why are we failing? Or at least, Why have we not yet reached the level of success that many of the very same people at this conference were predicting ten or fifteen or, dare I say it, twenty years ago?

Maybe because we’re doing it wrong.

I want to propose that we can learn a lot from We Feel Fine as we exit out the far end of what some media scholars have called the Gutenberg Parenthesis.

What can we learn from We Feel Fine?

Four things:

  1. It’s inviting
  2. It’s paradata
  3. It’s open
  4. It’s juicy

Imagine if textual scholars built their digital editions and archives using these four principles.

Think about We Feel Fine and what makes work. Most importantly, We Feel Fine is a compelling reading experience. It’s not daunting. There’s a playful balance between interactivity and narrative coherence.

Secondly, and this goes back to my idea of paradata. Harris and Kamvar are not afraid to corrupt the source data, or to create metadata that blurs the line between metadata and not-metadata. They are not afraid to play with their sources, and for the most part, they are up front about how they’re playing with them.

This relates to the third feature of We Feel Fine that we should learn from. It’s open. Some of the source code is available. The list of emotions is available. There’s an open API, which anyone can use to build their own application on top of We Feel Fine, or more generally extract data from.

And finally, it’s juicy. I admit, this is probably not a term many textual scholars use in their research, but it’s essential for the success of We Feel Fine. The text responds to you. It’s alive in your hands, and I don’t think there’s much more we could ever ask from a text.

Bibliography

  • Drucker, Johanna. 2010. “Humanistic Approaches to the Graphical Expression of Interpretation” presented at the Hyperstudio: Digital Humanities at MIT, May 20, Cambridge, MA. http://mitworld.mit.edu/video/796.
  • Genette, Gerard. 1997. Paratexts: Thresholds of Interpretation. Cambridge: Cambridge University Press.
  • Harris, Jonathan and Sep Kamvar. 2006. Methodology. We Feel Fine. http://wefeelfine.org/methodology.html.
  • Juul, Jesper. 2009. Paragaming: Good Fun with Bad Games. The Ludologist. September 24. http://www.jesperjuul.net/ludologist/?p=732.
  • Juul, Jesper. 2010. A Casual Revolution: Reinventing Video Games and Their Players. Cambridge,  MA: MIT Press.
  • Ryan, Marie-Laure. 2001. Narrative as Virtual Reality: Immersion and Interactivity in Literature and Electronic Media. Baltimore: Johns Hopkins University Press.
  • Sconce, Jeffrey. 1995. “‘Trashing’ the academy: taste, excess, and an emerging politics of cinematic style.” Screen 36 (4) (December 1): 371-393.
  • Shodhan, Shalin, Matt Kucic, Kyle Gray, and Kyle Gabler. 2005. How to Prototype a Game in Under 7 Days. Gamasutra. October 26. http://www.gamasutra.com/view/feature/2438/how_to_prototype_a_game_in_under_7_.php.

5 thoughts on “The Poetics of Metadata and the Potential of Paradata (Revised)

  1. Actually, this concept goes back even further to Martha Nell Smith’s ideas (2007 and previous) about letting users manipulate the data to form their own unique editions:

    The “Introduction” to the Rotunda edition of ED’s correspondence has been available since early 2008 and discusses exactly this idea of user-generated editions (by Martha Nell Smith — no password or subscription via the press necessary to read the introduction). Martha also made a similar argument in the lead article of the Textual Cultures, “The Human Touch Software of the Highest Order: Revisiting Editing as Interpretation,” (2:1, Spring 2007, 1-15):

    Abstract: A rigid set of orthodoxies, a “right” way of doing editorial business, need not inform our practices in order for them to be principled, rigorous, and reliably according to standard. Instead the rule that should continue to inform all of what we do is the lesbian rule, lesbian not in the sense of Showtime’s The L Word series but in a seventeenth-century architectural sense. That lesbian rule was/is the principle invoked for difficult challenges in construction (such as arches, irregular corners, and the like) and thus is a principle that is pliant and accommodating in its faithful adherence to standards. In order for editorial praxes to obtain the rigor and sharp discipline required of principled methodologies, our pliant and accommodating standards need also to be more interdisciplinary and take into account the “messy” facts of authorship, production, and reception: race, class, gender, and sexuality. This essay extends some of my previous observations about technologies and texts to argue that embracing messy humanity in all its diversities, even as we embrace new technologies, is no longer a luxury for our community, it is a necessity.

  2. I reread your talk yesterday, and even played the fire alarm. So paradata is *any* kind of metadata on the threshold:
    (a) data (as in not data about data, but the data itself — the data that makes wefeelfine work, as you said)
    (b) the null data set. This, I think, is also just data as in (a). It’s unique in that it just comes from lack of information in certain categories.
    (c) bad data (e.g. wrong images)

    Paradata is what enables the text to be juicy and playful and interactive. Got it.

    If paradata is metadata on a threshold, I’m tempted to also apply it to things like genre categorizations in the metadata that don’t adequately describe a text — where the work rests between 2 or 3 genres in a liminal space that defies its data-ness and categorization. Jacqueline Wernimont suggested in her talk to use this moment as a place to actually build into a GUI and make the user confront the issue of “bad data” or inadequate genre categorizations — rather than just run with the bad data, as wefeelfine.org seems to do. Either way, the paradata creates play.

    Why limit paradata to a kind of metadata? It might also be data that verges on metadata, or wrong data as well. For example, if I tagged all the noises in The Italian in order to study Radcliffe’s sounds (for whatever reason), and my tagging and search didn’t catch all of them, that data that I missed might be paradata … it becomes metadata that describes what I missed in the search and also what I found in the search (it’s the null set). It indicates the places where the search failed, and becomes really interesting at the same time.

    I guess the limit of paradata is that, for me, data and metadata are always squishy, especially when dealing with literature. Turning pieces of a text into data always changes those pieces when you pull them apart, date them, categorize them, reshuffle them. Paradata should be reserved for a moment of data as NOT data, or metadata as NOT metadata, or either as bad data/metadata. Eek – hope this makes sense.

  3. Interesting. I really enjoyed the read, so thanks for sharing. I felt bad reading it on a Kindle (Instapaper’ed), since it is essentially an argument against plain text.

    That being said I completely agree with you. We need more juicy texts.

    I’ve been thinking about about SEO in interactive narrative lately and when I read this I wondered how exactly it might fit in. Metadata is the primary key to SEO, but not necessarily the whole of it. The process of linking, social bookmarking etc… seems to me to fall into paradata (since it is metadata built on top of other metadata and manipulating it for consumption). Would the data fed to the webmaster through tools like Google Analytics or Webmaster Tools be paradata? I think we certainly make up stories about what we see people searching for on our website, or odd incoming links.

    Do you think a narrative could be built from Google Analytics reports on the consumption/usefulness of our metadata? (People have already been having fun looking at some popular Google searches.) Or incoming links? I’m thinking back to my old new media class projects, where people built stories out of blog posts, multiple MySpace accounts, or Twitter. Would they be more interesting at a greater level of distance, assuming they could be mediated in a juicy way?

    I think, from my understanding of your post, that the common searches that did not result in correct, or any, content would count as paradata. This sort of information is easy to find (or create, assuming a small, low traffic site) via tools like the ones I mentioned or the more purpose-targeted Lijit. I bet some fun could be had building a narrative or two there.

    Do you think that the manipulation of search engines with metadata that is essential to the success of ARGs adds a paradata (since we are seeing the interpretation of metadata in search results) narrative to the larger story?

  4. Pingback: Mike Frangos » Blog Archive » imaginary classes #3: Database/Archive

Comments are closed.