Thursday, December 30, 2010

Assisted Reading vs. Data Mining

I've started thinking that there's a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I'd call them:
  1. Assisted Reading: Using a computer as a means of targeting and enhancing traditional textual reading—finding texts relevant to a topic, doing low level things like counting mentions, etc.
  2. Text Mining: Treating texts as data sources to chopped up entirely and recast into new forms like charts of word use or graphics of information exchange that, themselves, require a sort of historical reading.
Humanists are far more comfortable with the first than the second. (That's partly why they keep calling the second type of work 'text mining', even I think the field has moved on from that label--it sounds sinister). Basic search, which everyone uses on J-stor or Google Books, is far more algorithmically sophisticated than a text-mining star like Ngrams. But since it promises to merely enable reading, it has casually slipped into research practices without much thought.

The distinction is important because the way we use texts is tied to humanists' reactions to new work in digital humanities. Ted Underwood started an interesting blog to look at ngrams results from an English lit perspective: he makes a good point in his first post:

Monday, December 27, 2010

Call numbers

I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.

The HathiTrust Bibliographic API is great. What a resource. There are a few odd tricks I had to put in to account for their integrating various catalogs together (Michigan call numbers are filed under MARC 050 (Library of Congress catalog), while California ones are filed under MARC 090 (local catalog), for instance, although they both seem to be basically an LCC scheme). But the openness is fantastic--you just plug in OCLC or LCCN identifiers into a url string to get an xml record. It's possible to get a lot of OCLCs, in particular, by scraping Internet Archive pages. I haven't yet found a good way to go the opposite direction, though: from a large number of specially chosen Hathi catalogue items to IA books.

This lets me get a slightly better grasp on what I have. First, a list of how many books I have for each headline LC letter:

Sunday, December 26, 2010

Finding keywords

Before Christmas, I spelled out a few ways of thinking about historical texts as related to other texts based on their use of different words, and did a couple examples using months and some abstract nouns. Two of the problems I've had with getting useful data out of this approach are:

  1. What words to use? I have 200,000, and processing those would take at least 10 times more RAM than I have (2GB, for the record). 
  2. What books to use? I can—and will—apply them across the whole corpus, but I think it's more useful to use the data to draw distinctions between types of books we know to be interesting.
I've got tentative solutions to both those questions. For (2), I finally figured out how to get a substantial number of LCC call numbers into my database (for about 30% of the books). More on that later, which I'm obviously excited about. But I finally did some reading to get a better answer for (1), too. This is all still notes and groundwork-laying, so if you're reading for historical analysis or DH commentary, this is the second of several skippable posts. But I like this stuff because it gives us glimpses at the connections between semantics, genre, and word-use patterns.

Basically, I'm going to start off using tf-idf weight. A while ago, I talked about finding "lumpy" words. Any word appears in x books, and y times overall. We can plot that. (I'm using the data from the ngrams 1-set here instead of mine, because it has a more complete set of words. There are lots of uses for that data, for sure, although I keep finding funny little mistakes in it that aren't really worth blogging—they seem to have messed up their processing of contractions, for instance, and their handling of capital letters forces some guess-work into the analysis I'm doing here). Each blue dot in this graph is a word: the red ones are the 1000 or so ones that appear a lot but in fewer books than you'd think. Those words should be more interesting for analysis. 

Thursday, December 23, 2010

What good are the 5-grams?

 Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. There's just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, there's no reason to use the ngrams data rather than just downloading the original books, because:

  1. Ngrams are not complete; and
  2. Were they complete, they wouldn't offer significant computing benefits over reading the whole corpus.
Edit: let me intervene after the fact and change this from a rhetorical to a real question. Am I missing some really important research applications of the 5-grams in what follows? Another way of putting it: has the dump that Google did for the non historical ngrams in 2006 been useful in serious research? I don't know, but I suspect it might have been.

Second Principals

Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but I'm going to try again. This post is largely a test of whether I can explain principal components analysis to people who don't know about it so: correct me if you already understand PCA, and let me know me know what's unclear if you don't. (Or, it goes without saying, skip it.)

Start with an example. Let's say I'm interested in social theory. I can take two words—"social" and "political"—and count how frequent each of them is --something like two or three out of every thousand words is one of those. I can even make a chart, where every point is a book, with one axis the percentage of words in that book that are "social" and the other the percentage that are "political." I put a few books on it just to show what it finds:



Sunday, December 19, 2010

Not included in ngrams: Tom Sawyer

I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.

But we have no idea what books are in there. There's no connection to the texts from the data.

I'm particularly interested in how they deal with subsequent editions of books. Their methodology (pdf) talks about multiple editions of Tom Sawyer. I think it says that they eliminate multiple copies of the same edition but keep different years.

I thought I'd check this. There are about 5 occasions in Tom Sawyer where the phrase "Huck said" appears with separating quotes, and 11 for "said Huck." Both are phrases that basically appear only in Tom Sawyer in the 19th century (the latter also has a tiny life in legal contracts involving huckaback, and a few other places), so we can use it as a fair proxy for different editions. The first edition of Tom Sawyer was 1881: there are loads of later ones, obviously. Here's what you get from ngrams:



Three big spikes around 1900, and nothing before. Until about 1940, the ratio is somewhat consistent with the internal usage in the book, 11 to 5, although "said huck" is a little overrepresented as we might think. Note:

Saturday, December 18, 2010

State of the Art/Science

As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and you're pretty much guaranteed an explosion of theories and methods.

Some of the theories are deeply interesting. I really like the censorship stuff. That really does deal with books specifically, not 'culture,' so it makes a lot of sense to do with this dataset.  The stuff about half-lives for celebrity fame and particularly for years is cool, although without strict genre controls and a little more context I'm not sure what it actually says--it might be something as elegaic as the article's "We are forgetting our past faster with each passing year," but there are certainly more prosaic explanations. (Say: 1) footnotes are getting more and more common, and 2) footnotes generally cite more recent years than does main text. I think that might cover all the bases, too.) Yes, the big ideas, at least the ones I feel qualified to assess, are a little fuzzier—it's hard to tell what to do with the concluding description of wordcounts as "a great cache of bones from which to reconstruct the skeleton of a new science," aside from marveling at the BrooksianFreedmanian tangle of metaphor. (Sciences once roamed the earth?) But although a lot of the language of a new world order (have you seen the "days since first light" counter on their web page?) will rankle humanists, that fuzziness about the goals is probably good. This isn't quite sociobiology redux, intent on forcing a particular understanding of humanity on the humanities. It's just a collection of data and tools that they find interesting uses for, and we can too.


But it's the methods that should be more exciting for people following this. Google remains ahead of the curve in terms of both metadata and OCR, which are the stuff of which digital humanities is made. What does the Science team get?

Friday, December 17, 2010

Missing humanists

(First in a series on yesterday's Google/Harvard paper in Science and its reception.)

So there are four things I'm immediately interested from yesterday's Google/Harvard paper.

  1. A team of linguists, computer scientists and other non-humanists published that paper in Science about using Google data for word counts to outline the new science of 'culturomics';
  2. They described the methodology they used to get word counts out of the raw metadata and scans, which presumably represents the best Google could do in 2008-09;
  3. Google released a web site letting you chart the shifts in words and phrases over time;
  4. Google released the core data powering that site containing data on word, book, and page occurrences for various combinations of words.

Twitter seems largely focused on #3 as a fascinating tool/diversion, the researchers seem to hope that #1 will create a burst of serious research using #4, and anyone doing research in the field should be eagerly scanning #2 for clues about what the state of art is—how far you can get with full cooperation from Google, with money to hire programmers, etc, and with unlimited computing infrastructure.


Each of these is worth thinking about in turn. Cut through all of it, though, and I think the core takeaway should be this:

Humanists need to be more involved in how these massive stores of data are used.

Thursday, December 16, 2010

Culturomics

Days from when I said "Google Trends for historical terms might be worse than nothing" to the release of "Google ngrams:" 12. So: we'll get to see!


Also, I take back everything I said about 'digital humanities' having unfortunate implications. "Culturomics"—like 'culturenomics', but fluffier?—takes the cake.


Anyway, I should have some more thoughts on this later. I have them now, I suppose, but let me digest. For now, just dwell on the total lack of any humanists in that article promising to revolutionize the humanities.

Tuesday, December 14, 2010

How Bad is Internet Archive OCR?

We all know that the OCR on our digital resources is pretty bad. I've often wondered if part of the reason Google doesn't share its OCR is simply it would show so much ugliness. (A common misreading, 'tlie' for 'the', gets about 4.6m results in Google books). So how bad is the the internet archive OCR, which I'm using? I've started rebuilding my database, and I put in a few checks to get a better idea. Allen already asked some questions in the comments about this, so I thought I'd dump it on to the internet, since there doesn't seem to be that much out there.

First: here's a chart of the percentage of "words" that lie outside my list of the top 200,000 or so words. (See an earlier post for the method). The recognized words hover at about 91-93 percent for the period. (That it's lowest in the middle is pretty good evidence the gap isn't a product of words entering or leaving the language).

Now, that has flaws in both directions. Here are some considerations that would tend to push the OCR error rate on a word basis lower than 8%:

Avoidance tactics

Can historical events suppress use of words? Usage of the word 'panic' seems to spike down around the bank panics of 1873 and 1893, and maybe 1837 too. I'm pretty confident this is just an artifact of me plugging in a lot of words in to test out how fast the new database is and finding some random noise. There are too many reasons to list: 1857 and 1907 don't have the pattern, the rebound in 1894 is too fast, etc. It's only 1873 that really looks abnormal. What do you think:
But it would be really interesting if true--in my database of mostly non-newsy texts, do authors maybe shy away from using words that have too specific a meaning at the present moment? Lack of use might be interesting in all sorts of other ways, even if this one is probably just a random artifact.

Sunday, December 12, 2010

Capitalist lackeys

I'm interested in the ways different words are tied together. That's sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for "scientific method," but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. I'm going to think through this staying on "capitalist" as the word of the day. Fair warning: this post is a rambler.

Earlier I looked at some sentences to conclude that language about capitalism has always had critics in the American press (more, Dan said in the comments, than some of the historiography might suggest). Can we find this by looking at numbers, rather than just random samples of text? Let's start with a log-scale chart about what words get used in the same sentence as "capitalist" or "capitalists" between 1918 and 1922. (I'm going to just say capitalist, but my numbers include the plural too).


Thursday, December 9, 2010

Metadata for OCR books

A commenter asked about why I don't improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. I'd like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? I'm going to think through what I know, but I'd love any advice on this because it's really outside my expertise.

Wednesday, December 8, 2010

First Principals

Let me get ahead of myself a little.

For reasons related to my metadata, I had my computer assemble some data on the frequencies of the most common words (I explain why at the end of the post.) But it raises some exciting possibilities using forms of clustering and principal components analysis (PCA); I can't resist speculating a little bit about what else it can do to help explore ways different languages intersect. With some charts at the bottom.

Monday, December 6, 2010

Back to the Future

Maybe this is just Patricia Cohen's take, but it's interesting to note that she casts both of the text mining projects she's put on the Times site this week (Victorian books and the Stanford Literature Lab) as attempts to use modern tools to address questions similar to vast, comprehensive tomes written in the 1950s. There are good reasons for this. Those books are some of the classics that informed the next generation of scholarship in their field; they offer an appealing opportunity to find people who should have read more than they did; and, more than some recent scholarship, they contribute immediately to questions that are of interest outside narrow disciplinary communities. (I think I've seen the phrase 'public intellectuals' more times in the four days I've been on Twitter than in the month before). One of the things that the Times articles highlight is how this work can re-engage a lot of the general public with current humanities scholarship.

But some part of my ABD self is a little uncomfortable with reaching so far back. As important as it is to get the general public on board with digital humanities, we also need to persuade less tech-interested, but theory-savvy, scholars that this can create cutting edge research, not just technology. The lede for P. Cohen's first article—that the Theory Wars can be replaced by technology—isn't going to convince many inside the academy. Everybody's got a theory. It's better if you can say what it is.

The Age of Capital–

Dan asks for some numbers on "capitalism" and "capitalist" similar to the ones on "Darwinism" and "Darwinist" I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.

I'm going to go step-by-step here at some length to show just how cyclical a process this is--the computer is bad at semantic analysis, and it requires some actual knowledge of the history involved to get anything very useful out of the raw data on counts. A lot of comments on semantic analysis make it sound like it's asking computers to think for us, so I think it's worth showing that most of the R functions I'm using generally operate at a pretty low level--doing some counting, some index work, but nothing too mysterious.

Saturday, December 4, 2010

Full-text American versions of the Times charts

This verges on unreflective datadumping: but because it's easy and I think people might find it interesting, I'm going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohen's charts of title word counts. I've tossed in a couple extra words where it seems interesting—including some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just aren't many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history ends--thank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they aren't.

This is pretty close to Cohen's chart, and I don't have much to add. In looking at various words that end in -ism, I got some sense earlier of how individual religious discussions--probably largely in history—peak at substantially different times. But I don't quite have the expertise in American religious history to fully interpret that data, so I won't try to plug any of it in.

Today's Times Article

Patricia Cohen's new article about the digital humanities doesn't come with the rafts of crotchety comments the first one did, so unlike last time I'm not in a defensive crouch. To the contrary: I'm thrilled and grateful that Dan Cohen, the main subject of the article, took the time in his moment in the sun to link to me. The article itself is really good, not just because the Cohen-Gibbs Victorian project is so exciting, but because P. Cohen gets some thoughtful comments and the NYT graphic designers, as always, do a great job. So I just want to focus on the Google connection for now, and then I'll post my versions of the charts the Times published.

Now with actual text!

Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researcher's control. I've noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.


This is all by way of showing off the latest thing it lets me do--get examples of actual usage so we can do semantic processing ourselves, rather than trying to have a computer do it poorly. It might be good to put some tests like this into the code by default, as a check on interpretive hubris. I need to put the years and titles in here too, but if we just take a random set of samples of the language of natural selection, I think it's already clear that we get an interesting new form of text to interpret; it's sort of like reading the usage examples in the OED, except that we can create much more interesting search contraints on where our passages come from.


> get.usage.example("natural selection",sample(books,1))
[1] "we might extend the parallel and get some good illustrations of natural selection from the history of architecture and the origin of the different styles under different climates and conditions"

Friday, December 3, 2010

Quick, extremely relevant outlinks

Dan Cohen, the hub of all things digital history, in the news and on his blog.

What's worth knowing?

I have my database finally running in a way that lets me quickly select data about books. So now I can start to ask questions that are more interesting than just how overall vocabulary shifted in American publishers. The question is, what sort of questions? I'll probably start to shift to some of my dissertation stuff, about shifts in verbs modifying "attention", but there are all sorts of things we can do now. I'm open to suggestions, but here are some random examples:

1. How does the vocabulary used around slavery change between the 1850s and the 1890s, or the 1890s and the 1920s? Probably the discursive space widens--but in what kind of ways, and what sorts of authors use rhetoric of slavery most freely?

2. How do various social and political words cluster by book in the progressive era? Maybe these are words that appear disproportionately often in a sentence with "reform." Can we identify the closeness of ties between various social movements (suffragism, temperance, segregation, municipal government) based on some sort of clustering of co-mentions in books, as I did for the isms?

Questions don't have to be historical, either: they can plug in to other American Studies areas:

3. What different sorts of words are used to modify 'city' or 'crowd' in the novels of (say) Howells, James, and Dreiser? How does it change over time within some of them?

4. What sorts of books discuss the plays of Shakespeare between 1850 and 1922--can we identify a shift in a) the sorts of books writing about him that could confirm some Highbrow/Lowbrow stuff, or b) the particular plays that get mention or praise?

Centennials, part II

So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interesting--how does the publishing industry focus in on certain figures to create news or resurgences of interest in them?  I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.

I was asking if this spike in mentions of Thoreau in 1917, is extraordinary or merely high.
Emerson (1903) doesn't seem to have much a spike--he's up in 1904 with everyone, although Hawthorne, whose centenary is 1904, isn't up very much.

Can we look at the centennial spikes for a lot of authors? Yes. The best way would be to use a biographical dictionary or wikipedia or something, but I can also just use the years built into some of my author metadata to get a rough list of authors born between 1730 and 1822, so they can have a centenary during my sample. A little grepping gets us down to thousand or so authors. Here are the ten with the most books, to check for reliability:

Centennials, part I.

I was starting to write about the implicit model of historical change behind loess curves, which I'll probably post soon, when I started to think some more about a great counterexample to the gradual change I'm looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.

I've always been interested in tracking changes in historical memory, and this is a good place to do it. I talked about the Gettysburg sesquicentennial earlier, and I think all the stuff about the civil war sesquicentennial (a word that doesn't show up in my top 200,000, by the way) prompted me to wonder whether the commemorations a hundred years ago helped push forward practices in the publishing industry of more actively reflecting on anniversaries. Are there patterns in the celebration of anniveraries? For once my graphs will be looking at the spikes, not the general trends. With two exceptions to start: the words themselves:
So that's a start: the word centennial was hardly an American word at all before 1876, and it didn't peak until 1879. The Loess trend puts the peak around 1887. So it seems like not only did the American centennial put the word into circulation, it either remained a topic of discussion or spurred a continuing interest in centennials of Founding era events for over a decade.

Thursday, December 2, 2010

Do it yourself

Jamie's been asking for some thoughts on what it takes to do this--statistics backgrounds, etc. I should say that I'm doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I don't think I'm going to do the software review thing here, but there are what look like a lot of promising leads at an American Studies blog.

As for whether the courses exist, I think they do from place to place: Stephen Ramsay says he's taught one at Nebraska for years.

It's easy to follow a few of these links and quickly end up drinking from a firehose of information. I get two initial impressions: 1) English is ahead of history on this; 2) there are a lot of highly developed applications for doing similar things with text analysis. The advantage is that it's leading me to think more carefully about how my applications are different than other people's.

Wednesday, December 1, 2010

Digital Humanities and Humanities Computing

I've had "digital humanities" in the blog's subtitle for a while, but it's a terribly offputting term. I guess it's supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtle—and scheduled for demolition. No wonder a world of online exhibitions and digital texts doesn't appeal to most humanists of the tweed– and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.

It's too easy to think Digital Humanities is about teaching people to think like computers, when it really should be about making computers think like humanists.* What we want isn't digital humanities; it's humanities computing. To some degree, we all know this is possible—we all think word processors are better than pen and paper, or jstor better than buried stacks of journals (musty musings about serendipity aside). But we can go farther than that. Manfred Kuehn's blog is an interesting project in exploring how notetaking software can reflect and organize our thinking in ways that create serendipity within one person's own notes. I'm trying to figure out ways of doing that on a larger body of texts, but we could think of those as notes, themselves.

Programming and other Languages

Jamie asked about assignments for students using digital sources. It's a difficult question.

A couple weeks ago someone referred an undergraduate to me who was interested in using some sort of digital maps for a project on a Cuban emigre writer like the ones I did of Los Angeles German emigres a few years ago. Like most history undergraduates, she didn't have any programming background, and she didn't have a really substantial pile of data to work with from the start. For her to do digital history, she'd have to type hundreds of addresses and dates off of letters from the archives, and then learn some sort of GIS software or google maps API, without any clear payoff. No would get much out of forcing her to spend three days playing with databases when she's really looking at the contents of letters.