Google, aka the “Big Data Dredge in the Sky,” in addition to sucking up all your personal information in order to make you want to buy stuff, has also scanned 5.2 million books. Ten corpora consisting of about 4% of every word and phrase published are now available for public download. Also included, a nifty little free tool that allows those of us with too much time on our hands to do word frequency searches in books published within the last 500 years. Results are output to a graph mapping year-by-year frequency of each term in English, French, German, Spanish, Russian and Simplified Chinese.
I have completed initial testing and can report that this is going to be a real time waster. In anticipation of my readers’ interests, here are some initial findings: George Carlin’s Seven Dirty Words, which may not be appropriate for TV viewers, are showing steady growth, and despite a dip during the Reagan years, have returned to pre-1810 levels of frequency, with the exception of one four-letter word beginning with F, which still has a long way to go.
In yet another half-hearted attempt to keep this blog out of the gutter, I also did a search of the names of a few well-known painters―Michelangelo, Rembrandt, Ingres, Van Gogh and Picasso―to see if I could discern any patterns, which I could not. That is until I checked my spelling and capitalization, upon which I discovered that Van Gogh can’t hold a paint brush to Picasso when it comes to spilled ink in books.
“‘The goal is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books,’ said Erez Lieberman Aiden, a junior fellow at Harvard’s Society of Fellows. Mr. Lieberman Aiden and Jean-Baptiste Michel, a postdoctoral fellow at Harvard, assembled the data set with Google and spearheaded a research project to demonstrate how vast digital databases can transform our understanding of language, culture and the flow of ideas.” This from Patricia Cohen in the NYT earlier this month.
Surely there is great potential in this tool, where scholars will no doubt outdo the achievements of us 8-year-olds and bloggers with ADD in search of dirty words. (“Attention Deficit Disorder” beat out “short attention span” in 1983. “Clueless” had to wait until 1995.)
But its a little sad too. In that these kind of searches require no citizenship in the Republic of Letters. (To track the dimming of the Enlightenment, click here.) Thanks to this database and Google in general, information is instantly parsed and analyzied without the scholarly drugery of ages passed, and the kind of deep understanding that comes with with full immersion. Obviously this tool is not intended to replace scholarly passions, but is intended as just another quill in the inkstand of the humanities, But surely, in the name of convenience alone, this kind of approach means that old understandings must be lost, distancing us from full appreciation of the giants whose shoulders we stand on.
In the same sense, the Web has removed the need for study of any kind, allowing us to pick up any datapoint convenient to a particular need of the moment in a few key strokes. We need only skate across the pixelated surface of the deep resevoirs of humanity’s accumlated knowledge moving fast, without reflection.
Regretable indeed, but on the other hand, I’ve been at this for awhile, so time to move on to Maru the Cat videos on YouTube, where I invite you to join me.