Thursday, October 11, 2007

Shakespeare, oft-repeated tropes, long Ss, Google Books and digitization.

Volokh Conspiracy has a good post called 'How much did Shakespeare embiggen* the English vocabulary?'

Basically, there is an old trope that the great English moshel William Shakespeare introduced a huge amount of neologisms (words which he himself coined) into the language. Many of them are still in use. While this is true, it seems that the number is vastly overstated. Sources which Volokh cites claim that fully 1700 words were created by Shkspr, words like "majestic," "pious" and "obscene."

Nowadays digitization is all the rage in information dissemination. There are all sorts of companies working on digitizing an amazing range of literature, and it's no secret that my blogging is highly influenced by the opportunity this affords. Anyway, using Chadwyck-Healey's Early English Books Online [EEBO] Volokh can show that many of these words were not coined by Shakespeare, but are found in published works that precede him.

The source for this trope (apart for its grain of truth) is the majestic** Oxford English Dictionary, which often cites Shakespeare as the first usage of a word in literature. Now, this is not the OED's fault. First of all, it doesn't make the claim that the earliest usage they were able to find is the earliest usage in all of literature. Secondly, the earliest usage in print or handwriting is evidence of usage but not coinage.

So it's very cool that non-professionals (and professionals) now have tools at their tips to do research that normally would require trips to great libraries and jumping through hoops for access to 500 year old books that one may look at (but not touch) so long as one could explain to someone else why exactly you need to look at them.

However, it's crucial not to think of digitization as a panacea. For one thing, things exist even if they haven't been scanned. For another, OCR (Optical character recognition) is still in its infancy. One small example: in Hebrew, teaching a computer to tell apart a ג and a נ or a ד and a ר is hard enough. But not only in non-Latin alphabets.

Go to Google Books and try to find all the references to the word "masorah" that occurs in the books published between 1700 and 1900 that they've digitized. 504 results. But don't forget also to search for maforah.

You try telling a computer that there is no difference between an "s" and a long S, or rather that an f (long ess) isn't an f (eff).

This will be fixed,*** but that will take time.

* If you get the Simpsons reference, you get it. NDY LKWMEV
** That's the leitwort of this post.
*** In fact, EEBO's technology pretty nicely distinguishes between long s and f; however, Google's doesn't.

No comments:

Post a Comment


Related Posts with Thumbnails