Friday, September 02, 2011

Searching online in Hebrew with imperfect OCR.

This is just a short reminder that people have to be mindful not only to search for terms using the correct spelling, but also to become aware of which letters are confused with other letters. But not only obvious ones like dalet and resh. There are many such letters, such as ayin and mem.

For example, a search as follows / site:hebrewbooks.org ימקב / returns no less than 11,400 results. To be clear, the word we would like to see in the results is "יעקב" not "ימקב." But presumably many, if not all, these 11,400 would just not show up if you searched for them properly spelled. Six-hundred seventy-four books is the results returned from a search for ימקב on Otzar HaHochma (which actually has some pretty sophisticated advanced search features which takes some of this into account).

Chet
and mem are commonly confused as well. Hebrewbooks.org returns 167 results for "תלמיד מכם" and Google Books returned 74 results. Actually, I was pleasantly surprised to see that Google actually "asked" if I meant to search for "תלמיד חכם."

Most of the time these don't matter, but it would matter if that one result you need doesn't turn up, wouldn't it? Same thing in English and other languages. Google Books seems to confuse u and n, for example.

It would actually be a good idea to compile some kind of list of letters which search engines commonly confuse for the purpose of OCR dependent searching.

9 comments:

  1. Thanks for posting this. I realize for this reason I probably miss a lot of helpful results however truth is I'm too lazy to start working out possible spelling errors with OCR searches. If there would be a list though perhaps it would encourage me a bit more.

    ReplyDelete
  2. Computer Enthusiast4:36 PM, September 03, 2011

    How dare you say such a thing? OCR is 100% perfect. Computers are the way of the future. I have already put all my seforim into genizah, now that I have a computer.

    ReplyDelete
  3. I don't know when or if the day will come that people will sit one across the other in a beis medresh each looking at their own computer to learn from and not from seforim.

    (However I'm sure it has been done somewhere by now!)

    ReplyDelete
  4. Um, Yehoshua? I was doing that with a chevruso back in 2005.

    ReplyDelete
  5. Computer Enthusiast, while you have a point to make, what does it have to do with the subject here? If we're talking about a method of searching then you probably wouldn't have come across the thing in the first place. First, because most of us don't have tens of thousands of seforim, secondly because other than the obvious sources and references you probably just won't stumble across too many obscure, yet relevant things on your own.

    Yehoshua, this isn't for every day, normal searching. I would say that most words are correctly OCRed. But if you're really trying to dig deep then you have to keep this in mind.

    ReplyDelete
  6. Mar Gavriel - I live in Eretz Yisrael in charedi neighborhoods..Computers only became common/accepted within the last few years bclal in a beis medresh...

    S. - Now that I have your attention reply back to my emails ;)

    ReplyDelete
  7. http://somehowfrum.blogspot.com/2009/10/internet-and-future-of-learning.html

    Yehoshua, I posted this almost 2 years ago. Enjoy.

    ReplyDelete
  8. I had trouble once with Hehs; they came out as Samechs or Mem sofits, I don't remember.

    ReplyDelete
  9. The OCR on the JTS seforim like Liberman and Ginzberg regularly mix up samech and mem sofit, also nun and gimmel. They are wrong as often as they are right.

    ReplyDelete