Humans helping out book scanners
May 25th, 2007

Led by Luis von Ahn, a group of Carnegie Mellon University programmers has created a method for using human superior pattern recognition to help out computers with words they have trouble identifying. The Captcha technique is used in which humans type words from scanned books, as illustrated above and further explained by ReCaptcha here. Reporting this new harnessing of human recognition prowess, ZDNet says in part:

It’s a new example of how the Internet can harness the collective energies of large numbers of people. Other examples include news sites such as Digg and Slashdot, which give prominence to content that users rate highly, and stock photography seller iStockphoto, which is beta testing an Image Fight site to rate photo quality.

ReCaptcha has the potential to digitize vast quantities of words. Von Ahn estimates that people perform 60 million Captcha (Completely Automated Public Turing test to tell Computers and Humans Apart) tests daily.

The service presents users with two words, one from a conventional Captcha test and the other an unknown word that a computerized optical character recognition couldn’t figure out. If the user correctly identifies the known word, he or she is presumed to have decoded the unknown one. Currently, ReCaptcha requires three separate people to digitize the word the same before it’s determined to be correct, von Ahn said.

Von Ahn was a member of the Carnegie Mellon team that developed Captcha in response to a Yahoo request for technology to keep computers from registering for bogus e-mail accounts, according to Carnegie Mellon. He’s a recipient of a MacArthur Foundation “genius” grant, which funded some ReCaptcha work.

Digital libraries

The ReCaptcha project is digitizing books in the Internet Archive, a project building a digital library of cultural materials and that operates the Wayback Machine of historical Web site snapshots.

Among the first books being digitized is Psychology by philosopher John Dewey, von Ahn said. The project is considering other book archives, too, he added.

The ReCaptcha service is available now through an application programming interface (API) for people to integrate into their Web sites. Software plug-ins to use the API are open-source software packages hosted at Google Code.

