Featured Research

from universities, journals, and other organizations

Carnegie Mellon Project Boosts Book Digitization Efforts

Date:
May 25, 2007
Source:
Carnegie Mellon University
Summary:
A Carnegie Mellon University computer scientist is enlisting the unwitting help of thousands, if not millions, of Web users each day to eliminate a technical bottleneck that has slowed efforts to transform books, newspapers and other printed materials into digitized text that is computer searchable.

CAPTCHAs, an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart, distinguish between legitimate human users and malevolent computer programs designed by spammers to harvest thousands of free email accounts.
Credit: Image courtesy of Carnegie Mellon University

A Carnegie Mellon University computer scientist is enlisting the unwitting help of thousands, if not millions, of Web users each day to eliminate a technical bottleneck that has slowed efforts to transform books, newspapers and other printed materials into digitized text that is computer searchable. Luis von Ahn, an assistant professor of computer science and recipient of a MacArthur Foundation "genius grant," says the project will also improve Web security systems used to reduce spam and make it possible for individuals to safeguard their own email addresses from spammers.

Key to the new project is assigning a new, dual use to existing technology: CAPTCHAs, the distorted-letter tests found at the bottom of registration forms on Yahoo, Hotmail, PayPal, Wikipedia and hundreds of other sites worldwide. CAPTCHAs, an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart, distinguish between legitimate human users and malevolent computer programs designed by spammers to harvest thousands of free email accounts. The tests require users to type the distorted letters they see inside a box -- a task that is difficult for computers, but easy for humans.

Working with a team that includes computer science professor Manuel Blum, undergraduate student Ben Maurer and research programmer Mike Crawford, von Ahn invented a new version of the tests, called reCAPTCHAs, that will help convert printed text into computer-readable letters on behalf of the Internet Archive. The San Francisco-based non-profit group administers the Open Content Alliance and is one of several large initiatives working to digitize books and other printed materials under open principles, making the text searchable by computer and capable of being reformatted for new uses.

Optical character recognition (OCR) systems that automatically perform this conversion are often stumped by underlined text, scribbles and fuzzy or otherwise poorly printed letters. ReCAPTCHAs will use words from these troublesome passages to replace the artificially distorted letters and numbers typically used in CAPTCHAs.

The new tests continue to distinguish between humans and machines because they use text that OCR systems have already failed to read. And because people must decipher these words to pass the reCAPTCHA test, they will help complete the expensive digitization process.

"I think it's a brilliant idea -- using the Internet to correct OCR mistakes," said Brewster Kahle, director of the Internet Archive. ReCAPTCHAs will speed the digitization process while also helping to improve OCR methods and perhaps extend them to additional languages, he said. "This is an example of why having open collections in the public domain is important," he added. "People are working together to build a good, open system." Von Ahn hopes to substitute his reCAPTCHAs for as many conventional CAPTCHAs as possible. "It is estimated that 60 million or more CAPTCHAs are solved each day, with each test taking about 10 seconds," he said. "That's more than 150,000 precious hours of human work that are lost each day, but that we can put to good use with reCAPTCHAs."

With support from Intel Corp., von Ahn's team has devised a free, Web-based service that allows individual webmasters to install reCAPTCHAs to protect their sites. Individuals can also use the service to protect their own email addresses, or lists of addresses they post on personal Web pages. In the case of some commercial Web sites with heavy traffic, reCAPTCHA may charge a fee to pay for additional bandwidth.

To make certain that people are correctly deciphering the printed text, the reCAPTCHA system will require Web site visitors to type two words, one of which the system already knows. Each unknown word will be submitted to multiple visitors. If the visitor types the known word correctly, the system has greater confidence that the unknown word is being typed correctly. If several visitors type the same answer for the unknown word, that answer will be assumed to be correct.

An audio version of reCAPTCHA, which will transcribe portions of radio programs that have defied speech recognition programs, will also be available for blind Web users.


Story Source:

The above story is based on materials provided by Carnegie Mellon University. Note: Materials may be edited for content and length.


Cite This Page:

Carnegie Mellon University. "Carnegie Mellon Project Boosts Book Digitization Efforts." ScienceDaily. ScienceDaily, 25 May 2007. <www.sciencedaily.com/releases/2007/05/070524164318.htm>.
Carnegie Mellon University. (2007, May 25). Carnegie Mellon Project Boosts Book Digitization Efforts. ScienceDaily. Retrieved August 20, 2014 from www.sciencedaily.com/releases/2007/05/070524164318.htm
Carnegie Mellon University. "Carnegie Mellon Project Boosts Book Digitization Efforts." ScienceDaily. www.sciencedaily.com/releases/2007/05/070524164318.htm (accessed August 20, 2014).

Share This




More Computers & Math News

Wednesday, August 20, 2014

Featured Research

from universities, journals, and other organizations


Featured Videos

from AP, Reuters, AFP, and other news services

Ballmer Leaves Microsoft's Board, Has Advice For Nadella

Ballmer Leaves Microsoft's Board, Has Advice For Nadella

Newsy (Aug. 19, 2014) In a letter to Microsoft CEO Satya Nadella, Ballmer said he's leaving the board of directors and offered tips on how the company can be successful. Video provided by Newsy
Powered by NewsLook.com
What Google Can Gain From Special Accounts For Children

What Google Can Gain From Special Accounts For Children

Newsy (Aug. 19, 2014) Google will reportedly offer official accounts for children younger than 13 years old. Video provided by Newsy
Powered by NewsLook.com
Breakingviews: Ebola's Economic Impact Could Eclipse SARS

Breakingviews: Ebola's Economic Impact Could Eclipse SARS

Reuters - Business Video Online (Aug. 18, 2014) The virus ravaging Africa has yet to spread elsewhere. Yet Asia’s SARS crisis in 2003 showed how changes to behaviour can hurt the economy more than the actual disease, says Breakingviews' Una Galani. Video provided by Reuters
Powered by NewsLook.com
Twitter Users Up In Arms After 'Favorites' Show Up In Feeds

Twitter Users Up In Arms After 'Favorites' Show Up In Feeds

Newsy (Aug. 17, 2014) Twitter is testing a feature on some users that shows favorited tweets from people they follow in their own timeline, the same way a retweet appears. Video provided by Newsy
Powered by NewsLook.com

Search ScienceDaily

Number of stories in archives: 140,361

Find with keyword(s):
Enter a keyword or phrase to search ScienceDaily for related topics and research stories.

Save/Print:
Share:

Breaking News:
from the past week

In Other News

... from NewsDaily.com

Science News

Health News

Environment News

Technology News



Save/Print:
Share:

Free Subscriptions


Get the latest science news with ScienceDaily's free email newsletters, updated daily and weekly. Or view hourly updated newsfeeds in your RSS reader:

Get Social & Mobile


Keep up to date with the latest news from ScienceDaily via social networks and mobile apps:

Have Feedback?


Tell us what you think of ScienceDaily -- we welcome both positive and negative comments. Have any problems using the site? Questions?
Mobile: iPhone Android Web
Follow: Facebook Twitter Google+
Subscribe: RSS Feeds Email Newsletters
Latest Headlines Health & Medicine Mind & Brain Space & Time Matter & Energy Computers & Math Plants & Animals Earth & Climate Fossils & Ruins