What happens when soldiers in hostile areas capture valuable documents written in a language they don’t speak and in a non-Roman script they can’t read? Are the documents exploited for their intelligence value, or are they stored unread and unevaluated?
All too often, it is the latter—a case of too many documents and too few translators.
American soldiers captured more than 2 million documents during the first few years of U.S. military involvement in Iraq and Afghanistan. Of these, fewer than 3 percent have been evaluated, mostly by human eyes with little help from computers.
Henry Baird and Daniel Lopresti, professors of computer science and engineering, have resolved to change that. Early in the U.S. anti-terrorism campaign they concluded that new research was needed to develop high-performance document analysis techniques to help military intelligence officials. They recommended that DARPA (the Defense Advanced Research Projects Agency) launch a program to support the development of a faster, more computerized capability to understand and translate documents.
The result of these efforts was DARPA funding for the MADCAT (Multilingual Automatic Document Classification, Analysis and Translation) project, in which Baird and Lopresti are participating. Its purpose is to use computers to convert foreign-language text images into English transcripts. Its early focus was on Arabic documents and it has made impressive progress in recognizing Arabic handwriting.
Building on these successes, the Lehigh team made the case for a second, more far-reaching undertaking, the Document Analysis and Exploitation (DAE) project. This proposal was shepherded through the government approval process by Bill Michalerya, associate vice president for government relations and economic development at Lehigh, and was recently approved by Congress. Together, MADCAT and DAE represent the largest investment of federal research grants in document image analysis in two decades.
Automating the analysis of documents
In the new project, Baird and Lopresti and Hank Korth, also a professor of computer science and engineering, are leading efforts to advance the automation of document analysis and to provide a national resource to enable shared research and access. The three faculty members and their students are collaborating with BBN Technologies of Massachusetts.
Baird brings to the project more than 20 years of experience in computer vision and pattern recognition. He has authored and edited prominent books on the subject and published numerous articles. Lopresti, the co-director of Lehigh’s Pattern Recognition Research Lab, has published widely in the areas of document analysis, handwriting, computer security and bioinformatics. Korth, an expert in databases, has published more than 100 technical publications, including the widely-used text Database System Concepts, which is now in its sixth edition.
Applications as diverse as digital libraries, medical records, and the discovery process for lawsuits will benefit from automated document analysis, say the three researchers. But the immediate need—national security—adds urgency to the effort.
The goal of the DAE project goes well beyond the computerized “reading” that’s been around for decades.
Simple document scanning, long used in copiers and fax machines, produces a photo-like representation of a document, not a file that people can search and edit. OCR (optical character recognition) software can understand some machine-printed text and so allow for searching and editing. But it’s limited mostly to Western languages on very clean documents.
“You can buy decent OCR machines these days for printed text, but not for handwriting,” Baird says. “But if you look for Arabic, Ethiopian languages and many other non-Western languages, for example, you won’t find usable technology.”
In addition, says Baird, OCR is not able to recognize unusual fonts, poor-quality faxes, bleed-throughs, or even yellowing paper in old books.
Examining more than the printed word
And those challenges don’t even begin to cover the demanding needs of the military intelligence community.
“People in the field might get a pile of papers in handwritten Arabic scribbles,” Baird says. “In addition to text, the documents can contain maps, tables of information, drawings and photographs. We’d like to automate the analysis of these as well.”
“We also want go beyond the actual written text to meta-data,” Lopresti says. “For example, if we can tell that documents appear to be written by the same author, that’s incredibly valuable information.”
Automating document analysis involves three steps: scanning the document; converting images to computer-readable characters; and translating and evaluating the content.
“There’s been much agreement over the past 20 years on how to describe ordinary language text for computers,” Baird says, “but not on mathematics, chemical diagrams, maps, or tabular data. That’s what this project is all about. We’re developing the technology to automate people’s ability to read and understand the contents of all kinds of documents. And that turns out to be pretty hard.”
To do that they will exploit synergies of document layout analysis, character recognition, language modeling and parsing, link analysis, and semantic modeling. The result will be script- and language-independent, making it easily transferable to new applications.
“This project has two main thrusts,” says Lopresti. “One is to develop new techniques for extracting information from documents. The other is to make available to the research community these techniques plus data sets for producing new algorithms and techniques.”
Pattern recognition isn’t the only challenge.
“The database challenges,” Korth states, “relate to the type of information that must be stored and the ways in which we anticipate the data being used. For example, we’ve had to consider collaborative document analysis. What are the problems when lots of teams analyze lots of documents using lots of different tools? We need to design a database that can store the information efficiently so that it can be shared by all the potential users.”
The DAE project, says Lopresti, represents “a small slice of the research trajectory.” The Lehigh team is already looking ahead to the next challenge.
“Right now the emphasis is on Arabic,” Baird states. “Six or 12 months from now, though, who knows where the interest will be? We want to have techniques that are robust, well-developed, debugged, proven and ready to go, so that we can quickly adapt to any situation of interest.”