WEST LAFAYETTE, Ind. – Using watermarks to preserve the integrity of printed documents dates back 2000 years. Research by a pair of Purdue University professors could bring that time-tested method into the electronic age.
Mikhail Atallah, professor of computer science, and Victor Raskin, professor of English, have developed a way to embed a watermark in "natural language" text documents as well as sensitive electronic documents. Natural language includes all the spoken languages but not languages created for special purposes, such as computer languages.
The concept of placing watermarks on electronic documents is not new. Edward Delp, Purdue professor of electrical and computer engineering, was part of a research team that developed the technology in 1998 to watermark images placed on the Web.
What makes natural language watermarking unique is that it embeds the watermark in the syntax, or grammatical structure, of the language. A future version of the prototype will embed the watermark in the meaning of the language, as well. This process has never been done before electronically.
"Watermarking text is very, very difficult," said Atallah. "It's much more difficult than watermarking images."
One factor making text so difficult to watermark is that, compared to a photographic image, a text document has very few places in which to hide watermarks.
"Every pixel in a full-screen image contains information," said Raskin. "There is a lot of redundancy in the image."
That redundancy is what makes it possible to embed a watermark. One could, for example, switch a few blue pixels to red. If a field of blue surrounds the red pixels, the image itself is still seen as blue.
Text documents are another story, Raskin says. "In natural language, there is no redundancy. That is, every word means something. If you change it, you change the meaning of the sentence. That's the difficulty."
To get around this problem, Atallah and Raskin have developed a way to embed a watermark using the structure of language itself.
Natural language watermarking, unlike that used in images, does not embed something physical in the text. Language watermarks instead introduce very slight changes in the grammatical makeup of selected sentences throughout a document, while keeping the meaning intact.
"What we embed is not something you can see," said Raskin, "It's in the invisible syntactic structure."
A watermark is introduced throughout a document using an encryption algorithm – or computer instructions – based on a very large prime number. This large number is the "key" one needs to retrieve a watermark. The algorithm selects certain sentences in a document and subtly changes their syntactic structure.
For example, a sentence in a document may read "Ships in the vicinity may provide some additional assistance." After the document has been watermarked, the sentence will read, "Some additional assistance may be provided by the ships in the vicinity."
Another factor making this technique difficult to implement is that it must be resistant to change, Atallah says.
"We wanted to make our scheme resilient to simple changes in the text that are easy to make by automated processes, such as synonym substitutions," he says. "If you change one word for another throughout the whole document, we would expect the watermark to still be there. It turns out that it's resilient to a lot more than that. It's also resistant to insertions and deletions of sentences."
Purdue has filed a patent for the new technology. Possible applications include maintaining the integrity of sensitive documents and detecting whether a document has been tampered with or altered.
"My belief is that the large corporations and governments of the world will want to protect their documentation, especially when they see that it can be done cheaply," says Raskin.
Atallah and Raskin both are affiliated with Purdue's Center for Education and Research in Information Assurance Security, or CERIAS. They presented a paper on their work this month at the fourth International Information Hiding Workshop, in Pittsburgh. The new approach also will be highlighted this month in a talk during the second Annual CERIAS Research Symposium at Purdue April 26-27.
Both credit their interdisciplinary approach to the project's success. "What we have been able to do is to combine two areas of research– computer science and English – that nobody had thought of combining," says Raskin.
Cite This Page: