A new “semantic data” service: “Document Discovery”?

March 5, 2011 Sorin Adam Matei

The New York Times combines a number of data sieving and semantic sorting services into a new type of application “document discovery.” The aim of this new service is the Holly Grail of automatic content analysis. Concepts are derived from context by clustering synonyms. A leader seems to be Cataphora. New York Times identifies several contenders in the field of document discovery:

Now, thanks to advances in artificial intelligence, “e-discovery” software can analyze documents in a fraction of the time for a fraction of the cost. In January, for example, Blackstone Discovery of Palo Alto, Calif., helped analyze 1.5 million documents for less than $100,000.

…

More advanced programs filter documents through a large web of word and phrase definitions. A user who types “dog” will also find documents that mention “man’s best friend” and even the notion of a “walk.”

The sociological approach adds an inferential layer of analysis, mimicking the deductive powers of a human Sherlock Holmes. Engineers and linguists at Cataphora, an information-sifting company based in Silicon Valley, have their software mine documents for the activities and interactions of people — who did what when, and who talks to whom. The software seeks to visualize chains of events. It identifies discussions that might have taken place across e-mail, instant messages and telephone calls.

Then the computer pounces, so to speak, capturing “digital anomalies” that white-collar criminals often create in trying to hide their activities.

…

Another e-discovery company in Silicon Valley, Clearwell, has developed software that analyzes documents to find concepts rather than specific keywords, shortening the time required to locate relevant material in litigation.

Last year, Clearwell software was used by the law firm DLA Piper to search through a half-million documents under a court-imposed deadline of one week. Clearwell’s software analyzed and sorted 570,000 documents (each document can be many pages) in two days. The law firm used just one more day to identify 3,070 documents that were relevant to the court-ordered discovery motion.

Sorin Adam Matei

You May Also Like

The Bamboo Digital Humanities Initiative: A Modest Proposal

Adrants comments on the Obama infomercial

About Intute: Arts and Humanities

Leave a Reply Cancel reply