Solutions

Unleashing the Power of Words

Deloitte’s WordsWorth Text-Mining Solution

Deloitte’s WordsWorth is a state-of-the-art, Cloud-capable text-mining offering from scanning through named-entity recognition, document mark-up, semantic search, text translation & summarization, table & invoice extraction.

The Need

Today’s digital world has resulted in a Cambrian explosion of documents, ever easier to produce and to transmit. While offices worldwide have indeed become increasingly “paperless”, paper reports have only partially been superseded by electronic data exchanges of structured, tabular data. Instead, unstructured narratives have migrated from paper to PDF, or equivalents. The volume of textual documents has soared, lifted by improved editor tools, automated text generation, and a dramatic increase in the options an author has to disseminate a message: emails, chats, blogs, social media, cloud drives, collaborative document-sharing suites, to name a few. Furthermore, multiple providers in each of these formats compete to best serve the need for humans to tell a story, to instruct or explain, or simply to express views.

Proliferation of documents, types and formats poses a significant challenge to the reader. It is increasingly difficult to discern useful signals from spurious noise, or even facts from opinion. Digitization of processes and businesses, the “always-on” reachability through mobile devices has raised expectations for quick results. There is simply not enough time to wade through the flood of documents, to discern which is important or which is reliable. Fortunately, machines armed with Natural Language Processing (NLP) algorithms can help. NLP promises efficiency, quality and exhaustive coverage in working with unstructured, textual documents.
 

Our Solution

Pre-trained on universally applicable language models and enhanced with case-specific vocabulary, the text-mining solution WordsWorth excels in accurate interpretation of text documents. It achieves this by combining the most advanced underlying methods from multiple cloud providers (AWS, Azure, GCP) with the flexibility of multiple, dedicated open-source algorithms. It offers users two means to interact with the functionality, either through the intuitive graphical user interface (GUI) or through dedicated Python libraries, which may be invoked via the command line or embedded within custom applications.

WordsWorth performs a wide spectrum of text-mining services:

  • OCR – converts scanned text into machine-readable flowing text
  • language detection & translation – covers all European languages, plus Chinese and Japanese
  • topic modeling – discerns whether a document is relevant to the reader’s subject of interest
  • named entity recognition - identifies and extracts names, places..., classifying them into personally identifiable information (or not)
  • table recognition - identifies tables within the document, exportable into standard formats (.xslx, .csv)
  • invoice extraction – finds all relevant invoice elements and maps them to their text content and coordinates in the original document.
  • document mark-up – color-codes identified words, or blacks-out / anonymizes sensitive passages
  • version comparison – highlights differences between multiple versions of a document
  • semantic search – finds relevant passages associated with the meaning of input search words, beyond exact matches of key-word search
  • summarization – paraphrases documents to maximally condense while preserving most relevant topics
  • export – converts into a variety of popular editable text files (format selected depending on content)
     

Advantages/Benefits

  • Speed: quickly determine whether entire documents are relevant, quickly scan for passages of interest
  • Quality: find the best available cloud API or open-source Python library, rivaling human performance
  • Cost: process everything, avoiding rework or errors common to sample-selection or fatigue
  • Integration: work with documents as you would with other files
     

Example Use Cases

  • Tagging documents (metadata) according to contents – in order to route them to the right recipient
  • Providing advanced concept search (or document similarity) to document management systems
  • Redacting (blacking out) of sensitive information from confidential / legal documents
  • Generally structuring concepts (& associated quantities) from narratives into tables / databases
  • Populating systems with data automatically read from invoices / forms
  • Providing abstracts / summaries of large documents
     

David Thogmartin

David Thogmartin

aiStudio | AI & Data Analytics

David Thogmartin leads the aiStudio internationally and the “AI & Data Analytics” practice for Risk Advisory in Germany. He has 20 years of professional experience in Analytics and Digitization, large... Mehr