Artificial intelligence and machine learning in e-discovery and beyond

Driving efficiencies in e-discovery using AI

“AI” and “machine learning” may seem to be the most overused 2019 buzzwords, referencing something that is on the horizon but only just coming into use. In fact, this is far from the reality. Machine learning has been part of e-discovery for a long time now. Electronic discovery or e-discovery is the process of identifying, collecting and reviewing electronically stored information (ESI) to support legal proceedings. The ESI can include all data stored on devices such as laptops, desktops, smartphones, corporate email systems, file shares, backup systems, SharePoint, and accounting systems, and also data held on social media platforms such as Facebook and LinkedIn.

We’ve been working with AI and machine learning applications for more than ten years now.” says Arjan Hulsbos, Senior Manager at Deloitte Forensic. “We use AI tools primarily for early case assessment to provide quick insights. When we start an investigation, we often need to cast our net wide and collect a lot of data. It is practically impossible to review all the data manually, since investigations regularly involve more than ten terabytes of data – that’s the equivalent of 80 sea containers filled with paperwork.” It may seem surprising that such a huge amount of data can accumulate, but according to Hulsbos it is a reality. “Just think about all the information on your own laptop and smartphone. We’re not just talking about documents and PowerPoint presentations, but also about emails, text messages, WhatsApp communications−you name it.

Conceptual clustering

At the start of an investigation, using the allegations made as a basis, a machine learning algorithm such as Brainspace can be set up to read the documents, search for relevant words, and clusters them into groups based on their contents. This is known as ‘conceptual searching’ or ‘conceptual clustering’.

For example, if you ask the machine learning tool to identify any information about tennis and football in the files and documents, the algorithm will also cluster documents containing information about all kinds of sports,” says Hulsbos. “Of course in practice, we search for terms like ‘little brown envelope’ or ‘grease’, rather than ‘tennis’, and the algorithm will then cluster information about anything relating to ‘bribery’.

Most e-discovery tools present the data visually, for example using an interactive pie chart-like overview, which presents all the documents in clusters based on their conceptual similarity. Using this cluster wheel functionality provides an easy-to-digest understanding of what is important and in which areas the investigators should start searching for ‘hot’ documents.

Review assistant

Although AI is a very useful assistant in helping to identify relevant data, it clearly does not run an investigation. “It’s definitely not a matter of giving a large amount of data to a machine and asking it to give you the evidence. There needs to be interaction between the algorithm and the investigator,” stresses Hulsbos. In practice, this means that when the algorithm starts to identify possible relevant data, experts review it to determine whether the documents are relevant or not, and to identify those that are of the greatest importance. “We use their reviews as input to the algorithm so that the machine can learn what it needs to find,” says Hulsbos. “After we’ve performed a first and second review, we hand over the documents to the lawyers to carry out their legal analysis

The investigations by Deloitte often fall into one of two categories. Bob Dillen, Partner at Deloitte Forensic, explains: “The first kind, is the investigation where we want to find the facts: who did what, when, why and how? In these cases, the first-stage identification of ‘hot’ documents containing evidence can be sufficient.” The second kind of investigation is when a regulator is involved. “In these cases, we need to find as much as possible. Not just the hot documents, but everything relevant to the case.” In these regulatory-driven investigations, the e-discovery team instructs the machine learning algorithm to go through all the documents and continuously trains it to identify which documents might be relevant, and which not. “It’s a fine line: you don’t want to give non-relevant information to the regulator, but you also don’t want to miss anything,” says Dillen. 

In one project where supervised machine learning was applied, Deloitte investigators began by tagging a number of documents as either relevant or non-relevant. An algorithm was then used to tag additional documents and give them a relevancy score between one and 100 per cent. Documents receiving a high score that had not been previously been tagged as relevant, came back for review for another round. This reduced the risk of missing any relevant documents, that could have been critical for the investigation.

Adding expertise through AI

Using AI and machine learning in e-discovery is a significant cost-saver. It is estimated that simply presenting the documents in conceptual clusters can give a 15 to 20 per cent increase in review speed. However even bigger savings in time and cost comes from using machine learning to identify which documents are relevant and which are not. The highly sophisticated software learns, by creating patterns to understand the subject of the investigation, and this can save an enormous amount of investigators’ time. Dillen explains “After we’ve trained the algorithm for a while regarding what is relevant and what is irrelevant, we get to a point where we are sufficiently comfortable to allow the algorithm to cut out everything irrelevant on its own. Depending on the constellation of the data set and the variety of topics in your investigation, machine learning can potentially reduce the total number of hours required for a review by up to 40 per cent. However, for example, if you have a very targeted search in an email archive based on x-number of terms it is possible that you end up with a very high responsiveness rate. In other words, you search for all emails containing a very specific term you might still end up with 80 per cent responsive documents. This means that the algorithm will never be able to cut out a very large volume of non-relevant documents. In addition, if the scope of your investigation is very broad and you’re trying to cover a lot of different topics at the same time, the algorithm might not be able to cut out as many documents from a review as anticipated.”

Summarizing, the influence of the application of algorithms might be less than expected, particularly where the data sources are by nature highly relevant, or when the density of the relevant documents is extremely high, and/or where the scope of the investigation covers a wide variety of topics to be addressed.

Hulsbos continues: "During one of our projects we also applied machine learning to improve the quality of our review. The algorithm was trained using a sample of review documents and then the algorithm classified a large number of previously reviewed documents as relevant/non-relevant with a confidence level between. Afterwards mismatches between the machine learning model and the reviewer coding were resolved by an experienced reviewer as an additional QC step. For instance, documents coded by the algorithm with a high confidence level as relevant, but coded by a reviewer as non-relevant and vice versa."

Deloitte is not alone when it comes to e-discovery teams using machine learning. Dillen acknowledges that available algorithms are in common use among investigation teams. However the difference between Deloitte and smaller practitioners is depth of expertise. “Within our network we have a lot of people who are truly specialised in machine learning, and that makes all the difference,” says Dillen. “Because if you have to go through 30 million emails, you want to be absolutely sure you’re using the right algorithm. If it’s just a little bit off, you might miss 10,000 emails. We don’t let that happen.” Deloitte’s experts are also able to work with non-standard tools to recalculate certain results, and they also have the ability to work on failed algorithms, tweak them, and make them work. Dillen adds “Furthermore they can be put in front of a judge and can use their knowledge to explain their findings and results. This deep understanding of how machine learning tools work and how to adapt them, is what makes Deloitte a frontrunner in e-discovery.

Redacting personal information

Machine learning can also be extremely useful whenever personal information has to be redacted, for example to check whether all documents containing attorney-client privilege, have been identified and redacted in the same way. Dillen comments: “In Switzerland, the law prohibits disclosure to foreign regulators of personal information such as passport, driving license or credit card numbers. In these instances, machine learning is used to find and redact the personal data before supplying any information. And now with GDPR, this application of machine learning has become even more relevant.


The use of AI in e-discovery offers numerous advantages, both in finding relevant documents much more quickly, thereby increasing efficiency and reducing costs, and also in detecting a bigger volume of information to broaden and deepen the understanding of the subject matter itself.

Clearly, there will be many more advances in the uses of AI and machine learning. These will be driven by a growing awareness of the possibilities, and also by the growth of technical expertise.

Did you find this useful?