Posted: 11 Jun. 2024 7.25 min. read

Planning your AI journey?

Start by using AI to clean up your data

Data quality and the case for artificial intelligence (AI)

“I was promised an AI revolution, but all I have is this crummy data.” This fear is top of mind for many managers and business leaders as they analyze and prepare for significant investments in new AI technology. The fact is, amid all the recent excitement around AI, many companies still struggle with foundational data quality, including in core areas such as their product catalog. This challenge is particularly vexing for retailers, fashion brands, and other companies with a large and fast-changing portfolio of products. However, the fundamental problem goes beyond one industry, business model, or type of data. Business leaders worry: What happens if their new AI initiatives and analytics are built on a foundation of bad data?

We have good news. Data quality has emerged as one of the promising use cases for the large language models (LLMs) driving Generative AI, making it one of the first stops we might suggest in an AI journey. In this blog entry, we analyze real examples in an online retail setting to show three ways you can use AI to improve your data quality. Beyond these immediate improvements, we then discuss how to build a full suite of capabilities to transform your long-term data quality.

Three ways you can use LLMs to improve your data quality

Click image to enlarge

The approaches shared can be implemented quickly, and in the sections below, we include screenshots of our team running them on a sample of publicly available product data from a large retailer hosted by the University of California San Diego (see endnotes 1 and 2). These techniques, along with others, open a new era for item data quality in which companies may finally be able to unlock transformational improvements through AI—fast.

Example 1: Outlier detection

Open-entry text fields (where a user can input any value), such as the product name or description, have traditionally been one of the greatest challenges for data quality. When users can enter any information, they can (and often do) make mistakes. Automated data syncs and transfers reduce manual entry, but also reduce user oversight on the data being uploaded, which can lead to egregious errors slipping through. For example, in real product data, our team found a fruit product mistitled “Pipe Hawk Axe” numerous items with the title showing some form of “Error N/A,” and many other mix-up.

Rendering of real items found by our team with data quality issues in the title, potentially caused by automated uploads

Click image to enlarge

While these sorts of discrepancies are often obvious to the human eye, in the past, it has been tricky to automate rules to catch them. Now, using some of the latest tools, we can flag these types of issues in three easy steps. First, we convert our product titles to LLM embeddings. Next, we fit an outlier detection model to a product category (in our case, using 200 items from a T-shirts category). Finally, we run this model on test items in the same category to separate outliers.

Click image to enlarge

Example 2: Attribute extraction

Another common data quality issue is missing data. This problem is often amplified when new fields are added (for compliance or other business reasons), which results in a need to backfill legacy items. LLMs can help to fill in missing fields by extracting information from other fields, such as the product title or description. For situations where limited information is available in the product title/description, we can also use the product image as an additional prompt with a multimodal model. In one example, we were able to extract the unit of measure from item titles using an LLM agent and then validate the extraction using a second LLM agent.

Click image to enlarge

Example 3: Attribute retrieval

In some instances, existing product descriptions and images will not have the information needed to fill our target attributes. For these more challenging situations, we can leverage retrieval augmented generation (RAG) and allow the LLM to search the web (or any trusted source) to augment our data. In one simple example, we have an item with a description and brand name but no information on the parent company. We can set up an LLM agent that searches the web for the brand’s parent company and then populates the missing field with the results.

This approach can be extrapolated to any situation where the desired information is located on the internet or in another database (including the retrieval of similar products from your internal database), and we want to automate the process of searching, extracting information, and formatting it to fill the attribute.

Building your data quality capability

While LLMs can help to rapidly accelerate data quality efforts, building a long-term capability requires a coordinated, multipronged approach. Companies interested in LLMs for data quality might explore their application as part of a larger program. To get started, consider the following steps:

  1. Run a proof of concept to better understand the improvement potential for data quality using LLMs along with other machine learning, statistical, and rules-based checks.
  2. Scale the capability by assessing and cleaning a few priority attributes in their entirety. This scaling stage can also be used to further develop capabilities such as data governance, root cause analysis and remediation, and more.
  3. Transform your data quality by establishing an always-on data monitoring and management platform that flags issues as they arise, triages them for proper handling, and creates feedback loops to help the organization proactively mitigate and prevent upstream data quality issues.

These steps, in combination with the latest technology, can help companies make rapid progress in improving data quality and build a strong foundation for future AI initiatives.



Spencer Young
Deloitte Consulting

Phill Domschke
Senior Manager
Deloitte Consultingpdom

Thank you to our contributors: Rajesh Vegi, Lori Stevens, and Jesse Miller.


1 Jianmo Ni, “Amazon review data (2018),” University of California San Diego, 2018.
2 Jianmo Ni, Jiacheng Li, and Julian McAuley, “Justifying recommendations using distantly-labeled reviews and fine-grained aspects,” Conference on Empirical Methods in Natural Language Processing ( 2019).

Subscribe to receive The Business Operations Room | Executive blog