Data Sources and Ingestion Layer: the place where the data comes from has been saved
Data Sources and Ingestion Layer: the place where the data comes from
Data sources come in all shapes and sizes: structure, unstructured, semi-structured, streaming, etc., but each of these different types requires a unique method of ingestion. All the ins and outs of the Data Sources and Ingestion Layers, complete with best practices and examples, is available in this Part 2 of the series on Layered Architecture for Data Platforms.
By- Martijn Blom & Ingrid Lanting
In Part 1 of the Layered Architecture for Data Platforms blog series, we have described the common layers and the underlying components that build up a data platform as seen in Figure 1. In Part 2, we dive deeper into the Data Sources and Ingestion Layer and explain the purpose and components of these two layers in more detail.
Figure 1 – Layers of a Data Platform
The first step for a data platform is to acquire the actual data. As a data platform does not generate data itself, this is done through sources that generate data and consequently feed the data into the data platform. The Data Sources Layer describes the different types of data sources of a data platform.
There are four types of data that can be loaded onto the platform: structured, semi-structured, unstructured, and streaming data. The most common type of data source for a data platform are information systems such as ERP or CRM systems. The data here is already stored in databases and is known as structured data, because the tables have a strict structure of columns and datatypes that describe the data.
Other important sources are text files that are structured in a certain format like XML, JSON and CSV. These types of data sources are called semi-structured, because the file type contains a partial structure but is more flexible than the structured data found in tables. In semi-structured data, the columns can differ per record and some structures support nested data. Examples of semi-structured data formats are JSON and XML, which are often the formats delivered by APIs provided by cloud applications or third-party data providers.
It can also be possible that certain data without any structure needs to be stored in the data platform. These can be regular text files, for example log files, that do not contain a pre-set data model or schema, or other file types such as images, video, audio or documents. For this reason it is called unstructured data. The nature of unstructured data is that it is not possible to store it in a structured way in the data platform. Only after performing (advanced) analytics one can extract some form of structured data from these source files. You could also consider whether the unstructured data truly needs to be analyzed or if you can take advantage of the information in the metadata. Simply put, metadata is the data about your data. For example, a phone call consists of audio, but it also has properties such as the length of the call, the date and time the phone call was made, and which telephone number has called which other telephone number. For some use cases, this metadata gives enough information and it is therefore not needed to analyze the content of the audio file itself.
The fourth type of source is known as streaming data. Think about sensors, machines, IOT devices, audio/video broadcasts, etc., which can all generate streaming data. The challenge with streaming data is that it needs to be captured and processed/stored at the exact moment the data is received, which puts a time pressure on the work.
As we saw in the examples above, a data platform can use data from most types of data sources (structured, semi-structured, unstructured and streaming), but the type of data source has a huge impact on the choices of the other components in the data platform because it determines how the data can be ingested, processed, stored, analyzed and visualized.
Data sources can be replaced once in a while, so don’t make the data platform too dependent on the characteristics of the data source.
Use a data extraction method that does not have a high load on the source system.
If the data source does not contain historic data make sure that the historic data is stored somewhere so that the complete history can be rebuilt. For example this can be done by archiving all data extracts.
Use checks like row counts and summaries to check if every record is ingested correctly.
Track which data element comes from which data source and when it was ingested. This can be done with data lineage software, but it can also be done by just storing the data source and timestamp at each record.
Consider if personal data has to be extracted because of privacy issues. If the personal data is not needed, don’t extract it.
The purpose of the Ingestion Layer is to get the data from the data sources and send it to the processing or storage layer.
An import task of the Ingestion Layer is to connect to the data sources. This connection is often provided by data connectors, which are software programs such as drivers or libraries that can provide a connection to a specific data source. These data sources can be a database or some type of ERP or CRM system.
Another task of the Ingestion Layer is to extract the data from the data source. Data extraction be done in several ways:
- Full extraction: All data from the table, object or application is extracted at once. This is the simplest method to use as one does not need to know which data has been altered. Full extraction is ideal for small data sources, but not for larger batches.
- Incremental extraction: Only the changed records are extracted. For this method, the source system needs to know which records have been modified so only those that have been changed are extracted. Note that not all source systems have a mechanism in place to track the changes. Another challenge that comes from incremental extraction is keeping track of deleted records.
- Change notification: The source system sends a notification that contains which changes have been made or will direct you to the altered records in the database. Change notification is a continuous process and therefore requires immediate capture and processing of the messages, unless you buffer them so that they can be processed in batches. This type of data extraction is ideal for streaming data sources such as sensors or IOT devices. A specific type of “change notification” is the Change Data Capture (CDC) mechanism. This mechanism works by reading data from database logfiles to capture the changes in the source application. CDC works on the database layer and is supported by most database vendors. The benefit of CDC is that the performance of the source application or source database is not suffering from the data extraction, because no query or logic is run on the data source.
For on-premise applications, data is often extracted by reading the data from the database tables, however, with Software-as-a-Service (SaaS) applications it is often not possible to enable access to these database tables. Therefore, to be able to extract data from SaaS applications, these applications often send the changes via change notifications or offer APIs to extract certain data objects.
Another task of the Ingestion Layer is to check the data quality. This can be some basic checks to confirm that all data is extracted or some more sophisticated checks with rules that define how the data should look like. If the data quality is subpar, it can trigger one or multiple events: the data can be rejected, an alert can be raised for a manual intervention, or automatic data quality improvement processes can be executed. For example, the street and city can be automatically corrected or filled in based on the postal code and house number.
In this blog we have discussed the Data Sources and Ingestion layers, where we have also highlighted best practices to consider when setting up these layers in your data platform. Deloitte can help you with connecting the data sources to a data platform and defining the best data extraction method. Our next blog will be about the Processing Layer. If you want to know more about how the ingested data can be processed in a data platform read the next blog in our series about the Layered Architecture.
Deloitte's Data Modernization & Analytics team helps clients with modernizing their data-infrastructure to accelerate analytics delivery, such as self-service BI and AI-powered solutions. This is done by combining best practices and proven solutions with innovative, next-generation technologies, such as cloud-enabled platforms and big data architectures.
To make these data extraction methods more concrete, we have listed two examples from practice.
In the first situation, the source application was not able to use the incremental extract method, however, because of the amount of data, the incremental extract method was the preferred option. To solve this, a third party software program was used that offered this functionality and could perform incremental extracts from the source system.
In the second situation, it was very difficult for the client to extract meaningful data from the database because of the excessive logic implemented in the application layer. To be able to extract any meaningful data, the source application has to process the data first. We opted for an approach where the source application sent all changes as events to the ingestion layer, thereby being handled as a (near) real-time data source.
Would you like to know more about layered architecture for data platform, data sources and the ingestion layer? Please contact Martijn Blom via +31 88 2880720 or Ingrid Lanting on +31 88 288 04 98