Layered architecture for data platforms: the place where data transformations take place has been saved
Layered architecture for data platforms: the place where data transformations take place
Different types of data, data sources and ingestion methods have an influence on how the data can be transformed. In this Part 3 of the series on Layered Architecture blogs we describe the different kinds of data transformations, complete with best practices.
By- Martijn Blom & Ingrid Lanting
In the previous two blogs on Layered Architecture for Data Platforms, we introduced the complete layered architecture for data platforms, as seen in Figure 1, and dove deeper into the Data Sources and Ingestion Layer.
Figure 1 – Layers of a Data Platform
In this blog, we look into the Processing Layer, where the purpose is to transform the source data to a data model that is ready for analysis and reporting. We will discuss the different types of processing, that the processing can be done by an ETL or ELT process, and the location of the processing.
Data is ingested from the data sources into the Storage layer of the platform. It is the task of the Processing Layer to orchestrate this procedure and to transform the data in such a way that it can then be stored in the Storage layer.
In the Storage layer, structured data is stored in a certain data model, which can be the same data model as the data source. In this case, no transformations are required on the data. However, often times, the data model in the storage layer is changed to better suit the purpose of the data platform. In this case, the Processing Layer transforms the data from the source data model to the data model in the Storage layer.
Real-time or Batch processing
An important decision to make concerns the question if we should apply batch processing or real-time processing. Real-time processing is required when it is always necessary to look at the most current data. In this case, it is not acceptable when the data in the data platform does not match the most current data in the data source. In contrast, batch processing is suitable when having a delay in the data is acceptable. Due to the time constraint of needing current data, real-time processing is much more complex, and therefore we advise to only use this when it is absolutely necessary. To reduce the risks of real-time processing, it is also necessary to have some kind of batch process in place to ensure that there is a method to catch-up or reload the data when there is a technical issue or problem with the data itself that needs to be corrected. Also, it needs to be supported by the data source and Ingestion layer and needs a constant stream of data changes that can be processed. The Ingestion layer should therefore send events or messages that can be processed in real-time.
An architecture that is using both real-time processing and batch-processing is the Lambda architecture. The Lambda architecture uses a speed layer that is using real-time processing to make the data immediately available and a batch layer for a more comprehensive and accurate view of the data. The results of both the speed layer and batch layer are combined in a serving layer that makes the data available. The benefit of the Lambda architecture is that it makes the data available in real-time, but it also offers the accuracy of batch processing which can contain more comprehensive checks than is possible with real-time processing.
Real-time or near real-time
When we talk about real-time, we also mean near real-time. The difference between real-time and near real-time is that with near real-time a small delay (from a few seconds to minutes) is allowed. How much delay is acceptable depends on the use case. For most use cases near real-time should be sufficient. Real-time processing means that a delay of up-to 1 second is allowed. For example a solution that monitors the heartrate for IC-patients should have a minimal delay that should not exceed 1 second. This time requirement makes real-time processing more costly (because more processing power is needed) and/or it limits the possible transformations due to their complexity, so that it would take less time.
In some cases the source data alone does not completely suit the needs of the data users. In this case, data enrichment can come into play. Data enrichment is the process of appending or enhancing the collected data. This can be done by combining data from different sources which can be in the form of internal sources such as other applications, or third party sources such as weather data, social media, and purchased data sets from external companies. It can also be that data enrichment processes are automatically appending missing data by applying algorithms to the existing data and so improving the quality of the data. Some ETL tools already have built-in checks and datasets to append missing data or fix existing data; for example, the postcode and address check that is often applied. Data enrichment is often applied in customer analytics by adding demographic and geographic data to the customer data.
ETL or ELT?
The real-time or batch processes that are executed are also called the Extract, Transform and Load (ETL) processes. In an ETL process, the data is extracted from the data source, transformed to the data model of the storage layer and then stored in the storage layer.
But nowadays, when increasing amounts of data need to be stored more often, the concept of Extract, Load and Transform (ELT) process is becoming popular where data is first loaded into the storage layer and transformed afterwards. This method is often combined with a data lake whereby the data is stored as-is in the data lake. From the data lake it can be transformed and stored in a data model in another storage technology or in the data lake again. But it can also be the case that the data is not transformed; only analyzed when needed. The ELT method is often used when working with semi-structured or unstructured data that is stored as-is in a data lake.
Analyzing data that is stored in a pre-defined structure is much easier and for that reason the ETL process is the best approach for data that needs to be accessed frequently. But for data that is only needed once in a while (or for which it is unsure if it is ever needed), it is most efficient to store it as-is and process it only when needed, thereby using ELT.
Test if every record is processed. This can be done by executing row counts and summaries on the source data and the processed data. If those match it is not always a guarantee that everything is processed correctly, but if it does not match you know for sure that something has gone wrong.
Implement logic only once in the code and call the same routine when the same transformation is needed for different data sets. This makes the maintenance easier.
Document the logic that is applied on the source data and then perform audits to check if the documentation matches with the actual code.
Implement data lineage to trace how the source data is transformed.
Location of processing
When processing huge amounts of data, it is best to do this as close as possible to where the data is stored to minimize data transfer, which can be done by implementing the logic in the database. Most ETL tools can push most of the processing logic to the database layer instead of processing it on a separate server.
Moving away from close proximity, a trend we see currently is serverless computing where the ETL or ELT processes are executed by cloud services. This means that there is no dedicated server necessary and you only have to pay for the time it takes for the processes to be executed. This can be very cost-effective, especially when the processes are only running for a short period of time.
A specific type of processing is data mining. Data mining is about extracting and discovering patterns and meaningful information in large data sets, and consists of large batch processes that read the huge amounts of data. So it not about processing the data itself, but more about detecting and extracting meaningful information out of the data. Data mining often consists of one of the following types of processing:
- Anomaly detection: identification of unusual data records
- Association rule learning: finding relationships between data occurrences
- Clustering: finding similarities between data records to group them together
- Classification: categorize data records on the basis of certain data elements
- Regression: trying to find the relation between different data elements
- Summarization: providing a short representation of the datasets
Some of the above types of data mining can only be done by using advanced analytics algorithms. In one of the follow-up blogs we discuss the (advanced) analytics methods in more detail. Also, be very careful when using personal data in combination with data mining because it can result in privacy and legal issues when not enough precautions have been taken.
In this blog we described the different processing types, the difference between ETL and ELT processes and the location of the processing. There is no one-size-fits-all solution. Each situation and use case is different and that will result in different choices about the Processing layer.
Deloitte's Data Modernization & Analytics team helps clients with modernizing their data-infrastructure to accelerate analytics delivery, such as self-service BI and AI-powered solutions. This is done by combining best practices and proven solutions with innovative, next-generation technologies, such as cloud-enabled platforms and big data architectures. We can help you with designing and developing the ETL (or ELT) processes.
Our next blog will be about the Storage Layer. If you want to know more about the different possibilities on how data can be stored in the data platform please read our next blog in our series about the Layered Architecture.