Layered Architecture for Data Platforms: the place where data is stored has been saved
Layered Architecture for Data Platforms: the place where data is stored
Data can be stored in many different technologies such as databases, distributed file systems or cloud storage. In Part 4 of the series on Layered Architecture for Data Platforms we describe which technologies can be used to store which type of data and for which purpose.
In the previous blogs about the Layered Architecture for Data Platforms we introduced the layered architecture for data platforms, dove deeper into the Data sources and Ingestion layer, and discussed the Processing layer. In this blog we look into the storage layer. In general, the storage layer is used to store the data in a data model so that the data can be analyzed or reported on.
Figure 1- Layers of a Data Platform
When the data is ingested from the data sources and subsequently transformed in the processing layer, it is then stored in the storage layer. The purpose of the storage layer is to protect the data from disasters, malfunctions or user errors, to make the data available for developers, data scientists and end-users, and to archive data that needs to be kept for a long period of time.
There are many different technologies that can be used to store the data, each with their own advantages and drawbacks dependent on the type of storage you need. The most common storage technologies are:
- Relational Database Management Systems (RDBMS): These are relational databases that are often used for Online Transaction Processing (OLTP) systems like the ERP or CRM systems that many companies use. Most applications use a relational database to store the data that is entered in the system. For Online Analytical Processing (OLAP), relational databases can also be a good fit, however, other storage technologies are often used for these types of systems.
- Massive Parallel Processing (MPP) database: This is a type of relational database but with the main difference being that the data is split over several computers (with attached storage) that each process a part of the data. These kinds of databases are only used for OLAP solutions and are not suitable for OLTP, because these databases are designed to process queries that need huge amounts of data and are not good in processing many small requests. Mostly these MPP databases are used when the relational database is not fast enough.
- NoSQL databases: The NoSQL (Not Only SQL) databases are introduced to overcome some of the drawbacks of the relational databases related to the scalability and the flexibility of the data model. NoSQL database are non-tabular databases and store data differently than relational databases. Further in this blog we discuss the different types of NoSQL databases.
- Hadoop Distributed File System: This storage technology is a distributed file system that distributes (large) files over several computers with attached storage to make it faster to read and write. The idea of this storage solution is that low-cost servers can be used to limit the cost of storing huge amounts of data.
- In-memory databases: With in-memory databases the data is kept in the computers main memory. The data on the disk is only used as backup in case of power loss or malfunction, which makes in-memory databases very fast, but also quite expensive. Therefore, in-memory databases should only be used when the amount of data is small, performance is of high importance and the data is frequently accessed.
- Cloud storage: Cloud storage is comparable with the Hadoop Distributed File System. It is a storage solution that can store any kind of data such as files or tables. It is possible to choose between different protocols on how the data can be accessed, how fast it should be, the security options and the reliability (e.g. number of copies stored in different data centers or regions).
Figure 2- Categories of NoSQL databases
Most of the products from different vendors in the above categories are much alike, so for most use cases it really does not make much of a difference which product and vendor is chosen. This is not the case for the NoSQL databases. There are currently a few hundred NoSQL databases and they each have different properties that makes them useful for certain use cases. In general, these NoSQL databases can be categorized in 4 different groups (see figure 2):
- Document databases: stores data in a semi-structured file format like JSON.
- Key-value stores: stores keys and values. Data can only be retrieved by the key.
- Column families: stores data in tables, rows and dynamic columns
- Graph databases: stores data in nodes (people, places, things) and edges (relationships)
But also within each group there are many differences between the databases.
A data platform can use multiple different NoSQL databases or multiple storage technologies for different purposes or different types of data. However, before deciding on the storage technology you want to use, you need to consider the purpose of the storage layer. This can be:
- Landing Area: The landing area is the place where the ingested data is stored first before it is processed further.
- Staging Area: The staging area is an intermediate storage area for the ETL process.
- Data Warehouse: In the data warehouse, the data is integrated and stored in a data model so that it can be used for analysis or reporting.
- Data Mart: This is a special layer on top of the data warehouse that stores only a part of the data for a specific use case.
- Operational Data Store: Specific storage solution to store data that is needed to report on to support the current operations.
- Data Lake: This is a central place that stores all available data.
- Archive: This is where the incoming data is archived.
For each of these seven possible purposes, there is a specific storage technology that works best (as seen in Table 1).
Table 1: Link storage technology to purpose
A data platform often serves multiple purposes, which means that it can have multiple storage technologies for each different purpose. As you can see in Table 1, cloud storage is good for most purposes except when the data is frequently changed; the relational databases are good is most workloads, except when it involves unstructured data or when high performance is needed. Hadoop is a good choice when you need to store unstructured data or when you need a low-cost long term storage solution. In-memory databases are especially good when performance is very important and the amount of (structured) data is small. MPP databases are a good choice when you need to store and process huge amounts of structured data. NoSQL databases are not good for most purposes, but then can be a very good choice for certain specific use cases.
When considering which storage technologies to use, you can ask yourself the following questions:
Which type(s) of data do I need to store? Is it structured, semi-structured or unstructured data? Not only look at the current situation but also think about future requirements.
What are the performance requirements of the storage technology? Should the storage technology be fast enough to store the incoming data or should it also be fast enough to serve the data to the end-users?
What are the scalability requirements? Do you have a very stable predictive workload or will the workload differ greatly?
Do you want to store the data in the Cloud? This question can be related to where your data is generated and where you want to consume your data. It can also be a regulatory question whether it is allowed to store the data in the cloud.
Do you want to prevent vendor lock-in? Vendor lock-in can be reduced by using industry standards and open source technologies. Also, some vendors provide storage technologies that work on multiple cloud providers.
The storage layer is one of the places where security plays an important role, which will be covered in more depth in a follow-up blog about the security layer. One of the most important decisions that need to be considered is whether security is enforced at the storage layer or at other layers such as the analytics layer and/or the visualization layer. If security is enforced at the storage layer, it is possible to enforce that only certain people with the original rights can gain access to the data independent of the tool being used. If security is enforced in the analytics layer and/or the visualization layer, it should be ensured that it is not possible to bypass those layers to access the data in the storage layer. There are advantages and drawbacks for each option which will be discussed in the blog about the security layer. The storage layer can also have another important role in the security and that is that many storage technologies support auditing of all activities. This is a log of which users have accessed which data, and it can give alerts when certain users perform specific activities.
In this blog we described the different storage technologies and the purposes for which storage is needed in a data platform. Some storage technologies are a better fit for certain purposes than others which is showcased in Table 1. We have also introduced some standard questions that can help you in deciding which storage technologies to use.
Deloitte can help you with choosing which storage technologies to use for the data platform and we can also help you with implementing them. Our next blog will be about the Analytics Layer. If you want to know more about how the data can be analyzed, please read our next blog in our series about the Layered Architecture.
Deloitte's Data Modernization & Analytics team helps clients with modernizing their data-infrastructure to accelerate analytics delivery, such as self-service BI and AI-powered solutions. This is done by combining best practices and proven solutions with innovative, next-generation technologies, such as cloud-enabled platforms and big data architectures.
Design a backup and restore strategy.
Think about a data retention. How long should the data be kept? How long is it legally required and allowed to keep the data?
Make a choice between schema-on-read and schema-on-write. Schema-on-write means that all incoming data should be modelled, while schema-on-read includes storing the data as-is and only modelled when it is used. Often a data warehouse uses schema-on-write while a data lake uses schema-on-read.