The Future of ETL tools

Article

The Future of Extract, Transform & Load tools | Part II

A view on future ETL tools and applications

In the previous article of this series, we discussed the current role of the ETL tools in the data warehouse environment and the new requirements from ETL with the influx of big data. In this article, we will look at how the ETL tools are coping with this change and discuss several future ETL applications and approaches.

ETL for IoT

The Internet of Things is creating a fundamental shift in ETL requirements owing to the volume, data variety and complexity of IoT data which hits the ‘technology limitations’ of current ETL tools. To get the data into your platform to start “the real work” is a big challenge.

Apache NiFi as a single ingestion platform that gives you out-of-the-box tools to ingest several data sources in a secure and governed manner. NiFi enables easy collection, curation, analysis and action on any data anywhere (edge, cloud, data center) with built-in end-to-end security and provenance. Apache MiNiFi — a subproject of Apache NiFi — is a light-weight agent that focuses on data collection at the edge. MiNiFi can be easily integrated with NiFi to build an end-to-end flow management solution which is scalable, secure, and provides full chain of custody of information.

 

Data Management Framework (NoETL)

As the volumes of the data become too large to move to a central repository, NoETL will eliminate the need to load batches of data, instead data will be read and transformed to fit the target database directly when called by the applications.

 

Self-Service Data Preparation

With the emerge of data scientists, the need for agile, business-user driven data preparation and integration has increased. Instead of just being the client, the business wants to be involved in the data loading and processing stages. Self-service data preparation means the business users can do tasks that where usually only available to the developers.

 

Machine Learning meets Data Integration

Data solution vendors like SnapLogic and Informatica are already developing machine learning and artificial intelligence (AI) based smart data integration assistants. These assistants can recommend next-best-action or suggest datasets, transforms, and rules to a data engineer.

Informatica’s Cloud-scale AI-powered Real-time Engine (CLAIRE) technology for metadata-driven AI is aimed at delivering an effective metadata management and data governance approach for data in cloud, on-premises and big data environments.

Alation, is a solution that provides collaborative data cataloging for the enterprise. It combines the power of machine learning with human insight to automatically capture information about what the data describes, where the data comes from, who’s using it and how it’s used. Connect Alation to your data sources and the system crawls & indexes data assets stored across different physical repositories including databases, Hadoop files and data visualization tools.

 

Unified Data Management architecture

Organizations require one unified enterprise data integration, quality and governance platform that supports the entire enterprise. A unified data management (UDM) system combines the OLAP and OLTP requirements without expensive and error-prone ETL. It is able to support workloads from multiple resources whether running on-premises or on private/public cloud and workloads are backed with massively parallel architecture to decrease computing time.

Databricks Delta is a perfect example of this class. Delta implements the unified data management layer by using an underlying optimized Spark table that stores data as Parquet files in DBFS and maintains a transaction log that efficiently tracks changes to the table. Databricks Delta allows multiple writers to simultaneously modify a dataset, without interfering with jobs reading the dataset and see consistent views.

Stay up to date

The fact however remains that in-memory technologies like Spark will make the need for Traditional ETL minimal in near future by performing all steps in a data warehouse architecture on a single platform. So, it is important for ETL professionals to develop a diverse portfolio of tools for data integration which includes their current knowledge of traditional ETL tools along with scaling up their skills to process ‘big’ datasets using an approach that is scalable, high-performance and real-time.

So, what is your plan to deal with the evolving ETL needs?

Return to part I on The Future of ETL

read article
Did you find this useful?