“Data lakes” give business users direct access to raw data for analysis, potentially speeding the decision-making process. But CDOs need to put the right security, support, and governance measures in place for this model to work effectively.
Organizations are constantly looking for better ways to turn data into insights, which is why many government agencies are now exploring the concept of data lakes.
Read the full CDO Playbook
Create a custom PDF
Learn more about the Beeck Center
Subscribe to receive public sector content from Deloitte Insights
Data lakes combine distributed storage with rapid access to data, which can allow for faster analysis than more traditional methods such as enterprise data warehouses. Data lakes are special, in part, because they provide business users with direct access to raw data without significant IT involvement. This “self-service” access lets users quickly analyze data for insights. Because they store the full spectrum of an enterprise’s data, data lakes can break down the challenge of data silos that often bedevil data users. Implemented correctly, data lakes provide insight at the point of action, and give users the ability to draw on any data at any time to inform decision-making.
Data lakes store information in its raw and unfiltered form—whether it is structured, semi-structured, or unstructured. A data lake performs little automated data cleansing or transformation. Instead, data lakes shift the responsibility of data preparation to the business.
Providing broad access to raw data presents both a challenge and an opportunity for CDOs. By enabling easy access to enterprise data, data lakes allow subject matter experts to perform data analytics without going through an IT “middleman.” At the same time, however, these data lakes must provide users with enough context for the data to be usable—and useful.
CDOs can play a major role in the development of a data lake, providing a strategic vision that encourages usability, security, and operational impact.
A federal manufacturing facility’s CIO wanted faster access to large volumes of data in its native format to scale and adapt to the changing needs of the business. To accomplish this, the facility implemented a data lake, which stores distributed servers to efficiently process and store nonrelational data. This platform complements the organization’s existing data warehouse to support self-service and open-ended data discovery. Users now have on-demand access to business-created data sets from raw data, thereby reducing the time to access data from 16 to three weeks.
A poorly executed data lake is known as a data swamp: a place where data goes in, but does not come out. To ensure that a data lake provides value to an organization, a CDO should take some important steps.
Imagine being turned loose in a library without a catalog or the Dewey Decimal System, and where the books are untitled. All the information in the books is there, but good luck turning it into useful insight. The same goes for data lakes: To reap the data’s value, users need a metadata “map” to locate, make sense of, and draw relationships among the raw data stored within. This metadata layer provides additional context for data that flows through to the data lake, tagging information for ease of use later on.
Too often, raw data is stored with insufficient metadata to give the user enough context to make gainful use of it. CDOs can help combat this situation by acting as a metadata champion. In this capacity, the CDO should make certain that the metadata in the data lakes he or she oversees is well understood and documented, and that the appropriate business users are aware of how to use it.
By putting appropriate security and controls in place, CDOs will be better positioned to meet increasingly stringent compliance requirements. Given the vast amount of information data lakes typically contain, CDOs need to control which users have access to which parts of the data.
Role-based access control (RBAC) is a control mechanism defined around roles and privileges through security groups. The components of RBAC—such as role permissions, user roles, and role-to-role relationships—make it simple to grant individuals specific access and use rights, minimizing the risk of noncleared users accessing sensitive data. Within most data lake environments, security typically can be controlled with great precision, at a file, table, column, row, or search level.
Besides improving security, role-based access simplifies the user experience because it provides users with only the data they need. It also enhances consistency, which can build users’ trust in the accessed data; this, in turn, can increase user adoption.
Preparing a new data set can be an extremely time-consuming activity that can stymie data analysis before it begins. To obtain a reliable analytic output, it’s usually necessary to cleanse, consolidate, and standardize the data going in—and with a data lake, the responsibility of preparing the data falls largely into the hands of the business users. This means the CDO must work with business users to give them tools for data prep.
Thankfully, software is emerging to help with the work of data preparation. The IT organization should work collaboratively with the data lake’s business users to create tools and processes that allow them to prepare and customize data sets without needing to know technical code—and without the IT department’s assistance. Equipped with the right tools and know-how, business data users can prepare data efficiently, allowing them to focus the bulk of their efforts on data analysis.
Self-service data analysis will go more smoothly if users can use familiar tools rather than having to learn new technologies. CDOs should strive to ensure that the business’s data lake(s) will be compatible with the tools the business currently uses. This will greatly enhance the data lake platform’s effectiveness and adoption. Fortunately, data lakes support many varieties of third-party software that leverage SQL-like commands, as well as open source languages such as Python and R.
Once users have access to data, they will use it—which is the whole point of self-service. But what if a user makes an error in data extraction, leading to an inaccurate result? Self-service is fine for exploring data, but for mission-critical decisions or widespread dissemination, the analytical outcomes must be governed in a way to guarantee trust.
One approach to maintaining appropriate governance controls is to use “zones” for data access and sharing, with different zones allowing for different levels of review and scrutiny (figure 1). This allows users to explore data for inquiry without exhaustive review while simultaneously requiring that data that will be broadly shared or used in critical decisions will be appropriately vetted. With such controls in place, a data lake’s ecosystem can perform nimbly while limiting the impact of mistakes in extraction or interpretation.
Figure 1 illustrates one possible governance structure for a data lake ecosystem in which different zones offer appropriate governance controls:
Implementing a data lake is more than a technical endeavor. Ideally, the establishment of a data lake will be accompanied by a culture shift that embeds data-driven thinking across the enterprise, fostering collaboration and openness among various stakeholders. The CDO’s leadership through this transition is critical in order to give employees the resources and knowledge needed to turn data into action.
A federal agency CIO team built and deployed analytics tools to support operations to influence an insight-driven organization. The goal was to create an environment where stakeholders were consistently incorporating analysis, data, and reasoning into the decision-making process across the organization, such as enhancing data infrastructure. To give users the ability to utilize these tools to their full potential, an “Analytics University” was implemented. This was well-received; more than 20,000 field employees completed level 1 courses, with 90 percent saying they would recommend them to a colleague. The support by upper management to invest in the use and understanding of data analytics across the organization encouraged a data-driven culture, and this culture shift continues to enable business adoption of big data technologies.
CDOs are responsible for more than just the data in the data lake; they are also responsible for helping to equip the workforce with the data skills they need to effectively use the data lake. One way to help achieve this is for CDOs to advocate for and invest in employees that have the necessary skills, attitude, and enthusiasm. Specialized trainings, town halls, data boot camps—a variety of approaches may be needed to foster not only the technical skills, but the courage to change outdated approaches that trap data in impenetrable silos. The best CDOs will create an organization of data leaders.
CDOs may need to work with senior business leaders and HR in the drive for change. They should strive to overcome barriers, highlight data champions throughout the organization, and lead by example.
Governance over data lakes needs to walk a very fine line to support effective gatekeeping for the data lake while not impeding users’ speed or success in using it. Traditionally, governance bodies for data defined terms, established calculations, and presented a single voice for data. While this is still necessary for data lakes, governance bodies for a data lake also should establish best practices for working with the data. This includes activities such as working with business users to review data outputs and prioritizing ingestion within the data lake environment. Organizations should establish thorough policy-based governance to control who loads which data into the data lake and when or how it is loaded.
Technology is never static; it will always evolve, improve, and disrupt at a dizzying speed. The technology surrounding data lakes is no exception. Thus, CDOs must continue to make strategic investments in their data lake platforms to update them with new technologies.
To do this effectively, CDOs must educate themselves about current opportunities for improving the data lake and about new technologies that will reduce users’ burden. Doing so will open up the ability for more users to use data in their everyday work and decisions. Keeping oneself up to date is straightforward: Read journals and trades, attend conferences and meetups, talk to the users, and be critical of easy-sounding solutions. This will empower a CDO to sift through the vaporware, buzzwords, and flash to identify tactical, practical, and necessary improvements.
The challenges associated with traditional data storage platforms have led today’s business leaders to look for modern, forward-looking, flexible solutions. Data lakes are one such solution that can help government agencies utilize information in ways vastly different than was previously possible.
It would be easy to say that there is a one-size-fits-all approach and that every organization should have a data lake, but this is not true. A data lake is not a silver bullet, and it is important for CDOs to evaluate their organization’s specific needs before making that investment. By planning properly, understanding user needs, educating themselves on the potential pitfalls, and fostering collaboration, a CDO can gain a solid foundation for making the decision.