Let AI deliver value by using MLOps


Let AI deliver value by using MLOps

6 Success Factors

Deloitte detected six important factors to let AI deliver value using MLOps, which leads to impressive positive outcomes.

Inside the walls of organizations that are using artificial intelligence, or plan to use it, only 54% of AI-based projects reportedly make it from pilot to production. That success rate hasn’t daunted some – from chatbots to image recognition, AI is already being deployed with the hope of boosting efficiency, reducing costs, and creating new business opportunities.

The world may be smitten with AI, but it’s not always easily integrated into production to help solve real-world business challenges. This is where Machine Learning Operations (MLOps) comes in. If you’re new to MLOps, think of it as DevOps for AI: a set of practices to develop, deploy, and maintain machine learning models in production, reliably and efficiently.

Because simply using MLOps tools isn’t a guaranteed way to productionize AI, Deloitte has identified six key factors that lead to positive outcomes. Whether you’re just getting started with MLOps or looking to improve your existing platform, these factors will steer you past basic implementation, to actually seeing impressive AI results.

What’s Missing from Most MLOps?

MLOps, in its mission to help organizations deliver and apply high-quality models faster, producing fewer errors, can be summarized in the seven core principles shown in Figure 1.

Figure 1: Seven core principles of MLOps

You don’t need to fulfil all principles from the start, and there are several services and concepts that facilitate these principles, including:

  1. An ML experimentation environment with access to raw and curated data
  2. A feature store to accommodate a generic set of features, including lineage
  3. Experiment tracking to store results of hyper-parameter optimization
  4. A model registry to store models’ artifacts and model lineage
  5. ML pipelines to operationalize all elements of model training and inference
  6. Monitoring tools to monitor data and model quality
  7. CI/CD pipelines for code, data and models to deploy to several environments

The attention to MLOps is rapt, just like with AI. New MLOps tools are released almost daily, and expectations are high for these tools to be magic bullets that run models in production. But MLOps needs the right technology and people to perform the right processes. Building MLOps platforms can be challenging and may not yield the desired results.

Only adding a flavor of MLOps to your project is no guarantee for success.

Back to Basics: Shaping Your MLOps Platform

An MLOps platform is the foundation of your AI application, providing the infrastructure necessary for data processing and AI, such as data storage and compute. All services and concepts mentioned above run on the platform. In some cases, an application that consumes the results also runs on the platform, such as a web app. In addition to providing the tooling, the platform facilitates scaling and security for your AI application. These days, platforms are often built in the cloud. Infrastructure definitions and configurations are stored as code (infrastructure as code, IAC), and developed and deployed using CI/CD principles.

Before starting a project to develop an MLOps platform, you should first define:

  1. The business problem: This comes from the organization and is the reason for starting the project. An example can be: “It’s hard to find the correct information in our document management system.”
  2. The analytical problem: This translates the business problem into something that we can solve using an AI product. For example: “Can we provide the correct information from our document management system through smart interaction with the user and a smart document search using NLP?”

The analytical problem is the starting point for your project, and the final AI solution should be able to address the business problem. But, as mentioned, success actually relies heavily on six factors, described below. Skipping any of them doesn’t stack the odds in your favor.

Success Factor 1. Gather Requirements at the Start

What kind of AI solution do you want to build? To define your requirements, consider the business and analytical problems, as well as the organization, data availability and existing infrastructure. Then ask questions about the solution, such as:

  • Current situation: What can you reuse from previous projects, and what infrastructure is already available?
  • End-users and future developers: Who will maintain and use the solution in the end state, and what kind of frameworks are they familiar with?
  • Data: What is the quantity of the incoming data? Is it structured or unstructured? How often does new data come in? Can developers and data scientists use production data, or is a synthetic data set required
  • Location: What can happen in the cloud and at the client’s site, and do you need to use edge computing?
  • Pre-processing: Do you need to pre-process the data in batch mode or streaming? How do you clean the data? Are there requirements for data lineage? Do you need a feature store? (Most likely, you don’t.)
  • AI: What models do you need (e.g., rule-based, ML, deep learning, supervised or unsupervised)? How many models do you need to train and use in parallel? Can you directly connect training and inference, or are trained models applied multiple times to new data? Are there any model lineage requirements?
  • Post-processing: Do you need a feedback loop to update data based on model results and user feedback? Do you need to monitor model performance and do you need to do anything with the results?
  • Consumption: How do you consume the data (e.g., API, web app)? How many use cases do you build on the platform, and do they end up in the same consumption layer?

One source that might be helpful is GitLab’s Jobs To Be Done , a list of objectives to accomplish with MLOps.

Along with the requirements, identify critical success factors, like delivering on time, staying within budget limits and meeting quality standards. Also, identify the technical risks of your project, addressing them in a proof of concept (POC; see success factor 3). Other types of risks, like expected changes in requirements, should be addressed by other risk-management techniques.

At this point, a first version of a design can be drawn. Illustrate how data and models move through the applications, and how MLOps enables this process with its services and tools. The design helps you understand the problem and the solution, gather requirements, have conversations with stakeholders and define the scope of a POC.

Success Factor 2. Verify the Need for a Platform, Manage Expectations

Not all use cases call for an MLOps platform; running and maintaining a platform is expensive, and resources are scarce, so confirm the need. For some projects, a model trained once and combined with a dashboard can address the business problem; they can be packed together and shared with the end user, and run on local machines, without the need for an MLOps platform.

But MLOps platforms are essential when AI is used in production apps. The following (not exhaustive) list describes the preconditions of such apps that are fulfilled by MLOps:

  • AI is used in a production environment, separate from the development environment.
  • Development and training are automated and executed in continuous iterations.
  • Data lineage and model lineage are necessary.
  • A clear view of data quality and model performance is necessary.
  • Certain security requirements are met.

If you decide to develop and implement an MLOps platform, make stakeholders (e.g., product owners and the executive team) aware of the components involved. Usually, stakeholders are interested only in the ML aspect, as it directly solves the business issue. But it’s essential to explain that investments in infrastructure and development are needed to build an MLOps platform. That way, stakeholders understand it’s not a simple task and expectations are managed from the start.

When you run models in production, it’s essential to continuously monitor, update, improve and deploy them, which makes the MLOps platform essential but, potentially, costly. Manage stakeholders’ expectations about these investments from the start.

Success Factor 3. Create a Proof of Concept

Before starting the real project (building a platform and developing the AI solution), it’s wise to first focus on the technical risks you identified earlier, addressing them with a POC. (When no technical risks or uncertainties are identified, you can skip the POC and immediately start developing.) The POC is a small project in itself, intending to prove to stakeholders that the solution in mind can be built and answers the business question. In practice, the POC can consist of code in Jupyter notebooks, with the results presented in a simple app, dashboard or presentation.

The POC might focus on the AI, which can be the most challenging aspect of the project – not only for data scientists or ML engineers, but also key stakeholders who need to be convinced that AI can effectively solve the business problem. Additionally, the POC can be used to test other aspects of the solution, such as performance and new services that may need to be evaluated before full implementation.

Define what the objective of the POC is, and when it will qualify as successful. In other words: What is the concept that needs proof? The focus is on experimenting, hacking and prototyping. If the subject of the POC is the AI, go through the normal data-science project steps:

  1. Identify data sources and get access to them.
  2. Perform exploratory data analysis.
  3. Choose the right methodology, such as clustering, object detection, etc.
  4. Build, validate and apply the models.
  5. Expose the results in a simple application, dashboard or presentation and verify the results with the stakeholders.

If the subject of the POC is on (parts of) the MLOps platform, it can be used to demonstrate the value of MLOps.

Fast iterations are key; by getting feedback early, you can explore different approaches. Based on the POC, a go/no-go decision on the project is possible. Failing to take this step can be costly: If it turns out that the solution doesn’t address the business question after the full platform has been implemented, a great deal of time and resources are wasted. Also, in the POC phase, it’s easier to change direction than later in the project.

First, prove that your solution solves your problem – before building a platform.

Success Factor 4. Create a Design

When you defined your requirements (see the first success factor), you already made a first version of a design. Now it’s time to take this to the next level. The output of this exercise should be architecture diagrams with annotations. Diagrams.net and the markdown format for annotations work well for this. The design can be divided into several levels, using the C4 model of software architecture. That way, high-level building blocks (e.g., offline training, web app) and details and workflows (e.g., experiments run on Jupyter notebooks stored in an experiment tracker) are both captured.

There’s no need to start from scratch. Cloud providers, such as AWS or Databricks , provide reference architectures and design patterns for various solutions. The design should cover the implementation of:

  • Platform account structure, such as a root account, monitoring account and one or more accounts per use case
  • Data storage: database, data warehouse, data lake(house) and feature store
  • Data processing: batch or streaming
  • Model training process: offline or online
  • Model inference process: batch, embedded in stream application, request-response or deployed to an edge device
  • Continuous integration: git pattern and code quality control
  • Deployments: deployment of code, pipelines and model artifacts; the storage location of artifacts and the deployment pattern (blue/green, canary, etc.)
  • Data and model pipelines: including the steps (jobs) captured in these pipelines

Figure 2 shows a diagram focusing on the AI part of a solution, including CI/CD, compute, experiment tracking and model registry, and data storage. The solution uses offline training and batch inference, with data stored in a data lake. It also shows the CI/CD process. In this example, only code is deployed, and models are retrained in the test and production environment.

Figure 2: Components design (Level 3 of the C4 model) of the AI training and inference part of the solution, showing the implementation of components using managed cloud services

Success Factor 5. Split the Work into Iterations

Make a Roadmap
The design from the previous step contains an end-state. Not all features are must-haves from the start, so prioritize and map them. Prioritization is key: not all requirements and features are equally important, and nor are your stakeholders. It’s almost impossible to please everyone with a design.

Split your map into iterations, or phases, with clear milestones (like epics). To define milestones, try the three MLOps maturity levels as defined by Google . The focus of the first phases must be clear; for later phases, less detailed planning will suffice. Most likely, the design will change as requirements change and important details affecting the design will only pop up during development. Try iterations or sprints of two weeks. For example:

  • Start-up phase: Let platform engineers launch a first version of the platform that others can build upon (three sprints)
  • Phase 1: First model results shown in client-facing dashboard (eight sprints)
  • Phase 2: Data and ML processes are automated using pipelines (eight sprints)
  • Phase 3: …(and so on)

Develop Minimum Viable Products
Think of the first phases as steps to create minimum viable products (MVPs): You develop a product that has few features but still produces results for end users. The users can validate the results and check whether they answer the business and analytical problems. Their feedback will be input for the next iterations and phases.

It is important to deploy the MVP to the production team as soon as possible, so you can spot any issues early on. Also, focus on developing all components of the solution, not just the MLOps and platform foundations. Figure 3 compares a focus on all aspects of a solution versus only the functional foundations. Focusing on some degree of all aspects ensures that the full team can work simultaneously; teams downstream do not rely on upstream teams completing tasks.

Figure 3: Minimum viable product, shown at right: Build a slice across, instead of one layer at a time (as shown at left). Source: Jussi Pasanen with acknowledgments to Aarron Walter, Ben Rowe, Lexi Thorn, and Senthil Kugalur

Data scientists can start modeling with dummy data, rather than waiting for data engineers to deliver the first cleaned production data. For instance, the team working on the front end of a web app can already begin development using a dummy data set. The first models and results can then be deployed to the production team manually, while deployment pipelines are still being developed.

Developing MVPs enables rapid iterations. As feedback is frequently and quickly received, be prepared to discard features that may no longer be necessary or require modification, and expect minimal wait times for the data engineers to deliver required reworking. Proving business value with the application is important, but it’s also essential to dedicate time to hardening the application by focusing on non-functionals. This can be achieved through a few sprints periodically or a complete project phase to work on tests, monitoring, security, automation, etc. As the importance of these topics may not be clear to all stakeholders, it’s critical to include them in sprint demos.

Success Factor 6. Keep the Project Team Motivated

So far, we have discussed project processes and technology. Let’s not forget about the team, which usually has the following roles.

  • Platform engineers: developing and deploying the core infrastructure and security
  • MLOps engineers: based on the work of the platform team, responsible for hosting MLOps tools and developing ML pipelines and CI/CD for data and models
  • Data engineers: responsible for data processing, data cleaning and operationalizing features by building data pipelines
  • Data scientists: exploring data and creating models
  • Front-end developers: when in scope, UX designers and web app developers building the front end
  • Back-end developers: micro-service and API development (if in scope)

Teams can be based on the workload for each role. It can be beneficial to create a platform team, and a data-and-AI team. The latter includes MLOps engineers, data engineers and data scientists. Additionally, a team can be dedicated to the web app, containing both back-end and front-end developers. In larger projects, each role may have its own team.

Clearly define team roles and responsibilities to avoid disappointment.

It’s not always obvious who should take care of which tasks. Consider these two common examples:

  1. Who is responsible for productionalizing AI models? Data scientists enjoy exploring data and new models, but they don’t usually have a passion for improving code structure. On the other hand, cleaning up Jupyter notebooks all day is not the most efficient use of MLOps engineers’ efforts. Ideally, MLOps engineers would provide tooling, templates and a clear structure that allows data scientists to deploy their models with ease. However, this level of maturity is not always present from the start of a project.
  2. Do all teams have the option to install and use the tools and services (on development, of course) that they need to accomplish their goals, or is that restricted to the platform team? Platform engineers get bored when their backlog is filled with requests from other teams for new services, and MLOps engineers are blocked when they can’t upgrade the model registry to the latest version themselves. Ideally, the platform offers a self-service approach for new services, but sometimes security concerns prevent this.

In both cases, clear role definitions and managing the expectations of team members is key, to keep the team motivated.

Human Attention for MLOps Success

Despite the world’s enthusiasm for AI, introducing it into production doesn’t mean an immediate solution to real-world business challenges. Careful attention should be applied when it comes to why and how MLOps should integrate AI. In other words, human intelligence must accompany any artificial intelligence projects; the insights and guidance presented in the six identified factors will push your MLOps platforms toward success.

Do you want to know more about AI in MLOps? Please contact Pepijn van der Laan.

Did you find this useful?