Scaling AI using Databricks MLOps Stack

Written by Kaycee Low, Soheil Hosseini, Nicholas Anile, Ben Durance, Sahil Bahl.

Introduction

As artificial intelligence (AI) and machine learning (ML) become increasingly important in driving and informing business decisions, machine learning operations (MLOps) has emerged as a crucial solution to build reliable and production-ready ML. By streamlining processes and fostering collaboration between data scientists, machine learning engineers, and DevOps engineers, MLOps can transform a project from the proof-of-concept stage into a reproducible, scalable asset that drives value across the entire business. In this blog, we’ll highlight how Mantel Group has used Databricks Asset Bundles (DAB) and enhanced MLOps Stack to transform data science projects into production-ready ML systems, encompassing both infrastructure and deployment.

What are the challenges faced by businesses when Scaling AI?

Here are some of the most common challenges that we’ve observed across industries when it comes to scaling AI:

Lack of Scalable Infrastructure

Without adequate compute power, storage, and orchestration foundations, ML models cannot be scaled. When significant overhead is associated with use case development and productionisation, such as competing for shared resources, manual support requests, and waiting for new infrastructure, advanced analytics teams spend most of their time addressing these issues instead of creating new models or improving existing ones to solve problems and drive value.

Lack of Governance

Data governance managers are integral to ensuring smooth operations. Without proper data governance, data science teams can easily encounter situations with multiple sources of truth for the data and no clear understanding of whom to ask for the correct source. All stakeholders should be involved from the beginning of every ML project, including consulting security and infrastructure teams during the design phase to prevent roadblocks as early as possible.

Lack of Standardisation

Without defined standard processes, data science teams spend excessive time and resources setting up new use cases. Team members face barriers such as data visibility layers, access permission issues, or the need to rewrite code from scratch due to isolated, inaccessible repositories. This siloed environment and lack of repeatable code and deployment patterns lead to increased overhead and delays in deployment and productionisation.

How Databricks Enables Scalable AI

Fortunately, each of the issues outlined above can be addressed using Databricks features to unify ML workflows and standardise the development process from start to finish.

Databricks compute for scalable infrastructure

Databricks offers a range of compute types and configurations that can be tailored to the task type, including serverless compute, cluster, or SQL warehouse. By defining set compute types (e.g., small, medium, large) for your team to use across the AI development lifecycle, data science teams can focus on development rather than deciding on which compute type to use. Serverless compute further accelerates ML workflows through rapid start-up and scaling times, enhancing productivity and cost efficiency.

Data governance with Unity Catalog

Databricks provides centralised governance for any data and AI assets including models via Unity Catalog (UC). UC provides access control, auditing, data lineage, and discoverability through its integration with the MLflow Model Registry. Data assets are governed at the account level, with fine-grained control, and adjustable at the project level providing a powerful, yet flexible tool to manage the entire ML lifecycle.

Standardisation using Asset Bundles and MLOps Stack

Databricks Asset Bundles offer a straightforward solution to address concerns related to standardising processes. To understand how to enforce standardisation between development and production environments for MLOps, we’ll first compare two model deployment patterns: the traditional approach of deploying models and our recommended approach of deploying code.

Model Deployment vs Code Deployment

The traditional approach to model development has focused heavily on training a model within a development environment to produce a model artifact that can be elevated through platform environments into production. This model deployment approach is tried and tested, and has many advantages for teams in certain circumstances. However, it also has a few disadvantages that make it harder to scale to big data science teams looking to maintain a large number of models in parallel. These include:

Data must be kept in sync between environments, otherwise your training data will not be representative of real-life production. This introduces a number of challenging data engineering dependencies.

Feature engineering and training code dependency must be kept in sync between environments.
Slower elevation of new retained model versions, model upgrades, and CI/CD integration.

The alternative to the deploy model approach is code deployment, where you develop your training code in the development environment using production data and elevate your code (rather than a model) through each environment. This approach carries a few distinct advantages:

Removes data sync dependencies: this is facilitated by Databricks’ ability to share catalogs between different workspaces such as dev, staging, and production.
Consistency between training and inference code: this results in faster model elevation between each environment as the model trained in development is representative of a model in staging or production environments.
Better model performance: faster elevation results in better model performance overall, as models generally have best performance closest to the point of training.
Full control and flexibility over the entire model lifecycle: as all model development is defined in code, the supporting code follows the same pattern as the training code and both go through the same integration tests in the staging environment.

The code deployment paradigm does come with implementation challenges, in particular related to increased MLOps complexity. Read on to learn more about how this complexity can be managed in a relatively simple way using Databricks Asset Bundles.

What are Databricks Asset Bundles?

Databricks Asset Bundles (DABs) were recently announced as generally available in April 2024 and are a way to facilitate software engineering best practices using infrastructure as code. The purpose of a DAB is to codify an entire project - from its infrastructure, source code, testing, and deployment - into a single “bundle”, which is managed via YAML template files.

Under the hood, DABs are a wrapper around Terraform. This means that DABs manage to capture all the benefits of Terraform such as modularity and reusability, whilst maintaining (what can easily get to be) rather complex infrastructure via simple YAML-type declarations. Deploying the bundle to a target environment is as easy as using a command line prompt via Databricks CLI. Any errors in the YAML files can be caught before deployment as DAB allows users to validate bundle files against the target environment during CI. Furthermore, destroying any stood up infrastructure is also straightforward via CLI prompts - this can be useful after running CI, for example to clean up any deployed resources in the staging environment after CI tests have completed.

What is Databricks MLOps Stack?

Databricks MLOps Stacks is a DAB template available on GitHub which extends the idea of infrastructure-as-code to machine learning projects. It sets up the elements required for end-to-end operation of a ML project, including:

Machine learning pipelines including model training, validation, deployment, and batch inference
Release pipelines for production including scheduling
CI/CD for testing and deploying ML pipelines using AzureDevOps or GitHub Actions

The following components should be set up before using the MLOps stack (e.g., via Terraform):

Development, staging, and production workspaces
User group definitions and access controls. Permissions for models and experiments can be controlled via MLOps Stack.

Image source: Databricks

By following production best practices from the beginning of the project and codifying existing ML pipeline resources (e.g., model training, model validation, batch inference), MLOps Stacks can significantly increase productivity in machine learning teams and reduce time spent on productionising models. Using the same template across multiple projects ensures consistency throughout different teams, making scaling ML initiatives easier, reducing complexity, and subsequently the time associated with different parts of the business using varying tech stacks. The modular components of the MLOps stacks allow different teams to work independently, whilst still maintaining best practices through automated testing and CICD. With less time spent focusing on infrastructure, automation, and other important ML pipeline resources, the data science team is free to deploy to new environments and iterate on solving problems using ML quickly.

Case Studies from Mantel Group

Pharmaceutical Sector:
We leveraged DAB and MLOps Stacks within a global pharmaceutical organisation to build a modern ML platform with a focus on speeding up time to deployment and reusability across regions. Previous methods used to build and deploy ML models did not follow a standardised approach, and required significant manual effort to serve multiple regions for deployment and productionisation.

Mantel Group refactored and retemplated the existing code with reusability in mind, allowing for region-agnostic use. This enabled new regions to deploy their own ML pipelines efficiently with automated CICD and workflows across development, staging, and production environments. By optimising the code to run efficiently leveraging PySpark, Databricks compute, and autoscaling, we achieved a 50% reduction in time taken for end-to-end pipeline inference. Automated daily inference and monthly model retraining was scheduled, allowing business users to access the most up-to-date information.

Online Wagering Organisation"

One of Australia’s largest wagering organisations is currently undergoing a significant Data & AI transformation. This initiative involves migrating legacy data solutions to a new enterprise Databricks platform. As part of this transformation, Mantel Group’s Advanced Analytics team played a crucial role in shaping the development of the ML platform and AI capabilities. Our team prioritised utilising DABs as the cornerstone of a repeated development pathway for the multiple data science teams each previously developing in their own methods.

The MLOps capability was developed by migrating existing legacy solutions to PySpark and our template. This included creating pathways for Batch Inference and Real-time solutions that utilise Databricks' feature store and Unity Catalog capabilities. Additionally, automated champion-challenger retraining cycles and extensive post-deployment monitoring were implemented.

Australian Energy Provider:

Australia’s leading energy generation company partnered with Mantel Group to tackle the challenge of migrating their mature Machine Learning and dependent ETL workloads from Azure ML to Azure Databricks. They aimed to enhance efficiency and speed to value by establishing responsible, repeatable practices and templates for ML development in the new Databricks environment. Additionally, the existing setup lacked a proper feature store, presenting an opportunity to develop this alongside the migration and uplift of ML operational practices and governance.

Our team implemented a Feature Store on Databricks with over 100 features, significantly reducing the development time for future use cases. A Feature Store is valuable in ensuring consistent feature definitions across multiple ML models, improving collaboration among data scientists, and accelerating the deployment of new use cases. It provides automatic feature retrieval during model training and inference, enabling seamless batch scoring and real-time serving. The Feature Store also enhances data lineage and governance, ensuring traceability and compliance across the ML lifecycle.

Along with the Feature Store, an ML use case was successfully migrated to Databricks, with the model now live in production using DABs. Best practices for Databricks MLOps, including reusable patterns, templates, and components, were designed and applied, leveraging the Databricks MLOps Stack as a starting point.

Key Takeaways and Recommendations

Through our experiences using DAB in the above use cases, some essential insights have emerged which allow for the most effective use of MLOps Stacks in an organisation:

Consider Unity Catalog Design from the Beginning

It is essential to define the number of workspaces and catalogs needed for your project from the beginning, since this factor will influence all aspects of your DAB design and implementation. Databricks recommends three workspaces corresponding to the deployment targets Development, Staging and Production. However, at the end of the day, this decision also depends on your business’ objectives, conventions, and constraints.

Have Clearly Defined User Groups and Access Permissions

This practice ties in tightly with the number of catalogs and workspaces defined mentioned previously. As a general rule, we do not recommend any security principal other than the service principals to have manage permissions in non-development workspaces. Besides this, a clearly defined access pattern for different user groups will significantly help with the integrity and security of the ML platform and decision-making process.

Challenges

As with every new tool and technology in its infancy, using a DAB comes with its own set of challenges. The MLOps Stacks repository is being developed at a fast pace, so we are confident that many of the challenges listed below will be addressed in the future. Nonetheless, at the time of writing this article, we have come across the following challenges when using DAB in our projects:

Sparse documentation: While new DAB features are emerging rapidly, documentation and outlining best practices associated with using these features seems to be lagging behind. This is also apparent in the documentation for Databricks CLI commands. We expect this to be remedied shortly due to the recent general availability of DAB.

YAML files are non-dynamic: Since YAML files are used to define infrastructure and ML workflows in DAB, there is some redundancy when it comes to asset definitions. For example, if we want a workflow to use two different compute clusters in two different deployment targets such as Dev and Staging, all tasks corresponding to that workflow must be specified under both deployment targets, and the same compute cluster should be mentioned under each task in the workflow.
Access permissions may interfere with deployment across different target environments: As of writing, DAB allows access level definitions for workflows, experiments, models and delta live tables. In some scenarios, bundle deployment may be unsuccessful when defining permission levels for experiments and models via DAB. For example, if you have defined permission levels on two models in the current deployment, and then want to remove one model in the next deployment, DAB will attempt to delete that model from its corresponding catalog before redeployment. If there are model versions registered under that model asset in UC, deployment will fail as it cannot delete a non-empty model asset. The solution to this is to manually delete all model versions under the model asset, which can be rather time consuming. Therefore, you may consider not using DAB to manage model and experiment permissions if you expect the number of models in your project to change frequently.

Conclusion

Scaling AI in a business requires standardisation across all areas from people, processes, and technology. In this blog we have focused mainly on the technology aspect, showcasing how Databricks Asset Bundles can be used to eliminate duplicate efforts and speed up time to model deployment. While not without its challenges, Databricks MLOps Stacks is continually improving with new features released frequently and is likely to become a major contender in out-of-the-box solutions for automatic ML workflows.

With 60% of organisations in Australia and New Zealand aiming to use AI to reach their growth objectives, tools such as DAB are vital in enabling an AI system that can meet the current and future demands of the business. By using these field-tested MLOps best practices to reduce time to deployment, businesses can maintain ML systems efficiently at scale while translating model outputs into real value.