Building an Enterprise Deep Learning Platform
There is an incredibly common story playing out in countless enterprises in 2020. It sounds something like this:
- Enterprise realizes that deep learning presents huge opportunities: new revenue potential, product ideas, cost reductions, etc.
- Enterprise starts a small team (3–5) to research deep learning applications
- Small team succeeds with an initial use case and demonstrates $1 Million in potential business value
- Enterprise tries to scale ML team 5–10x hoping for $5–10 Million in business value
- Scaled team faces all sorts of scaling issues, particularly around managing data, collaboration, sharing compute resources, and deploying models
- Scaled team struggles to put any models into production resulting in delays, large expenses, and returns that don’t match the promise of the initial proof of concept
At this point, enterprises realize something is broken and attempt to improve their team’s ability to scale. For a surprising number of organizations, this goes something like:
- Read Uber’s Michelangelo Blog Post
- Decide to build a Machine Learning Platform
Building a machine learning platform is no easy endeavor, though — the vast majority of companies that want to make money with deep learning aren’t software companies that have the time, money, or expertise to build their own platform from scratch. As such, most look to off-the-shelf tools to put together their platform.
What is slowing down your team?
To understand what pieces are needed for an effective deep learning platform, you need to understand what was slowing down the big ML team you just scaled up. Some really common patterns emerge:
- Deep learning infrastructure is complicated! Between storing massive datasets, needing GPUs for computation, and provisioning hardware for deployment, managing the infrastructure needed for deep learning can be really difficult. We need to keep machine learning scientists from getting bogged down in infrastructure minutiae.
- Datasets used for deep learning are unwieldy. Whether your dataset is millions of labeled images or billions of sentence pairs, storing, sharing, and versioning this data can be difficult. You need good tools so people can work together on datasets.
- Training models can be very time-consuming! As model complexity grows, your ML team will spend way more time figuring out how to distribute training and tune hyperparameters, and way less time churning out new models. We need tools to automate these low-level tasks for your team.
- Deploying models to production can be really tricky. Teams in a wide range of enterprises report this taking up to six months per model! Tools that automate model deployment and monitoring will hugely accelerate your team.
- We need a new set of tools to collaborate on machine learning. The experiment-driven nature of data science is fundamentally different from software engineering; as such, the tools and methods out there to manage software engineering projects (Git, testing frameworks, Agile, …) are not sufficient for collaborating on model building.
With these goals in mind, an enterprise can begin identifying tools to fill these gaps and accelerate their deep learning team. Unfortunately, even using software other companies built can prove difficult. There are dozens of companies in the ML software space and figuring out what all of their offerings do, let alone which companies do it best, can be really hard. How can you tell the difference between ten different products that all claim to be ML platforms?
Choosing the Right Tools
There is a sea of companies out there selling software to help solve these problems. Let’s dive into some of the best tools to meet all of your team’s needs.
Deep Learning Infrastructure
Storage: The datasets used to train deep learning models can grow massive and out of control quickly. You’ll want a cost-effective, scalable storage backend to wrangle all that data. We’ve talked to a lot of teams and found most data ops teams have standardized on object storage, like Amazon S3. If you’re on-prem first, a shared file system can also do the job.
Training Infrastructure: You’ll need GPUs to train your models, and probably lots of them to parallelize training.
- On Prem: If you’re dedicated to saving money and working on-premise, you’ll need to set up a GPU cluster with enough GPUs to meet the demands of your ML team. Managing a GPU cluster, along with the tools for your data scientists need to do their job, is hard — you’ll want a software layer that simplifies using that hardware for your ML team.
- Cloud: Cloud GPUs can get expensive, but they’re flexible and capable of scaling to whatever size your group needs. Expecting your ML scientists to master cloud infrastructure is a big ask, and you’ll face some serious stumbling blocks. You’ll need a way to abstract the hardware so that ML scientists can focus more on building models and less on infrastructure management.
Deployment Infrastructure: Deployment varies from one ML model to the next, but the most common deployment pattern exposes models as microservices. Your smoothest path to managing a GPU cluster will be Kubernetes, which gained traction because it has the wisdom of a decade of container management baked into it by Google.
Data Management, Versioning, Sharing
The key data problems you’ll really need to solve quickly are sharing, preprocessing pipelines, and versioning. Data changes over time: new data gets collected, labels are improved, the code that transforms and preps that data gets updated at the same time too. If your code and your models and your data are all changing at the same time, how do you keep track of it all easily? You’ll want a strong data versioning platform to keep all those changes in check. Without the right tools to manage this changing data, collaboration becomes next to impossible. The best-in-class open source data versioning and management tools I’ve seen are Pachyderm and DVC.
Pachyderm runs on Kubernetes and delivers a copy-on-write filesystem on top of your object store. It gives you Git-like data versioning and scalable data pipelines. If you’re already leveraging object storage and Kubernetes, it’s really easy to set up. Pachyderm leverages Kubernetes to scale the pipelines you use to transform and prepare data for machine learning. Datasets live in repositories, which version every single change you make to that data and are simple to share with your team.
DVC is significantly more lightweight than Pachyderm, running locally and adding versioning on top of your local storage solution. DVC simply integrates into existing Git repositories to track the version of data that was used to run experiments. ML teams can also define and execute transformation pipelines with DVC; however, the biggest drawback of DVC is that those transformations run locally and are not automatically scaled to a cluster. Notably, DVC does not handle the storage of data, simply the versioning.
Determined: A Deep Learning Experimentation Platform
The process of building deep learning models is slow and laborious, even for an experienced team. It’s especially challenging for a small team trying to do more with less. Deep learning models can take weeks to train, and often need dozens of training runs to find effective hyperparameters. If the ML team is only working with GPUs hosted on a single machine, things are pretty straightforward, but as soon as your team reaches cluster scale your ML team will spend way more time writing code to work with the cluster than building ML models.
Determined is an open source platform built to accelerate deep learning experimentation for small to large machine learning teams. It fills a number of key roles in your AI/ML stack:
- Hardware abstraction that frees model developers from having to write systems glue code
- Tools for accelerating experiments, e.g., easy-to-use distributed training and hyperparameter tuning
- A collaboration platform to allow teams to share their ML research and work together more effectively
- A cost reduction for cloud-first companies by training models on GPU instances that autoscale with demand, even on preemptible instances.
Determined is where your machine learning scientists will build, train, and optimize their deep learning models. By removing the need for your team to write glue code for every new model they build, they’ll be able to spend more time creating models that differentiate you from your competitors.
Determined is open source, and you can install it on bare metal or in the cloud, so it integrates well with most existing infrastructure.
Deployment Platform
Without the right tools, most ML teams spend a lot of time creating and hosting REST endpoints, which contributes to deployment taking months instead of days. There are great tools out there to make this process easier for your data scientists, such as Seldon Core and offerings from cloud providers.
Seldon Core is an open source project that significantly simplifies the process of deploying a model to an existing Kubernetes cluster. If your team is already comfortable with Kubernetes, Seldon makes deployment much easier. If not, Seldon offers an enterprise product that will help, or you may want to build an interface on top of Seldon to help bridge the gap between ML developers and deployment.
AWS, GCP, and Azure all have deployment tools that will help users deploy models on their cloud infrastructure. These tools largely abstract the infrastructure from the user, but require a thorough understanding of their respective cloud platforms. If you’re already locked into a cloud provider, these tools are almost certainly the fastest way to enable deployment but if you ever want to leave that platform you’re locked in tight.
Even worse, the cloud platforms won’t always support the frameworks you need because cutting edge frameworks don’t get the same support as well known frameworks. If you’re using PyTorch and TensorFlow, you’re in good shape, but as soon as you need something that just came fresh out of a research lab you may have to wait for cloud platforms to catch up. Better to have an agnostic and flexible open source framework that lets you bring whatever tools you want to the job.
Putting it All Together
So what does a deep learning platform look like in practice? Check it out here.
To see the platform in action, check out this Jupyter notebook example, which walks through:
- Creating data repositories with Pachyderm and using Pachyderm data pipelines to prepare the data for training a model
- Building a model within the Determined platform, and using its Adaptive hyperparameter search to quickly find a high-performing model
- Building and testing a Seldon Core endpoint using the trained model produced by Determined.
This simple collection of tools (configurable in less than a day, and entirely open source) will greatly ease your ML team’s scaling challenges. They’ll be able to use this platform to generate more, higher-performing models and to quickly scale them to production to generate business value for your company.
If you have any questions about how you can use Determined as a part of your machine learning platform, join our community Slack where we’d be happy to help out.
Written by David Hershey, a Solutions Engineer at Determined AI. David has spent the last two years working on building enterprise ML platforms, previously a lead on Ford’s ML platform project. Feel free to reach out to him directly.