Machine Learning Pipeline Architecture
When it comes to model development, Data Scientists are laser focused on cleaning and preprocessing their data, statistical methods, machine learning modeling techniques and so on; consequently, once their model is ready for production use, they usually hit a wall due to being in a sandboxed environment which is not set up for launching these prototypes into production. The need for a continuous process that seamlessly allows writing code, releasing into production, performing data extractions, creating training models and tuning those models is key. This process is known as a machine learning pipeline. Architecting an ML pipeline needs purpose and planning before being able to execute.
Creating an ML pipeline can be broken down into eight steps:
STEP 1 — Defining the problem: This is a basic step where you state the business problem that needs an answer.
STEP 2 — Data ingestion: Data is required as a first step for any ML effort. There are two layers:
Offline: Data is fed from a source or multiple sources that flows into an ingestion service and stored in a raw data store. When this data is sent to the ML platform, it is given a unique batch ID which allows for the dataset to easily and efficiently be queried and traceable. Each dataset has a dedicated pipeline, which is processed simultaneously and individually. The data within each pipeline is divided to utilize multiple processors, cores, and other resources to reduce the overall time to complete a task.
Online: Data is fed from a source and flows into a streaming engine, then to an online ingestion service which saves the data into the same raw data store as the offline layer. This online layer also connects to another streaming engine which provides further instantaneous processing.
STEP 3 — Data prep: This is a heavy step that involves taking raw and unstructured data and turning it into data that is usable for the models. During this step, the pipeline is looking for differences in formatting, incorrect or missing data points, outliers, and anomalies, etc. This stage also includes the feature engineering process, which can be manual or automated.
Offline: Once the ingestion service is complete, the data prep service is triggered. From here, the feature engineering logic processes the data and saves all generated features into a feature data store. Once each data prep pipeline completes, the output of features is also replicated to the online feature data store for easy querying and immediate prediction.
Online: The streaming engine provides data to the online data prep service, in-memory, while also persisting these features in the offline feature data store for future training.
STEP 4 — Data segregation: In this stage, we are dividing the data into training, test and validation sets to validate how the model performs against new datasets. This stage contains two pipelines, model training and evaluation, both of which must be able to call an API or service to reach the required datasets. This API or service must also have the capability of returning labeled and/or unlabeled data.
STEP 5 — Model training: This pipeline is always offline. It contains the library of training model algorithms the Data Scientist has developed which can be used continuously and interchangeably as needed. The workflow of this pipeline starts with the model training service which gets the training configuration parameters from the config service and requests the required training dataset from the API (or service) built during the data segregation stage. Once the model, configurations, learned parameters, timings, etc., are ready, all will be saved into a model candidate data store to be ready for evaluation and use later in the full pipeline.
STEP 6 — Candidate model eval: This stage of the pipeline is also always offline. It assesses the performance of the stored models using the test and validation data subsets until a model sufficiently answers the original problem defined. Once a model is ready for deployment, a notification service is broadcast.
STEP 7 — Model deployment: This is the part where time and resources are needed to deploy the ML model.
STEP 8 — Performance and monitoring: The model should be continuously and iteratively monitored, and behavior audited to incrementally be improved upon.
Following these eight stages to building your ML pipeline will help increase your chances of future success in model development. These steps, however, are high level guides rather than rigid requirements. Each may need more work depending on your problem statement and needs. Other items that need to be addressed include outlining what notifications are required, the timing and scheduling of each pipeline’s active states, logging, auditing, etc. Once all these measures have been fully vetted and built, you will have a well-rounded ML system.
About Wallaroo. Wallaroo enables data scientists and ML engineers to deploy enterprise-level AI into production simpler, faster, and with incredible efficiency. Our platform provides powerful self-service tools, a purpose-built ultrafast engine for ML workflows, observability, and experimentation framework. Wallaroo runs in cloud, on-prem, and edge environments while reducing infrastructure costs by 80 percent.
Wallaroo’s unique approach to production AI gives any organization the desired fast time-to market, audited visibility, scalability — and ultimately measurable business value — from their AI-driven initiatives, and allows data scientists to focus on value creation, not low-level “plumbing.”