An Introduction to Wallaroo Performance
Wallaroo: Production ML
The Wallaroo ML platform includes an efficient, low-footprint event-by-event machine learning model execution engine, an experimentation framework for A/B testing, anomaly detection, and web dashboard — all driven from a familiar Notebook/Python SDK interface that data scientists love. For most use cases, Wallaroo can deliver faster time-to-market — typically 3X faster — and a much lower infrastructure footprint — typically 80% lower.
Wallaroo is ideal for high throughput production use cases involving AI algorithms and ML models inferring across either large amounts of data, fast data, or complex ML pipelines. Wallaroo provides better per-worker throughput, dramatically lower latencies and infrastructure cost, and a much simpler operational experience than existing tools or homegrown solutions. Wallaroo was designed as a streaming system, but with batch, microbatch, and request/response support layered on top — this means it works with any usage scenario.
Our ultrafast model execution engine means less hardware provisioning and significantly lower ongoing infrastructure costs for a given workload. We’ve integrated common machine learning tools like PyTorch, RoBERTa, TensorFlow, XGBoost, Scikit Learn, and others with the Wallaroo Engine. This permits developers to write simple Python code to get a distributed application with performance similar to low level compiled C code.
We wrap this all up into push-button production deployment right from a notebook. Wallaroo provides a high level Python SDK and lower level raw APIs to give you the widest range of integration options for your model deployment strategy, all from the convenience of your familiar tools and workflows.
This whitepaper goes into detail on how we achieve high performance and low infrastructure costs while delivering a superb and well designed user experience. We will:
- Learn about Wallaroo by exploring an example use case and application.
- Develop an understanding of the technical underpinnings of Wallaroo’s first-in-class performance.
Wallaroo by Example: Malware Prediction with ALOHA
Example. Aloha Model
The Aloha Model (ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation) is a complex open-source model that attempts to classify a given file as either “malware” or “benign” — an insanely hard task! It’s so complex that many now use Aloha for benchmarking and performance testing, to get an understanding of how well their ML systems are executing.
We built a benchmarking proof-of-concept for a large financial company that was looking for ways to modernize its ML infrastructure. We took this Aloha Model and put it to the test in comparison with two competitors in the market:
- AWS SageMaker
- Google Vertex
How did Wallaroo compare?
Core Wallaroo abstractions
Wallaroo applications are usually orchestrated from our Python SDK within a Jupyter notebook or other model development environment (including e.g. SageMaker, Databricks, etc.). Four core abstractions are used:
- Models
- Pipelines
- Sources
- Sinks
The most important component is the Model. This encapsulates the fundamental calculation to be run on each step of input data. We support most common ML model formats such as ONNX and TensorFlow, and are adding new ones all the time. The model definitions are loaded into our ultrafast custom Rust execution engine for C-like execution times, and concomitant low infrastructure cost.
You can combine Models together into a data Pipeline. A pipeline allows you to perform model chaining: to declare that the output from Model A will be processed as the input to Model B. Model chaining allows different conceptual areas of an ML project to be reasoned about and iterated independently. For example, the first Model might identify objects in an image. Its output can be fed into another Model that classifies the objects as “alive” or “not alive.” Utility steps can also be added that perform arbitrary transformations to the data flowing through the pipeline, such as altering its shape into the format needed by the next model.
A Pipeline begins with a Source stage, which is responsible for receiving and decoding incoming messages from outside systems that require an ML inference. The Source might be a live system streaming real time data about ongoing events, such as an ecommerce website streaming questions about live customer’s shopping carts, or an industrial control system streaming temperature and pressure readings from manufacturing sensors; or it might be some sort of data warehouse where historic business data is stored for later analysis, such as a Snowflake or Google BigQuery database.
Likewise, the pipeline ends in a Sink stage, which sends the final ML result out to an external receiver. For example, an ecommerce Source might request that upsell recommendations be sent to its Sink. The industrial control system might request that a maintenance alerting Sink be notified every time an ML model predicts maintenance might be required soon. If needed, the Source and Sink can be the same: a business analyst might load historic sales data from a BigQuery database, and send results with price forecasts back to another BigQuery database.
What’s the secret sauce?
People often ask us, “what’s the secret sauce?” What they mean is, “how does Wallaroo get those great performance numbers?”
At a high level, Wallaroo’s performance comes from a combination of design choices and constant vigilance. We built the system in the ultrafast Rust language, which provides a high level of robustness and safety while executing at C-like speeds. The Wallaroo engine uses an actor-model approach that encapsulates data, minimizes coordination, and brings the state close to the computation. We automatically test how every new feature affects system performance, and don’t accept any degradations: we reimplement the feature when we find performance lacking. We’re constantly developing new techniques to make the system go even faster.
High Performance ML Execution Engine
Let’s start with Wallaroo’s language of implementation. While it is possible to write programs that perform poorly in any programming language, some make it more difficult to write efficient programs. We discovered the many benefits of the Rust programming language:
- Robust ecosystem of libraries
- C-like execution performance
- Concurrent runtime
- Type-safe
- Borrow checker
- High maturity and adoption rate
The runtime provides a programming model that makes it easy to write lockless, high-performance code. The runtime itself comes with an efficient work-stealing scheduler that is used to schedule actors in a CPU friendly fashion. By default, a program will start one scheduler thread per CPU. When combined with CPU pinning on operating systems such as Linux, this can allow for CPU cache friendly programs. Rather than a large number of threads stepping on one another to get access to a given CPU, each CPU is dedicated to a single thread.
One of our top objectives of Wallaroo has been consistent, stable performance. Our implementation helps us achieve this by eliminating any “stop-the-world” garbage collection phase, unlike the Java Virtual Machine. The JVM has a single large heap that requires all threads to be stopped to collect garbage. In our runtime, there is never a point in time when a Wallaroo application is doing nothing but collecting garbage. The result of this is predictable performance.
In a clustered environment, like Wallaroo, this consistency can become a huge source of performance gains. Imagine two processes featuring “stop-the-world” garbage collection that are working together. Process A feeds data into Process B. Any time Process A experiences garbage collection, no other processing will be done and Process B ends up starved for work. The “stop-the-world” pause on Process A ends up acting as a “stop-the-world” for our cluster of machines. The same interaction can happen when Process B experiences a “stop-the-world” event. When B is no longer able to process work, A will become backlogged and will either have to 1) exert backpressure to slow all producers down, or 2) queue large amounts of work that it needs to send to B, thereby increasing the likelihood that it will soon experience a garbage collection event of its own. Wallaroo never suffers from such cross worker pauses because there is never a “stop-the-world” garbage collection event to completely halt processing.
The problems that “stop-the-world” garbage collection can cause in a clustered environment are covered in depth in “Trash Day: Coordinating Garbage Collection in Distributed Systems”. Wallaroo avoids all of these issues.
Spin Freely: Avoid Coordination
Coordination is a performance killer. Any time we introduce coordination into our designs, we introduce a potential bottleneck. Coordination is when two or more components need to agree on something before we can make further progress.
Coordination can take many forms. Locks are an example of coordination. Multiple threads need to update shared data so it remains consistent. To do this, they coordinate by introducing a lock. Consensus is another form of coordination. We see “consensus as coordination” in our daily office lives: we want to make a major decision, but we need to get three people together to discuss a topic. To do this, we must find a time everyone is available and then wait.
We have designed Wallaroo to avoid coordination (aka locks). We design so individual components can proceed using local knowledge. How large of an impact can coordination have on performance? Let’s take a look at one of our initial design mistakes. In an early version of Wallaroo, we had “global” routing actors. Every message processed had to pass through one of these routers. Changing message routing was very easy. High performance was difficult. We have since removed the global router and replaced it with many local routers. This one change in design resulted in an order of magnitude improvement in performance.
The performance and scalability impact of coordination can be huge. In “Silence is Golden”, Peter Bailis discusses the topic at length. If you are interested in learning more about how your systems can benefit from a coordination-avoiding design, we suggest you check it out.
In-process Coordination-Free State
Want to make a streaming data application go slow? Add a bunch of calls to a remote data store. Those calls are going to end up being a bottleneck. To maximize performance, you need to keep your data and computation close to each other.
Imagine an application that tracks the price activity of stocks. We want to be able to update the state of each stock as fast as possible. We have at our disposal three 16-core computers. We want to put all 48 cores to work updating price information. To achieve this, we need to be able to update each stock independently.
Wallaroo’s state object API provides independent, parallelizable individual state “entities.” In our application, each state object would be the data for a given stock symbol. In “Life Beyond Distributed Transactions”, Pat Helland presents a means of avoiding data coordination. Wallaroo’s “state objects” closely resemble the independent “entities” that Pat discusses as being key to scaling distributed data systems. The state object API makes it easy to partition state to avoid coordination among partitions.
Performance Measurements Taken on Every Feature
In the end, Wallaroo performance comes down to careful measurement. The performance of computer applications is often surprising. Who among us has not made an innocent looking change and suffered a massive performance degradation? Recognizing this reality, we have adopted a simple solution. As we add features or otherwise make changes to Wallaroo, we test the impact those features have on performance, with a view towards ensuring that no performance regressions ever take place. Ideally, things should go faster each release.
We seek to keep the latency overhead of any feature as small as possible. Why? Lowering per-feature latency is key to increasing throughput. Increased throughput means we can do the same amount of work with fewer resources. Not clear? Don’t worry, let’s take a look.
Imagine for a moment a simple Wallaroo application. It takes some input, does a computation or two, and creates some output. We’re going to measure performance in “units of work.” Each unit of work takes the same amount of time. If two different computations take 1 unit of work each, they take the same amount of time. Our application has a certain amount of units of work that it takes to transform an input into an output. Let’s say each input takes a total of 7 units of work to complete. Of those, 4 units of work are for user computations and 3 for Wallaroo overhead. Let’s further imagine that our computer can do 30 units of work at any one time. Given that processing 1 message requires 7 units of work, this means that at most, we can process 4 messages at a time.
30 / 7 ≈ 4
Now, let’s say that we can lower our Wallaroo overhead from 3 units to 1. If we do that, it will only take 5 units of work to process a message. And with that change, we can handle 6 messages at a time. That’s a 50% improvement over what we were doing before!
30 / 5 = 6
That’s a simple, contrived example but the basic logic holds in the real world. The less time it takes to complete a given task, the more times we can complete that task. We take this approach every day when building Wallaroo. Watch the overhead; take fewer resources to do the work; save money.
Why Wallaroo?
The Aloha model is one of many examples we at Wallaroo have experimented with to challenge our core system to provide the most efficient, low-footprint event-by-event machine learning model execution engine, an experimentation framework for A/B testing, and anomaly detection. We welcome any models and frameworks that our customers are using or need to utilize to expand our ever-growing capabilities. The Wallaroo platform makes it simple, fast, and very low cost to get AI algorithms live against production data.