Replace Apache Spark with Wallaroo for Lower Infrastructure Costs & Faster Deployments

5 min readJan 27, 2022

Apache Spark, whether open source or a managed service like AWS EMR and Databricks Runtime, is often one of the first production runtimes considered by engineering teams for deploying and running ML models in production. Many times that’s because an organization already has a production installation that’s being used for big data, analytics, or ETL workflows. If there are only a few simple models that don’t need regular retraining and redeployment, and production inferencing is a batch processing affair, then Spark can provide a “good enough” solution.

However, under most usage patterns, Spark will require far more time on the part of Data Scientists and ML Engineers to get the ML models to run in production, will consume far more infrastructure, and will scale poorly compared to Wallaroo. Often, when we talk to Data Scientist heads, we hear the familiar refrain — “I can’t wait to free up my Data Scientists’ time to focus on model development, not messing around with Spark.”

In this blog, we want to explore why an organization would use Wallaroo, and what benefits might be expected, vs. a Spark/Databricks deployment. We want to be clear, Wallaroo is not a rip and replace of the entire Databricks ecosystem, but instead focused on facilitating and accelerating the last-mile of your machine learning journey — taking your model from the development environment into production to generate business results (you can get an introduction to Wallaroo here). If you already depend on Spark for big data analytics or model development, you can keep doing that and Wallaroo will easily fit into your ecosystem.

Scalability and performance roadblocks with Apache Spark

Spark starts to create increasing deployment friction and infrastructure cost as the number of models grow, the frequency of redeployment increases, or the models get more complex (e.g. NLP or other neural network models). In addition, most enterprises are introducing more real-time use cases and hoping to eventually move many batch workloads to a streaming workload.

Spark is not good enough for these use cases because many of these increasingly frequent usage patterns were rare when Spark was first introduced. Java-based Spark was designed over 10 years ago for processing big data, but modern ML model libraries are written in C or Python; consequently, Java-based Spark leads to decreases in performance and requires additional resources. For example, Java-based virtual machines have a garbage collection task that consumes additional application resources and creates application pause times. Additionally, Spark is not true streaming — it uses smaller batches, so, not designed for low latency operations; moreover, since Spark does not scale well to small clusters and low infrastructure, it does not run well at the edge.

Spark vs. Wallaroo customer bakeoff

Last year, a top-10 global bank needed a better way to deploy, run, and update the 100+ ML models they used to detect malignant traffic analyzing over 2 billion events per day. Their current Spark-based infrastructure was compute heavy, and more importantly, updating their models could take several days, exposing their business to significant risk as bad actors adjusted their tactics.

When they came to Wallaroo, they wanted to understand what it would take to deploy ML models into production and compute 2 billion events per day in managed Spark versus Wallaroo. They were stunned at the results — Wallaroo was over 100x faster to deploy, 12.5x faster to inference, and with >92% lower cloud infrastructure costs versus Spark. See the full benchmark comparison below. NOTE: This was just for a single model in production. Their full savings would be multiplied by the final number of models they chose to run via Wallaroo.

You can see a more detailed benchmark comparing Wallaroo to Google Vertex, Amazon SageMaker, and Databricks for deploying and running models here. For this benchmark we used a publicly sourced code “Aloha-CNN-LSTM” found here. The code is a CNN model, with a domain generating algorithm to predict the legitimacy of domains using Alexa’s top 1 million domains to train the model. The model is originally in Python 2.7. The VM’s used were Standard F16s v2, with 32.0GB memory, 16 cores and two working nodes. The data points for upload were JSON files with 1, 1k, 25k, 50k, 100k, 200k, 400k, 800k and 1600k data points.

How to increase the ROI of your AI investments with Wallaroo

Without minimizing the importance of the upstream or midstream, we often find the last mile of ML — actually getting it live and into production — is an afterthought even though live production is where the ROI from AI investments comes from. According to Gartner, only 53% of projects make it from prototype to production. From this same Gartner report: “CIOs and IT leaders find it hard to scale AI projects because they lack the tools to create and manage a production-grade AI pipeline.”

That is why we are hyper focused on the “last mile” of machine learning.

*Fig 2: Wallaroo focuses on the last mile of ML — deploying, running, observing and optimizing ML models in production*

If you are having trouble deploying, running, and monitoring your ML models in production because you:

Have hundreds or thousands of models in production so have a hard time managing and updating
Have big data challenges of running models efficiently in production, or
Have complex models, like neural networks, that require better monitoring

Reach out to us at datascience@wallaroo.ai to see how we stack up against your current model deployment solution. We can help automate your deployment process while lowering compute costs and delivering faster inferences.

About Wallaroo.

Wallaroo enables Data Scientists and ML Engineers to deploy enterprise-level AI into production simpler, faster, and with incredible efficiency. Our platform provides powerful self-service tools, a purpose-built ultrafast engine for ML workflows, observability, and experimentation framework. Wallaroo runs in cloud, on-prem, and edge environments while reducing infrastructure costs by 80 percent.

Wallaroo’s unique approach to production AI gives any organization the desired fast time-to market, audited visibility, scalability — and ultimately measurable business value — from their AI-driven initiatives, and allows Data Scientists to focus on value creation, not low-level “plumbing.”

Replace Apache Spark with Wallaroo for Lower Infrastructure Costs & Faster Deployments

Written by Wallaroo.AI