How to Accelerate Computer Vision Model Inference
If you are having trouble deploying compute-intensive models for real-time or edge use cases, reach out to us at deployML@wallaroo.ai for a free consultation.
Although research in computer vision has been around for forty years, it took the advent of deep neural nets to make computer vision practical for more sophisticated tasks. Today, computer vision models are used for a variety of business applications, in many different verticals.
In retail, computer vision is used for tasks like automated checkout, shopper movement, and inventory tracking. Heavy industry applications include parts inspection and robotic assembly. Computer vision is also used in healthcare to aid medical image analysis, and in agriculture for monitoring crop development.
While deep learning model models are highly effective for computer vision problems, they present a number of challenges for both development and deployment. Deep nets work so well in rich, unstructured task domains like vision because they can find and express complicated relationships and patterns in the data. To accomplish this, the model structure is often quite complex, with millions — and in some problem domains, billions — of parameters that capture concepts learned from high-volume data sets.
For example, the well-known ResNet50 image classification model has approximately 26 million parameters and was trained on the ImageNet data set of 14 million images. VGG16 has about 138 million parameters and was trained on the same dataset.
Techniques like transfer learning (fine-tuning pre-trained models) alleviate the training data volume and computational issues around training a deep net. But these large, complex computer vision models often have fairly heavy computational requirements during inference as well. This can make them hard to deploy in real-time, edge, or other production environments with either time or resource constraints.
In this article, we’ll look at some techniques for making computer vision models lean and efficient in production.
Quantization and Pruning
Quantization and pruning are two common ways to “trim some fat” off an existing model.
Int8 quantization tries to reduce the size of a model by reducing the size of the weights of the neural net. These weights are typically represented as 32-bit floating-point numbers. Converting the floating-point representation to 8-bit (or even smaller) integers not only saves space but changes mathematical calculations from floating-point to integer. This results in a smaller and faster model. However, this model can also be less accurate.
In pruning, another program is used to search for the subset of the original network that contributes the most to the decisions that the model makes and then tries to remove the parts that are not necessary for the task at hand. For example, a model trained on ImageNet data was trained on a thousand categories including many different breeds of dogs. If the part of the network that knows about dogs is not important to your use case, it may be possible to prune it out and save on space and computation. However, this comes at the cost of increased effort, as one must find the relevant parts of the network, and then restructure the network to account for the pruned parts.
Knowledge Distillation
Knowledge distillation, or knowledge transfer, is an interesting technique in which a small model is trained to reproduce the output of a larger one. It’s an approach used to build small and high-performant models for use in limited-resource environments, like mobile phones and embedded devices.
Knowledge distillation starts with a larger, “heavyweight” network, the teacher, that has been trained to learn the original task. Another, smaller, network, the student, is then trained to learn the same concepts from the teacher. This smaller model often has a similar network structure as the teacher, but with fewer parameters, as in this example. The student can be trained on the same training data as the teacher, or on a different data set.
In response-based distillation, the most straightforward variation of knowledge transfer, the student model tries to reproduce the last layer of the teacher model. Other variations also try to reproduce intermediate layers or even the relationships between adjacent layers.
Knowledge distillation can produce smaller models that match, and sometimes exceed, the accuracy of their larger teacher models. However, this approach does require the overhead of an additional training round, and possibly some experimentation to find the best architecture for the student model. For a recent survey of knowledge distillation techniques, see [Gou, et.al. 2021].
Use A High-Performance Compute Engine
Rather than (or in addition to) downsizing the original model, you can speed up a model by running it on a high-performance compute engine, like Wallaroo. Wallaroo specializes in the last mile of the ML process: deployment. The Wallaroo platform is built around a high-performance, scalable Rust engine that is specialized for fast, high-volume computational tasks. The platform is designed to integrate smoothly into your data ecosystem and can be run on-prem, in the cloud, or at the edge.
With one customer who deploys computer vision models in a low-resource environment, Wallaroo was able to more than double the inference throughput and reduce latency by almost half! In other cases, we have seen up to a 12.5X improvement in inference speed and an 80% reduction in the cost of computational resources.
With Wallaroo, businesses can easily deploy complex models and achieve the accuracy they need, without sacrificing speed and performance. There’s less need for additional, time-consuming model optimization steps.
You can read more about the Wallaroo platform at our blog. If you are interested in finding out how Wallaroo can improve your model deployment process, reach out to us at deployML@wallaroo.ai to learn more.