Why Data Scientists Should be Excited about Wallaroo Community Edition
As a data scientist, here are the parts of my job that I enjoy the most (in no particular order):
- Learning about new domains where I can apply my problem-solving skills, and gaining knowledge and inspiration from the experts in these fields who work with me.
- Learning about exciting new techniques and approaches in machine learning and statistics, and applying them to the real-world problems that I’m working on.
- Getting my hands dirty in the data, exploring how it relates to the information that I need and the knowledge that I want to extract.
- Creating and executing elegant and effective solutions to the problems that I’m tackling.
- Seeing my models and solutions bring real value to the business, solving problems and making decisions.
What’s the part of the job I do not enjoy?
To be honest (and I’m not proud of it): the act of making my models bring value to the business.
I think it boils down to this: to me, data science (and data analysis in general) is a process of abstraction: finding the patterns, the “truth” that lies within the data, and turning those patterns into automated and actionable decision processes (models) that perform a valuable function within the business. That means my job entails tasks like exploration, experimentation, and communication of my findings to interested parties.
But putting these models into production feels — to me — like the opposite. To operationalize a model, the focus is on tasks like automation, optimization of resources, hardening, security, logging and developing fallback strategies in case things go wrong. This requires explicitly worrying about all the complexities and details that must be taken into account to make sure that the production pipeline is robust, reliable, and failsafe. And I’m not a details person; at least not those kinds of details.
Not all data scientists feel the way I do, of course; some of them are quite comfortable with, and even actively interested in, the process of shepherding their models into the real world. But it seems that the population of data scientists who feel similarly to the way I do is large enough that a whole new profession has arisen to fill the gap: Machine Learning Ops engineers.
MLOps engineers have precisely the skill set to make the models that data scientists create active and actionable in the real world, and to monitor model performance, operationally speaking, while that model is live. But while many data scientists may be happy to hand their models over to the MLOps team for a production rollout, this process, as currently done, may not be terribly efficient. Because data scientists and MLOps engineers don’t speak the same language, and don’t work or think the same way, there can often be time-consuming bottlenecks as one group tries to articulate a requirement, and the other team tries to satisfy it.
In addition, if the model does start misbehaving in production, it can often be a team effort to diagnose the issue — is it an error in the production stack, or is something wrong with the model? This can lead to the same communication and coordination bottlenecks as deployment as data scientists struggle with gaining visibility into their models within the production stack.
Can we make the interaction between the data scientist’s world and the MLOps world easier?
When I am learning how to operate within an unfamiliar system it’s helpful to me to do that in an environment I’m comfortable in; preferably an interactive one. As a data scientist, I find a notebook-style environment to be helpful, because I can simultaneously figure out the process and document it for future reference. Some people prefer graphical UIs and dashboards, and those are certainly useful for information that’s best absorbed visually, like summaries of model health, or quick glances into what might need my attention the most.
I want it to be easy to specify important information about the model to deployment teams: for instance, data validation constraints that the model expects to be met, or any preprocessing of incoming data that must be done before it’s fed to the model.
If and when something goes awry with one of my models in production, I want to know about it sooner rather than later. I want to easily diagnose the problem; automated diagnosis is ideal, but for hairy problems I want to to quickly pull down the information I need for more in-depth analysis. This includes inference logs, input history, and model health information.
Wallaroo helps me do all that.
It’s designed to be a platform where I can interact with production teams, about production concerns, in a way that fits my natural (data scientist) working style. That’s definitely a benefit for me, and I hope it’s a benefit for the ML engineers who have to work with me, too!