Automating the Data Pipeline
This month we will be ramping up our coverage of DataOps and data engineering, so we can all learn how to streamline data operations in much the same way code development and infrastructure management are also being automated.
In the field of software development, we have tools like git to allow for distributed collaboration. And with GitOps, we can automatically push that code into production. On the operations side, we have the emergence of programmable infrastructure, that — with tools from HashiCorp, Pulumi and others — automates the process of rolling out deployments without manual intervention.
But while the processes around code and computers are rapidly becoming automated, the management of data (or “Big Data” if you have a lot of data) still seems stuck in the 20th century. In a contributed post to The New Stack this week, Lenses.io co-founder and Chief Technology Officer Andrew Stevenson writes that “Businesses in every industry are data driven, and data professionals are feeling increasing pressure to work more efficiently and accelerate time to market for their products. The last thing a data professional wants is to become a bottleneck in the process.”
The good news is that help is on the way. Stevenson argues for instituting a form of DataOps, in which analysts, developers and other business-focused users can work with the data in a self-service fashion. A good DataOps system — Lenses.io offers a DataOps platform itself — should also accommodate corporate compliance and governance needs as well.
Other companies are tackling this issue as well.
The Irish startup TerminusDB is just one of the projects working on creating a “git for data” system, so we learn from Susan Hall this week. TerminusDB is an open source in-memory graph database that allows different people to work on different versions of the same project at the same time.
For machine learning (ML) data, we learned from KubeCon that Microsoft has been pitching the idea of MLOps, or a complete pipeline for managing the data from training to the model stage itself. This automates the management of data and, not incidentally, provides a baseline for security, “The truths you can’t avoid here: your models will be attacked, your pipelines will have issues. And the game is all about mitigation of harms and quick recovery. And you can do that using an MLOps pipeline,” Microsoft’s David Aronchick said.
Also at KubeCon, we got a preview of a new concept, called a “Feature Store,” which was also created to automate the data pipeline. A collaboration between Google and Indonesian startup Gojek, called Feast, provides a way for companies to organize commonly used data sets and formats so they can be easily accessed from developers and data scientists (and more easily managed in a uniform way behind the scenes).
“A typical flow is for data scientists to either push data into a feature store for storage, or to register transformations with a feature store that will generate data to be stored within the feature store. Once the data is available within the feature store, another team can consume those features for training a model, and can also retrieve features from the feature store for online serving,” Feast creator Willem Pienaar explained to TNS writer Kimberly Mok.
It's a rapidly evolving field, and like GitOps or programmable infrastructure, one that demands automation. In an excellent QCon talk on data engineering a year ago, software engineer Chris Riccomini passed on an insightful quote from a Google SRE, namely that “If a human operator needs to touch your system during normal operations, you have a bug."
The future is automation — for the code, infrastructure AND data.