Tutorial
5 min read

Running Machine Learning Pipelines with Kedro, Kubeflow and Airflow

One of the biggest challenges of today’s Machine Learning world is the lack of standardization when it comes to models training.

We all know that data needs to be cleaned, split into training and test sets, fitted into the model and validated on the observations from the test subset. Maybe there should be some cross-validation involved, hyperparameters tuning is also not a bad idea. All Data Scientists feel how to work with the models efficiently, but a lack of common standards makes their work hard to understand for other engineers using a different methodology.

To overcome this issue, QuantumBlack open-sourced the Kedro framework, an open-source Python framework for creating reproducible, maintainable and modular data science code. Projects created with Kedro are universal enough to cover most of the tasks that Data Scientists may have. Additionally, Data Catalog and Pipeline abstractions make the model building process look like a software project that can be configured and deployed easily, especially by the engineers that didn’t take part in implementing them. It provides a variety of plugins to log models and metrics in Mlflow, ship the project as a docker image and more. Based on the feedback gathered from our clients, we made Kedro a core part of GetInData Machine Learning Operations Platform. We presented our MLOps platform blueprint during the Google Cloud Region launch in Warsaw. You can watch it here (currently only in Polish)

However, having a Kedro project ready and well tested on the data sample doesn’t mean it is ready enough to go “into production” and be deployed quickly. Continuous training, hyper parameter tuning, continuous quality validation - all these tasks need some scheduling capabilities, distributed computing and powerful hardware to do things efficiently. Kedro describes some ideas on how to deploy the models, but well, you know what they say, “work smarter, not harder - automate everything” ;-)

*Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Mariusz Strzelecki - Big Data Technology Warsaw Summit 2021*

Using Kubeflow? Meet kedro-kubeflow

Kubernetes is the core of our Machine Learning Operations platform and Kubeflow is a system that we often deploy for our clients. Therefore, we decided to automate the generation of the Kubeflow pipeline from the existing Kedro pipeline to allow it to be scheduled by Kubeflow Pipelines (a.k.a. KFP) and started on the Kubernetes cluster. Thankfully, the creators of Kedro gave us a little help, by doing proof-of-concept of this integration and providing interesting insights.

The result of our work is available on GitHub as a kedro-kubeflow plugin. You install it in your existing Kedro project and soon you can:

  • compile Kedro nodes as KFP steps and reflect dependencies,
  • upload the compiled KFP pipeline to the server,
  • trigger the execution using CLI or enable a schedule when you’re finally happy with how the pipeline works,
  • connect seamlessly with MLflow and log all the steps under one Mlflow run (even if they are actually separate processes),
  • access Google AI Platform Pipelines API using IAM Proxy,
  • and more!

We faced some challenges - for example Kedro expects the `data/` directory to be a place where nodes exchange the data, but in a distributed environment there are limited options to maintain shared storage between different processes. Thankfully, with a bit of hacking, we made it! 

getindata-machine-learning-operations-platform-kubeflow-plugin-mlops GetInData MLOps Platform: Kubeflow plugin

Using Airflow? Meet kedro-airflow-k8s

Some of our customers tend to avoid Kubeflow, as the system is quite complicated to install and maintain. Fortunately, Airflow can meet the same needs with Kedro pipeline deployment. There is an official kedro-airflow plugin, but it doesn’t support running in Docker containers inside a Kubernetes cluster which is our preferred, most universal method.

Therefore, based on the experience of developing kedro-kubeflow, we created another plugin that we called kedro-airflow-k8s. It has the same capabilities and even the same CLI syntax as its older brother, but compiles the Kedro pipelines to Airflow DAG and deploys it by copying the file to the shared bucket which Airflow uses to synchronize Dag Bag.

machine-learning-pipelines-kedro-airflow-plugin
Machine Learning pipelines: kedro-airflow plugin

Try the plugins and let us know your thoughts!

If you’re using Kubeflow, feel free to check quickstart and the rest of the documentation. If you’re using Airflow, we have quickstart as well. If you decide to give them a try - we’re waiting for your feedback! The plugins are in Beta phase, but the main API (the way you call it from Kedro CLI) is now stable, so don’t be afraid to integrate it into your CI/CD pipelines, as we did recently.

If you want to know more please check our Machine Learning Platform and do not hesitate to contact us.

big data
technology
kubernetes
machine learning
Airflow
28 April 2021

Want more? Check our articles

getindata nifi blog post
Tutorial

NiFi Ingestion Blog Series. PART III - No coding, just drag and drop what you need, but if it’s not there… - custom processors, scripts, external services

Apache NiFI, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…

Read more
data menocratization data managment white paper by getindata
Whitepaper

White Paper: Data Democratization Through Data Management

Our recently released white paper, "Data Democratization Through Data Management" offers an in-depth exploration of the subject. This article will…

Read more
getindata xebia joining forces globa partner

Joining forces with Xebia: The story by GetInData’s founders about their aspirations, dilemmas and key reasons for joining the global partner

Starting a company from scratch as first-time founders can be very challenging, but being active community members can make all the difference…

Read more
lean big data 1
Tutorial

Lean Big Data - How to avoid wasting money with Big Data technologies and get some ROI

During my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e…

Read more
getindata big data blog ml model mleap
Tutorial

Online ML Model serving using MLeap

Training ML models and using them in online prediction on production is not an easy task. Fortunately, there are more and more tools and libs that can…

Read more
blogobszar roboczy 1 4
Tutorial

Power of Big Data: MLOps for business.

Welcome to the next instalment of the “Power of Big Data” series. The entire series aims to make readers aware of how much Big Data is needed and how…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy