MLflow best practices and lessons learned

Updated: May 4

MLflow is an open source platform for the machine learning life cycle. It has three main features, tracking, models and projects. Before jumping into the MLFlow best practices I would like to briefly introduce you the industry standard processes for machine learning. If you want to skip this part, skip the next chapter and jump directly to Setting up your environment and installing MLflow.



Machine Learning Development Life Cycle (MLDLC)

The machine learning life cycle is the cyclical process that data science projects follow. There are few industry standard of such cycles such as CRISP-DM, ASUM-DM and TDSP, which helps individuals and companies succeed in developing and deploying machine learning procjects. Below are 6 generic phases of CRISP-DM (Cross Industry Standard Process for Data Mining):

  1. Business Understanding: The focus of the first phase is on understanding project objectives and requirements from a business perspective and turning that into a data mining problem definition and a preliminary plan designed to achieve the objectives.

  2. Data Understanding: The data understanding phase deals with initial data load and includes activities such as data quality check, data discovery and data correlation with the business hypothesis.

  3. Data Preparation: The main task of the data preparation phase is to process collected raw data and turin them into meaningfull features. Typical tasks during this phase are data cleansing, transformation, normalization and etc. this phase of a machine learning project usually takes up to 80% time of any project.

  4. Modelling: In this phase multiple algorithms are used, compared to find the best model for the business problem. If prepared data does not answer the business problem, one might go back and start from previous phases. For example, a Data Scientist might decide to add more data sources which he thinks might increase model performance.

  5. Evaluation: At this stage in the project, you have built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, to be certain the model properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

  6. Deployment: Once a model is able to fulfill business requirements and technical accuracy, it could be deployed to production in order to be used by business. For example, if you have built a product recommendation model such as Collaborative Filtering (CF), you might want to use on production to recommend products once a customer is visiting your online shop product page, in order to increase your revenue. For this purpose, you need to deploy your trained model on a server and build an API which your website could communicate with your trained model and pass a customer profile and receive high score products to recommend to the customer.

These phases are at the very core of any ML or data science project and during any project a Data Scientist or a Machine Learning engineer might multiple times switch/jump between these phases in order to be able to fulfill business requirements and achieve desired model accuracy. This process could result stress, confusion and lead to failure of the project. MLflow could support a Data Scientist or Machine Learning engineer in keeping track of all his experiments and fine tuning his model. Also makes it easy to differenciate between development and production environment. In this blog I will focus on model tracking feature of MLflow and share with you lesson learned that you will also definitely face when working with MLflow.

Setting up your environment and installing MLflow

I will use Conda to manage packages related to this example:

# create a new conda environment
[home$conda create -n mlflow_example
# activate created conda environment
[home$conda activate mlflow_example
# install necessary packages for this example
[(mlflow_example) home$conda install python=3.6 mlflow xgboost sklearn matplotlib pandas

Setting up MLflow tracking server

Before using MLflow to track and log your work parameters and metrics, you need to setup MLflow tracking server. You could do this locally by simply running below script on your project folder:

[home$mlflow server --backend-store-uri ./mlruns --host 127.0.0.1 --port 8080
[home$ls
mlruns
[home$ls mlruns
0
[home$ls mlruns/0
meta.yaml
[home$cat mlruns/0/meta.yaml
artifact_location: ./mlruns/0
experiment_id: '0'
lifecycle_stage: active
name: Default

above we have started MLflow server locally on 8080 port. In case this port is already in use, simply use a different port. As you see after running the server, MLflow has created a folder called mlruns under your project directory. Inside this folder there already exists a subfolder called 0 which is first automatically generated experiment called Default. If you wouldn't specify ./mlruns as your --backend-store-uri and for instance only specify ./ then you would encounter an error when opening MLflow UI from your browser. Now open your browser and hit 127.0.0.1:8080/. You will see following dashboard.



Now your MLflow tracking server is up and running. All you need to do is to run an experiment on Python or R and log your model parameters to MLflow. Below is a guide to how to run an experiment and use MLflow.


For this example I will run an experiment usign XGBoost (eXtreme Gradient Boosting) model in Python and show you logging model parameters, metrics and artifacts into MLflow.

Firstly we will import necessary libraries

from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score, log_loss
import xgboost as xgb 
import matplotlib as mpl
import mlflow 
import mlflow.xgboost
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

I have used red wine quality dataset from UCI Machine Learning Repository homepage. The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. You can download the csv file from this link.


Data is loaded via pandas package.

dataset = pd.read_csv('./winequality-red.csv',sep=';')
dataset.describe()

The goal is to predict the quality of wine based on above features. Now let's look at the number of data records per:

dataset.loc[:,'quality'].value_counts()







As we see categories are not uniformly distributed, therefore we will use stratified split.

X = dataset.drop('quality',axis=1) y = dataset.loc[:,'quality'] y=y.replace({5:0, 6:1, 7:2, 4:3, 8:4, 3:5})
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=110, stratify=y)
scaler = MinMaxScaler()
scaler.fit(X_train,y_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
dtrain = xgb.DMatrix(X_train_scaled, label=y_train)
dtest = xgb.DMatrix(X_test_scaled, label=y_test)

Now are data is ready to pass to the model.

# enable auto logging
mlflow.xgboost.autolog()
with mlflow.start_run():
    # train model
    params = {
        'objective': 'multi:softprob',
        'num_class': 6,
        'eval_metric': 'mlogloss',
        'colsample_bytree': 0.9,
        'subsample': 0.9,
        'seed': 6174,
    }
    model = xgb.train(params, dtrain, evals=[(dtrain, 'train')])

    # evaluate the model
    y_proba = model.predict(dtest)
    y_pred = y_proba.argmax(axis=1)
    loss = log_loss(y_test, y_proba)
    acc = accuracy_score(y_test, y_pred)

    # log metrics
    mlflow.log_metrics({'log_loss': loss, 'accuracy': acc})

As you see above we have used mlflow.xgboost.autolog method which automatically logs all parameters, metrics and artifacts without the need to use mllfow.log_param, mlflow.log_metric and etc. separately. Now lets look at the MLflow UI. As you see now under default experiment (0) there is a new item (run). When you click on this run, you will see all model parameters, metrics which used and artifacts.

Lessons Learned

One important feature of MLflow is that when you delete a run or experiment, they won't be deleted from the server disk and will only be labeled as deleted on lifecycle_stage of meta.yaml file. To restore the run or experiment you can either use below mlflow CLI command:

# for run:
(mlflow_example) home$mlflow runs restore --run-id <run_id>
# for experiment:
(mlflow_example) home$mlflow experiments restore --experiment-id <experiment_id> 

or change the lifecycle_stage of meta.yaml file of run or experiment from deleted to active.

Another interested fact is that if you are running your MLflow server on a separate computer and multiple users might run under same experiments, you may need to take care of ACLs. In order to run to the same experiment you may need to use os.umask(0o002). With this mask default directory permissions are 775 and default file permissions are 664.

One other important fact to consider is that corrently MLflow only supports postgresql, mysql, sqlite, mssql databases to register the models. The MLflow Model Registry component is a centralized model store. It provides model lineage (which MLflow experiment and run produced the model), model versioning, stage transitions (for example from staging to production), and annotations. In the above example our MLflow server is not connected to any of backends such as mysql and etc. In order to do this, I have provided steps, how to register model using MLFlow.

How to register your model via MLflow

Before doing any changes to MLflow server, let's create a datbase in MySQL by executing CREATE statement in MySQL:

>CREATE DATABASE IF NOT EXISTS db_mlflow;

Once the database db_mlflow is created we could use it to run our server this time by specifying more options of MLflow CLI. However, we need to install mysqlclient package to let mlflow push to the database:

[(mlflow_example) home$conda install mysqlclient
[(mlflow_example) home$mlflow server --backend-store-uri mysql://root:${password}@localhost/db_mlflow  --default-artifact-root ./mlruns --host 127.0.0.1 --port 8080

Once you run the server and try to run a new experiment you will notice that your experiment is not registered on the UI and is written under ./mlruns directory. The reason for that is by default mlflow is writing to ./mlruns. Therefore in this case you should specify mlflow_tracking_uri in your python script before logging anythin using following command:

mlflow.set_tracking_uri('mysql://root:${password}@localhost/db_mlflow')

This line of code sets uri for all logging to mysql database instead of ./mlruns. Now if you open MLflow UI you should see your experiment run. Now let's register this model to Model Registry. Simply go to your run page on the UI and click on the model folder under Artifact section. Select register and specify a name for your model. Once this done, a new external link is attached on the right side of the model. When you open the link you are able to see all information about registered model and you could change staging of the model between development and production.


172 views
Quick Links
  • Instagram
  • Facebook
  • Twitter
  • LinkedIn

Vienna, Austria

​​​​Copyright © 2020 Asigmo. All rights reserved. icons by icons8.com