The ingredients of a reproducible machine learning model



Chloe Mawer, PhD

Principal Data Scientist, Lineage Logistics
Adjunct Lecturer, Masters of Science in Analytics, Northwestern University

Irreproducibility in the wild

Steps of a machine learning model

How to write unmaintainable code by Roedy Green

forgetting-how-your-own-code-works

Why is this so hard?

Randomness is everywhere


  • Sampling of data for training
  • Train/test split
  • Model initialization
  • Sampling of data within algorithm
  • Order of exposure of the model in training to the data
  • Sampling of data for evaluation and cross validation
  • And more!

The path is long

Steps of a machine learning model

machine-learning-steps-code-and-params

machine-learning-steps-code-params-and-artifacts

Ingredients of a reproducible model

Determinism

Find every random state and parameterize it

Versioning of everything

Versioning code is not enough

├── src                               <- Source data for the model 
│   ├── ingest_data.py                <- Script for ingesting data from different sources 
│   ├── generate_features.py          <- Script for cleaning and transforming data for use in training and scoring.
│   ├── train_model.py                <- Script for training machine learning model(s)
│   ├── score_model.py                <- Script for scoring new predictions using a trained model.
│   ├── postprocess.py                <- Script for postprocessing predictions and model results
│   ├── evaluate_model.py             <- Script for evaluating model performance 
│
├── run.py                            <- Simplifies the execution of one or more of the src scripts 
├── requirements.txt                  <- Python package dependencies 

Parameters and settings

config.yml

Data

  • At minimum, version an explicit query and include in configuration filters used.
  • Source data can change so even this is not sufficient in many cases.
  • Ideally, you can version the entire training dataset through tools like gitlfs, S3 or your own tables in HDFS or the database of your choosing.

Features

  • If a feature changes, the downstream models change too.
  • Often a feature is the output of another model.
  • Ideally each feature should be treated this way and managed accordingly.

Auxiliary data

  • Models can be highly dependent on auxiliary data, such as the options for categorical variables.
  • If this data gets out of sync with the model files or code, it can cause code to fail.

Trained model objects

Workflows

  • Something needs to remember how the steps were stringed together.
  • Use tools like Make files, Airlflow, Luigi and version them.

Version them all together

  • Commit hashes
  • Manually cultivated version list
  • Dates

Reproducibility testing

Traditional software testing is not enough

(Though you should definitely do it still!)

Model testing

Environment management

requirements.txt

conda

Code alignment

Steps of a machine learning model - offline

Steps of a machine learning model - online

Thank you!

Thank you!


You can find these slides at
https://cmawer.github.io/reproducible-model


and the reproducible model template repo at
https://github.com/cmawer/reproducible-model


Chloe Mawer | Lineage Logistics

cmawer@lineagelogistics.com | @chloemawer