How to write unmaintainable code by Roedy Green
├── src <- Source data for the model
│ ├── ingest_data.py <- Script for ingesting data from different sources
│ ├── generate_features.py <- Script for cleaning and transforming data for use in training and scoring.
│ ├── train_model.py <- Script for training machine learning model(s)
│ ├── score_model.py <- Script for scoring new predictions using a trained model.
│ ├── postprocess.py <- Script for postprocessing predictions and model results
│ ├── evaluate_model.py <- Script for evaluating model performance
│
├── run.py <- Simplifies the execution of one or more of the src scripts
├── requirements.txt <- Python package dependencies
model:
name: example-model
author: Chloe Mawer
version: AA1
description: Predicts a random result given some arbitrary data inputs as an example of this config file
tags:
- classifier
- housing
dependencies: requirements.txt
load_data:
how: csv
csv:
path: data/sample/boston_house_prices.csv
usecols: [CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT]
generate_features:
make_categorical:
columns: RAD
RAD:
categories: [1, 2, 3, 5, 4, 8, 6, 7, 24]
one_hot_encode: True
bin_values:
columns: CRIM
quartiles: 2
save_dataset: test/test/boston_house_prices_processed.csv
train_model:
method: xgboost
choose_features:
features_to_use: [ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO]
get_target:
target: CRIM
split_data:
train_size: 0.5
test_size: 0.25
validate_size: 0.25
random_state: 24
save_split_prefix: test/test/example-boston
params:
max_depth: 100
learning_rate: 50
random_state: 1019
fit:
eval_metric: auc
verbose: True
save_tmo: models/example-boston-crime-prediction.pkl
evaluate_model:
metrics: [auc, accuracy, logloss]
config.yml
gitlfs
, S3
or your own tables in HDFS or the database of your choosing.train_model:
command: python run.py train_model --config=config/example-training-config.yml --csv=test/test/boston_house_prices_processed.csv
true_dir: test/true/
test_dir: test/test/
files_to_compare:
- example-boston-train-features.csv
- example-boston-train-targets.csv
- example-boston-test-features.csv
- example-boston-test-targets.csv
- example-boston-validate-features.csv
- example-boston-validate-targets.csv
- example-boston-fitted-params.yml
generate_features:
command: python run.py generate_features --config=config/example-training-config.yml
true_dir: test/true/
test_dir: test/test/
files_to_compare:
- boston_house_prices_processed.csv
train_model:
command: python run.py train_model --config=config/example-training-config.yml --csv=test/test/boston_house_prices_processed.csv
true_dir: test/true/
test_dir: test/test/
files_to_compare:
- example-boston-train-features.csv
- example-boston-train-targets.csv
- example-boston-test-features.csv
- example-boston-test-targets.csv
- example-boston-validate-features.csv
- example-boston-validate-targets.csv
- example-boston-fitted-params.yml
score_model:
command: python run.py score_model --csv=test/test/example-boston-validate.csv --config=config/example-training-config.yml
true_dir: test/true/
test_dir: test/test/
files_to_compare:
- example_boston_scores.csv
requirements.txt
conda
You can find these slides at https://cmawer.github.io/reproducible-model
and the reproducible model template repo at https://github.com/cmawer/reproducible-model
Chloe Mawer | Lineage Logistics
cmawer@lineagelogistics.com | @chloemawer