View this notebook on GitHub or run it yourself on Binder!


Logging with a schema#

Create a rubicon_ml project

[1]:
from rubicon_ml import Rubicon

rubicon = Rubicon(persistence="memory", auto_git_enabled=True)
project = rubicon.create_project(name="apply schema")
project
[1]:
<rubicon_ml.client.project.Project at 0x11c99e890>

Train a RandomForestClassifier#

Load a training dataset

[2]:
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True, as_frame=True)

Train an instance of the model the schema represents

[3]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(
    ccp_alpha=5e-3,
    criterion="log_loss",
    max_features="log2",
    n_estimators=24,
    oob_score=True,
    random_state=121,
)
rfc.fit(X, y)

print(rfc)
RandomForestClassifier(ccp_alpha=0.005, criterion='log_loss',
                       max_features='log2', n_estimators=24, oob_score=True,
                       random_state=121)

Infer schema and log model metadata#

Log the model metadata defined in the applied schema to a new experiment in project with project.log_with_schema

Note: project.log_with_schema will infer the correct schema based on the given object to log

[4]:
experiment = project.log_with_schema(
    rfc,
    experiment_kwargs={  # additional kwargs to be passed to `project.log_experiment`
        "name": "log with schema",
        "model_name": "RandomForestClassifier",
        "description": "logged with the `RandomForestClassifier` `rubicon_schema`",
    },
)

print(f"inferred schema name: {project.schema_['name']}")
experiment
inferred schema name: sklearn__RandomForestClassifier
[4]:
<rubicon_ml.client.experiment.Experiment at 0x16d392b10>

View the experiment’s logged metadata#

Each experiment contains all the data represented in the schema - more information on the data captured by a rubicon_schema can be found in the “Representing model metadata with a rubicon_schema” section

[5]:
vars(experiment._domain)
[5]:
{'project_name': 'apply schema',
 'id': 'ec4c3ead-3337-4623-9a97-c61f48e8de3d',
 'name': 'log with schema',
 'description': 'logged with the `RandomForestClassifier` `rubicon_schema`',
 'model_name': 'RandomForestClassifier',
 'branch_name': 'schema',
 'commit_hash': 'c9f696408a03c6a6fbf2fbff39fa48bbf722bae1',
 'training_metadata': None,
 'tags': [],
 'created_at': datetime.datetime(2023, 9, 25, 15, 47, 37, 552091)}

The features and their importances are logged as defined in the schema’s “features” section

[6]:
project.schema_["features"]
[6]:
[{'names_attr': 'feature_names_in_',
  'importances_attr': 'feature_importances_',
  'optional': True}]
[7]:
for feature in experiment.features():
    print(f"{feature.name} ({feature.importance})")
alcohol (0.1276831830349219)
malic_acid (0.03863837532736449)
ash (0.006168227239831861)
alcalinity_of_ash (0.025490751927615605)
magnesium (0.02935763050777937)
total_phenols (0.058427899304369986)
flavanoids (0.15309812550131274)
nonflavanoid_phenols (0.007414542189797497)
proanthocyanins (0.012615187741781065)
color_intensity (0.13608806341133572)
hue (0.0892558912217226)
od280/od315_of_diluted_wines (0.15604181694153108)
proline (0.15972030565063608)

Each parameter and its value are logged as defined in the schema’s “parameters” section

[8]:
project.schema_["parameters"]
[8]:
[{'name': 'bootstrap', 'value_attr': 'bootstrap'},
 {'name': 'ccp_alpha', 'value_attr': 'ccp_alpha'},
 {'name': 'class_weight', 'value_attr': 'class_weight'},
 {'name': 'criterion', 'value_attr': 'criterion'},
 {'name': 'max_depth', 'value_attr': 'max_depth'},
 {'name': 'max_features', 'value_attr': 'max_features'},
 {'name': 'min_impurity_decrease', 'value_attr': 'min_impurity_decrease'},
 {'name': 'max_leaf_nodes', 'value_attr': 'max_leaf_nodes'},
 {'name': 'max_samples', 'value_attr': 'max_samples'},
 {'name': 'min_samples_split', 'value_attr': 'min_samples_split'},
 {'name': 'min_samples_leaf', 'value_attr': 'min_samples_leaf'},
 {'name': 'min_weight_fraction_leaf',
  'value_attr': 'min_weight_fraction_leaf'},
 {'name': 'n_estimators', 'value_attr': 'n_estimators'},
 {'name': 'oob_score', 'value_attr': 'oob_score'},
 {'name': 'random_state', 'value_attr': 'random_state'}]
[9]:
for parameter in experiment.parameters():
    print(f"{parameter.name}: {parameter.value}")
bootstrap: True
ccp_alpha: 0.005
class_weight: None
criterion: log_loss
max_depth: None
max_features: log2
min_impurity_decrease: 0.0
max_leaf_nodes: None
max_samples: None
min_samples_split: 2
min_samples_leaf: 1
min_weight_fraction_leaf: 0.0
n_estimators: 24
oob_score: True
random_state: 121

Each metric and its value are logged as defined in the schema’s “metrics” section

[10]:
project.schema_["metrics"]
[10]:
[{'name': 'classes', 'value_attr': 'classes_'},
 {'name': 'n_classes', 'value_attr': 'n_classes_'},
 {'name': 'n_features_in', 'value_attr': 'n_features_in_'},
 {'name': 'n_outputs', 'value_attr': 'n_outputs_'},
 {'name': 'oob_decision_function',
  'value_attr': 'oob_decision_function_',
  'optional': True},
 {'name': 'oob_score', 'value_attr': 'oob_score_', 'optional': True}]
[11]:
import numpy as np

for metric in experiment.metrics():
    if np.isscalar(metric.value):
        print(f"{metric.name}: {metric.value}")
    else:  # don't print long metrics
        print(f"{metric.name}: ...")
classes: ...
n_classes: 3
n_features_in: 13
n_outputs: 1
oob_decision_function: ...
oob_score: 0.9775280898876404

A copy of the trained model is logged as defined in the schema’s “artifacts” section

[12]:
project.schema_["artifacts"]
[12]:
['self']
[13]:
for artifact in experiment.artifacts():
    print(f"{artifact.name}:\n{artifact.get_data(unpickle=True)}")
RandomForestClassifier:
RandomForestClassifier(ccp_alpha=0.005, criterion='log_loss',
                       max_features='log2', n_estimators=24, oob_score=True,
                       random_state=121)