Representing model metadata with a schema¶
A rubicon-ml schema is a YAML file defining how attributes of a Python object, generally representing a model, will be logged to a rubicon-ml experiment. Schema can be used to automatically instrument and standardize the rubicon-ml logging of commonly used model objects.
Schema are used to log experiments to an existing rubicon-ml project. Experiments consist of features, parameters, metrics, artifacts, and dataframes. More info on each of these can be found in rubicon-ml’s glossary.
A simple schema¶
Consider the following objects from a module called my_model
:
import pandas as pd
class Optimizer:
def optimize(X, y, target):
self.optimized_ = True
return "optimized"
class Model:
def __init__(self, alpha=1e-3, gamma=1e-3):
self.alpha = alpha
self.gamma = gamma
def fit(self, X, y):
self.optimizer = Optimizer()
self.target = "y"
self.feature_names_in_ = X.columns
self.feature_importances_ = [1.0 / len(X.columns)] * len(X.columns)
self.learned_attribute_ = optimizer.optimize(X, y, target)
return self
def score(self, X):
self.score_ = 1.0
self.summary_ = pd.DataFrame(
[[self.alpha, self.gamma, self.learned_attribute_, self.score_]],
columns=["alpha", "gamma", "learned_attribute", "score"],
)
return self.score_
The following is a complete YAML representation of the Model
object’s schema:
name: my_model__Model
verison: 1.0.0
compatibility:
pandas:
max_version:
min_version: 1.0.5
docs_url: https://my-docs.com/my-model/Model.html
artifacts:
- self
- name: optimizer
data_object_attr: optimizer
dataframes:
- name: summary
df_attr: summary_
features:
- names_attr: feature_names_in_
importances_attr: feature_importances_
optional: true
- name_attr: target
metrics:
- name: learned_attribute
value_attr: learned_attribute_
optional: true
- name: score
value_attr: score_
- name: env_metric
value_env: METRIC
parameters:
- name: alpha
value_attr: alpha
- name: gamma
value_attr: gamma
- name: env_param
value_env: PARAMETER
Schema metadata¶
The first section of the schema defines metadata about the schema itself, like the name and version. The name of a schema should be the name of the library the class it represents comes from and the name of the Python class itself separated by a double underscore.
name: my_model__Model
verison: 1.0.0
The next section defines any dependencies the model object has on external Python libraries. Generally, this will be at least the library the object is imported from. Reference documentation for the object to be logged can also be included in this section.
compatibility:
pandas:
max_version:
min_version: 1.0.5
docs_url: https://my-docs.com/my-model/Model.html
The remaining sections define how the attributes of the object will be logged to the
rubicon-ml
experiment. In general, each section is a list of attributes to log to
rubicon-ml
with a name for the logged metadata and the name of the attribute
containing the value to log.
Artifacts¶
Define a rubicon_ml.Artifact
for logging by providing a name
for the logged artifact and the attribute data_object_attr
containing the object to log. The special keyword self
will log the full object the schema
represents as an artifact with the same name as the object’s class.
artifacts:
- self # logs this Model as an artifact named "Model"
- name: optimizer # logs Optimizer in `optimizer` attribute as an artifact named "optimizer"
data_object_attr: optimizer
Dataframes¶
Define a rubicon_ml.Dataframe
for logging by providing a name
for the logged dataframe and the attribute df_attr
containing the DataFrame to log.
dataframes:
- name: summary # logs DataFrame in `summary_` attribute as a dataframe named "summary"
df_attr: summary_
Features¶
Define a single rubicon_ml.Feature
for logging by providing the attribute name_attr
containing the name of the feature to log
and optionally the attribute importance_attr
containing the feature’s importance.
Lists of features can be defined for logging with the attributes names_attr
containing a
list of feature names to log and optionally importances_attr
containing the corresponding
importances.
features:
- names_attr: feature_names_in_ # for each value in the `feature_names_in_` attribute, logs a feature named that
# value with the corresponding importance in the `feature_importances_` attribute
importances_attr: feature_importances_
optional: true
- name_attr: target # logs a feature named the value of the `target` attribute
Metrics¶
Define a rubicon_ml.Metric
for logging by providing a name
for the logged metric and the attribute value_attr
containing the metric value to log.
Metric values can also be extracted from the runtime environment. Replace value_attr
with value_env
to
leverage os.environ
to read the metric value from the available environment variables.
metrics:
- name: learned_attribute # logs value in `learned_attribute_` attribute as a metric named "learned_attribute"
value_attr: learned_attribute_
optional: true
- name: score # logs value in `score_` attribute as a metric named "score"
value_attr: score_
- name: env_metric # logs value in `METRIC` environment varibale as a metric named "env_metric"
value_env: METRIC
Parameters¶
Define a rubicon_ml.Parameter
for logging by providing a name
for the logged parameter and the attribute value_attr
containing the parameter value to log.
Parameter values can also be extracted from the runtime environment. Replace value_attr
with value_env
to
leverage os.environ
to read the parameter value from the available environment variables.
parameters:
- name: alpha # logs value in `alpha` attribute as a parameter named "alpha"
value_attr: alpha
- name: gamma # logs value in `gamma` attribute as a parameter named "gamma"
value_attr: gamma
- name: env_param # logs value in `PARAMETER` environment varibale as a parameter named "env_param"
value_env: PARAMETER
Optional attributes¶
In some cases, the attribute containing the value to log may not always be set on the underlying object. A model may have been trained on a dataset with no feature names, or perhaps some learned attributes are only learned if certain parameters have certain values while fitting.
By default, schema logging will raise an exception if the attribute to be logged is not set. To suppress the errors
and simply move on, items in the artifacts
, dataframes
, features
, metrics
, parameters
and
schema
lists may optionally contain a key optional
with a true value.
The feature_names_in_
and learned_attribute_
attributes are both marked optional in the example schema
above to handle cases where no feature names were present in the training data and learned_attribute_
was
not learned:
features:
- names_attr: feature_names_in_
importances_attr: feature_importances_
optional: true # will not error if `feature_importances_` attribute is not set
- name_attr: target # **will** error if `target` attribute is not set
metrics:
- name: learned_attribute
value_attr: learned_attribute_
optional: true # will not error if `learned_attribute_` attribute is not set
Note: Optional items in artifacts
, dataframes
, features
, and schema
will omit the associated
entity from logging entirely if an optional attribute is not set. Optional items in metrics
and parameters
will log the associated entity with the given name and a value of None if an optional attribute is not set.
Nested schema¶
The following is a complete YAML representation of the Optimizer
object’s schema:
name: my_model__Optimizer
verison: 1.0.0
metrics:
- name: optimized
value_attr: optimized_
To apply another schema to one of the attributes of the original object, provide the schema name
to be retrieved via registry.get_schema
and the attribute attr
containing the
object to apply the schema to.
schema:
- name: my_model__Optimizer # logs a metric according to the above schema using the object in `optimizer`
- attr: optimizer
Note: Nested schema will add the logged entities to the original experiment created by the parent schema, not a new experiment. Nested schema cannot have names that conflict with the entites logged by the parent schema.
The complete schema now looks like this and will log an additional metric optimized
as defined by the
Optimizer
schema to the original experiment:
name: my_model__Model
verison: 1.0.0
compatibility:
pandas:
max_version:
min_version: 1.0.5
docs_url: https://my-docs.com/my-model/Model.html
artifacts:
- self
- name: optimizer
data_object_attr: optimizer
dataframes:
- name: summary
df_attr: summary_
features:
- names_attr: feature_names_in_
importances_attr: feature_importances_
optional: true
- name_attr: target
metrics:
- name: learned_attribute
value_attr: learned_attribute_
optional: true
- name: score
value_attr: score_
- name: env_metric
value_env: METRIC
parameters:
- name: alpha
value_attr: alpha
- name: gamma
value_attr: gamma
- name: env_param
value_env: PARAMETER
schema:
- name: my_model__Optimizer
- attr: optimizer
Hierarchical schema¶
Some objects may contain a list of other objects that are already represented by a scehma, like a feature eliminator or hyperparameter optimizer that trained multiple iterations of an underlying model object.
The children
key can be provided to log each of these underlying objects to a new experiment. This
means that a single call to project.log_with_schema
will log 1+n experiments to project
where
n is the number of objects in the list specified by children
.
Within the children
key, provide the schema name
for the children objects to be retrieved via
registry.get_schema
and the attribute attr
containing the list of child objects.
children:
- name: my_model__Optimizer # defines the children's schema
- attr: optimizers # logs an experiment according to the schema for each object in `optimizers`
If we replace the nested schema from the previous example with a list of children that adhere to the same
Optimizer
schema, the complete schema now looks like this. It will log a single experiment for Model
containing all the information in the original Model
schema, as well as an additional experiment as
defined by the Optimizer
schema for each of the objects in Model
’s optimizers
list.
name: my_model__Model
verison: 1.0.0
compatibility:
pandas:
max_version:
min_version: 1.0.5
docs_url: https://my-docs.com/my-model/Model.html
artifacts:
- self
- name: optimizer
data_object_attr: optimizer
children:
- name: my_model__Optimizer
- attr: optimizers
dataframes:
- name: summary
df_attr: summary_
features:
- names_attr: feature_names_in_
importances_attr: feature_importances_
optional: true
- name_attr: target
metrics:
- name: learned_attribute
value_attr: learned_attribute_
optional: true
- name: score
value_attr: score_
- name: env_metric
value_env: METRIC
parameters:
- name: alpha
value_attr: alpha
- name: gamma
value_attr: gamma
- name: env_param
value_env: PARAMETER
Extending a schema¶
Consider an extension of Model
named NewModel
:
class NewModel(Model):
def __init__(self, alpha=1e-3, gamma=1e-3, delta=1e-3):
super().__init__(alpha=alpha, gamma=gamma)
self.delta = delta
def fit(self, X, y):
super().fit(X, y)
self.other_learned_attribute_ = self.delta * self.learned_attribute_
return self
To extend an existing schema, provide the name of the schema to extend as the
extends
key’s value after the new schema’s name. This new schema will log everything
in the schema represented by extends
plus any additional values.
name: my_model__NewModel
extends: my_model__Model
verison: 1.0.0
The following is a complete YAML representation of the NewModel
object’s schema.
This schema will log everything that the Model
schema would with the addition of the
other_learned_attribute
metric and delta
parameter from NewModel
.
name: my_model__NewModel
extends: my_model__Model
verison: 1.0.0
compatibility:
pandas:
max_version:
min_version: 1.0.5
docs_url: https://my-docs.com/my-model/NewModel.html
metrics:
- name: other_learned_attribute
value_attr: other_learned_attribute_
parameters:
- name: delta
value_attr: delta
To see an extended schema in action, check out the “Register a custom schema” section.