View this notebook on GitHub or run it yourself on Binder!
Sharing Experiments¶
In the first part of the quick look, we learned how to log rubicon_ml
experiments in the context of a simple classification problem. We also performed a small hyperparameter search to show how rubicon_ml
can be used to compare the results of multiple model fits.
Inspecting our model fit results in the same session that we trained the model is certainly useful, but sharing experiments can help us collaborate with teammates and compare new model training results to old experiments.
First, we’ll create a rubicon_ml
entry point and get the project we logged in the first part of the quick look.
[1]:
from rubicon_ml import Rubicon
rubicon = Rubicon(persistence="filesystem", root_dir="./rubicon-root")
project = rubicon.get_project(name="classifying penguins")
project
[1]:
<rubicon_ml.client.project.Project at 0x16da73a60>
Let’s say we want to share the results of our hyperparmeter search with another teammate so they can evaluate the results. rubicon_ml
’s publish
function takes a list of experiments as an input and uses intake
to generate a catalog containing the bare-minimum amount of metadata needed to retrieve an experiment, like its ID and filepath. More on intake
can be found in their docs.
Hyperparameter searches can span thousands of combos, so sharing every single file rubicon_ml
logs during the training process can be a lot. That’s why we use intake
via our publish
function to only share what needs to be shared in a single YAML file. Then, later, users can use said YAML file to retrieve the experiments shared within it.
Note: Sharing experiments relys on both the sharer and the recipient having access to the same underlying data source. In this example, we’re using a local filesystem - so these experiments couldn’t actually be shared with anyone other than people on this same physical machine. To get the most out of sharing, log your experiments to an S3 bucket that all teammates have access to.
[2]:
from rubicon_ml import publish
catalog = publish(
project.experiments(tags=["parameter search"]),
output_filepath="./penguin_catalog.yml",
)
!head -7 penguin_catalog.yml
sources:
experiment_193ab005_671b_4991_b3a8_397311b390c6:
args:
experiment_id: 193ab005-671b-4991-b3a8-397311b390c6
project_name: classifying penguins
urlpath: ./rubicon-root
driver: rubicon_ml_experiment
Each catalog contains a “source” for each rubicon_ml
experiment. These sources contain the minimum metadata needed to retrieve the associated experiment - the experiment_id
, project_name
and urlpath
to the root of the rubicon_ml
repository used as an entry point. The rubicon_ml_experiment
driver can be found within our library and leverages the metadata in the YAML catalog to
return the experiment objects associated to it.
Provided the recipient of the shared YAML catalog has read access to the filesystem represented by urlpath
, they can now use intake
directly to read the catalog and load in the shared rubicon_ml
expierments for their own inspection.
[3]:
import intake
catalog = intake.open_catalog("./penguin_catalog.yml")
for source in catalog:
catalog[source].discover()
shared_experiments = [catalog[source].read() for source in catalog]
print("shared experiments:")
for experiment in shared_experiments:
print(
f"\tid: {experiment.id}, "
f"parameters: {[(p.name, p.value) for p in experiment.parameters()]}, "
f"metrics: {[(m.name, m.value) for m in experiment.metrics()]}"
)
shared experiments:
id: 193ab005-671b-4991-b3a8-397311b390c6, parameters: [('strategy', 'median'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6826923076923077)]
id: 19d5f5ec-88c4-4fb7-a1cb-e7ccd86eed62, parameters: [('strategy', 'median'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]
id: 47eeaf42-18c4-44b0-b053-9198c5a942e8, parameters: [('strategy', 'median'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]
id: 4b188436-0af5-4917-8c57-9f463ec7fab4, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]
id: 5728479d-b524-4d49-ac0a-723d516f51a8, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]
id: 5def5ddb-b52e-44e1-ab17-00aa28c0714a, parameters: [('strategy', 'median'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]
id: 938d580b-e9b3-41d6-8f82-a18346b07355, parameters: [('strategy', 'mean'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]
id: d810d725-e449-4072-960e-c3be5edd4cd2, parameters: [('strategy', 'mean'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]
id: e172b79f-0964-44fc-aed8-7714732b2b83, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]
id: e2cdfd2e-004e-4b5d-a248-0c7ee7fd2500, parameters: [('strategy', 'mean'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]
id: e82f1ad2-e016-48e6-a189-7361b527f264, parameters: [('strategy', 'mean'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6826923076923077)]
id: ef909f56-22a1-4440-9bfa-11c09da772ce, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6730769230769231)]
Updating Existing Catalogs¶
Suppose we have an existing intake catalog
and would like to update and append experiments to that the same catalog file. To do this, let’s create a project with 2 random experiments Next, in order to update an exisiting catalog file with new experiments, we can utilize the penguin_catalog
and then directly update it with our new experiments. To do this, we leverage an optional argument in the publish
function called base_catalog_filepath
as seen below. The result shows the new
experiments added to the penguin catalog
. We can verify this by noticing the different project name that the new experiments fall under.
[4]:
new_project = rubicon.get_or_create_project(name="update catalog example")
new_experiments = [new_project.log_experiment() for _ in range(2)]
updated_catalog = publish(
base_catalog_filepath="./penguin_catalog.yml",
experiments = new_experiments,
)
!head -7 penguin_catalog.yml
sources:
experiment_193ab005_671b_4991_b3a8_397311b390c6:
args:
experiment_id: 193ab005-671b-4991-b3a8-397311b390c6
project_name: classifying penguins
urlpath: ./rubicon-root
driver: rubicon_ml_experiment