View this notebook on GitHub or run it yourself on Binder!


Sharing Experiments#

In the first part of the quick look, we learned how to log rubicon_ml experiments in the context of a simple classification problem. We also performed a small hyperparameter search to show how rubicon_ml can be used to compare the results of multiple model fits.

Inspecting our model fit results in the same session that we trained the model is certainly useful, but sharing experiments can help us collaborate with teammates and compare new model training results to old experiments.

First, we’ll create a rubicon_ml entry point and get the project we logged in the first part of the quick look.

[1]:
from rubicon_ml import Rubicon

rubicon = Rubicon(persistence="filesystem", root_dir="./rubicon-root")
project = rubicon.get_project(name="classifying penguins")
project
[1]:
<rubicon_ml.client.project.Project at 0x16da73a60>

Let’s say we want to share the results of our hyperparmeter search with another teammate so they can evaluate the results. rubicon_ml’s publish function takes a list of experiments as an input and uses intake to generate a catalog containing the bare-minimum amount of metadata needed to retrieve an experiment, like its ID and filepath. More on intake can be found in their docs.

Hyperparameter searches can span thousands of combos, so sharing every single file rubicon_ml logs during the training process can be a lot. That’s why we use intake via our publish function to only share what needs to be shared in a single YAML file. Then, later, users can use said YAML file to retrieve the experiments shared within it.

Note: Sharing experiments relys on both the sharer and the recipient having access to the same underlying data source. In this example, we’re using a local filesystem - so these experiments couldn’t actually be shared with anyone other than people on this same physical machine. To get the most out of sharing, log your experiments to an S3 bucket that all teammates have access to.

[2]:
from rubicon_ml import publish

catalog = publish(
    project.experiments(tags=["parameter search"]),
    output_filepath="./penguin_catalog.yml",
)

!head -7 penguin_catalog.yml
sources:
  experiment_193ab005_671b_4991_b3a8_397311b390c6:
    args:
      experiment_id: 193ab005-671b-4991-b3a8-397311b390c6
      project_name: classifying penguins
      urlpath: ./rubicon-root
    driver: rubicon_ml_experiment

Each catalog contains a “source” for each rubicon_ml experiment. These sources contain the minimum metadata needed to retrieve the associated experiment - the experiment_id, project_name and urlpath to the root of the rubicon_ml repository used as an entry point. The rubicon_ml_experiment driver can be found within our library and leverages the metadata in the YAML catalog to return the experiment objects associated to it.

Provided the recipient of the shared YAML catalog has read access to the filesystem represented by urlpath, they can now use intake directly to read the catalog and load in the shared rubicon_ml expierments for their own inspection.

[3]:
import intake

catalog = intake.open_catalog("./penguin_catalog.yml")

for source in catalog:
    catalog[source].discover()

shared_experiments = [catalog[source].read() for source in catalog]

print("shared experiments:")
for experiment in shared_experiments:
    print(
        f"\tid: {experiment.id}, "
        f"parameters: {[(p.name, p.value) for p in experiment.parameters()]}, "
        f"metrics: {[(m.name, m.value) for m in experiment.metrics()]}"
    )
shared experiments:
        id: 193ab005-671b-4991-b3a8-397311b390c6, parameters: [('strategy', 'median'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6826923076923077)]
        id: 19d5f5ec-88c4-4fb7-a1cb-e7ccd86eed62, parameters: [('strategy', 'median'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]
        id: 47eeaf42-18c4-44b0-b053-9198c5a942e8, parameters: [('strategy', 'median'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]
        id: 4b188436-0af5-4917-8c57-9f463ec7fab4, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]
        id: 5728479d-b524-4d49-ac0a-723d516f51a8, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]
        id: 5def5ddb-b52e-44e1-ab17-00aa28c0714a, parameters: [('strategy', 'median'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]
        id: 938d580b-e9b3-41d6-8f82-a18346b07355, parameters: [('strategy', 'mean'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]
        id: d810d725-e449-4072-960e-c3be5edd4cd2, parameters: [('strategy', 'mean'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]
        id: e172b79f-0964-44fc-aed8-7714732b2b83, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]
        id: e2cdfd2e-004e-4b5d-a248-0c7ee7fd2500, parameters: [('strategy', 'mean'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]
        id: e82f1ad2-e016-48e6-a189-7361b527f264, parameters: [('strategy', 'mean'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6826923076923077)]
        id: ef909f56-22a1-4440-9bfa-11c09da772ce, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6730769230769231)]

Updating Existing Catalogs#

Suppose we have an existing intake catalog and would like to update and append experiments to that the same catalog file. To do this, let’s create a project with 2 random experiments Next, in order to update an exisiting catalog file with new experiments, we can utilize the penguin_catalog and then directly update it with our new experiments. To do this, we leverage an optional argument in the publish function called base_catalog_filepath as seen below. The result shows the new experiments added to the penguin catalog. We can verify this by noticing the different project name that the new experiments fall under.

[4]:
new_project = rubicon.get_or_create_project(name="update catalog example")
new_experiments = [new_project.log_experiment() for _ in range(2)]

updated_catalog = publish(
    base_catalog_filepath="./penguin_catalog.yml",
    experiments = new_experiments,
)

!head -7 penguin_catalog.yml



sources:
  experiment_193ab005_671b_4991_b3a8_397311b390c6:
    args:
      experiment_id: 193ab005-671b-4991-b3a8-397311b390c6
      project_name: classifying penguins
      urlpath: ./rubicon-root
    driver: rubicon_ml_experiment