{ "cells": [ { "cell_type": "markdown", "id": "ffc55deb-dc28-4d36-be12-c375ca24e5d3", "metadata": {}, "source": [ "# Sharing Experiments\n", "\n", "In the [first part](https://capitalone.github.io/rubicon-ml/quick-look/logging-experiments.html)\n", "of the quick look, we learned how to log ``rubicon_ml`` experiments in the context of a\n", "simple classification problem. We also performed a small hyperparameter search to show how ``rubicon_ml``\n", "can be used to compare the results of multiple model fits.\n", "\n", "Inspecting our model fit results in the same session that we trained the model is certainly useful, but\n", "sharing experiments can help us collaborate with teammates and compare new model training results to old\n", "experiments.\n", "\n", "First, we'll create a ``rubicon_ml`` entry point and get the project we logged in the first part of the\n", "quick look." ] }, { "cell_type": "code", "execution_count": 1, "id": "c66ddf15-dc7e-4a5e-8ce9-b8770c6e6177", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from rubicon_ml import Rubicon\n", "\n", "rubicon = Rubicon(persistence=\"filesystem\", root_dir=\"./rubicon-root\")\n", "project = rubicon.get_project(name=\"classifying penguins\")\n", "project" ] }, { "cell_type": "markdown", "id": "3e7c81be-8e5f-456a-8a4d-364001571acb", "metadata": {}, "source": [ "Let's say we want to share the results of our hyperparmeter search with another teammate so they can evaluate the results.\n", "``rubicon_ml``'s ``publish`` function takes a list of experiments as an input and uses ``intake`` to generate a catalog\n", "containing the bare-minimum amount of metadata needed to retrieve an experiment, like its ID and filepath. More on ``intake``\n", "can be found [in their docs](https://intake.readthedocs.io/en/latest/).\n", "\n", "Hyperparameter searches can span thousands of combos, so sharing every single file ``rubicon_ml`` logs during the training\n", "process can be a lot. That's why we use ``intake`` via our ``publish`` function to only share what needs to be shared in a\n", "single YAML file. Then, later, users can use said YAML file to retrieve the experiments shared within it.\n", "\n", "**Note**: Sharing experiments relys on both the sharer and the recipient having access to the same underlying data source.\n", "In this example, we're using a local filesystem - so these experiments couldn't actually be shared with anyone other than \n", "people on this same physical machine. To get the most out of sharing, log your experiments to an S3 bucket that all teammates\n", "have access to." ] }, { "cell_type": "code", "execution_count": 2, "id": "3696915b-e7d4-4760-a7f9-720cd2c8c781", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sources:\n", " experiment_193ab005_671b_4991_b3a8_397311b390c6:\n", " args:\n", " experiment_id: 193ab005-671b-4991-b3a8-397311b390c6\n", " project_name: classifying penguins\n", " urlpath: ./rubicon-root\n", " driver: rubicon_ml_experiment\n" ] } ], "source": [ "from rubicon_ml import publish\n", "\n", "catalog = publish(\n", " project.experiments(tags=[\"parameter search\"]),\n", " output_filepath=\"./penguin_catalog.yml\",\n", ")\n", "\n", "!head -7 penguin_catalog.yml" ] }, { "cell_type": "markdown", "id": "74b2175a-4644-479b-9ff6-f01305f8d568", "metadata": {}, "source": [ "Each catalog contains a \"source\" for each ``rubicon_ml`` experiment. These sources contain the minimum metadata needed\n", "to retrieve the associated experiment - the ``experiment_id``, ``project_name`` and ``urlpath`` to the root of the \n", "``rubicon_ml`` repository used as an entry point. The ``rubicon_ml_experiment`` driver can be found \n", "[within our library](https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/intake_rubicon/experiment.py)\n", "and leverages the metadata in the YAML catalog to return the experiment objects associated to it.\n", "\n", "Provided the recipient of the shared YAML catalog has read access to the filesystem represented by ``urlpath``,\n", "they can now use ``intake`` directly to read the catalog and load in the shared ``rubicon_ml`` expierments\n", "for their own inspection." ] }, { "cell_type": "code", "execution_count": 3, "id": "7aafdde6-e660-4567-a183-dfd926ee2ddb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shared experiments:\n", "\tid: 193ab005-671b-4991-b3a8-397311b390c6, parameters: [('strategy', 'median'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6826923076923077)]\n", "\tid: 19d5f5ec-88c4-4fb7-a1cb-e7ccd86eed62, parameters: [('strategy', 'median'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]\n", "\tid: 47eeaf42-18c4-44b0-b053-9198c5a942e8, parameters: [('strategy', 'median'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]\n", "\tid: 4b188436-0af5-4917-8c57-9f463ec7fab4, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]\n", "\tid: 5728479d-b524-4d49-ac0a-723d516f51a8, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]\n", "\tid: 5def5ddb-b52e-44e1-ab17-00aa28c0714a, parameters: [('strategy', 'median'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]\n", "\tid: 938d580b-e9b3-41d6-8f82-a18346b07355, parameters: [('strategy', 'mean'), ('n_neighbors', 15)], metrics: [('accuracy', 0.6442307692307693)]\n", "\tid: d810d725-e449-4072-960e-c3be5edd4cd2, parameters: [('strategy', 'mean'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]\n", "\tid: e172b79f-0964-44fc-aed8-7714732b2b83, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7019230769230769)]\n", "\tid: e2cdfd2e-004e-4b5d-a248-0c7ee7fd2500, parameters: [('strategy', 'mean'), ('n_neighbors', 5)], metrics: [('accuracy', 0.6923076923076923)]\n", "\tid: e82f1ad2-e016-48e6-a189-7361b527f264, parameters: [('strategy', 'mean'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6826923076923077)]\n", "\tid: ef909f56-22a1-4440-9bfa-11c09da772ce, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 20)], metrics: [('accuracy', 0.6730769230769231)]\n" ] } ], "source": [ "import intake\n", "\n", "catalog = intake.open_catalog(\"./penguin_catalog.yml\")\n", "\n", "for source in catalog:\n", " catalog[source].discover()\n", " \n", "shared_experiments = [catalog[source].read() for source in catalog]\n", "\n", "print(\"shared experiments:\")\n", "for experiment in shared_experiments:\n", " print(\n", " f\"\\tid: {experiment.id}, \"\n", " f\"parameters: {[(p.name, p.value) for p in experiment.parameters()]}, \"\n", " f\"metrics: {[(m.name, m.value) for m in experiment.metrics()]}\"\n", " )" ] }, { "cell_type": "markdown", "id": "07eee8ad", "metadata": {}, "source": [ "## Updating Existing Catalogs" ] }, { "cell_type": "markdown", "id": "c4b2733d-6687-4883-a122-893d836c0e80", "metadata": {}, "source": [ "Suppose we have an existing `intake catalog` and would like to update and append experiments to that the same catalog file. To do this, let's create a project with 2 random experiments Next, in order to update an exisiting catalog file with new experiments, we can utilize the `penguin_catalog` and then directly update it with our new experiments. To do this, we leverage an optional argument in the `publish` function called `base_catalog_filepath` as seen below. The result shows the new experiments added to the `penguin catalog`. We can verify this by noticing the different project name that the new experiments fall under.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "55d1021b-e003-4914-ae72-b30dcc52b6d3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sources:\n", " experiment_193ab005_671b_4991_b3a8_397311b390c6:\n", " args:\n", " experiment_id: 193ab005-671b-4991-b3a8-397311b390c6\n", " project_name: classifying penguins\n", " urlpath: ./rubicon-root\n", " driver: rubicon_ml_experiment\n" ] } ], "source": [ "new_project = rubicon.get_or_create_project(name=\"update catalog example\")\n", "new_experiments = [new_project.log_experiment() for _ in range(2)]\n", "\n", "updated_catalog = publish(\n", " base_catalog_filepath=\"./penguin_catalog.yml\",\n", " experiments = new_experiments,\n", ")\n", "\n", "!head -7 penguin_catalog.yml\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "25b1e4c5b858c72efcaa777acfbbe2a6a9238d41ae968ff19668f7de97f00b4c" } } }, "nbformat": 4, "nbformat_minor": 5 }