{ "cells": [ { "cell_type": "markdown", "id": "ffc55deb-dc28-4d36-be12-c375ca24e5d3", "metadata": {}, "source": [ "# Logging Experiments\n", "\n", "``rubicon_ml``'s core functionality is centered around logging **experiments** to explain and explore various\n", "model runs throughout the model development lifecycle. This example will take a quick look at how we can log\n", "model metadata to ``rubicon_ml`` in the context of a simple classification project.\n", "\n", "We'll leverage the ``palmerpenguins`` dataset collected by Dr. Kristen Gorman as our training/testing data. More\n", "information on the dataset can be [found here](https://allisonhorst.github.io/palmerpenguins/).\n", "\n", "Our goal is to create a simple classification model to differentiate the species of penguins present in the\n", "dataset. We'll leverage ``rubicon_ml`` logging to make it easy to compare runs of our model as well as preserve\n", "important information for reproducibility later." ] }, { "cell_type": "code", "execution_count": 1, "id": "934210f6-3701-47bb-9223-bd18171ea761", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: palmerpenguins in /Users/nvd215/opt/miniconda3/envs/rubicon-ml/lib/python3.10/site-packages (0.1.4)\n", "Requirement already satisfied: numpy in /Users/nvd215/opt/miniconda3/envs/rubicon-ml/lib/python3.10/site-packages (from palmerpenguins) (1.21.6)\n", "Requirement already satisfied: pandas in /Users/nvd215/opt/miniconda3/envs/rubicon-ml/lib/python3.10/site-packages (from palmerpenguins) (1.4.2)\n", "Requirement already satisfied: python-dateutil>=2.8.1 in /Users/nvd215/opt/miniconda3/envs/rubicon-ml/lib/python3.10/site-packages (from pandas->palmerpenguins) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /Users/nvd215/opt/miniconda3/envs/rubicon-ml/lib/python3.10/site-packages (from pandas->palmerpenguins) (2022.1)\n", "Requirement already satisfied: six>=1.5 in /Users/nvd215/opt/miniconda3/envs/rubicon-ml/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas->palmerpenguins) (1.16.0)\n" ] } ], "source": [ "! pip install palmerpenguins" ] }, { "cell_type": "markdown", "id": "e9d17efc-9147-4e90-a2c7-c83f8a0d9a22", "metadata": {}, "source": [ "First, we'll load the dataset and perform some basic data preparation. In many scenarios, this will likely be\n", "done before loading training/testing data and before experimentation begins." ] }, { "cell_type": "code", "execution_count": 2, "id": "9d58148e-69af-4664-8cf5-328a884943ba", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "target classes (species): ['Adelie' 'Gentoo' 'Chinstrap']\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.7181.03750.0male2007
1AdelieTorgersen39.517.4186.03800.0female2007
2AdelieTorgersen40.318.0195.03250.0female2007
3AdelieTorgersenNaNNaNNaNNaNNaN2007
4AdelieTorgersen36.719.3193.03450.0female2007
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "3 Adelie Torgersen NaN NaN NaN \n", "4 Adelie Torgersen 36.7 19.3 193.0 \n", "\n", " body_mass_g sex year \n", "0 3750.0 male 2007 \n", "1 3800.0 female 2007 \n", "2 3250.0 female 2007 \n", "3 NaN NaN 2007 \n", "4 3450.0 female 2007 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from palmerpenguins import load_penguins\n", "\n", "penguins_df = load_penguins()\n", "target_values = penguins_df['species'].unique()\n", "\n", "print(f\"target classes (species): {target_values}\")\n", "penguins_df.head()" ] }, { "cell_type": "markdown", "id": "c9e15f1a-a20b-4360-af4f-82b43a830285", "metadata": {}, "source": [ "Let's encode the string variables in our dataset to categoricals so our KNN can work with the data." ] }, { "cell_type": "code", "execution_count": 3, "id": "0e515256-72ae-4a27-9f5f-03a5313d6b61", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "target classes (species): [0 2 1]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
00239.118.7181.03750.012007
10239.517.4186.03800.002007
20240.318.0195.03250.002007
302NaNNaNNaNNaN22007
40236.719.3193.03450.002007
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 0 2 39.1 18.7 181.0 \n", "1 0 2 39.5 17.4 186.0 \n", "2 0 2 40.3 18.0 195.0 \n", "3 0 2 NaN NaN NaN \n", "4 0 2 36.7 19.3 193.0 \n", "\n", " body_mass_g sex year \n", "0 3750.0 1 2007 \n", "1 3800.0 0 2007 \n", "2 3250.0 0 2007 \n", "3 NaN 2 2007 \n", "4 3450.0 0 2007 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "for column in [\"species\", \"island\", \"sex\"]:\n", " penguins_df[column] = LabelEncoder().fit_transform(penguins_df[column])\n", "\n", "print(f\"target classes (species): {penguins_df['species'].unique()}\")\n", "penguins_df.head()" ] }, { "cell_type": "markdown", "id": "82a8e959-ed52-4b33-9996-708b8eeb0876", "metadata": {}, "source": [ "Finally, we'll split the preprocessed data into a train and test set." ] }, { "cell_type": "code", "execution_count": 4, "id": "e6ab60bf-29db-4cb3-89ec-fba1f6f7c23a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((240, 7), (240,), (104, 7), (104,))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "train_penguins_df, test_penguins_df = train_test_split(penguins_df, test_size=.30)\n", "\n", "target_name = \"species\"\n", "feature_names = [c for c in train_penguins_df.columns if c != target_name]\n", "\n", "X_train, y_train = train_penguins_df[feature_names], train_penguins_df[target_name]\n", "X_test, y_test = test_penguins_df[feature_names], test_penguins_df[target_name]\n", "\n", "X_train.shape, y_train.shape, X_test.shape, y_test.shape" ] }, { "cell_type": "markdown", "id": "e80d22a6-a21b-4438-83a7-0746239d292d", "metadata": {}, "source": [ "Now we can create and train a simple Scikit-learn pipeline to organize our model training code. We'll use a `SimpleImputer`\n", "to fill in missing values followed by a `KNeighborsClassifier` to classify the penguins." ] }, { "cell_type": "code", "execution_count": 5, "id": "d7bb797b-1757-4319-876c-41cd2156237a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7307692307692307" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.impute import SimpleImputer\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.pipeline import Pipeline\n", "\n", "imputer_strategy = \"mean\"\n", "classifier_n_neighbors = 5\n", "\n", "steps = [\n", " (\"si\", SimpleImputer(strategy=imputer_strategy)),\n", " (\"kn\", KNeighborsClassifier(n_neighbors=classifier_n_neighbors)),\n", "]\n", "\n", "penguin_pipeline = Pipeline(steps=steps)\n", "penguin_pipeline.fit(X_train, y_train)\n", "\n", "score = penguin_pipeline.score(X_test, y_test)\n", "score" ] }, { "cell_type": "markdown", "id": "bc131e02-8f65-4d6c-9148-27f58d6469ff", "metadata": {}, "source": [ "We've completed a training run, so let's finally log our results to ``rubicon_ml`` ! We'll create an entrypoint to the\n", "local filesystem and create a project called \"classifying penguins\" to store our results. ``rubicon_ml``'s ``log_*``\n", "methods can be placed throughout your model code to log any important information along the way. Entities available\n", "for logging via the ``log_*`` methods can be found in [our glossary](https://capitalone.github.io/rubicon-ml/glossary.html)." ] }, { "cell_type": "code", "execution_count": 6, "id": "0d234a1d-d143-4ad8-80ce-512fbc8327f0", "metadata": {}, "outputs": [], "source": [ "from rubicon_ml import Rubicon\n", "\n", "rubicon = Rubicon(\n", " persistence=\"filesystem\",\n", " root_dir=\"./rubicon-root\",\n", " auto_git_enabled=True,\n", ")\n", "project = rubicon.get_or_create_project(name=\"classifying penguins\")\n", "experiment = project.log_experiment()\n", "\n", "for feature_name in feature_names:\n", " experiment.log_feature(name=feature_name)\n", "\n", "_ = experiment.log_parameter(name=\"strategy\", value=imputer_strategy)\n", "_ = experiment.log_parameter(name=\"n_neighbors\", value=classifier_n_neighbors)\n", "_ = experiment.log_metric(name=\"accuracy\", value=score)" ] }, { "cell_type": "markdown", "id": "68a5598c-93d5-4bf4-898a-033c9e97aad9", "metadata": {}, "source": [ "After logging, we can inspect the various attributes of our logged entities. All available attributes can be found in \n", "[our API reference](https://capitalone.github.io/rubicon-ml/api_reference.html)." ] }, { "cell_type": "code", "execution_count": 7, "id": "9d9523e5-69ef-4af3-9488-bf73bfb65073", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Experiment(project_name='classifying penguins', id='c484caf8-bdc1-429f-b012-7a4e02dbc83a', name=None, description=None, model_name=None, branch_name='210-new-quick-look', commit_hash='490e8af895f2cd0636c72295c2762b21cd6c8102', training_metadata=None, tags=[], created_at=datetime.datetime(2022, 6, 30, 13, 51, 4, 958916))\n", "\n", "git info:\n", "\tbranch name: 210-new-quick-look\n", "\tcommit hash: 490e8af895f2cd0636c72295c2762b21cd6c8102\n", "features: ['island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']\n", "parameters: [('strategy', 'mean'), ('n_neighbors', 5)]\n", "metrics: [('accuracy', 0.7307692307692307)]\n" ] } ], "source": [ "print(experiment)\n", "print()\n", "print(f\"git info:\")\n", "print(f\"\\tbranch name: {experiment.branch_name}\\n\\tcommit hash: {experiment.commit_hash}\")\n", "print(f\"features: {[f.name for f in experiment.features()]}\")\n", "print(f\"parameters: {[(p.name, p.value) for p in experiment.parameters()]}\")\n", "print(f\"metrics: {[(m.name, m.value) for m in experiment.metrics()]}\")" ] }, { "cell_type": "markdown", "id": "54d8e9e4-7f6c-400a-ac9c-39424864c3c2", "metadata": {}, "source": [ "Tracking the results of a single model fit is nice, but ``rubicon_ml`` can really shine when we're iterating over numerous\n", "model fits - like a hyperparameter search. The code below performs a very basic hyperparameter search for a ``strategy``\n", "for the ``SimpleImputer`` and an ``n_neighbors`` for the ``KNeighborsClassifier`` while logging the results of each model\n", "fit to a new ``rubicon_ml`` experiment." ] }, { "cell_type": "code", "execution_count": 8, "id": "9ebe5fa9-3ff7-4ab9-8213-bfd1199e0520", "metadata": {}, "outputs": [], "source": [ "from sklearn.base import clone\n", "\n", "for imputer_strategy in [\"mean\", \"median\", \"most_frequent\"]:\n", " for classifier_n_neighbors in [5, 10, 15, 20]:\n", " pipeline = clone(penguin_pipeline)\n", " pipeline.set_params(\n", " si__strategy=imputer_strategy,\n", " kn__n_neighbors=classifier_n_neighbors,\n", " )\n", " \n", " pipeline.fit(X_train, y_train)\n", " score = pipeline.score(X_test, y_test)\n", "\n", " experiment = project.log_experiment(tags=[\"parameter search\"])\n", "\n", " for feature_name in feature_names:\n", " experiment.log_feature(name=feature_name)\n", " experiment.log_parameter(name=\"strategy\", value=imputer_strategy)\n", " experiment.log_parameter(name=\"n_neighbors\", value=classifier_n_neighbors)\n", " experiment.log_metric(name=\"accuracy\", value=score)" ] }, { "cell_type": "markdown", "id": "312705c8-e7c3-48b9-98b1-34b6f5c5b71c", "metadata": {}, "source": [ "Now we can take a look at a few experiments and compare our results. Notice that we're still pulling experiments from the same\n", "project that we logged the first one to. However, we're only retrieving the experiments from the search above by using the\n", "\"parameter search\" tag when we get our experiments. Each experiment in the hyperparameter search above was tagged with\n", "\"parameter search\" when it was logged." ] }, { "cell_type": "code", "execution_count": 9, "id": "79e60347-e4f0-4e17-bf77-846277aa7f50", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "experiments:\n", "\tid: a75b1258-2276-4eb1-beb5-caf83e9aacf3, parameters: [('strategy', 'mean'), ('n_neighbors', 5)], metrics: [('accuracy', 0.7307692307692307)]\n", "\tid: 02a89318-b8d9-49a5-9337-7e4368cc54da, parameters: [('strategy', 'mean'), ('n_neighbors', 10)], metrics: [('accuracy', 0.75)]\n", "\tid: ce24eeef-4686-4fc7-8c0a-e73d6c9cdb71, parameters: [('strategy', 'mean'), ('n_neighbors', 15)], metrics: [('accuracy', 0.7596153846153846)]\n", "\tid: 093a9d02-89f7-4e48-82b1-f9ade435ef03, parameters: [('strategy', 'mean'), ('n_neighbors', 20)], metrics: [('accuracy', 0.7211538461538461)]\n", "\tid: bc4d0503-32d1-4a11-8222-4151dae893cf, parameters: [('strategy', 'median'), ('n_neighbors', 5)], metrics: [('accuracy', 0.7211538461538461)]\n", "\tid: c1b6cb3a-0ad1-4932-914d-ba53a054891b, parameters: [('strategy', 'median'), ('n_neighbors', 10)], metrics: [('accuracy', 0.7403846153846154)]\n", "\tid: 9d6ffe67-088d-483f-9d3f-8f0fb34c22e8, parameters: [('strategy', 'median'), ('n_neighbors', 15)], metrics: [('accuracy', 0.7596153846153846)]\n", "\tid: f497245a-6149-4604-9ceb-da74ae9855d4, parameters: [('strategy', 'median'), ('n_neighbors', 20)], metrics: [('accuracy', 0.7211538461538461)]\n", "\tid: b2cd8067-ad4c-4ed5-87f7-2cd4536b2c73, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 5)], metrics: [('accuracy', 0.7211538461538461)]\n", "\tid: c4277327-381a-4885-aba4-a07c050463a5, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 10)], metrics: [('accuracy', 0.75)]\n", "\tid: d4ea2fe7-061e-4f5e-8958-e6ac29025708, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 15)], metrics: [('accuracy', 0.7596153846153846)]\n", "\tid: d9fe2005-824c-4e23-9809-e0459e57d78a, parameters: [('strategy', 'most_frequent'), ('n_neighbors', 20)], metrics: [('accuracy', 0.7211538461538461)]\n" ] } ], "source": [ "print(\"experiments:\")\n", "for experiment in project.experiments(tags=[\"parameter search\"]):\n", " print(\n", " f\"\\tid: {experiment.id}, \"\n", " f\"parameters: {[(p.name, p.value) for p in experiment.parameters()]}, \"\n", " f\"metrics: {[(m.name, m.value) for m in experiment.metrics()]}\"\n", " )" ] }, { "cell_type": "markdown", "id": "2fd08a67-3bf5-48f3-a40b-18324f7da1f6", "metadata": {}, "source": [ "``rubicon_ml`` can log more complex data as well. Below we'll log our trained model as an artifact (generic binary) and a\n", "confusion matrix explaining the results as a dataframe (accepts both ``pandas`` and ``dask`` dataframes natively)." ] }, { "cell_type": "code", "execution_count": 10, "id": "2240e03d-b8a6-4c40-a570-512873ff7277", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KNeighborsClassifier(n_neighbors=20)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AdelieGentooChinstrap
Adelie3703
Gentoo1901
Chinstrap6038
\n", "
" ], "text/plain": [ " Adelie Gentoo Chinstrap\n", "Adelie 37 0 3\n", "Gentoo 19 0 1\n", "Chinstrap 6 0 38" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from sklearn.metrics import confusion_matrix\n", "\n", "experiment = project.experiments(tags=[\"parameter search\"])[-1]\n", "\n", "trained_model = pipeline._final_estimator\n", "experiment.log_artifact(data_object=trained_model, name=\"trained model\")\n", "\n", "y_pred = pipeline.predict(X_test)\n", "confusion_matrix_df = pd.DataFrame(\n", " confusion_matrix(y_test, y_pred),\n", " columns=target_values,\n", " index=target_values,\n", ")\n", "experiment.log_dataframe(confusion_matrix_df, name=\"confusion matrix\")\n", "\n", "print(experiment.artifact(name=\"trained model\").get_data(unpickle=True))\n", "experiment.dataframe(name=\"confusion matrix\").get_data()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }