View this notebook on GitHub or run it yourself on Binder!


Integrate with Scikit-learn#

This example shows how you can integrate rubicon_ml into your Scikit-learn pipelines to enable automatic logging of parameters and metrics as you fit and score your models!

Simple pipeline run#

Using the RubiconPipeline class, you can set up a enhanced Scikit-learn pipeline with automated logging.

[1]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from rubicon_ml import Rubicon
from rubicon_ml.sklearn import RubiconPipeline


rubicon = Rubicon(persistence="memory")
project = rubicon.get_or_create_project("Rubicon Pipeline Example")

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = RubiconPipeline(
    project,
    [('scaler', StandardScaler()), ('svc', SVC())],
)
print(pipe)
RubiconPipeline(project=<rubicon_ml.client.project.Project object at 0x162eb6650>,
                steps=[('scaler', StandardScaler()), ('svc', SVC())])
[2]:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
[2]:
0.88

During the pipeline run, an experiment was automatically created and the corresponding parameters and metrics logged to it. Afterwards, you can use the rubicon_ml library to pull these experiments back or view them by running the dashboard.

[3]:
for experiment in project.experiments():
    parameters = [(p.name, p.value) for p in experiment.parameters()]
    metrics = [(m.name, m.value) for m in experiment.metrics()]

    print(
        f"experiment {experiment.id}\n"
        f"parameters: {parameters}\nmetrics: {metrics}"
    )
experiment eea66305-73a0-453d-9177-cc8ccee5d7fb
parameters: [('scaler__copy', True), ('scaler__with_mean', True), ('scaler__with_std', True), ('svc__C', 1.0), ('svc__break_ties', False), ('svc__cache_size', 200), ('svc__class_weight', None), ('svc__coef0', 0.0), ('svc__decision_function_shape', 'ovr'), ('svc__degree', 3), ('svc__gamma', 'scale'), ('svc__kernel', 'rbf'), ('svc__max_iter', -1), ('svc__probability', False), ('svc__random_state', None), ('svc__shrinking', True), ('svc__tol', 0.001), ('svc__verbose', False)]
metrics: [('score', 0.88)]

By default, rubicon_ml’s logging is very verbose. It captures each parameter passed to each stage of your pipeline. In the next example we’ll see how to target our logging a bit more.

A more realistic example using GridSearch#

GridSearch is commonly used to test many different parameters across an estimator or pipeline in the hopes of finding the optimal parameter set. The RubiconPipeline fits the Scikit-learn estimator specificaion, so it can be passed to Scikit-learn’s GridSearchCV to automatically log each set of parameters tried in the grid search to an individual experiment. Then, all of these experiments can be explored with the dashboard!

This example is adapted from this Scikit-learn example.

[4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV


categories = ["alt.atheism", "talk.religion.misc"]
data = fetch_20newsgroups(subset='train', categories=categories)

You can pass user-defined loggers to the RubiconPipeline to have more control over exactly which parameters are logged for specific estimators. For example, you can use the FilteredLogger class to select or ignore parameters on any estimator.

[5]:
import os

from rubicon_ml import Rubicon
from rubicon_ml.sklearn import FilterEstimatorLogger, RubiconPipeline


root_dir = os.environ.get("RUBICON_ROOT", "rubicon-root")
root_path = f"{os.path.dirname(os.getcwd())}/{root_dir}"

rubicon = Rubicon(persistence="filesystem", root_dir=root_path)
project = rubicon.get_or_create_project("Grid Search")

pipeline = RubiconPipeline(
    project,
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier()),
    ],
    user_defined_loggers = {
        "vect": FilterEstimatorLogger(select=["max_df"]),
        "tfidf": FilterEstimatorLogger(ignore_all=True),
        "clf": FilterEstimatorLogger(select=["max_iter", "alpha", "penalty"]),
    },
    experiment_kwargs={
        "name": "logged from a RubiconPipeline",
        "model_name": SGDClassifier.__name__,
    },
)

Let’s define a parameter grid and run some experiments!

[6]:
parameters = {
    "vect__max_df": (0.5, 0.75, 1.0),
    "vect__ngram_range": ((1, 1), (1, 2)),
    "clf__max_iter": (10, 20),
    "clf__alpha": (0.00001, 0.000001),
    "clf__penalty": ("l2", "elasticnet"),
}

grid_search = GridSearchCV(pipeline, parameters, cv=2, n_jobs=-1, refit=False)
grid_search.fit(data.data, data.target)

print(grid_search)
GridSearchCV(cv=2,
             estimator=RubiconPipeline(experiment_kwargs={'model_name': 'SGDClassifier',
                                                          'name': 'logged from '
                                                                  'a '
                                                                  'RubiconPipeline'},
                                       project=<rubicon_ml.client.project.Project object at 0x162fcf0d0>,
                                       steps=[('vect', CountVectorizer()),
                                              ('tfidf', TfidfTransformer()),
                                              ('clf', SGDClassifier())],
                                       user_defined_loggers={'clf': <rubicon_ml.sklearn.filter_estimator_l...
                                                             'tfidf': <rubicon_ml.sklearn.filter_estimator_logger.FilterEstimatorLogger object at 0x162f4d650>,
                                                             'vect': <rubicon_ml.sklearn.filter_estimator_logger.FilterEstimatorLogger object at 0x162c1b150>}),
             n_jobs=-1,
             param_grid={'clf__alpha': (1e-05, 1e-06),
                         'clf__max_iter': (10, 20),
                         'clf__penalty': ('l2', 'elasticnet'),
                         'vect__max_df': (0.5, 0.75, 1.0),
                         'vect__ngram_range': ((1, 1), (1, 2))},
             refit=False)

Fetching the best parameters from the GridSearchCV object involves digging into the object’s properties and doesn’t easily paint a full picture of our of the experimentation.

[7]:
print(f"Best score: {grid_search.best_score_}")
full_results = grid_search.cv_results_
Best score: 0.9276708493998214

With rubicon_ml’s dashboard, we can view all of the experiments and easily compare them!

[8]:
from rubicon_ml.viz import Dashboard


Dashboard(project.experiments()).serve(in_background=True)
Dash is running on http://127.0.0.1:8050/

 * Serving Flask app 'rubicon_ml.viz.base'
 * Debug mode: off
[8]:
'http://localhost:8050'

Hiding Warnings in RubiconPipeline#

RubiconPipeline has an ignore_warnings attribute that when set to True (default=False) will hide warnings generated by its fit(), score(), and score_samples() methods. If you wish to see warnings again in future fits and scores, simply set the ignore_warnings attribute back to False.

Here we are instantiating a pipeline that ignores warnings.

[9]:
pipe_toggle_warnings = RubiconPipeline(
    project,
    [('scaler', StandardScaler()), ('svc', SVC())], ignore_warnings=True
)

Warnings can be turned back on by setting the ignore_warnings attribute to False.

[10]:
pipe_toggle_warnings.ignore_warnings = False