View this notebook on GitHub or run it yourself on Binder!
Integrate with Scikit-learn¶
This example shows how you can integrate rubicon_ml
into your Scikit-learn pipelines to enable automatic logging of parameters and metrics as you fit and score your models!
Simple pipeline run¶
Using the RubiconPipeline
class, you can set up a enhanced Scikit-learn pipeline with automated logging.
[1]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from rubicon_ml import Rubicon
from rubicon_ml.sklearn import RubiconPipeline
rubicon = Rubicon(persistence="memory")
project = rubicon.get_or_create_project("Rubicon Pipeline Example")
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = RubiconPipeline(
project,
[('scaler', StandardScaler()), ('svc', SVC())],
)
print(pipe)
RubiconPipeline(project=<rubicon_ml.client.project.Project object at 0x162eb6650>,
steps=[('scaler', StandardScaler()), ('svc', SVC())])
[2]:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
[2]:
0.88
During the pipeline run, an experiment was automatically created and the corresponding parameters and metrics logged to it. Afterwards, you can use the rubicon_ml
library to pull these experiments back or view them by running the dashboard.
[3]:
for experiment in project.experiments():
parameters = [(p.name, p.value) for p in experiment.parameters()]
metrics = [(m.name, m.value) for m in experiment.metrics()]
print(
f"experiment {experiment.id}\n"
f"parameters: {parameters}\nmetrics: {metrics}"
)
experiment eea66305-73a0-453d-9177-cc8ccee5d7fb
parameters: [('scaler__copy', True), ('scaler__with_mean', True), ('scaler__with_std', True), ('svc__C', 1.0), ('svc__break_ties', False), ('svc__cache_size', 200), ('svc__class_weight', None), ('svc__coef0', 0.0), ('svc__decision_function_shape', 'ovr'), ('svc__degree', 3), ('svc__gamma', 'scale'), ('svc__kernel', 'rbf'), ('svc__max_iter', -1), ('svc__probability', False), ('svc__random_state', None), ('svc__shrinking', True), ('svc__tol', 0.001), ('svc__verbose', False)]
metrics: [('score', 0.88)]
By default, rubicon_ml
’s logging is very verbose. It captures each parameter passed to each stage of your pipeline. In the next example we’ll see how to target our logging a bit more.
A more realistic example using GridSearch¶
GridSearch
is commonly used to test many different parameters across an estimator or pipeline in the hopes of finding the optimal parameter set. The RubiconPipeline
fits the Scikit-learn estimator specificaion, so it can be passed to Scikit-learn’s GridSearchCV
to automatically log each set of parameters tried in the grid search to an individual experiment. Then, all of these experiments can be explored with the dashboard!
This example is adapted from this Scikit-learn example.
[4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
categories = ["alt.atheism", "talk.religion.misc"]
data = fetch_20newsgroups(subset='train', categories=categories)
You can pass user-defined loggers to the RubiconPipeline
to have more control over exactly which parameters are logged for specific estimators. For example, you can use the FilteredLogger
class to select or ignore parameters on any estimator.
[5]:
import os
from rubicon_ml import Rubicon
from rubicon_ml.sklearn import FilterEstimatorLogger, RubiconPipeline
root_dir = os.environ.get("RUBICON_ROOT", "rubicon-root")
root_path = f"{os.path.dirname(os.getcwd())}/{root_dir}"
rubicon = Rubicon(persistence="filesystem", root_dir=root_path)
project = rubicon.get_or_create_project("Grid Search")
pipeline = RubiconPipeline(
project,
[
("vect", CountVectorizer()),
("tfidf", TfidfTransformer()),
("clf", SGDClassifier()),
],
user_defined_loggers = {
"vect": FilterEstimatorLogger(select=["max_df"]),
"tfidf": FilterEstimatorLogger(ignore_all=True),
"clf": FilterEstimatorLogger(select=["max_iter", "alpha", "penalty"]),
},
experiment_kwargs={
"name": "logged from a RubiconPipeline",
"model_name": SGDClassifier.__name__,
},
)
Let’s define a parameter grid and run some experiments!
[6]:
parameters = {
"vect__max_df": (0.5, 0.75, 1.0),
"vect__ngram_range": ((1, 1), (1, 2)),
"clf__max_iter": (10, 20),
"clf__alpha": (0.00001, 0.000001),
"clf__penalty": ("l2", "elasticnet"),
}
grid_search = GridSearchCV(pipeline, parameters, cv=2, n_jobs=-1, refit=False)
grid_search.fit(data.data, data.target)
print(grid_search)
GridSearchCV(cv=2,
estimator=RubiconPipeline(experiment_kwargs={'model_name': 'SGDClassifier',
'name': 'logged from '
'a '
'RubiconPipeline'},
project=<rubicon_ml.client.project.Project object at 0x162fcf0d0>,
steps=[('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier())],
user_defined_loggers={'clf': <rubicon_ml.sklearn.filter_estimator_l...
'tfidf': <rubicon_ml.sklearn.filter_estimator_logger.FilterEstimatorLogger object at 0x162f4d650>,
'vect': <rubicon_ml.sklearn.filter_estimator_logger.FilterEstimatorLogger object at 0x162c1b150>}),
n_jobs=-1,
param_grid={'clf__alpha': (1e-05, 1e-06),
'clf__max_iter': (10, 20),
'clf__penalty': ('l2', 'elasticnet'),
'vect__max_df': (0.5, 0.75, 1.0),
'vect__ngram_range': ((1, 1), (1, 2))},
refit=False)
Fetching the best parameters from the GridSearchCV
object involves digging into the object’s properties and doesn’t easily paint a full picture of our of the experimentation.
[7]:
print(f"Best score: {grid_search.best_score_}")
full_results = grid_search.cv_results_
Best score: 0.9276708493998214
With rubicon_ml
’s dashboard, we can view all of the experiments and easily compare them!
[8]:
from rubicon_ml.viz import Dashboard
Dashboard(project.experiments()).serve(in_background=True)
Dash is running on http://127.0.0.1:8050/
* Serving Flask app 'rubicon_ml.viz.base'
* Debug mode: off
[8]:
'http://localhost:8050'
Hiding Warnings in RubiconPipeline¶
RubiconPipeline
has an ignore_warnings
attribute that when set to True (default=False) will hide warnings generated by its fit()
, score()
, and score_samples()
methods. If you wish to see warnings again in future fits and scores, simply set the ignore_warnings
attribute back to False.
Here we are instantiating a pipeline that ignores warnings.
[9]:
pipe_toggle_warnings = RubiconPipeline(
project,
[('scaler', StandardScaler()), ('svc', SVC())], ignore_warnings=True
)
Warnings can be turned back on by setting the ignore_warnings
attribute to False.
[10]:
pipe_toggle_warnings.ignore_warnings = False