View this notebook on GitHub or run it yourself on Binder!


Logging Training Metadata

We can’t train a model without a lot of data. Keeping track of where that data is and how to get it can be difficult. rubicon_ml isn’t in the business of storing full training datasets, but it can store metadata about our training datasets on both projects (for high level datasource configuration) and experiments (for indiviual model runs).

Below, we’ll use rubicon_ml to reference a dataset stored in S3.

[1]:
s3_config = {
    "region_name": "us-west-2",
    "signature_version": "v4",
    "retries": {
        "max_attempts": 10,
        "mode": "standard",
    }
}

bucket_name = "my-bucket"
key = "path/to/my/data.parquet"

We could use the following function to pull training data locally from S3.

Note: We’re reading the user’s account credentials from an external source rather than exposing them in the s3_config we created. rubicon_ml is not intended for storing secrets.

[2]:
def read_from_s3(config, bucket, key, local_output_path):
    import boto3
    from botocore.config import Config

    config = Config(**config)

    # assuming credentials are correct in `~/.aws` or set in environment variables
    client = boto3.client("s3", config=config)

    with open(local_output_path, "wb") as f:
        s3.download_fileobj(bucket, key, f)

But we don’t actually need to reach out to S3 for this example, so we’ll use a no-op.

[3]:
def read_from_s3(config, bucket, key, local_output_path):
    return None

Let’s create a project for the experiments we’ll run in this example. We’ll use in-memory persistence so we don’t need to clean up after ourselves when we’re done!

[4]:
from rubicon_ml import Rubicon


rubicon = Rubicon(persistence="memory")
project = rubicon.get_or_create_project("Storing Training Metadata")

project
[4]:
<rubicon_ml.client.project.Project at 0x1065a1be0>

Experiment level training metadata

Before we create an experiment, we’ll construct some training metadata to pass along so future collaborators, reviewers, or even future us can reference the same training dataset later.

[5]:
training_metadata = (s3_config, bucket_name, key)

experiment = project.log_experiment(
    training_metadata=training_metadata,
    tags=["S3", "training metadata"]
)
# then run the experiment and log everything to rubicon!

experiment.training_metadata
[5]:
({'region_name': 'us-west-2',
  'signature_version': 'v4',
  'retries': {'max_attempts': 10, 'mode': 'standard'}},
 'my-bucket',
 'path/to/my/data.parquet')

We can come back any time and use the experiment’s training metadata to pull the same dataset.

[6]:
experiment = project.experiments(tags=["S3", "training metadata"], qtype="and")[0]

training_metadata = experiment.training_metadata

read_from_s3(
    training_metadata[0],
    training_metadata[1],
    training_metadata[2],
    "./local_output.parquet",
)

If we’re referencing multiple keys within the bucket, we can send a list of training metadata.

[7]:
training_metadata = [
    (s3_config, bucket_name, "path/to/my/data_0.parquet"),
    (s3_config, bucket_name, "path/to/my/data_1.parquet"),
    (s3_config, bucket_name, "path/to/my/data_2.parquet"),
]

experiment = project.log_experiment(training_metadata=training_metadata)
experiment.training_metadata
[7]:
[({'region_name': 'us-west-2',
   'signature_version': 'v4',
   'retries': {'max_attempts': 10, 'mode': 'standard'}},
  'my-bucket',
  'path/to/my/data_0.parquet'),
 ({'region_name': 'us-west-2',
   'signature_version': 'v4',
   'retries': {'max_attempts': 10, 'mode': 'standard'}},
  'my-bucket',
  'path/to/my/data_1.parquet'),
 ({'region_name': 'us-west-2',
   'signature_version': 'v4',
   'retries': {'max_attempts': 10, 'mode': 'standard'}},
  'my-bucket',
  'path/to/my/data_2.parquet')]

training_metadata is simply a tuple or an array of tuples, so we can decide how to best store our metadata. The config and prefix are the same for each piece of metadata, so no need to duplicate!

[8]:
training_metadata = (
    s3_config,
    bucket_name,
    [
        "path/to/my/data_0.parquet",
        "path/to/my/data_1.parquet",
        "path/to/my/data_2.parquet",
    ],
)

experiment = project.log_experiment(training_metadata=training_metadata)
experiment.training_metadata
[8]:
({'region_name': 'us-west-2',
  'signature_version': 'v4',
  'retries': {'max_attempts': 10, 'mode': 'standard'}},
 'my-bucket',
 ['path/to/my/data_0.parquet',
  'path/to/my/data_1.parquet',
  'path/to/my/data_2.parquet'])

Since it’s just an array of tuples, we can even use a namedtuple to represent the structure we decide to go with.

[9]:
from collections import namedtuple


S3TrainingMetadata = namedtuple("S3TrainingMetadata", "config bucket keys")

training_metadata = S3TrainingMetadata(
    s3_config,
    bucket_name,
    [
        "path/to/my/data_0.parquet",
        "path/to/my/data_1.parquet",
        "path/to/my/data_2.parquet",
    ],
)

experiment = project.log_experiment(training_metadata=training_metadata)
experiment.training_metadata
[9]:
S3TrainingMetadata(config={'region_name': 'us-west-2', 'signature_version': 'v4', 'retries': {'max_attempts': 10, 'mode': 'standard'}}, bucket='my-bucket', keys=['path/to/my/data_0.parquet', 'path/to/my/data_1.parquet', 'path/to/my/data_2.parquet'])

Projects for complex training metadata

Each experiment on the S3 Training Metadata project below uses the same config to connect to S3, so no need to duplicate it. We’ll only log it to the project. Then we’ll run three experiments, with each one using a different key to load data from S3. We can represent that training metadata as a different namedtuple and log one to each experiment.

[10]:
S3Config = namedtuple("S3Config", "region_name signature_version retries")
S3DatasetMetadata = namedtuple("S3DatasetMetadata", "bucket key")

project = rubicon.get_or_create_project(
    "S3 Training Metadata",
    training_metadata=S3Config(**s3_config),
)

for key in [
    "path/to/my/data_0.parquet",
    "path/to/my/data_1.parquet",
    "path/to/my/data_2.parquet",
]:
    experiment = project.log_experiment(
        training_metadata=S3DatasetMetadata(bucket=bucket_name, key=key)
    )
    # then run the experiment and log everything to rubicon!

Later, we can use the project and experiments to reconnect to the same datasets!

[11]:
project = rubicon.get_project("S3 Training Metadata")
s3_config = S3Config(*project.training_metadata)

print(s3_config)

for experiment in project.experiments():
    s3_dataset_metadata = S3DatasetMetadata(*experiment.training_metadata)

    print(s3_dataset_metadata)

    training_data = read_from_s3(
        s3_config._asdict(),
        s3_dataset_metadata.bucket,
        s3_dataset_metadata.key,
        "./local_output.parquet"
    )
S3Config(region_name='us-west-2', signature_version='v4', retries={'max_attempts': 10, 'mode': 'standard'})
S3DatasetMetadata(bucket='my-bucket', key='path/to/my/data_2.parquet')
S3DatasetMetadata(bucket='my-bucket', key='path/to/my/data_0.parquet')
S3DatasetMetadata(bucket='my-bucket', key='path/to/my/data_1.parquet')