Common Recipes

Lets go over some common use cases people will have when interacting with locopy.

I have a file I want to load to Redshift

This is probably the most common use case. What you need to start:

  • Read/Write S3 access

  • Read/Write Redshift Access

There are a couple of ways you can authenticate with Redshift. If you are running this process on a regular basis it might make sense to use a YAML file to store (in a secure place) most of the details. You will also need valid AWS tokens in order to access the appropriate AWS resources (S3 bucket).

Assuming you have your YAML file setup, the following will occur:

import pg8000
import locopy

with locopy.Redshift(dbapi=pg8000, config_yaml="example.yaml") as redshift:
    redshift.load_and_copy(
        local_file="some_data_to_load.txt",
        s3_bucket="s3_bucket_to_use",
        s3_folder="s3_folder_to_use/s3_subfolder_to_use",
        table_name="redshift_table_to_load",
        delim=",")
  • some_data_to_load.txt will be compressed (gzip) and the original file deleted

  • some_data_to_load.txt will be uploaded to the S3 bucket: s3_bucket_to_use at the location of s3_folder_to_use/s3_subfolder_to_use

  • A COPY command will be executed to load the contents of some_data_to_load.txt to redshift_table_to_load using the delimiter ,

I’d like to split my file into n files

You can tweak the above code to include the splits option which takes an integer

import pg8000
import locopy

with locopy.Redshift(dbapi=pg8000, config_yaml="example.yaml") as redshift:
    redshift.load_and_copy(
        local_file="some_data_to_load.txt",
        s3_bucket="s3_bucket_to_use",
        s3_folder="s3_folder_to_use/s3_subfolder_to_use",
        table_name="redshift_table_to_load",
        delim=",",
        splits=10)

splits is really useful when you have a very large file you want to load. Ideally this number will equal the number of slices available to you on the Redshift cluster.

I don’t want to compress my files

By default compression is always on. If you’d like to turn if off you can set compress=False

import pg8000
import locopy

with locopy.Redshift(dbapi=pg8000, config_yaml="example.yaml") as redshift:
    redshift.load_and_copy(
        local_file="some_data_to_load.txt",
        s3_bucket="s3_bucket_to_use",
        s3_folder="s3_folder_to_use/s3_subfolder_to_use",
        table_name="redshift_table_to_load",
        delim=",",
        compress=False)

Depending on your own requirements it might make sense to turn off compression or leave it on. Combining this with splits can drastically alter the performance of the COPY command

I want to export some data from Redshift to a local CSV

Data can be exported from Redshift to a CSV by supplying an export_path to locopy.Redshift.unload_and_copy():

import pg8000
import locopy

my_profile = "some_profile_with_valid_tokens"
with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift:
    redshift.unload_and_copy(
        query="SELECT * FROM schema.table",
        s3_bucket="s3_bucket_to_use",
        export_path="output.csv")

Or a pipe delimited….

with locopy.Redshift(config_yaml="config.yml", profile=my_profile) as redshift:
    redshift.unload_and_copy(
        query="SELECT * FROM schema.table",
        s3_bucket="s3_bucket_to_use",
        export_path="output.tsv",
        delimiter="|")

Note

If your bucket has previously unloaded files, you may get an error when unloading. If you don’t want to delete the older files, you can unload your data to a new folder or with a different prefix by specifying the s3_folder parameter. You can specify a folder to write to, i.e. s3_folder=s3_folder_to_use/, or create a unique filename prefix name by omitting the last /, i.e. s3_folder=unique_file_prefix.

with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift:
    redshift.unload_and_copy(
        query="SELECT * FROM schema.table",
        s3_bucket="s3_bucket_to_use",
        s3_folder="s3_folder_to_use/",
        export_path="output.csv")

I want to backup my table to S3

To simply export data to S3 and do nothing else, omit the export_path option so that the file is not downloaded, and set delete_s3_after=False to prevent the S3 files from being automatically deleted after the run.

import pg8000
import locopy

with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift:
    redshift.unload_and_copy(
        query="SELECT * FROM schema.table",
        s3_bucket="s3_bucket_to_use",
        s3_folder="s3_folder_to_use/s3_subfolder_to_use/",
        delete_s3_after=False) # defaults to True

By default, the Redshift unloads data to multiple files in S3 for performance reasons. The maximum size for a data file is 6.2 GB. If the data size is greater than the maximum, UNLOAD creates additional files, up to 6.2 GB each. If you want to back it up as a single file, you can run:

with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift:
    redshift.unload_and_copy(
        query="SELECT * FROM schema.table",
        s3_bucket="s3_bucket_to_use",
        s3_folder="s3_folder_to_use/s3_subfolder_to_use/",
        delete_s3_after=False,
        parallel_off=True)