Common Recipes ============== Lets go over some common use cases people will have when interacting with ``locopy``. I have a file I want to load to Redshift ---------------------------------------- This is probably the most common use case. What you need to start: - Read/Write S3 access - Read/Write Redshift Access There are a couple of ways you can authenticate with Redshift. If you are running this process on a regular basis it might make sense to use a ``YAML`` file to store (in a secure place) most of the details. You will also need valid AWS tokens in order to access the appropriate AWS resources (S3 bucket). Assuming you have your ``YAML`` file setup, the following will occur: .. code-block:: python import pg8000 import locopy with locopy.Redshift(dbapi=pg8000, config_yaml="example.yaml") as redshift: redshift.load_and_copy( local_file="some_data_to_load.txt", s3_bucket="s3_bucket_to_use", s3_folder="s3_folder_to_use/s3_subfolder_to_use", table_name="redshift_table_to_load", delim=",") - ``some_data_to_load.txt`` will be compressed (gzip) and the original file deleted - ``some_data_to_load.txt`` will be uploaded to the S3 bucket: ``s3_bucket_to_use`` at the location of ``s3_folder_to_use/s3_subfolder_to_use`` - A ``COPY`` command will be executed to load the contents of ``some_data_to_load.txt`` to ``redshift_table_to_load`` using the delimiter ``,`` I'd like to split my file into *n* files ---------------------------------------- You can tweak the above code to include the ``splits`` option which takes an integer .. code-block:: python import pg8000 import locopy with locopy.Redshift(dbapi=pg8000, config_yaml="example.yaml") as redshift: redshift.load_and_copy( local_file="some_data_to_load.txt", s3_bucket="s3_bucket_to_use", s3_folder="s3_folder_to_use/s3_subfolder_to_use", table_name="redshift_table_to_load", delim=",", splits=10) ``splits`` is really useful when you have a **very large** file you want to load. Ideally this number will equal the number of slices available to you on the Redshift cluster. I don't want to compress my files --------------------------------- By default compression is always on. If you'd like to turn if off you can set ``compress=False`` .. code-block:: python import pg8000 import locopy with locopy.Redshift(dbapi=pg8000, config_yaml="example.yaml") as redshift: redshift.load_and_copy( local_file="some_data_to_load.txt", s3_bucket="s3_bucket_to_use", s3_folder="s3_folder_to_use/s3_subfolder_to_use", table_name="redshift_table_to_load", delim=",", compress=False) Depending on your own requirements it might make sense to turn off compression or leave it on. Combining this with ``splits`` can drastically alter the performance of the ``COPY`` command I want to export some data from Redshift to a local CSV ------------------------------------------------------- Data can be exported from Redshift to a CSV by supplying an ``export_path`` to ``locopy.Redshift.unload_and_copy()``: .. code-block:: python import pg8000 import locopy my_profile = "some_profile_with_valid_tokens" with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift: redshift.unload_and_copy( query="SELECT * FROM schema.table", s3_bucket="s3_bucket_to_use", export_path="output.csv") Or a pipe delimited.... .. code-block:: python with locopy.Redshift(config_yaml="config.yml", profile=my_profile) as redshift: redshift.unload_and_copy( query="SELECT * FROM schema.table", s3_bucket="s3_bucket_to_use", export_path="output.tsv", delimiter="|") .. note:: If your bucket has previously unloaded files, you may get an error when unloading. If you don't want to delete the older files, you can unload your data to a new folder or with a different prefix by specifying the ``s3_folder`` parameter. You can specify a folder to write to, i.e. ``s3_folder=s3_folder_to_use/``, or create a unique filename prefix name by omitting the last ``/``, i.e. ``s3_folder=unique_file_prefix``. .. code-block:: python with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift: redshift.unload_and_copy( query="SELECT * FROM schema.table", s3_bucket="s3_bucket_to_use", s3_folder="s3_folder_to_use/", export_path="output.csv") I want to backup my table to S3 ------------------------------------ To simply export data to S3 and do nothing else, omit the ``export_path`` option so that the file is not downloaded, and set ``delete_s3_after=False`` to prevent the S3 files from being automatically deleted after the run. .. code-block:: python import pg8000 import locopy with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift: redshift.unload_and_copy( query="SELECT * FROM schema.table", s3_bucket="s3_bucket_to_use", s3_folder="s3_folder_to_use/s3_subfolder_to_use/", delete_s3_after=False) # defaults to True By default, the Redshift unloads data to multiple files in S3 for performance reasons. The maximum size for a data file is 6.2 GB. If the data size is greater than the maximum, UNLOAD creates additional files, up to 6.2 GB each. If you want to back it up as a single file, you can run: .. code-block:: python with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift: redshift.unload_and_copy( query="SELECT * FROM schema.table", s3_bucket="s3_bucket_to_use", s3_folder="s3_folder_to_use/s3_subfolder_to_use/", delete_s3_after=False, parallel_off=True)