locopy.utility module

Utility Module.

Module which utility functions for use within the application.

class locopy.utility.ProgressPercentage(filename)[source]

Bases: object

ProgressPercentage class is used by the S3Transfer upload_file callback.

Please see the following url for more information: http://boto3.readthedocs.org/en/latest/reference/customizations/s3.html#ref-s3transfer-usage.

locopy.utility.compress_file(input_file, output_file)[source]

Compresses a file (gzip).

Parameters:
  • input_file (str) – Path to input file to compress

  • output_file (str) – Path to write the compressed file

locopy.utility.compress_file_list(file_list)[source]

Compresses a list of files (gzip) and clean up the old files.

Parameters:

file_list (list) – List of strings with the file paths of the files to compress

Returns:

List of strings with the file paths of the compressed files (original file name with gz appended)

Return type:

list

locopy.utility.concatenate_files(input_list, output_file, remove=True)[source]

Concatenate a list of files into one file.

Parameters:
  • input_list (list) – List of strings with the paths to input files to concateneate

  • output_file (str) – Path of the output file

  • remove (bool, optional) – Removes the files from the input list if True. Defaults to True

Raises:

LocopyConcatError – If input_list or there is a issue while concatenating the files into one

locopy.utility.find_column_type(dataframe, warehouse_type: str)[source]
locopy.utility.find_column_type(dataframe: DataFrame, warehouse_type: str)
locopy.utility.find_column_type(dataframe: DataFrame, warehouse_type: str)

Find data type of each column from the dataframe.

locopy.utility.find_column_type_pandas(dataframe: DataFrame, warehouse_type: str)[source]

Find data type of each column from the dataframe.

Following is the list of pandas data types that the function checks and their mapping in sql:

  • bool/pd.BooleanDtype -> boolean

  • datetime64[ns, <tz>] -> timestamp

  • M8[ns] -> timestamp

  • int/pd.Int64Dtype -> int

  • float/pd.Float64Dtype -> float

  • float object -> float

  • datetime object -> timestamp

  • object/pd.StringDtype -> varchar

For all other data types, the column will be mapped to varchar type.

Parameters:
  • dataframe (Pandas dataframe)

  • warehouse_type (str) – Required to properly determine format of uploaded data, either “snowflake” or “redshift”.

Returns:

A dictionary of columns with their data type

Return type:

dict

locopy.utility.find_column_type_polars(dataframe: DataFrame, warehouse_type: str)[source]

Find data type of each column from the dataframe.

Following is the list of polars data types that the function checks and their mapping in sql:

  • Boolean -> boolean

  • Date/Datetime/Duration/Time -> timestamp

  • int -> int

  • float/decimal -> float

  • float object -> float

  • datetime object -> timestamp

  • others -> varchar

For all other data types, the column will be mapped to varchar type.

Parameters:
  • dataframe (Pandas dataframe)

  • warehouse_type (str) – Required to properly determine format of uploaded data, either “snowflake” or “redshift”.

Returns:

A dictionary of columns with their data type

Return type:

dict

locopy.utility.get_ignoreheader_number(options)[source]

Return the number_rows from IGNOREHEADER [ AS ] number_rows.

This doesn’t validate that the AS is valid.

Parameters:

options (A list (str) of copy options that should be appended to the COPY) – statement.

Returns:

The number_rows from IGNOREHEADER [ AS ] number_rows

Return type:

int

Raises:

LocopyIgnoreHeaderError – If more than one IGNOREHEADER is found in the options

locopy.utility.read_config_yaml(config_yaml)[source]

Read a configuration YAML file.

Populate the database connection attributes, and validate required ones.

Example:

host: my.redshift.cluster.com
port: 5439
dbname: db
user: userid
password: password
Parameters:

config_yaml (str or file pointer) – String representing the file location of the configuration file, or a pointer to an open file object

Returns:

A dictionary of parameters for setting up a connection to the database.

Return type:

dict

Raises:

CredentialsError – If any connection items are missing from the YAML file

locopy.utility.split_file(input_file, output_file, splits=1, ignore_header=0)[source]

Split a file into equal files by lines.

For example: myinputfile.txt will be split into myoutputfile.txt.01 , `myoutputfile.txt.02 etc..

Parameters:
  • input_file (str) – Path to input file to split

  • output_file (str) – Name of the output file

  • splits (int, optional) – Number of splits to perform. Must be greater than zero. Defaults to 1

  • ignore_header (int, optional) – If ignore_header is > 0 then that number of rows will be removed from the beginning of the files as they are split. Defaults to 0

Returns:

List of strings with the file paths of the split files

Return type:

list

Raises:

LocopySplitError – If splits is less than 1 or some processing error when splitting

locopy.utility.write_file(data, delimiter, filepath, mode='w')[source]

Write data to a file.

Parameters: