Data Utils

Contains functions for data readers.

dataprofiler.data_readers.data_utils.data_generator(data_list: List[str]) Generator[str, None, None]

Take a list and return a generator on the list.

Parameters

data_list (list) – list of strings

Returns

item from the list

Return type

generator

dataprofiler.data_readers.data_utils.generator_on_file(file_object: Union[_io.StringIO, _io.BytesIO]) Generator[Union[str, bytes], None, None]

Take a file and return a generator that returns lines.

Parameters

file_path (path) – path to the file

Returns

Line from file

Return type

generator

dataprofiler.data_readers.data_utils.convert_int_to_string(x: int) str

Convert the given input to string.

In particular, it is int, it converts it ensuring there is no . or 00. In addition, if the input is np.nan, the output will be ‘nan’ which is what we need to handle data properly.

Parameters

x (Union[int, float, str, numpy.nan]) –

Returns

Return type

str

dataprofiler.data_readers.data_utils.unicode_to_str(data: Union[str, int, float, bool, None, List, Dict], ignore_dicts: bool = False) Union[str, int, float, bool, None, List, Dict]

Convert data to string representation if it is a unicode string.

Parameters
  • data (JSONType) – input data

  • ignore_dicts (boolean) – if set, ignore the dictionary type processing

Returns

string representation of data

Return type

str

dataprofiler.data_readers.data_utils.json_to_dataframe(json_lines: List[Union[str, int, float, bool, None, List, Dict]], selected_columns: Optional[List[str]] = None, read_in_string: bool = False) Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series]

Take list of json objects and return dataframe representing json list.

Parameters
  • json_lines (list(JSONType)) – list of json objects

  • selected_columns (list(str)) – a list of keys to be processed

  • read_in_string (bool) – if True, all the values in dataframe will be converted to string

Returns

dataframe converted from json list and list of dtypes for each column

Return type

tuple(pd.DataFrame, pd.Series(dtypes))

dataprofiler.data_readers.data_utils.read_json_df(data_generator: Generator, selected_columns: Optional[List[str]] = None, read_in_string: bool = False) Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series]

Return an iterator that returns a chunk of data as dataframe in each call.

The source of input to this function is either a file or a list of JSON structured strings. If the file path is given as input, the file is expected to have one JSON structures in each line. The lines that are not valid json will be ignored. Therefore, a file with pretty printed JSON objects will not be considered valid JSON. If the input is a data list, it is expected to be a list of strings where each string is a valid JSON object. if the individual object is not valid JSON, it will be ignored.

NOTE: both data_list and file_path cannot be passed at the same time.

Parameters
  • data_generator (generator) – The generator you want to read.

  • selected_columns (list(str)) – a list of keys to be processed

  • read_in_string (bool) – if True, all the values in dataframe will be converted to string

Returns

returns an iterator that returns a chunk of file as dataframe in each call as well as original dtypes of the dataframe columns.

Return type

tuple(pd.DataFrame, pd.Series(dtypes))

dataprofiler.data_readers.data_utils.read_json(data_generator: Iterator, selected_columns: Optional[List[str]] = None, read_in_string: bool = False) List[Union[str, int, float, bool, None, List, Dict]]

Return the lines of a json.

The source of input to this function is either a file or a list of JSON structured strings. If the file path is given as input, the file is expected to have one JSON structures in each line. The lines that are not valid json will be ignored. Therefore, a file with pretty printed JSON objects will not be considered valid JSON. If the input is a data list, it is expected to be a list of strings where each string is a valid JSON object. if the individual object is not valid JSON, it will be ignored.

NOTE: both data_list and file_path cannot be passed at the same time.

Parameters
  • data_generator (generator) – The generator you want to read.

  • selected_columns (list(str)) – a list of keys to be processed

  • read_in_string (bool) – if True, all the values in dataframe will be converted to string

Returns

returns the lines of a json file

Return type

list(dict)

dataprofiler.data_readers.data_utils.reservoir(file: _io.TextIOWrapper, sample_nrows: int) list

Implement the mathematical logic of Reservoir sampling.

Parameters
  • file (TextIOWrapper) – wrapper of the opened csv file

  • sample_nrows (int) – number of rows to sample

Raises

ValueError()

Returns

sampled values

Return type

list

dataprofiler.data_readers.data_utils.rsample(file_path: _io.TextIOWrapper, sample_nrows: int, args: dict) _io.StringIO

Implement Reservoir Sampling to sample n rows out of a total of M rows.

Parameters
  • file_path (TextIOWrapper) – path of the csv file to be read in

  • sample_nrows (int) – number of rows being sampled

  • args (dict) – options to read the csv file

dataprofiler.data_readers.data_utils.read_csv_df(file_path: Union[str, _io.BytesIO, _io.TextIOWrapper], delimiter: Optional[str], header: Optional[int], sample_nrows: Optional[int] = None, selected_columns: List[str] = [], read_in_string: bool = False, encoding: Optional[str] = 'utf-8') pandas.core.frame.DataFrame

Read a CSV file in chunks and return dataframe in form of iterator.

Parameters
  • file_path (str) – path to the CSV file.

  • delimiter (str) – character used to separate csv values.

  • header (int) – the header row in the csv file.

  • selected_columns (list(str)) – a list of columns to be processed

  • read_in_string (bool) – if True, all the values in dataframe will be converted to string

Returns

Iterator

Return type

pd.DataFrame

dataprofiler.data_readers.data_utils.convert_unicode_col_to_utf8(input_df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Convert all unicode columns in input dataframe to utf-8.

Parameters

input_df (pd.DataFrame) – input dataframe

Returns

corrected dataframe

Return type

pd.DataFrame

dataprofiler.data_readers.data_utils.sample_parquet(file_path: str, sample_nrows: int, selected_columns: Optional[List[str]] = None, read_in_string: bool = False) Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series]

Read parquet file, sample specified number of rows from it and return a data frame.

Parameters
  • file_path (str) – path to the Parquet file.

  • sample_nrows (int) – number of rows being sampled

  • selected_columns (list) – columns need to be read

  • read_in_string (bool) – return as string type

Returns

Return type

Iterator(pd.DataFrame)

dataprofiler.data_readers.data_utils.read_parquet_df(file_path: str, sample_nrows: Optional[int] = None, selected_columns: Optional[List[str]] = None, read_in_string: bool = False) Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series]

Return an iterator that returns one row group each time.

Parameters
  • file_path (str) – path to the Parquet file.

  • sample_nrows (int) – number of rows being sampled

  • selected_columns (list) – columns need to be read

  • read_in_string (bool) – return as string type

Returns

Return type

Iterator(pd.DataFrame)

dataprofiler.data_readers.data_utils.read_text_as_list_of_strs(file_path: str, encoding: Optional[str] = None) List[str]

Return list of strings relative to the chunk size.

Each line is 1 chunk.

Parameters

file_path (str) – path to the file

Returns

Return type

list(str)

dataprofiler.data_readers.data_utils.detect_file_encoding(file_path: str, buffer_size: int = 1024, max_lines: int = 20) str

Determine encoding of files within initial max_lines of length buffer_size.

Parameters
  • file_path (str) – path to the file

  • buffer_size (int) – buffer length for each line being read

  • max_lines (int) – number of lines to read from file of length buffer_size

Returns

encoding type

Return type

str

dataprofiler.data_readers.data_utils.detect_cell_type(cell: str) str

Detect the cell type (int, float, etc).

Parameters

cell (str) – String designated for data type detection

dataprofiler.data_readers.data_utils.get_delimiter_regex(delimiter: str = ',', quotechar: str = ',') Pattern[str]

Build regex for delimiter checks.

Parameters
  • delimiter (str) – Delimiter to be added to regex

  • quotechar – Quotechar to be added to regex

dataprofiler.data_readers.data_utils.find_nth_loc(string: Optional[str] = None, search_query: Optional[str] = None, n: int = 0, ignore_consecutive: bool = True) Tuple[int, int]

Search string via search_query and return nth index in which query occurs.

If there are less than ‘n’ the last loc is returned

Parameters
  • string (str) – Input string, to be searched

  • search_query (str) – char(s) to find nth occurrence of

  • n (int) – The number of occurrences to iterate through

  • ignore_consecutive (bool) – Ignore consecutive matches in the search query.

Return idx

Index of the nth or last occurrence of the search_query

Rtype idx

int

Return id_count

Number of identifications prior to idx

Rtype id_count

int

dataprofiler.data_readers.data_utils.load_as_str_from_file(file_path: str, file_encoding: Optional[str] = None, max_lines: int = 10, max_bytes: int = 65536, chunk_size_bytes: int = 1024) str

Load data from a csv file up to a specific line OR byte_size.

Parameters
  • file_path (str) – Path to file to load data from

  • file_encoding (str) – File encoding

  • max_lines (int) – Maximum number of lines to load from file

  • max_bytes (int) – Maximum number of bytes to load from file

  • chunk_size_bytes (int) – Chunk size to load every data load

Returns

Data as string

Return type

str

dataprofiler.data_readers.data_utils.is_valid_url(url_as_string: Any) typing_extensions.TypeGuard[Url]

Determine whether a given string is a valid URL.

Parameters

url_as_string (str) – string to be tested if URL

Returns

true if string is a valid URL

Return type

boolean

dataprofiler.data_readers.data_utils.url_to_bytes(url_as_string: Url, options: Dict) _io.BytesIO

Read in URL and converts it to a byte stream.

Parameters
  • url_as_string (str) – string to read as URL

  • options (dict) – options for the url

Returns

BytesIO stream of data downloaded from URL

Return type

BytesIO stream

class dataprofiler.data_readers.data_utils.S3Helper

Bases: object

A utility class for working with Amazon S3.

This class provides methods to check if a path is an S3 URI

and to create an S3 client.

static is_s3_uri(path: str, logger: logging.Logger) bool

Check if the given path is an S3 URI.

This function checks for common S3 URI prefixes “s3://” and “s3a://”.

Parameters
  • path (str) – The path to check for an S3 URI.

  • logger (logging.Logger) – The logger instance for logging.

Returns

True if the path is an S3 URI, False otherwise.

Return type

bool

static create_s3_client(aws_access_key_id: Optional[str] = None, aws_secret_access_key: Optional[str] = None, aws_session_token: Optional[str] = None, region_name: Optional[str] = None) boto3.client

Create and return an S3 client.

Parameters
  • aws_access_key_id (str) – The AWS access key ID.

  • aws_secret_access_key (str) – The AWS secret access key.

  • aws_session_token (str) – The AWS session token (optional, typically used for temporary credentials).

  • region_name (str) – The AWS region name (default is ‘us-east-1’).

Returns

A S3 client instance.

Return type

boto3.client

static get_s3_uri(s3_uri: str, s3_client: boto3.client) _io.BytesIO

Download an object from an S3 URI and return its content as BytesIO.

Parameters
  • s3_uri (str) – The S3 URI specifying the location of the object to download.

  • s3_client (boto3.client) – An initialized AWS S3 client for accessing the S3 service.

Returns

A BytesIO object containing the content of

the downloaded S3 object.

Return type

BytesIO