Data Utils¶

dataprofiler.data_readers.data_utils.data_generator(data_list)¶

Takes a list and returns a generator on the list.

Parameters: data_list (list) – list of strings
Returns: item from the list
Return type: generator

dataprofiler.data_readers.data_utils.generator_on_file(file_object)¶

Takes a file and returns a generator that returns lines

Parameters: file_path (path) – path to the file
Returns: Line from file
Return type: generator

dataprofiler.data_readers.data_utils.convert_int_to_string(x)¶

Converts the given input to string. In particular, it is int, it converts it ensuring there is no . or 00 in the converted string. In addition, if the input is np.nan, the output will be ‘nan’ which is what we need to handle data properly.

Parameters: x (Union[int, float, str, numpy.nan]) –
Returns
Return type: str

dataprofiler.data_readers.data_utils.unicode_to_str(data, ignore_dicts=False)¶

Convert data to string representation if it is a unicode string.

Parameters

data (str) – input data
ignore_dicts (boolean) – if set, ignore the dictionary type processing

Returns

string representation of data

Return type

str

dataprofiler.data_readers.data_utils.json_to_dataframe(json_lines, selected_columns=None, read_in_string=False)¶

This function takes a list of json objects and returns the dataframe representing the json list.

Parameters

json_lines (list(dict)) – list of json objects
selected_columns (list(str)) – a list of keys to be processed
read_in_string (bool) – if True, all the values in dataframe will be converted to string

Returns

dataframe converted from json list and list of dtypes for each column

Return type

tuple(pd.DataFrame, pd.Series(dtypes))

dataprofiler.data_readers.data_utils.read_json_df(data_generator, selected_columns=None, read_in_string=False)¶

This function returns an iterator that returns a chunk of data as dataframe in each call. The source of input to this function is either a file or a list of JSON structured strings. If the file path is given as input, the file is expected to have one JSON structures in each line. The lines that are not valid json will be ignored. Therefore, a file with pretty printed JSON objects will not be considered valid JSON. If the input is a data list, it is expected to be a list of strings where each string is a valid JSON object. if the individual object is not valid JSON, it will be ignored.

NOTE: both data_list and file_path cannot be passed at the same time.

Parameters

data_generator (generator) – The generator you want to read.
selected_columns (list(str)) – a list of keys to be processed
read_in_string (bool) – if True, all the values in dataframe will be converted to string

Returns

returns an iterator that returns a chunk of file as dataframe in each call as well as original dtypes of the dataframe columns.

Return type

typle(Iterator(pd.DataFrame), pd.Series(dtypes)

dataprofiler.data_readers.data_utils.read_json(data_generator, selected_columns=None, read_in_string=False)¶

This function returns the lines of a json. The source of input to this function is either a file or a list of JSON structured strings. If the file path is given as input, the file is expected to have one JSON structures in each line. The lines that are not valid json will be ignored. Therefore, a file with pretty printed JSON objects will not be considered valid JSON. If the input is a data list, it is expected to be a list of strings where each string is a valid JSON object. if the individual object is not valid JSON, it will be ignored.

NOTE: both data_list and file_path cannot be passed at the same time.

Parameters

data_generator (generator) – The generator you want to read.
selected_columns (list(str)) – a list of keys to be processed
read_in_string (bool) – if True, all the values in dataframe will be converted to string

Returns

returns the lines of a json file

Return type

list(dict)

dataprofiler.data_readers.data_utils.read_csv_df(file_path, delimiter, header, selected_columns=[], read_in_string=False, encoding='utf-8')¶

Reads a CSV file in chunks and returns a dataframe in the form of iterator.

Parameters

file_path (str) – path to the CSV file.
delimiter (str) – character used to separate csv values.
header (int) – the header row in the csv file.
selected_columns (list(str)) – a list of columns to be processed
read_in_string (bool) – if True, all the values in dataframe will be converted to string

Returns

Iterator

Return type

pd.DataFrame

dataprofiler.data_readers.data_utils.read_parquet_df(file_path, selected_columns=None, read_in_string=False)¶

Returns an iterator that returns one row group each time.

Parameters: file_path (str) – path to the Parquet file.
Returns
Return type: Iterator(pd.DataFrame)

dataprofiler.data_readers.data_utils.read_text_as_list_of_strs(file_path, encoding=None)¶

Returns a list of strings relative to the chunk size. Each line is 1 chunk.

Parameters: file_path (str) – path to the file
Returns
Return type: list(str)

dataprofiler.data_readers.data_utils.detect_file_encoding(file_path, buffer_size=1024, max_lines=20)¶

Determines the encoding of files within the initial max_lines of length buffer_size.

Parameters

file_path (str) – path to the file
buffer_size (int) – buffer length for each line being read
max_lines (int) – number of lines to read from file of length buffer_size

Returns

encoding type

Return type

str

dataprofiler.data_readers.data_utils.detect_cell_type(cell)¶

Detects the cell type (int, float, etc)

Parameters: cell (str) – String designated for data type detection

dataprofiler.data_readers.data_utils.get_delimiter_regex(delimiter=',', quotechar=',')¶

Builds regex for delimiter checks

Parameters

delimiter (str) – Delimiter to be added to regex
quotechar – Quotechar to be added to regex

dataprofiler.data_readers.data_utils.find_nth_loc(string=None, search_query=None, n=0, ignore_consecutive=True)¶

Searches the string via the search_query and returns the nth index in which the query occurs. If there are less than ‘n’ the last loc is returned

Parameters

string (str) – Input string, to be searched
search_query (str) – char(s) to find nth occurrence of
n (int) – The number of occurrences to iterate through
ignore_consecutive (bool) – Ignore consecutive matches in the search query.

Return idx

Index of the nth or last occurrence of the search_query

Rtype idx

int

Return id_count

Number of identifications prior to idx

Rtype id_count

int

dataprofiler.data_readers.data_utils.load_as_str_from_file(file_path, file_encoding=None, max_lines=10, max_bytes=65536, chunk_size_bytes=1024)¶

Loads data from a csv file up to a specific line OR byte_size.

Parameters

file_path (str) – Path to file to load data from
file_encoding (str) – File encoding
max_lines (int) – Maximum number of lines to load from file
max_bytes (int) – Maximum number of bytes to load from file
chunk_size_bytes (int) – Chunk size to load every data load

Returns

Data as string

Return type

str

dataprofiler.data_readers.data_utils.is_valid_url(url_as_string)¶

Determines whether a given string is a valid URL

Parameters: url_as_string (str) – string to be tested if URL
Returns: true if string is a valid URL
Return type: boolean

dataprofiler.data_readers.data_utils.url_to_bytes(url_as_string, options)¶

Reads in URL and converts it to a byte stream

Parameters

url_as_string (str) – string to read as URL
options (dict) – options for the url

Returns

BytesIO stream of data downloaded from URL

Return type

BytesIO stream