Data Utils¶
Contains functions for data readers.
- dataprofiler.data_readers.data_utils.data_generator(data_list: List[str]) Generator[str, None, None] ¶
Take a list and return a generator on the list.
- Parameters
data_list (list) – list of strings
- Returns
item from the list
- Return type
generator
- dataprofiler.data_readers.data_utils.generator_on_file(file_object: Union[_io.StringIO, _io.BytesIO]) Generator[Union[str, bytes], None, None] ¶
Take a file and return a generator that returns lines.
- Parameters
file_path (path) – path to the file
- Returns
Line from file
- Return type
generator
- dataprofiler.data_readers.data_utils.convert_int_to_string(x: int) str ¶
Convert the given input to string.
In particular, it is int, it converts it ensuring there is no . or 00. In addition, if the input is np.nan, the output will be ‘nan’ which is what we need to handle data properly.
- Parameters
x (Union[int, float, str, numpy.nan]) –
- Returns
- Return type
str
- dataprofiler.data_readers.data_utils.unicode_to_str(data: Union[str, int, float, bool, None, List, Dict], ignore_dicts: bool = False) Union[str, int, float, bool, None, List, Dict] ¶
Convert data to string representation if it is a unicode string.
- Parameters
data (JSONType) – input data
ignore_dicts (boolean) – if set, ignore the dictionary type processing
- Returns
string representation of data
- Return type
str
- dataprofiler.data_readers.data_utils.json_to_dataframe(json_lines: List[Union[str, int, float, bool, None, List, Dict]], selected_columns: Optional[List[str]] = None, read_in_string: bool = False) Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series] ¶
Take list of json objects and return dataframe representing json list.
- Parameters
json_lines (list(JSONType)) – list of json objects
selected_columns (list(str)) – a list of keys to be processed
read_in_string (bool) – if True, all the values in dataframe will be converted to string
- Returns
dataframe converted from json list and list of dtypes for each column
- Return type
tuple(pd.DataFrame, pd.Series(dtypes))
- dataprofiler.data_readers.data_utils.read_json_df(data_generator: Generator, selected_columns: Optional[List[str]] = None, read_in_string: bool = False) Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series] ¶
Return an iterator that returns a chunk of data as dataframe in each call.
The source of input to this function is either a file or a list of JSON structured strings. If the file path is given as input, the file is expected to have one JSON structures in each line. The lines that are not valid json will be ignored. Therefore, a file with pretty printed JSON objects will not be considered valid JSON. If the input is a data list, it is expected to be a list of strings where each string is a valid JSON object. if the individual object is not valid JSON, it will be ignored.
NOTE: both data_list and file_path cannot be passed at the same time.
- Parameters
data_generator (generator) – The generator you want to read.
selected_columns (list(str)) – a list of keys to be processed
read_in_string (bool) – if True, all the values in dataframe will be converted to string
- Returns
returns an iterator that returns a chunk of file as dataframe in each call as well as original dtypes of the dataframe columns.
- Return type
tuple(pd.DataFrame, pd.Series(dtypes))
- dataprofiler.data_readers.data_utils.read_json(data_generator: Iterator, selected_columns: Optional[List[str]] = None, read_in_string: bool = False) List[Union[str, int, float, bool, None, List, Dict]] ¶
Return the lines of a json.
The source of input to this function is either a file or a list of JSON structured strings. If the file path is given as input, the file is expected to have one JSON structures in each line. The lines that are not valid json will be ignored. Therefore, a file with pretty printed JSON objects will not be considered valid JSON. If the input is a data list, it is expected to be a list of strings where each string is a valid JSON object. if the individual object is not valid JSON, it will be ignored.
NOTE: both data_list and file_path cannot be passed at the same time.
- Parameters
data_generator (generator) – The generator you want to read.
selected_columns (list(str)) – a list of keys to be processed
read_in_string (bool) – if True, all the values in dataframe will be converted to string
- Returns
returns the lines of a json file
- Return type
list(dict)
- dataprofiler.data_readers.data_utils.reservoir(file: _io.TextIOWrapper, sample_nrows: int) list ¶
Implement the mathematical logic of Reservoir sampling.
- Parameters
file (TextIOWrapper) – wrapper of the opened csv file
sample_nrows (int) – number of rows to sample
- Raises
ValueError()
- Returns
sampled values
- Return type
list
- dataprofiler.data_readers.data_utils.rsample(file_path: _io.TextIOWrapper, sample_nrows: int, args: dict) _io.StringIO ¶
Implement Reservoir Sampling to sample n rows out of a total of M rows.
- Parameters
file_path (TextIOWrapper) – path of the csv file to be read in
sample_nrows (int) – number of rows being sampled
args (dict) – options to read the csv file
- dataprofiler.data_readers.data_utils.read_csv_df(file_path: Union[str, _io.BytesIO, _io.TextIOWrapper], delimiter: Optional[str], header: Optional[int], sample_nrows: Optional[int] = None, selected_columns: List[str] = [], read_in_string: bool = False, encoding: Optional[str] = 'utf-8') pandas.core.frame.DataFrame ¶
Read a CSV file in chunks and return dataframe in form of iterator.
- Parameters
file_path (str) – path to the CSV file.
delimiter (str) – character used to separate csv values.
header (int) – the header row in the csv file.
selected_columns (list(str)) – a list of columns to be processed
read_in_string (bool) – if True, all the values in dataframe will be converted to string
- Returns
Iterator
- Return type
pd.DataFrame
- dataprofiler.data_readers.data_utils.convert_unicode_col_to_utf8(input_df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame ¶
Convert all unicode columns in input dataframe to utf-8.
- Parameters
input_df (pd.DataFrame) – input dataframe
- Returns
corrected dataframe
- Return type
pd.DataFrame
- dataprofiler.data_readers.data_utils.sample_parquet(file_path: str, sample_nrows: int, selected_columns: Optional[List[str]] = None, read_in_string: bool = False) Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series] ¶
Read parquet file, sample specified number of rows from it and return a data frame.
- Parameters
file_path (str) – path to the Parquet file.
sample_nrows (int) – number of rows being sampled
selected_columns (list) – columns need to be read
read_in_string (bool) – return as string type
- Returns
- Return type
Iterator(pd.DataFrame)
- dataprofiler.data_readers.data_utils.read_parquet_df(file_path: str, sample_nrows: Optional[int] = None, selected_columns: Optional[List[str]] = None, read_in_string: bool = False) Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series] ¶
Return an iterator that returns one row group each time.
- Parameters
file_path (str) – path to the Parquet file.
sample_nrows (int) – number of rows being sampled
selected_columns (list) – columns need to be read
read_in_string (bool) – return as string type
- Returns
- Return type
Iterator(pd.DataFrame)
- dataprofiler.data_readers.data_utils.read_text_as_list_of_strs(file_path: str, encoding: Optional[str] = None) List[str] ¶
Return list of strings relative to the chunk size.
Each line is 1 chunk.
- Parameters
file_path (str) – path to the file
- Returns
- Return type
list(str)
- dataprofiler.data_readers.data_utils.detect_file_encoding(file_path: str, buffer_size: int = 1024, max_lines: int = 20) str ¶
Determine encoding of files within initial max_lines of length buffer_size.
- Parameters
file_path (str) – path to the file
buffer_size (int) – buffer length for each line being read
max_lines (int) – number of lines to read from file of length buffer_size
- Returns
encoding type
- Return type
str
- dataprofiler.data_readers.data_utils.detect_cell_type(cell: str) str ¶
Detect the cell type (int, float, etc).
- Parameters
cell (str) – String designated for data type detection
- dataprofiler.data_readers.data_utils.get_delimiter_regex(delimiter: str = ',', quotechar: str = ',') Pattern[str] ¶
Build regex for delimiter checks.
- Parameters
delimiter (str) – Delimiter to be added to regex
quotechar – Quotechar to be added to regex
- dataprofiler.data_readers.data_utils.find_nth_loc(string: Optional[str] = None, search_query: Optional[str] = None, n: int = 0, ignore_consecutive: bool = True) Tuple[int, int] ¶
Search string via search_query and return nth index in which query occurs.
If there are less than ‘n’ the last loc is returned
- Parameters
string (str) – Input string, to be searched
search_query (str) – char(s) to find nth occurrence of
n (int) – The number of occurrences to iterate through
ignore_consecutive (bool) – Ignore consecutive matches in the search query.
- Return idx
Index of the nth or last occurrence of the search_query
- Rtype idx
int
- Return id_count
Number of identifications prior to idx
- Rtype id_count
int
- dataprofiler.data_readers.data_utils.load_as_str_from_file(file_path: str, file_encoding: Optional[str] = None, max_lines: int = 10, max_bytes: int = 65536, chunk_size_bytes: int = 1024) str ¶
Load data from a csv file up to a specific line OR byte_size.
- Parameters
file_path (str) – Path to file to load data from
file_encoding (str) – File encoding
max_lines (int) – Maximum number of lines to load from file
max_bytes (int) – Maximum number of bytes to load from file
chunk_size_bytes (int) – Chunk size to load every data load
- Returns
Data as string
- Return type
str
- dataprofiler.data_readers.data_utils.is_valid_url(url_as_string: Any) typing_extensions.TypeGuard[Url] ¶
Determine whether a given string is a valid URL.
- Parameters
url_as_string (str) – string to be tested if URL
- Returns
true if string is a valid URL
- Return type
boolean
- dataprofiler.data_readers.data_utils.url_to_bytes(url_as_string: Url, options: Dict) _io.BytesIO ¶
Read in URL and converts it to a byte stream.
- Parameters
url_as_string (str) – string to read as URL
options (dict) – options for the url
- Returns
BytesIO stream of data downloaded from URL
- Return type
BytesIO stream
- class dataprofiler.data_readers.data_utils.S3Helper¶
Bases:
object
A utility class for working with Amazon S3.
- This class provides methods to check if a path is an S3 URI
and to create an S3 client.
- static is_s3_uri(path: str, logger: logging.Logger) bool ¶
Check if the given path is an S3 URI.
This function checks for common S3 URI prefixes “s3://” and “s3a://”.
- Parameters
path (str) – The path to check for an S3 URI.
logger (logging.Logger) – The logger instance for logging.
- Returns
True if the path is an S3 URI, False otherwise.
- Return type
bool
- static create_s3_client(aws_access_key_id: Optional[str] = None, aws_secret_access_key: Optional[str] = None, aws_session_token: Optional[str] = None, region_name: Optional[str] = None) boto3.client ¶
Create and return an S3 client.
- Parameters
aws_access_key_id (str) – The AWS access key ID.
aws_secret_access_key (str) – The AWS secret access key.
aws_session_token (str) – The AWS session token (optional, typically used for temporary credentials).
region_name (str) – The AWS region name (default is ‘us-east-1’).
- Returns
A S3 client instance.
- Return type
boto3.client
- static get_s3_uri(s3_uri: str, s3_client: boto3.client) _io.BytesIO ¶
Download an object from an S3 URI and return its content as BytesIO.
- Parameters
s3_uri (str) – The S3 URI specifying the location of the object to download.
s3_client (boto3.client) – An initialized AWS S3 client for accessing the S3 service.
- Returns
- A BytesIO object containing the content of
the downloaded S3 object.
- Return type
BytesIO