Parquet Data¶

class dataprofiler.data_readers.parquet_data.ParquetData(input_file_path=None, data=None, options=None)¶

Bases: dataprofiler.data_readers.structured_mixins.SpreadSheetDataMixin, dataprofiler.data_readers.base_data.BaseData

SpreadsheetData class to save and load spreadsheet data

Data class for loading datasets of type PARQUET. Can be specified by passing in memory data or via a file path. Options pertaining the PARQUET may also be specified using the options dict parameter. Possible Options:

options = dict(
    data_format= type: str, choices: "dataframe", "records", "json"
    selected_columns= type: list(str)
    header= type: any
)

data_format: user selected format in which to return data can only be of specified types selected_columns: columns being selected from the entire dataset

Parameters

input_file_path (str) – path to the file being loaded or None
data (multiple types) – data being loaded into the class instead of an input file
options (dict) – options pertaining to the data type

Returns

None

data_type = 'parquet'¶

property selected_columns¶

property is_structured¶: Determines compatibility with StructuredProfiler

classmethod is_match(file_path, options=None)¶

Test the given file to check if the file has valid Parquet format or not.

Parameters

file_path (str) – path to the file to be examined
options (dict) – parquet read options

Returns

is file a parquet file or not

Return type

bool

reload(input_file_path=None, data=None, options=None)¶

Reload the data class with a new dataset. This erases all existing data/options and replaces it with the input data/options.

Parameters

input_file_path (str) – path to the file being loaded or None
data (multiple types) – data being loaded into the class instead of an input file
options (dict) – options pertaining to the data type

Returns

None

property data¶

property data_format¶

property file_encoding¶

get_batch_generator(batch_size)¶

info = None¶

property length¶

Returns the length of the dataset which is loaded.

Returns: length of the dataset