dataprofiler.data_readers.parquet_data module

Contains class to save and load parquet data.

class dataprofiler.data_readers.parquet_data.ParquetData(input_file_path: str | None = None, data: DataFrame | str | None = None, options: Dict | None = None)

Bases: SpreadSheetDataMixin, BaseData

SpreadsheetData class to save and load parquet data.

Initialize Data class for loading datasets of type PARQUET.

Can be specified by passing in memory data or via a file path. Options pertaining to PARQUET may also be specified using options dict param. Possible Options:

options = dict(
    data_format= type: str, choices: "dataframe", "records", "json"
    selected_columns= type: list(str)
    header= type: any
)

data_format: user selected format in which to return data can only be of specified types selected_columns: columns being selected from the entire dataset

Parameters:
  • input_file_path (str) – path to the file being loaded or None

  • data (multiple types) – data being loaded into the class instead of an input file

  • options (dict) – options pertaining to the data type

Returns:

None

data_type: str = 'parquet'
property file_encoding: None

Set file encoding to None since not detected for avro.

property selected_columns: List[str]

Return selected columns.

property sample_nrows: int | None

Return sample_nrows.

property is_structured: bool

Determine compatibility with StructuredProfiler.

classmethod is_match(file_path: str | StringIO | BytesIO, options: Dict | None = None) bool

Test the given file to check if the file has valid Parquet format.

Parameters:
  • file_path (str) – path to the file to be examined

  • options (dict) – parquet read options

Returns:

is file a parquet file or not

Return type:

bool

property data

Return data.

property data_format: str | None

Return data format.

get_batch_generator(batch_size: int) Generator[DataFrame | List, None, None]

Get batch generator.

info: str | None = None
property length: int

Return the length of the dataset which is loaded.

Returns:

length of the dataset

reload(input_file_path: str | None = None, data: Any | None = None, options: Dict | None = None) None

Reload the data class with a new dataset.

This erases all existing data/options and replaces it with the input data/options.

Parameters:
  • input_file_path (str) – path to the file being loaded or None

  • data (multiple types) – data being loaded into the class instead of an input file

  • options (dict) – options pertaining to the data type

Returns:

None

options: Dict | None