dataprofiler.data_readers.text_data module¶

Contains class for saving and loading text files.

class dataprofiler.data_readers.text_data.TextData(input_file_path: str | None = None, data: List[str] | None = None, options: Dict | None = None)¶

Bases: BaseData

TextData class to save and load text files.

Initialize Data class for loading datasets of type TEXT.

Can be specified by passing in memory data or via a file path. Options pertaining the TEXT may also be specified using the options dict parameter. Possible Options:

options = dict(
    data_format= type: str, choices: "text"
    samples_per_line= type: int
)

data_format: user selected format in which to return data can only be of specified types samples_per_line: chunks by which to read in the specified dataset

Parameters:

input_file_path (str) – path to the file being loaded or None
data (multiple types) – data being loaded into the class instead of an input file
options (dict) – options pertaining to the data type

Returns:

None

data_type: str = 'text'¶

property samples_per_line: int¶: Return samples per line.

property is_structured: bool¶: Determine compatibility with StructuredProfiler.

tokenize() → None¶: Tokenize data.

classmethod is_match(file_path: str, options: Dict | None = None) → bool¶

Return True if all are text files.

Parameters:

file_path (str) – path to the file to be examined
options (dict) – text file read options

Returns:

is file a text file or not

Return type:

bool

reload(input_file_path: str | None = None, data: List[str] | None = None, options: Dict | None = None) → None¶

Reload the data class with a new dataset.

This erases all existing data/options and replaces it with the input data/options.

Parameters:

input_file_path (str) – path to the file being loaded or None
data (multiple types) – data being loaded into the class instead of an input file
options (dict) – options pertaining to the data type

Returns:

None

property data¶: Return data.

property data_format: str | None¶: Return data format.

property file_encoding: str | None¶: Return file encoding.

get_batch_generator(batch_size: int) → Generator[DataFrame | List, None, None]¶: Get batch generator.

info: str | None = None¶

property length: int¶

Return the length of the dataset which is loaded.

Returns:: length of the dataset

options: Dict | None¶