.. _data_readers: Data Readers ************ The `Data` class itself will identify then output one of the following `Data` class types. Using the data reader is easy, just pass it through the Data object. .. code-block:: python import dataprofiler as dp data = dp.Data("your_file.csv") The supported file types are: * CSV file (or any delimited file) * JSON object * Avro file * Parquet file * Text file * Pandas DataFrame * A URL that points to one of the supported file types above It's also possible to specifically call one of the data classes such as the following command: .. code-block:: python from dataprofiler.data_readers.csv_data import CSVData data = CSVData("your_file.csv", options={"delimiter": ","}) Additionally any of the data classes can be loaded using a URL: .. code-block:: python import dataprofiler as dp data = dp.Data("https://you_website.com/your_file.file", options={"verify_ssl": "True"}) Below are descriptions of the various `Data` classes and the available options. CSVData ======= Data class for loading datasets of type CSV. Can be specified by passing in memory data or via a file path. Options pertaining the CSV may also be specified using the options dict parameter. `CSVData(input_file_path=None, data=None, options=None)` Possible `options`: * delimiter - Must be a string, for example `"delimiter": ","` * data_format - must be a string, possible choices: "dataframe", "records" * selected_columns - columns being selected from the entire dataset, must be a list `["column 1", "ssn"]` * header - Define the header, for example * `"header": 'auto'` for auto detection * `"header": None` for no header * `"header": ` to specify the header row (0 based index) JSONData ======== Data class for loading datasets of type JSON. Can be specified by passing in memory data or via a file path. Options pertaining the JSON may also be specified using the options dict parameter. JSON data can be accessed via the "data" property, the "metadata" property, and the "data_and_metadata" property. `JSONData(input_file_path=None, data=None, options=None)` Possible `options`: * data_format - must be a string, choices: "dataframe", "records", "json", "flattened_dataframe" * "flattened_dataframe" is best used for JSON structure typically found in data streams that contain nested lists of dictionaries and a payload. For example: `{"data": [ columns ], "response": 200}` * selected_keys - columns being selected from the entire dataset, must be a list `["column 1", "ssn"]` * payload_keys - The dictionary keys for the payload of the JSON, typically called "data" or "payload". Defaults to ["data", "payload", "response"]. AVROData ======== Data class for loading datasets of type AVRO. Can be specified by passing in memory data or via a file path. Options pertaining the AVRO may also be specified using the options dict parameter. `AVROData(input_file_path=None, data=None, options=None)` Possible `options`: * data_format - must be a string, choices: "dataframe", "records", "avro", "json", "flattened_dataframe" * "flattened_dataframe" is best used for AVROs with a JSON structure typically found in data streams that contain nested lists of dictionaries and a payload. For example: `{"data": [ columns ], "response": 200}` * selected_keys - columns being selected from the entire dataset, must be a list `["column 1", "ssn"]` ParquetData =========== Data class for loading datasets of type PARQUET. Can be specified by passing in memory data or via a file path. Options pertaining the PARQUET may also be specified using the options dict parameter. `ParquetData(input_file_path=None, data=None, options=None)` Possible `options`: * data_format - must be a string, choices: "dataframe", "records", "json" * selected_keys - columns being selected from the entire dataset, must be a list `["column 1", "ssn"]` TextData ======== Data class for loading datasets of type TEXT. Can be specified by passing in memory data or via a file path. Options pertaining the TEXT may also be specified using the options dict parameter. `TextData(input_file_path=None, data=None, options=None)` Possible `options`: * data_format: user selected format in which to return data. Currently only supports "text". * samples_per_line - chunks by which to read in the specified dataset Data Using a URL ================ Data class for loading datasets of any type using a URL. Specified by passing in any valid URL that points to one of the valid data types. Options pertaining the URL may also be specified using the options dict parameter. `Data(input_file_path=None, data=None, options=None)` Possible `options`: * verify_ssl: must be a boolean string, choices: "True", "False". Set to "True" by default.