View this notebook on GitHub

Intro to Data Readers

Within the Data Profiler, there are 5 data reader classes:

  • CSVData (delimited data: CSV, TSV, etc.)

  • JSONData

  • ParquetData

  • AVROData

  • TextData

Each of these classes can be used to read data individually, however the Data Profiler provides the unique capability of auto detecting what data you have and reading it automatically by using the Data class.

import dataprofiler as dp
data = dp.Data('/path/to/mydata.abc')  # auto detects and reads your data

Automatically reading and detecting data

Below is a demonstration of utilizing the Data class which automatically detects the type of data for a given file and reads it automatically.

[ ]:
import os
import sys

try:
    sys.path.insert(0, '..')
    import dataprofiler as dp
except ImportError:
    import dataprofiler as dp
[ ]:
# use data reader to read input data with different file types
data_folder = "../dataprofiler/tests/data"
csv_files = [
    "csv/aws_honeypot_marx_geo.csv",
    "csv/all-strings-skip-header-author.csv", # csv files with the author/description on the first line
    "csv/sparse-first-and-last-column-empty-first-row.txt", # csv file with the .txt extension
]
json_files = [
    "json/complex_nested.json",
    "json/honeypot_intentially_mislabeled_file.csv", # json file with the .csv extension
]
parquet_files = [
    "parquet/nation.dict.parquet",
    "parquet/nation.plain.intentionally_mislabled_file.csv", # parquet file with the .csv extension
]
avro_files = [
    "avro/userdata1.avro",
    "avro/userdata1_intentionally_mislabled_file.json", # avro file with the .json extension
]
text_files = [
    "txt/discussion_reddit.txt",
]
all_files = csv_files + json_files + parquet_files + avro_files + text_files
print('filepath' + ' ' * 58 + 'data type')
print('='*80)
for file in all_files:
    filepath = os.path.join(data_folder, file)
    data = dp.Data(filepath)
    print("{:<65} {:<15}".format(file, data.data_type))
print("\n")
[ ]:
# importing from a url
data = dp.Data('https://raw.githubusercontent.com/capitalone/DataProfiler/main/dataprofiler/tests/data/csv/diamonds.csv')
data.head()

Specifying detection options of Data and loading pandas.DataFrame

The Data class also gives the ability to set options or if the user wants to load their data with specific requirements. Options for each data reader are specified in the docs: https://capitalone.github.io/DataProfiler/docs/0.4.4/html/dataprofiler.data_readers.html

import dataprofiler as dp

options = {...}  # allowed options are specified for each data reader.
data = dp.Data(data, options=options)

Later in this tutorial, the options for the CSVData class will be discussed.

Additionally, a user can directly load a pandas.DataFrame as any data reader they choose.

[ ]:
import pandas as pd
from dataprofiler.data_readers.csv_data import CSVData


df = pd.DataFrame(['my', 'random', 'data'])

# specify via the `Data` class
data = dp.Data(data=df, data_type='csv')
print('Data Type: ', data.data_type)

# specifically use the CSVData class
data = CSVData(data=df)
print('Data Type: ', data.data_type)

Accessing data and attributes

Once loaded, the data can be accessed via the data property of the object. Additional information about the data loaded may differ between data readers.

For this example we will focus on CSVData.

[ ]:
filepath = "../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv"
data = dp.Data(filepath)
print('Data Type: ', data.data_type)
print('Data Filepath: ', data.input_file_path)
print('File Encoding: ', data.file_encoding)
print('Data Length (two techniques): ', len(data), data.length)
print("Data Access:")
data.data

Checking data file types with is_match

Each data reader has a class method is_match which determines whether or not a dataset is of a given data type.

CSVData.is_match
JSONData.is_match
ParquetData.is_match
AVROData.is_match
TextData.is_match
[ ]:
# supplemental function
def add_true_false_color(value):
    """Converts True to green and False to red in printed text."""
    if value:
        return "\x1b[92m  " + str(is_match) + "\x1b[0m"
    return "\x1b[31m " + str(is_match) + "\x1b[0m"
[ ]:
from dataprofiler.data_readers.csv_data import CSVData


non_csv_files = [
    'json/iris-utf-8.json',
    'json/honeypot_intentially_mislabeled_file.csv',
    'parquet/titanic.parq',
    'parquet/nation.plain.intentionally_mislabled_file.csv',
    'txt/code.txt',
    'txt/sentence.txt',
    'avro/users.avro',
    'avro/snappy_compressed_intentionally_mislabeled_file.csv',
]

print("Is the file a CSV?")
print('=' * 80)
for file in csv_files:
    filepath = os.path.join(data_folder, file)
    is_match = CSVData.is_match(filepath)
    print(add_true_false_color(is_match), ':', file)
    print('=' * 80)

for file in non_csv_files:
    filepath = os.path.join(data_folder, file)
    is_match = CSVData.is_match(filepath)
    print(add_true_false_color(is_match), ':', file)
    print('=' * 80)

Reloading data after altering options with reload

There are two cases for using the reload function, both of which require the data type to have been interpreted correctly:

1. The options were not correctly determined
2. The options were loaded correctly but a change is desired.

In the example below, the data_format for reading the data is changed and the data is then reloaded.

[ ]:
filepath = "../dataprofiler/tests/data/csv/diamonds.csv"

data = dp.Data(filepath)
print('original data:')
print('=' * 80)
print(data.data[:5])

print()
data.reload(options={'data_format': 'records', 'record_samples_per_line': 1})
print('reloaded data:')
print('=' * 80)
data.data[:5]
[ ]:

A deeper dive into CSVData

The rest of this tutorial will focus on how to use the data reader class: CSVData. The CSVData class is used for reading delimited data. Delimited data are datasets which have their columns specified by a specific character, commonly the ,. E.g. from the diamonds.csv dataset:

carat,cut,color,clarity,depth,table,price,x,y,z
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75

However, the delimiter can be any character. Additionally, a quotechar, commonly ", can be specified which allows a delimiter to be contained within a column value. E.g. from the blogposts.csv dataset:

Blog Post,Date,Subject,Field
"Monty Hall, meet Game Theory",4/13/2014,Statistics,Mathematics
Gaussian Quadrature,4/13/2014,Algorithms,Mathematics

Notice how "Monty Hall, meet Game Theory" is contained by the quotechar because it contains the delimiter value ,.

These delimiter dataset parameters (and more) can be automatically determined by the CSVData data reader, however they can also be set via the options as demonstrated later in this tutorial.

Intro to the CSVData data reader

Previously, it was shown that CSVData may automatically be detected using Data or can be manually specified by the user:

import dataprofiler as dp
from dataprofiler.data_readers.csv_data import CSVData

data = dp.Data(filepath)
data = CSVData(filepath)
[ ]:
# use data reader to read delimited data
data_folder = "../dataprofiler/tests/data"
csv_files = [
    "csv/diamonds.csv",
    "csv/all-strings-skip-header-author.csv", # csv files with the author/description on the first line
    "csv/sparse-first-and-last-column-empty-first-row.txt", # csv file with the .txt extension
]

for file in csv_files:
    data = CSVData(os.path.join(data_folder, file))
    print(data.data.head())
    print('=' * 80)

CSVData Options

As mentioned preivously, CSVData has options that can be set to finetune its detection or to ensure the data is being read in a specific manner. The options for CSVData are detailed below:

  • delimiter - delimiter used to decipher the csv input file

  • quotechar - quote character used in the delimited file

  • header - location of the header in the file.

  • data_format - user selected format in which to return data can only be of specified types

  • selected_columns - columns being selected from the entire dataset

[ ]:
# options are set via a dictionary object in which the parameters are specified.
# these are the default values for each option
options = {
    "delimiter": ",",
    "quotechar": '"',
    "header": 'auto',
    "data_format": "dataframe",  # type: str, choices: "dataframe", "records"
    "selected_columns": list(),
}

Options: delimiter and quotechar

Below, both the auto detection and use of options will be illustrated for delimiter and quotechar.

[ ]:
# display the data we are reading
filepath = "../dataprofiler/tests/data/csv/daily-activity-sheet-@-singlequote.csv"
num_lines = 10
with open(filepath) as fp:
    print(''.join(fp.readlines()[:num_lines]))
[ ]:
data = dp.Data(filepath)  # or use CSVData
print('Auto detected')
print('=' * 80)
print('delimiter: ', data.delimiter)
print('quotechar: ', data.quotechar)
data.data.head()
[ ]:
options = {'delimiter': '@', 'quotechar': "'"}
data = dp.Data(filepath, options=options)  # or use CSVData
print('manually set')
print('=' * 80)
print('delimiter: ', data.delimiter)
print('quotechar: ', data.quotechar)
data.data.head()
[ ]:
# intentional failure with incorrect options
options = {'delimiter': ',', 'quotechar': '"'}

# will be interepted as TextData because the delimtier and quotechar were incorrect
data = dp.Data(filepath, options=options)
print('intentional faliure set')
print('=' * 80)
try:
    print('delimiter: ', data.delimiter)  # attribute error raised here, bc TextData, not CSVData
    print('quotechar: ', data.quotechar)

    # should not reach this or something went wrong
    raise Exception('Should have failed because this is detected as TextData.')
except AttributeError:
    print('When data_type is not set or the CSVData is not set, it will fail over to the\n'
          'next best reader. In this case it is "TextData"\n')
data.data

Options: header

Below, both the auto detection and use of options will be illustrated for header.

Notice how in the manually set mechanism, we are intentionally setting the header incorrectly to illustrate what happens.

[ ]:
# display the data we are reading
filepath = "../dataprofiler/tests/data/csv/sparse-first-and-last-column-header-and-author-description.txt"
num_lines = 10
with open(filepath) as fp:
    print(''.join(fp.readlines()[:num_lines]))
[ ]:
options = {'header': 'auto'}  # auto detected (default value)
data = dp.Data(filepath, options=options)  # or use CSVData
print('Data Header:', data.header)
print('=' * 80)
data.data.head()
[ ]:
options = {'header': 2}  # intentionally set incorrectly at value 2
data = dp.Data(filepath, options=options)  # or use CSVData
print('Data Header:', data.header)
print('=' * 80)
data.data.head()

Options: data_format

For CSVData, the data_format option can have the following values:

  • dataframe - (default) loads the dataset as a pandas.DataFrame

  • records - loads the data as rows of text values, the extra parameter record_samples_per_line how many rows are combined into a single line

dataframe is used for conducting structured profiling of the dataset while records is for unstructured profiling.

Below, both the auto detection and use of options will be illustrated for data_format.

[ ]:
# display the data we are reading
filepath = "../dataprofiler/tests/data/csv/diamonds.csv"
num_lines = 10
with open(filepath) as fp:
    print(''.join(fp.readlines()[:num_lines]))
[ ]:
options = {'data_format': 'dataframe'}  # default
data = dp.Data(filepath, options=options)  # or use CSVData
data.data[:5]
[ ]:
options = {'data_format': 'records', 'record_samples_per_line': 1}
data = dp.Data(filepath, options=options)
data.data[:5]

Options: selected columns

By default, all columns of a dataset will be read and loaded into the data reader. However, selected_columns can be set to only load columns which the user requests.

[ ]:
# display the data we are reading
filepath = "../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv"
num_lines = 10
with open(filepath) as fp:
    print(''.join(fp.readlines()[:num_lines]))
[ ]:
options = {'selected_columns': ['datetime', 'host', 'src', 'proto']}
data = dp.Data(filepath, options=options)
data.data.head()