Intro to Data Readers¶
Within the Data Profiler, there are 5 data reader classes:
CSVData (delimited data: CSV, TSV, etc.)
JSONData
ParquetData
AVROData
TextData
Each of these classes can be used to read data individually, however the Data Profiler provides the unique capability of auto detecting what data you have and reading it automatically by using the Data
class.
import dataprofiler as dp
data = dp.Data('/path/to/mydata.abc') # auto detects and reads your data
Automatically reading and detecting data¶
Below is a demonstration of utilizing the Data
class which automatically detects the type of data for a given file and reads it automatically.
[ ]:
import os
import sys
try:
sys.path.insert(0, '..')
import dataprofiler as dp
except ImportError:
import dataprofiler as dp
[ ]:
# use data reader to read input data with different file types
data_folder = "../dataprofiler/tests/data"
csv_files = [
"csv/aws_honeypot_marx_geo.csv",
"csv/all-strings-skip-header-author.csv", # csv files with the author/description on the first line
"csv/sparse-first-and-last-column-empty-first-row.txt", # csv file with the .txt extension
]
json_files = [
"json/complex_nested.json",
"json/honeypot_intentially_mislabeled_file.csv", # json file with the .csv extension
]
parquet_files = [
"parquet/nation.dict.parquet",
"parquet/nation.plain.intentionally_mislabled_file.csv", # parquet file with the .csv extension
]
avro_files = [
"avro/userdata1.avro",
"avro/userdata1_intentionally_mislabled_file.json", # avro file with the .json extension
]
text_files = [
"txt/discussion_reddit.txt",
]
all_files = csv_files + json_files + parquet_files + avro_files + text_files
print('filepath' + ' ' * 58 + 'data type')
print('='*80)
for file in all_files:
filepath = os.path.join(data_folder, file)
data = dp.Data(filepath)
print("{:<65} {:<15}".format(file, data.data_type))
print("\n")
[ ]:
# importing from a url
data = dp.Data('https://raw.githubusercontent.com/capitalone/DataProfiler/main/dataprofiler/tests/data/csv/diamonds.csv')
data.head()
Specifying detection options of Data
and loading pandas.DataFrame
¶
The Data
class also gives the ability to set options or if the user wants to load their data with specific requirements. Options for each data reader are specified in the docs: https://capitalone.github.io/DataProfiler/docs/0.4.4/html/dataprofiler.data_readers.html
import dataprofiler as dp
options = {...} # allowed options are specified for each data reader.
data = dp.Data(data, options=options)
Later in this tutorial, the options for the CSVData class will be discussed.
Additionally, a user can directly load a pandas.DataFrame
as any data reader they choose.
[ ]:
import pandas as pd
from dataprofiler.data_readers.csv_data import CSVData
df = pd.DataFrame(['my', 'random', 'data'])
# specify via the `Data` class
data = dp.Data(data=df, data_type='csv')
print('Data Type: ', data.data_type)
# specifically use the CSVData class
data = CSVData(data=df)
print('Data Type: ', data.data_type)
Accessing data and attributes¶
Once loaded, the data can be accessed via the data
property of the object. Additional information about the data loaded may differ between data readers.
For this example we will focus on CSVData
.
[ ]:
filepath = "../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv"
data = dp.Data(filepath)
print('Data Type: ', data.data_type)
print('Data Filepath: ', data.input_file_path)
print('File Encoding: ', data.file_encoding)
print('Data Length (two techniques): ', len(data), data.length)
print("Data Access:")
data.data
Checking data file types with is_match
¶
Each data reader has a class method is_match
which determines whether or not a dataset is of a given data type.
CSVData.is_match
JSONData.is_match
ParquetData.is_match
AVROData.is_match
TextData.is_match
[ ]:
# supplemental function
def add_true_false_color(value):
"""Converts True to green and False to red in printed text."""
if value:
return "\x1b[92m " + str(is_match) + "\x1b[0m"
return "\x1b[31m " + str(is_match) + "\x1b[0m"
[ ]:
from dataprofiler.data_readers.csv_data import CSVData
non_csv_files = [
'json/iris-utf-8.json',
'json/honeypot_intentially_mislabeled_file.csv',
'parquet/titanic.parq',
'parquet/nation.plain.intentionally_mislabled_file.csv',
'txt/code.txt',
'txt/sentence.txt',
'avro/users.avro',
'avro/snappy_compressed_intentionally_mislabeled_file.csv',
]
print("Is the file a CSV?")
print('=' * 80)
for file in csv_files:
filepath = os.path.join(data_folder, file)
is_match = CSVData.is_match(filepath)
print(add_true_false_color(is_match), ':', file)
print('=' * 80)
for file in non_csv_files:
filepath = os.path.join(data_folder, file)
is_match = CSVData.is_match(filepath)
print(add_true_false_color(is_match), ':', file)
print('=' * 80)
Reloading data after altering options with reload
¶
There are two cases for using the reload function, both of which require the data type to have been interpreted correctly:
1. The options were not correctly determined
2. The options were loaded correctly but a change is desired.
In the example below, the data_format
for reading the data is changed and the data is then reloaded.
[ ]:
filepath = "../dataprofiler/tests/data/csv/diamonds.csv"
data = dp.Data(filepath)
print('original data:')
print('=' * 80)
print(data.data[:5])
print()
data.reload(options={'data_format': 'records', 'record_samples_per_line': 1})
print('reloaded data:')
print('=' * 80)
data.data[:5]
[ ]:
A deeper dive into CSVData
¶
The rest of this tutorial will focus on how to use the data reader class: CSVData
. The CSVData
class is used for reading delimited data. Delimited data are datasets which have their columns specified by a specific character, commonly the ,
. E.g. from the diamonds.csv
dataset:
carat,cut,color,clarity,depth,table,price,x,y,z
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75
However, the delimiter can be any character. Additionally, a quotechar
, commonly "
, can be specified which allows a delimiter to be contained within a column value. E.g. from the blogposts.csv
dataset:
Blog Post,Date,Subject,Field
"Monty Hall, meet Game Theory",4/13/2014,Statistics,Mathematics
Gaussian Quadrature,4/13/2014,Algorithms,Mathematics
Notice how "Monty Hall, meet Game Theory"
is contained by the quotechar because it contains the delimiter value ,
.
These delimiter dataset parameters (and more) can be automatically determined by the CSVData
data reader, however they can also be set via the options as demonstrated later in this tutorial.
Intro to the CSVData
data reader¶
Previously, it was shown that CSVData
may automatically be detected using Data
or can be manually specified by the user:
import dataprofiler as dp
from dataprofiler.data_readers.csv_data import CSVData
data = dp.Data(filepath)
data = CSVData(filepath)
[ ]:
# use data reader to read delimited data
data_folder = "../dataprofiler/tests/data"
csv_files = [
"csv/diamonds.csv",
"csv/all-strings-skip-header-author.csv", # csv files with the author/description on the first line
"csv/sparse-first-and-last-column-empty-first-row.txt", # csv file with the .txt extension
]
for file in csv_files:
data = CSVData(os.path.join(data_folder, file))
print(data.data.head())
print('=' * 80)
CSVData Options¶
As mentioned preivously, CSVData
has options that can be set to finetune its detection or to ensure the data is being read in a specific manner. The options for CSVData
are detailed below:
delimiter - delimiter used to decipher the csv input file
quotechar - quote character used in the delimited file
header - location of the header in the file.
data_format - user selected format in which to return data can only be of specified types
selected_columns - columns being selected from the entire dataset
[ ]:
# options are set via a dictionary object in which the parameters are specified.
# these are the default values for each option
options = {
"delimiter": ",",
"quotechar": '"',
"header": 'auto',
"data_format": "dataframe", # type: str, choices: "dataframe", "records"
"selected_columns": list(),
}
Options: delimiter and quotechar¶
Below, both the auto detection and use of options will be illustrated for delimiter
and quotechar
.
[ ]:
# display the data we are reading
filepath = "../dataprofiler/tests/data/csv/daily-activity-sheet-@-singlequote.csv"
num_lines = 10
with open(filepath) as fp:
print(''.join(fp.readlines()[:num_lines]))
[ ]:
data = dp.Data(filepath) # or use CSVData
print('Auto detected')
print('=' * 80)
print('delimiter: ', data.delimiter)
print('quotechar: ', data.quotechar)
data.data.head()
[ ]:
options = {'delimiter': '@', 'quotechar': "'"}
data = dp.Data(filepath, options=options) # or use CSVData
print('manually set')
print('=' * 80)
print('delimiter: ', data.delimiter)
print('quotechar: ', data.quotechar)
data.data.head()
[ ]:
# intentional failure with incorrect options
options = {'delimiter': ',', 'quotechar': '"'}
# will be interepted as TextData because the delimtier and quotechar were incorrect
data = dp.Data(filepath, options=options)
print('intentional faliure set')
print('=' * 80)
try:
print('delimiter: ', data.delimiter) # attribute error raised here, bc TextData, not CSVData
print('quotechar: ', data.quotechar)
# should not reach this or something went wrong
raise Exception('Should have failed because this is detected as TextData.')
except AttributeError:
print('When data_type is not set or the CSVData is not set, it will fail over to the\n'
'next best reader. In this case it is "TextData"\n')
data.data
Options: header¶
Below, both the auto detection and use of options will be illustrated for header
.
Notice how in the manually set mechanism, we are intentionally setting the header incorrectly to illustrate what happens.
[ ]:
# display the data we are reading
filepath = "../dataprofiler/tests/data/csv/sparse-first-and-last-column-header-and-author-description.txt"
num_lines = 10
with open(filepath) as fp:
print(''.join(fp.readlines()[:num_lines]))
[ ]:
options = {'header': 'auto'} # auto detected (default value)
data = dp.Data(filepath, options=options) # or use CSVData
print('Data Header:', data.header)
print('=' * 80)
data.data.head()
[ ]:
options = {'header': 2} # intentionally set incorrectly at value 2
data = dp.Data(filepath, options=options) # or use CSVData
print('Data Header:', data.header)
print('=' * 80)
data.data.head()
Options: data_format¶
For CSVData, the data_format
option can have the following values:
dataframe - (default) loads the dataset as a pandas.DataFrame
records - loads the data as rows of text values, the extra parameter
record_samples_per_line
how many rows are combined into a single line
dataframe
is used for conducting structured profiling of the dataset while records
is for unstructured profiling.
Below, both the auto detection and use of options will be illustrated for data_format
.
[ ]:
# display the data we are reading
filepath = "../dataprofiler/tests/data/csv/diamonds.csv"
num_lines = 10
with open(filepath) as fp:
print(''.join(fp.readlines()[:num_lines]))
[ ]:
options = {'data_format': 'dataframe'} # default
data = dp.Data(filepath, options=options) # or use CSVData
data.data[:5]
[ ]:
options = {'data_format': 'records', 'record_samples_per_line': 1}
data = dp.Data(filepath, options=options)
data.data[:5]
Options: selected columns¶
By default, all columns of a dataset will be read and loaded into the data reader. However, selected_columns
can be set to only load columns which the user requests.
[ ]:
# display the data we are reading
filepath = "../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv"
num_lines = 10
with open(filepath) as fp:
print(''.join(fp.readlines()[:num_lines]))
[ ]:
options = {'selected_columns': ['datetime', 'host', 'src', 'proto']}
data = dp.Data(filepath, options=options)
data.data.head()