Install

To install the full package from pypi:

pip install DataProfiler[ml]

If the ML requirements are too strict (say, you don’t want to install tensorflow), you can install a slimmer package. The slimmer package disables the default sensitive data detection / entity recognition (labler)

Install from pypi:

pip install DataProfiler

Snappy Installation

This is required to profile parquet/avro datasets

MacOS (intel chip) with homebrew:

brew install snappy && CPPFLAGS="-I/usr/local/include -L/usr/local/lib" pip install python-snappy

MacOS (apple chip) with homebrew:

brew install snappy && CPPFLAGS="-I/opt/homebrew/include -L/opt/homebrew/lib" pip install python-snappy

Linux install:

sudo apt-get -y install libsnappy-dev

Build From Scratch

NOTE: Installation for python3

virtualenv install:

python3 -m pip install virtualenv

Setup virtual env:

python3 -m virtualenv --python=python3 venv3
source venv3/bin/activate

Install requirements:

pip3 install -r requirements.txt

Install labeler dependencies:

pip3 install -r requirements-ml.txt

Install via the repo – Build setup.py and install locally:

python3 setup.py sdist bdist bdist_wheel
pip3 install dist/DataProfiler*-py3-none-any.whl

If you see:

ERROR: Double requirement given:dataprofiler==X.Y.Z from dataprofiler/dist/DataProfiler-X.Y.Z-py3-none-any.whl (already in dataprofiler==X2.Y2.Z2 from dataprofiler/dist/DataProfiler-X2.Y2.Z2-py3-none-any.whl, name='dataprofiler')

This means that you have multiple versions of the DataProfiler distribution in the dist folder. To resolve, either remove the older one or delete the folder and rerun the steps above.

Install via github:

pip3 install git+https://github.com/capitalone/dataprofiler.git#egg=dataprofiler

Testing

For testing, install test requirements:

pip3 install -r requirements-test.txt

To run all unit tests, use:

DATAPROFILER_SEED=0 python3 -m unittest discover -p "test*.py"

To run file of unit tests, use form:

DATAPROFILER_SEED=0 python3 -m unittest discover -p test_profile_builder.py

To run a file with Pytest use:

DATAPROFILER_SEED=0 pytest dataprofiler/tests/data_readers/test_csv_data.py -v

To run individual of unit test, use form:

DATAPROFILER_SEED=0 python3 -m unittest dataprofiler.tests.profilers.test_profile_builder.TestProfiler