Install¶
To install the full package from pypi:
pip install DataProfiler[ml]
If the ML requirements are too strict (say, you don’t want to install tensorflow), you can install a slimmer package. The slimmer package disables the default sensitive data detection / entity recognition (labler)
Install from pypi:
pip install DataProfiler
Snappy Installation¶
This is required to profile parquet/avro datasets
MacOS with homebrew:
brew install snappy && CPPFLAGS="-I/usr/local/include -L/usr/local/lib" pip install python-snappy
Linux install:
sudo apt-get -y install libsnappy-dev
Build From Scratch¶
NOTE: Installation for python3
virtualenv install:
python3 -m pip install virtualenv
Setup virtual env:
python3 -m virtualenv --python=python3 venv3
source venv3/bin/activate
Install requirements:
pip3 install -r requirements.txt
Install labeler dependencies:
pip3 install -r requirements-ml.txt
Install via the repo – Build setup.py and install locally:
python3 setup.py sdist bdist bdist_wheel
pip3 install dist/DataProfiler*-py3-none-any.whl
If you see:
ERROR: Double requirement given:dataprofiler==X.Y.Z from dataprofiler/dist/DataProfiler-X.Y.Z-py3-none-any.whl (already in dataprofiler==X2.Y2.Z2 from dataprofiler/dist/DataProfiler-X2.Y2.Z2-py3-none-any.whl, name='dataprofiler')
This means that you have multiple versions of the DataProfiler distribution in the dist folder. To resolve, either remove the older one or delete the folder and rerun the steps above.
Install via github:
pip3 install git+https://github.com/capitalone/dataprofiler.git#egg=dataprofiler
Testing¶
For testing, install test requirements:
pip3 install -r requirements-test.txt
To run all unit tests, use:
DATAPROFILER_SEED=0 python3 -m unittest discover -p "test*.py"
To run file of unit tests, use form:
DATAPROFILER_SEED=0 python3 -m unittest discover -p test_profile_builder.py
To run a file with Pytest use:
DATAPROFILER_SEED=0 pytest dataprofiler/tests/data_readers/test_csv_data.py -v
To run individual of unit test, use form:
DATAPROFILER_SEED=0 python3 -m unittest dataprofiler.tests.profilers.test_profile_builder.TestProfiler