.. _architecture:

Architecture & Design Overview
******************************

This section describes the design rationale, algorithmic choices, assumptions, testing strategy, and contribution process used in the DataProfiler library.

Overview
--------

DataProfiler computes numeric statistics (e.g., mean, variance, skewness, kurtosis) using **streaming algorithms** that allow efficient, incremental updates without recomputing from raw data. Approximate quantile metrics like the median are calculated using histogram-based estimation, making the system scalable for large or streaming datasets.

Additionally, DataProfiler uses a **Convolutional Neural Network (CNN)** to detect and label entities (e.g., names, emails, credit cards) in unstructured text. This supports critical tasks such as **PII detection**, **schema inference**, and **data quality analysis** across structured and unstructured data.

Algorithm Rationale
-------------------

The algorithms used are designed for **speed, scalability, and flexibility**:

- **Streaming numeric methods** (e.g., Welford's algorithm, moment-based metrics, histogram binning) efficiently summarize data without full recomputation.
- **CNNs for entity detection** are fast, high-throughput, and well-suited for sequence labeling tasks in production environments.

These choices align with the tool's goal of delivering fast, accurate data profiling with minimal configuration.

Assumptions & Limitations
-------------------------

- **Consistent formatting** of sensitive entities is assumed (e.g., standardized credit card or SSN formats).
- **Overlapping entity types** (e.g., phone vs. SSN) may lead to misclassification without context.
- **Synthetic training data** may not fully capture real-world diversity, reducing model accuracy on natural or unstructured text.
- **Quantile estimation** (e.g., median) is approximate and based on binning rather than exact sorting.

Testing & Validation
--------------------

- Comprehensive **unit testing** is performed across Python 3.9, 3.10, and 3.11.
- Tests are executed on every pull request targeting `dev` or `main` branches.
- All pull requests require **two code reviewer approvals** before merging.
- Testing includes correctness, performance, and compatibility checks to ensure production readiness.

Versioning & Contributions
--------------------------

- Versioning and development are managed via **GitHub**.
- Future changes must follow the guidelines in `CONTRIBUTING.md`, including:
  - Forking the repo and branching from `dev` or an active feature branch.
  - Ensuring **80%+ unit test coverage** for all new functionality.
  - Opening a PR and securing **two approvals** prior to merging.