Roadmap¶
For more detailed tasks, checkout the repo’s github issues page here: Github Issues.
Data Reader Updates¶
- Read data from S3 bucket
All in the current dp.Data() API paradigm, we want to enable passing an S3 bucket file path to read in data from AWS s3.
Pass list of data file paths to data reader
Pass in linst of data frames to data reader
New Model¶
Transformer model from sensitive data detection
Historical Profiles¶
Conditional Report Metric¶
Based on what is populated on other metrics in the report, have “secondary” / “derivatives” of that number (or that number in conjunction with another number) populate in thie report as well.
For example, if null_count is not None, then populate a null_percent key with a value of the dividence of (null_count / sample_count).
Space / Time Testing¶
- Automatic comparison testing for space and time analysis on PR’s
Standardize a report for space time analysis for future comparisons (create baseline numbers)
Include those in integration tests that will automatically run on code when it is changed in PRs
Could be an optional test, if the user thinks there is concern around the change driving an issue in the library performance
Testing Suite Upgrades¶
Add mocking to unit tests where mocking is not utilized
Integration testing separated out from the unit testing suite. Determine how to only run remotely during PRs
Backward compatibility testing along with informative warnings and errors when a user is utilizing incompatible versions of the library and saved profile object
Historical Versions¶
Legacy version upgrades to enable patches to prior versions of the Data Profiler
Miscellaneous¶
Refact/or Pandas to Polars DataFrames
Spearman correlation calculation
Workflow Profiles