dataprofiler.profilers.text_column_profile module¶

Text profile analysis for individual col within structured profiling..

class dataprofiler.profilers.text_column_profile.TextColumn(name: str | None, options: TextOptions = None)¶

Bases: NumericStatsMixin[TextColumn], BaseColumnPrimitiveTypeProfiler[TextColumn]

Text column profile subclass of BaseColumnProfiler.

Represents a column in the dataset which is a text column.

Initialize column base properties and itself.

Parameters:

name (String) – Name of the data
options (TextOptions) – Options for the Text column

type: str | None = 'text'¶

report(remove_disabled_flag: bool = False) → dict¶: Report profile attribute of class; potentially pop val from self.profile.

property profile: dict¶

Return the profile of the column.

Returns:

diff(other_profile: TextColumn, options: dict | None = None) → dict¶

Find the differences for text columns.

Parameters:: other_profile (TextColumn Profile) – profile to find the difference with
Returns:: the text columns differences
Return type:: dict

property data_type_ratio: float | None¶

Calculate the ratio of samples which match this data type.

NOTE: all values can be considered string so always returns 1 in this case.

Returns:: ratio of data type
Return type:: float

update(df_series: Series) → TextColumn¶

Update the column profile.

Parameters:: df_series (pandas.core.series.Series) – df series
Returns:: updated TextColumn
Return type:: TextColumn

classmethod load_from_dict(data, config: dict | None = None)¶

Parse attribute from json dictionary into self.

Parameters:

data (dict[string, Any]) – dictionary with attributes and values.
config (Dict | None) – config for loading column profiler params from dictionary

Returns:

Profiler with attributes populated.

Return type:

TextColumn

col_type = None¶

static is_float(x: str) → bool¶

Return True if x is float.

For “0.80” this function returns True For “1.00” this function returns True For “1” this function returns True

Parameters:: x (str) – string to test
Returns:: if is float or not
Return type:: bool

static is_int(x: str) → bool¶

Return True if x is integer.

For “0.80” This function returns False For “1.00” This function returns True For “1” this function returns True

Parameters:: x (str) – string to test
Returns:: if is integer or not
Return type:: bool

property kurtosis: float | np.float64¶: Return kurtosis value.

property mean: float | np.float64¶: Return mean value.

property median: float¶

Estimate the median of the data.

Returns:: the median
Return type:: float

property median_abs_deviation: float | np.float64¶

Get median absolute deviation estimated from the histogram of the data.

Subtract bin edges from the median value Fold the histogram to positive and negative parts around zero Impose the two bin edges from the two histogram Calculate the counts for the two histograms with the imposed bin edges Superimpose the counts from the two histograms Interpolate the median absolute deviation from the superimposed counts

Returns:: median absolute deviation

property mode: list[float]¶

Find an estimate for the mode[s] of the data.

Returns:: the mode(s) of the data
Return type:: list(float)

static np_type_to_type(val: Any) → Any¶

Convert numpy variables to base python type variables.

Parameters:: val (numpy type or base type) – value to check & change
Return val:: base python type
Rtype val:: int or float

property skewness: float | np.float64¶: Return skewness value.

property stddev: float | np.float64¶: Return stddev value.

property variance: float | np.float64¶: Return variance.

min: int | float | np.float64 | np.int64 | None¶

max: int | float | np.float64 | np.int64 | None¶

sum: int | float | np.float64 | np.int64¶

max_histogram_bin: int¶

min_histogram_bin: int¶

histogram_bin_method_names: list[str]¶

histogram_selection: str | None¶

user_set_histogram_bin: int | None¶

bias_correction: bool¶

num_zeros: int | np.int64¶

num_negatives: int | np.int64¶

histogram_methods: dict¶

quantiles: list[float] | None¶

match_count: int¶

name: str | None¶

sample_size: int¶

metadata: dict¶

times: dict¶

thread_safe: bool¶