Categorical Column Profile¶

class dataprofiler.profilers.categorical_column_profile.CategoricalColumn(name, options=None)¶

Bases: dataprofiler.profilers.base_column_profilers.BaseColumnProfiler

Categorical column profile subclass of BaseColumnProfiler. Represents a column int the dataset which is a categorical column.

Initialization of column base properties and itself.

Parameters: name (String) – Name of data

type = 'category'¶

diff(other_profile, options=None)¶

Finds the differences for CategoricalColumns.

Parameters: other_profile (CategoricalColumn) – profile to find the difference with
Returns: the CategoricalColumn differences
Return type: dict

property profile¶: Property for profile. Returns the profile of the column. For categorical_count, it will display the top k categories most frequently occurred in descending order.

property categories¶: Property for categories.

property categorical_counts¶: Property for the counts of each category.

property unique_ratio¶: Property for unique_ratio. Returns ratio of unique categories to sample_size

property is_match¶: Property for is_match. Returns true if column is categorical.

update(df_series)¶

Updates the column profile.

Parameters: df_series (pandas.core.series.Series) – Data to profile.
Returns: None

property gini_impurity¶

Property for Gini Impurity. Gini Impurity is a way to calculate likelihood of an incorrect classification of a new instance of a random variable.

G = Σ(i=1; J): P(i) * (1 - P(i)), where i is the category classes. We are traversing through categories and calculating with the column

Returns: None or Gini Impurity probability

col_type = None¶

property unalikeability¶

Property for Unlikeability. Unikeability checks for “how often observations differ from one another” Reference: Perry, M. and Kader, G. Variation as Unalikeability. Teaching Statistics, Vol. 27, No. 2 (2005), pp. 58-60.

U = Σ(i=1,n)Σ(j=1,n): (Cij)/(n**2-n) Cij = 1 if i!=j, 0 if i=j

Returns: None or unlikeability probability