Categorical Column Profile

class dataprofiler.profilers.categorical_column_profile.CategoricalColumn(name, options=None)

Bases: dataprofiler.profilers.base_column_profilers.BaseColumnProfiler

Categorical column profile subclass of BaseColumnProfiler. Represents a column int the dataset which is a categorical column.

Initialization of column base properties and itself.

Parameters

name (String) – Name of data

type = 'category'
diff(other_profile, options=None)

Finds the differences for CategoricalColumns.

Parameters

other_profile (CategoricalColumn) – profile to find the difference with

Returns

the CategoricalColumn differences

Return type

dict

property profile

Property for profile. Returns the profile of the column. For categorical_count, it will display the top k categories most frequently occurred in descending order.

property categories

Property for categories.

property unique_ratio

Property for unique_ratio. Returns ratio of unique categories to sample_size

property is_match

Property for is_match. Returns true if column is categorical.

update(df_series)

Updates the column profile.

Parameters

df_series (pandas.core.series.Series) – Data to profile.

Returns

None

property gini_impurity

Property for Gini Impurity. Gini Impurity is a way to calculate likelihood of an incorrect classification of a new instance of a random variable.

G = Σ(i=1; J): P(i) * (1 - P(i)), where i is the category classes. We are traversing through categories and calculating with the column

Returns

None or Gini Impurity probability

property unalikeability

Property for Unlikeability. Unikeability checks for “how often observations differ from one another” Reference: Perry, M. and Kader, G. Variation as Unalikeability. Teaching Statistics, Vol. 27, No. 2 (2005), pp. 58-60.

U = Σ(i=1,n)Σ(j=1,n): (Cij)/(n**2-n) Cij = 1 if i!=j, 0 if i=j

Returns

None or unlikeability probability

col_type = None