Encoding Categorical Variables

Most machine learning algorithms accept only numerical input, so categorical variables need to be converted to numeric values.

Ordinal Encoding: converts categories into ordered values (0,1,2,…).
- Drawback: creates ordinal relationships between categories that are actually unrelated, potentially lowering model performance or causing unexpected issues.
One-Hot Encoding: creates a 0/1 variable for each category.
- Drawback: increases dimensionality as the number of categories grows, potentially degrading training performance.
Target Encoding: converts categories into target statistics (focus of this note).

Converts categories into target statistics.
- Binary classification: probability of 1 within each category.
- Regression: target mean within each category.
Advantages: no increase in dimensionality, avoids artificial ordinal relationships.
Extensions: higher moments such as variance, skewness, or kurtosis can also be used.

Mitigates extreme values for categories with few samples.

$$ encoding = \alpha \cdot p(t=1 \mid x=c_i) + (1-\alpha) \cdot p(t=1) $$

$$ \alpha = \frac{1}{1 + e^{-(n-k)/f}} $$

where:

f = Smoothing factor
k = Minimum samples per leaf

Using the target variable in encoding can lead to overfitting.

Mitigation methods:
- Leave-One-Out Target Encoding: exclude the target value of the current sample.
- Leave-One-Fold-Out: exclude the fold that the current sample belongs to.
- Smoothing: acts as regularization.

TargetEncoder uses the Leave-One-Fold-Out method by default to reduce target leakage.

The number of folds can be adjusted using the cv parameter (default=5).

1. Categorical Target Encoding Formula (Binary Classification)

2. Numerical Target Encoding Formula (Regression)

For numeric targets, the encoded value for category i is:
$$ S_i = \lambda_i \frac{\sum_{k \in L_i} Y_k}{n_i} + (1-\lambda_i) \frac{\sum_{k=1}^{n} Y_k}{n} $$
- L_i: set of samples belonging to category i
- Y_k: target value of sample k
- Other symbols are the same as in binary classification.

Leave-One-Fold-OutLOFO (cv=5) reduces the risk of target leakage.
Binary classification: uses the proportion of 1s per category; Regression: uses the mean per category.
Smoothing (λ) helps stabilize encoding for categories with few samples.