Encoding Categorical Variables
Most machine learning algorithms accept only numerical input, so categorical variables need to be converted to numeric values.
Main encoding methods
- Ordinal Encoding: converts categories into ordered values (0,1,2,…).
- Drawback: creates ordinal relationships between categories that are actually unrelated, potentially lowering model performance or causing unexpected issues.
- One-Hot Encoding: creates a 0/1 variable for each category.
- Drawback: increases dimensionality as the number of categories grows, potentially degrading training performance.
- Target Encoding: converts categories into target statistics (focus of this note).
Target Encoding (Mean Encoding)
- Converts categories into target statistics.
- Binary classification: probability of 1 within each category.
- Regression: target mean within each category.
- Advantages: no increase in dimensionality, avoids artificial ordinal relationships.
- Extensions: higher moments such as variance, skewness, or kurtosis can also be used.
a. Smoothing
Mitigates extreme values for categories with few samples.
- Formula:
- α calculation:
where:
f = Smoothing factor
k = Minimum samples per leaf
k = Minimum samples per leaf
b. Target Leakage
Using the target variable in encoding can lead to overfitting.
- Mitigation methods:
- Leave-One-Out Target Encoding: exclude the target value of the current sample.
- Leave-One-Fold-Out: exclude the fold that the current sample belongs to.
- Smoothing: acts as regularization.
c. scikit-learn TargetEncoder
TargetEncoder uses the Leave-One-Fold-Out method by default to reduce target leakage.
The number of folds can be adjusted using the cv parameter (default=5).
1. Categorical Target Encoding Formula (Binary Classification)
- The encoding value for category i is calculated as:
$$ S_i = \lambda_i \frac{n_i^Y}{n_i} + (1-\lambda_i) \frac{n^Y}{n} $$
- Si: encoded value for category i
- niY: number of samples with Y=1 in category i
- nY: total number of samples with Y=1
- ni: total samples in category i
- n: total number of samples
- λi: shrinkage factor
$$ \lambda_i = \frac{n_i}{n_i + m} $$
- m: smoothing parameter (default = "auto")
- If using the default "auto", $m = \sigma_i^2 / \tau^2$,
where $\sigma_i^2$ is the variance of the target in category i,
and $\tau^2$ is the variance of the target across all samples
2. Numerical Target Encoding Formula (Regression)
- For numeric targets, the encoded value for category i is:
$$ S_i = \lambda_i \frac{\sum_{k \in L_i} Y_k}{n_i} + (1-\lambda_i) \frac{\sum_{k=1}^{n} Y_k}{n} $$
- Li: set of samples belonging to category i
- Yk: target value of sample k
- Other symbols are the same as in binary classification.
Summary
- Leave-One-Fold-OutLOFO (cv=5) reduces the risk of target leakage.
- Binary classification: uses the proportion of 1s per category; Regression: uses the mean per category.
- Smoothing (λ) helps stabilize encoding for categories with few samples.