Encoding Categorical Variables

Most machine learning algorithms accept only numerical input, so categorical variables need to be converted to numeric values.


Main encoding methods

  • Ordinal Encoding: converts categories into ordered values (0,1,2,…).
    • Drawback: creates ordinal relationships between categories that are actually unrelated, potentially lowering model performance or causing unexpected issues.
  • One-Hot Encoding: creates a 0/1 variable for each category.
    • Drawback: increases dimensionality as the number of categories grows, potentially degrading training performance.
  • Target Encoding: converts categories into target statistics (focus of this note).


Target Encoding (Mean Encoding)

  • Converts categories into target statistics.
    • Binary classification: probability of 1 within each category.
    • Regression: target mean within each category.
  • Advantages: no increase in dimensionality, avoids artificial ordinal relationships.
  • Extensions: higher moments such as variance, skewness, or kurtosis can also be used.


a. Smoothing

Mitigates extreme values for categories with few samples.

  • Formula:
$$ encoding = \alpha \cdot p(t=1 \mid x=c_i) + (1-\alpha) \cdot p(t=1) $$
  • α calculation:
$$ \alpha = \frac{1}{1 + e^{-(n-k)/f}} $$

where:

f = Smoothing factor
k = Minimum samples per leaf


b. Target Leakage

Using the target variable in encoding can lead to overfitting.

  • Mitigation methods:
    • Leave-One-Out Target Encoding: exclude the target value of the current sample.
    • Leave-One-Fold-Out: exclude the fold that the current sample belongs to.
    • Smoothing: acts as regularization.


c. scikit-learn TargetEncoder

TargetEncoder uses the Leave-One-Fold-Out method by default to reduce target leakage.

The number of folds can be adjusted using the cv parameter (default=5).

1. Categorical Target Encoding Formula (Binary Classification)

  • The encoding value for category i is calculated as:

    $$ S_i = \lambda_i \frac{n_i^Y}{n_i} + (1-\lambda_i) \frac{n^Y}{n} $$

    • Si: encoded value for category i
    • niY: number of samples with Y=1 in category i
    • nY: total number of samples with Y=1
    • ni: total samples in category i
    • n: total number of samples
    • λi: shrinkage factor

    $$ \lambda_i = \frac{n_i}{n_i + m} $$

    • m: smoothing parameter (default = "auto")
    • If using the default "auto", $m = \sigma_i^2 / \tau^2$,
      where $\sigma_i^2$ is the variance of the target in category i,
      and $\tau^2$ is the variance of the target across all samples

2. Numerical Target Encoding Formula (Regression)

  • For numeric targets, the encoded value for category i is:

    $$ S_i = \lambda_i \frac{\sum_{k \in L_i} Y_k}{n_i} + (1-\lambda_i) \frac{\sum_{k=1}^{n} Y_k}{n} $$

    • Li: set of samples belonging to category i
    • Yk: target value of sample k
    • Other symbols are the same as in binary classification.


Summary

  • Leave-One-Fold-OutLOFO (cv=5) reduces the risk of target leakage.
  • Binary classification: uses the proportion of 1s per category; Regression: uses the mean per category.
  • Smoothing (λ) helps stabilize encoding for categories with few samples.