Probabilistic forecasting I: Temperature

stephane degeye
15 min readAug 6, 2024

--

Probabilistic forecasting I: Temperature | Kaggle

Kaggle Competition to deal with Probabilistic Forecasting and the opportunity to check the awesome Conformal Prediction (CP) Framework.

This story shows the approach use to achieve at the 1st place of this Kaggle Probalistic Forecasting competition hosted by Carl McBride Ellis, PhD | LinkedIn.

Gentle remainder of usual Linear Regression tasks

Whether Simple Linear Regression (SLR) / Multiple Linear Regression (MLR) alone without prediction interval (PI), The predicted values are just Point Predictions belonging to a single characteristic value of the data Distribution, for examples :

  • Conditional Mean [Loss Function to be optimized will be the L2 Loss a.k.a the Mean Squared Error (MSE) or Ordinary Least Squared (OLS)]
  • Conditional Median [Loss Function to be optimized will be the L1 Loss a.k.a the Mean Absolute Error (MAE) more robust to outliers]

How can we improve this baseline?

A first possible improvement is to quantify the uncertainty by providing not only single predicted values (Such as Conditional Mean) but also the Prediction Intervals (PI) around the Point Predictions and based upon the past experience.

How to quantify Uncertainty?

Several Approaches exist to deal with Uncertainty Quantification (UQ) :

  • Bootstrapping
  • Bayesian Optimization
  • Conformal Prediction (CP) is my retained choice for the competition

Why Conformal Prediction?

  • Distribution Free
  • Model Agnostic
  • Based on User-defined confidence level with Coverage guarantee

Domain of application of Conformal Prediction

  • Classification problem
  • Regression problem (choice retained for this competition, Time Series will be processed as a Regression task)
  • Time Serie Forecasting

Conformal Prediction Implementation

  • Transductive Conformal Prediction (TCP)
  • Inductive Conformal Prediction (ICP), the most used for performance reason and the retained choice for this competition

Example of Regression with CP

Can this baseline be improved further?

Yes, we can. It is important to stop only thinking in terms of prediction of the Conditional Mean or Conditional Median and start thinking in terms of full Probability Distribution (predictions of all Conditional Quantiles).

Catboost Multiquantiles Regressor has been an obvious choice for me from the start as ML Model.

Catboost Quantiles/MultiQuantiles Regressor will predict different quantiles but these are still Point Predictions.

The question is : Is it also possible to add uncertainty quantification with this setup as we did it for the Conditional Mean above?

Once again, the answer is Yes, and we are precisely in the use case proposed in this competition!

Emmanuel Candes, Evan Patterson and Yaniv Romano have created a framework called Conformalized Quantiles Regression (CQR) and that’s the solution I’ve chosen to tackle the competition.

This is an obvious choice for generating adjusted prediction intervals (PI) by readjusting each quantile and therefore taking into account of Heteroskedasticity

This method is fairly simple to implement, and I chose to implement it from scratch without using a specialized library. It’s a pedagogical choice that doesn’t take any optimization into account!

Valeriy Manokhin, PhD, MBA, CQF | LinkedIn book/course and research papers are certainly a big part of the keys to this success. See references for links to these resources.

This story is subject to enrichment with more details (and links to other stories) in the future but I already find it useful to share this baseline.

For greater readability, some parts of the code have been rearranged and some comments added in this story, but the functionality is identical to the original notebook, unchanged.

Strategy for the competition

  • Catboost Multiquantiles Regression as Model (for Quantiles Point Predictions)
  • Conformalized Quantile Regression (CQR) (to generate adjusted Prediction Intervals based on Non-Conformity Measure)

And now, let’s dive into the competition…

Some Librairies

import utility_script_crps_score as us
# help(us.crps)

# Data Structure
import numpy as np
import pandas as pd

# Model
from catboost import CatBoostRegressor

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller

# Utils
from sklearn.model_selection import train_test_split

# Error Handling
import warnings
warnings.filterwarnings("ignore")

# Plot
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

plt.style.use('ggplot')
plt.rcParams.update(**{'figure.dpi':150})

%matplotlib inline

Data Loading

# Load the dataset
train_df = pd.read_csv('/kaggle/input/probabilistic-forecasting-i-temperature/train_and_Public.csv')
kaggle_test_df = pd.read_csv('/kaggle/input/probabilistic-forecasting-i-temperature/test.csv')
submission_df = pd.read_csv('/kaggle/input/probabilistic-forecasting-i-temperature/s
ample_submission.csv')

train_df :

As of 01/07/2024, the Public Leaderboard was revealed and the initial train_df was therefore extended with the 28 days of the Public Leaderboard, giving a total of 67007 rows (698 days):

kaggle_test_df :

The kaggle_test_df dataset is unchanged and still contains the 5360 rows (55.8 days). From 01/07/2024, there is therefore a 28-day overlap between the train_df and the kaggle_test_df, but this must be put into perspective, as what is required is to predict the last 28 days of the kaggle_test_df. For this reason, we don’t consider that we’re in a data leakage situation (the first 28 days of the kaggle_test_df can be initialized to null or anything else, it doesn’t matter).

num_rec_train = train_df.shape[0]
print(f'Train dataframe contains {num_rec_train} records belongings to {round(num_rec_train / 96, 1)} days')

num_rec_test = kaggle_test_df.shape[0]
print(f'Test dataframe contains {num_rec_test} records belongings to {round(num_rec_test / 96, 1)} days')
Competition Datasets Setup

Exploratory Data Analysis

# Loop through each year
for year in train_df['date'].dt.year.unique():

# Filter on the current year
yearly_data = train_df[train_df['date'].dt.year == year]

# Plot the time series
plt.figure(figsize=(10, 5))
plt.plot(yearly_data.date, yearly_data['Temperature'], linestyle='-', color='b')
plt.title(f'Temperature Over Time for year {year}')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.grid()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Year 2016
Year 2017
Year 2018

Missing Values

# Detecting missing values in train_df
missing_values = train_df.isnull().sum()
print(f"\nMissing values in each column of train_df:\n{missing_values}")

# Detecting missing values in kaggle_test_df
missing_values = kaggle_test_df.isnull().sum()
print(f"\nMissing values in each column of kaggle_test_df:\n{missing_values}")

There are no missing values, so no imputation is required in our case.

Data Distribution

# Set the theme for the plot
sns.set_theme(style="whitegrid")

# Create a histogram of the total bill
plt.figure(figsize=(10, 6))
sns.histplot(train_df['Temperature'], bins=30, kde=True, color='blue')

# Adding titles and labels
plt.title('Distribution of Temperature', fontsize=16)
plt.xlabel('Temperature', fontsize=14)
plt.ylabel('Frequency', fontsize=14)

# Show the plot
plt.show()
Target Distribution
Features Distribution

Check for Non-Stationarity

ADF_result = adfuller(train_df['Temperature'])

print(f'ADF Statistic: {ADF_result[0]}')
print(f'p-value: {ADF_result[1]}')

p-value < 0.05, so we can reject the Null Hypothesis, Time Series is Stationary according to Augmented Dickey–Fuller Test!

Partial Autocorrelation Visualization

The results of Partial Autocorrelations will be decisive and very important for feature engineering and particularly for lag features.

plot_pacf(train_df['feature_AA'], lags=100)
plt.ylim(-0.2, 1.1)
plt.tight_layout

# Adding titles and labels
plt.title('Partial Autocorrelation feature_AA', fontsize=16)
plot_pacf(train_df['feature_AB'], lags=100)
plt.ylim(-0.2, 1.1)
plt.tight_layout

# Adding titles and labels
plt.title('Partial Autocorrelation feature_AB', fontsize=16)
plot_pacf(train_df['feature_BA'], lags=100)
plt.ylim(-0.2, 1.1)
plt.tight_layout

# Adding titles and labels
plt.title('Partial Autocorrelation feature_BA', fontsize=16)
plot_pacf(train_df['feature_BB'], lags=100)
plt.ylim(-0.2, 1.1)
plt.tight_layout

# Adding titles and labels
plt.title('Partial Autocorrelation feature_BB', fontsize=16)
plot_pacf(train_df['feature_CA'], lags=100)
plt.ylim(-0.2, 1.1)
plt.tight_layout

# Adding titles and labels
plt.title('Partial Autocorrelation feature_CA', fontsize=16)
plot_pacf(train_df['feature_CB'], lags=100)
plt.ylim(-0.2, 1.1)
plt.tight_layout

# Adding titles and labels
plt.title('Partial Autocorrelation feature_CB', fontsize=16)

Features Engineering

Inclusion of exogenous variables (covariates) is very easy in Regression model, since it’s simply a matter of adding new ‘features’.

# inform model about the temporal progression of the data
train_df['id'] = train_df['id'].astype(float)
kaggle_test_df['id'] = kaggle_test_df['id'].astype(float)

# Create Calendar Features
train_df['year'] = train_df['date'].dt.year.astype(int)
train_df['month'] = train_df['date'].dt.month.astype(int)
train_df['day'] = train_df['date'].dt.day.astype(int)
train_df['hour'] = train_df['date'].dt.hour.astype(int)
train_df['week_of_year'] = train_df['date'].apply(lambda x: x.isocalendar()[1])
train_df['day_of_week'] = train_df['date'].dt.dayofweek.astype(int)


kaggle_test_df['year'] = kaggle_test_df['date'].dt.year.astype(int)
kaggle_test_df['month'] = kaggle_test_df['date'].dt.month.astype(int)
kaggle_test_df['day'] = kaggle_test_df['date'].dt.day.astype(int)
kaggle_test_df['hour'] = kaggle_test_df['date'].dt.hour.astype(int)
kaggle_test_df['week_of_year'] = kaggle_test_df['date'].apply(lambda x: x.isocalendar()[1])
kaggle_test_df['day_of_week'] = kaggle_test_df['date'].dt.dayofweek.astype(int)


# Create Lag Features
features_prefix = "feature_"
features_suffix = "_lag_"

lag_features = [
{'AA' : ['1', '2', '4', '6', '97', '98']},
{'AB' : ['1', '2', '3', '98']},
{'CA' : ['1', '36', '97', '98']},
{'CB' : ['1', '2', '3']}
]

for lag_feature in lag_features:
for lag_list in lag_feature.values():
for lag in lag_list:
train_df[features_prefix + ', '.join(lag_feature.keys()) + features_suffix + lag] = train_df[features_prefix + ', '.join(lag_feature.keys())].shift(int(lag))
kaggle_test_df[features_prefix + ', '.join(lag_feature.keys()) + features_suffix + lag] = kaggle_test_df[features_prefix + ', '.join(lag_feature.keys())].shift(int(lag))


# Create Fourier terms for daily seasonality
train_fourier_day = np.sin(2 * np.pi * np.arange(len(train_df)) / 96)
train_fourier_day += np.cos(2 * np.pi * np.arange(len(train_df)) / 96)
train_df['fourier_terms_day'] = train_fourier_day

kaggle_test_fourier_day = np.sin(2 * np.pi * np.arange(len(kaggle_test_df)) / 96)
kaggle_test_fourier_day += np.cos(2 * np.pi * np.arange(len(kaggle_test_df)) / 96)
kaggle_test_df['fourier_terms_day'] = kaggle_test_fourier_day

# Create Fourier terms for weekly seasonality
train_fourier_week = np.sin(2 * np.pi * np.arange(len(train_df)) / 96*7)
train_fourier_week += np.cos(2 * np.pi * np.arange(len(train_df)) / 96*7)
train_df['fourier_terms_week'] = train_fourier_week

kaggle_test_fourier_week = np.sin(2 * np.pi * np.arange(len(kaggle_test_df)) / 96*7)
kaggle_test_fourier_week += np.cos(2 * np.pi * np.arange(len(kaggle_test_df)) / 96*7)
kaggle_test_df['fourier_terms_week'] = kaggle_test_fourier_week

# Remove highly correlated features (unnecessarily increase model complexity)
train_df.drop('feature_BA', axis=1, inplace=True)
train_df.drop('feature_BB', axis=1, inplace=True)
train_df.drop('date', axis=1, inplace=True)

kaggle_test_df.drop('feature_BA', axis=1, inplace=True)
kaggle_test_df.drop('feature_BB', axis=1, inplace=True)
kaggle_test_df.drop('date', axis=1, inplace=True)

Pearson Pairwise Correlation

pear_corr = train_df.drop(['id'], axis=1).corr(method='pearson')
pear_corr.style.background_gradient(cmap='Greens', axis=0)

See code for the entire results…

ML Model : Catboost Multiquantiles Regressor

CQR requires the choice of a user-defined confidence level. For this competition, we have to provide the full range of pairs of quantiles around the median, a Python Dict was provided in a competition demo notebook :

quantile_mapping = {
95: ['0.025', '0.975'],
90: ['0.05', '0.95'],
80: ['0.10', '0.90'],
70: ['0.15', '0.85'],
60: ['0.20', '0.80'],
50: ['0.25', '0.75'],
40: ['0.30', '0.70'],
30: ['0.35', '0.65'],
20: ['0.40', '0.60'],
10: ['0.45', '0.55']
}
quantile_str = str(quantiles).replace('[','').replace(']','')

Copy pattern of submission dataframe into a dataframe for validation w.r.t metrics :

submission_val_df = submission_df[:5360].copy()

Breakdown of Dataset and isolation of target values :

The way data is sliced up can be a game changer for modeling!

Split features/target of the initial dataframe (train_df) into (X, y)

X = train_df.drop('Temperature', axis=1)
y = train_df['Temperature']

5360 records of the initial dataframe (X, y) are

reserved for validation set (X_val, y_val)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=5360)

Split of remaining records (after removal of records for validation) into two parts, one for the proper train dataset (X_proper_train, y_proper_train) and one for the calibration dataset (X_cal, y_cal)

X_proper_train, X_cal, y_proper_train, y_cal = train_test_split(X_train, y_train, test_size=0.12, random_state=1)
ICP (Inductive Conformal Prediction) Datasets Setup

Model Definition and Fitting on Proper Training Set

The aim here is to predict Conditional Quantiles (Prediction Points for each Quantile).

The loss function used by Catboost Quantiles/Multiquantiles regressor is the Quantile Loss Function a.k.a Pinball Loss Function:

Pinball loss function

The Pinball loss function will differently penalize underpredicted and overpredicted values (asymmetry) by weighting the absolute error by multiplying it by the quantile (α) or by its complement (1-α) according to the location of the prediction relative to the true target value.

Example of Pinball Loss for Quantile 0.75

Note that in the case where the Pinball Loss Function is applied for quantile 0.5 (Median), the optimization is identical to L1 Loss (Mean Absolute Error) minimization!

Get intuition of Pinball Loss

For now, back to our code with the Catboost model:

# model definition
multiquantiles_model = CatBoostRegressor(
loss_function = f'MultiQuantile:alpha={quantile_str}',
thread_count = -1,
bootstrap_type = "Bernoulli",
sampling_frequency= 'PerTree',
iterations = 10000,
**{
'learning_rate': 0.015,
'max_depth': 10,
'subsample': 0.75,
'colsample_bylevel': 0.9,
'min_data_in_leaf': 45
},
verbose=1000
)


# model fitting on training set
multiquantiles_model.fit(X_proper_train, y_proper_train, eval_set=(X_val, y_val))

The execution time for this model is ~ 4h 20 min.

I invite you to consult the Catboost website for details of all possible hyperparameters : Common parameters | CatBoost

Predictions

Prediction is performed on Calibration, Validation and Test datasets :

predict_cal = multiquantiles_model.predict(X_cal)
predict_val = multiquantiles_model.predict(X_val)
predict_test = multiquantiles_model.predict(kaggle_test_df)

Room for improvement :

  1. As you know, there’s a cold start problem when using lag features, because these values simply don’t exist for the all first records (As designed). The current implementation doesn’t deal with this aspect, by imputing or otherwise.
  2. Catboost Multiquantiles Regressor does not explicitly enforce coherence between the predicted quantiles. Post-processing techniques can be applied to ensures the quantiles are monotonically increasing. To be honest, I still have a lot of investigating to do around this in the future.

Conformalized Quantiles Regression (CQR)

  1. Non-conformity score a.k.a Non-conformity Measure (Ei) must be defined (Heuristic vision of Uncertainty Quantification). The goal is to check how the predicted quantiles (lower & higher) relate to the True target value (Xi). The calibration dataset is used for that :
Non-conformity Measure (from original paper)

Interpretation is function of the location of the True target value with respect to the predicted quantiles interval :

A) True value is above the higher quantile, then Non conformity measure is positive and equals the distance between the True Value and the higher quantile.

B) True value is below the lower quantile, then Non conformity measure is positive and equals to the distance between the True value and the lower quantile.

C) True value falls within the interval (so between higher and lower quantiles, in that case, Non conformity measure is negative and equals to the distance between the True value and the nearest quantile.

2. Calculate the Quantile (which will be the correction factor applied to lower/higher quantiles). This computed quantile is based on the chosen confidence level applied on the Conformity Score Distribution (calculated in point 1) :

Quantile Computation (from original paper)

3. Determine the Prediction Interval (PI) by applying the correction to the predicted Quantiles (Correction is the same for Lower and Upper Quantiles) :

Construction of the prediction interval (from original paper)
for index, (level, q_pair) in enumerate(quantile_mapping.items()):

print(f'level : {level} interval : [{q_pair[0]} , {q_pair[1]}]\n')

# Keep only predicted Quantiles for the calibration set related to the current level
quantile_regression_calibration_intervals = np.zeros([len(X_cal), 2])
quantile_regression_calibration_intervals[:, 0] = predict_cal[:, index]
quantile_regression_calibration_intervals[:, 1] = predict_cal[:, 20 - index]

# Compute Non-Conformity Measures on the Calibration set (How predicted Quantiles relate to the True value)
non_conformity_scores = np.max(
[
quantile_regression_calibration_intervals[:, 0] - y_cal,
y_cal - quantile_regression_calibration_intervals[:, 1],
],
axis=0,
)

# Sort Non-Conformity Measures
non_conformity_scores = np.sort(conformity_scores)[::-1]

# Compute the Quantile based on Non-Conformity Measures distribution with level as threshold
emperical_quantile = (level/100) * (1 + (1 / len(y_cal)))
correction_factor = np.quantile(non_conformity_scores, emperical_quantile, method="higher")

# Plot the non_conformity_scores distribution
plt.hist(conformity_scores, bins='auto', color='magenta')

# Add a vertical line for the Quantile
plt.axvline(correction_factor, color='black', linestyle='dashed', linewidth=1, label='quantile')

plt.legend()
plt.xlabel('Calibration Error')
plt.ylabel('Frequency')
plt.title('Histogram of Calibration Errors')

plt.show()

# Keep only predicted Quantiles for the Kaggle test set related to the current level
quantile_regression_prediction_intervals_kaggle = np.zeros([len(kaggle_test_df), 2])
quantile_regression_prediction_intervals_kaggle[:, 0] = predict_test[:, index]
quantile_regression_prediction_intervals_kaggle[:, 1] = predict_test[:, 20 - index]

# Apply Correction factor on Kaggle test set
correction_factor_kaggle = np.ones([len(kaggle_test_df), 2])
correction_factor_kaggle[:, 0] *= correction_factor
correction_factor_kaggle[:, 1] *= correction_factor

submission_df[q_pair[0]] = quantile_regression_prediction_intervals_kaggle[:, 0] - correction_factor_kaggle[:, 0]
submission_df[q_pair[1]] = quantile_regression_prediction_intervals_kaggle[:, 1] + correction_factor_kaggle[:, 1]

# Keep only predicted Quantiles for the Validation set related to the current level
quantile_regression_prediction_intervals_validation = np.zeros([len(X_val), 2])
quantile_regression_prediction_intervals_validation[:, 0] = predict_val[:, index]
quantile_regression_prediction_intervals_validation[:, 1] = predict_val[:, 20 - index]

# Apply Correction factor on Validation set
correction_factor_validation = np.ones([len(X_val), 2])
correction_factor_validation[:, 0] *= correction_factor
correction_factor_validation[:, 1] *= correction_factor

submission_val_df[q_pair[0]] = quantile_regression_prediction_intervals_validation[:, 0] - correction_factor_validation[:, 0]
submission_val_df[q_pair[1]] = quantile_regression_prediction_intervals_validation[:, 1] + correction_factor_validation[:, 1]

Below, Non-Conformity Scores Distribution for the different levels :

Distribution of Non-Conformity Measures along with the Quantile
Distribution of Non-Conformity Measures along with the Quantile
Distribution of Non-Conformity Measures along with the Quantile
Distribution of Non-Conformity Measures along with the Quantile
Distribution of Non-Conformity Measures along with the Quantile
Distribution of Non-Conformity Measures along with the Quantile
Distribution of Non-Conformity Measures along with the Quantile
Distribution of Non-Conformity Measures along with the Quantile
Distribution of Non-Conformity Measures along with the Quantile

Prediction Intervals (PI) Visualization

for (level, q_pair) in quantile_mapping.items():

print(f'level : {level} interval : [{q_pair[0]} , {q_pair[1]} ]\n')

title = 'interval : ' + q_pair[0] + ' - ' + q_pair[1]

# Create a new figure
plt.figure(figsize=(10, 6))

# Plot actual values
plt.plot(submission_df['0.50'], label='median', color='blue')


# Plot prediction intervals
plt.fill_between(submission_df.index, submission_df[q_pair[0]], submission_df[q_pair[1]],
color='gray', alpha=0.2, label='Prediction Interval')

# Add the legend
plt.legend()

# Set labels and title
plt.xlabel('Index')
plt.ylabel('Value')
plt.title(title)

# Show the plot
plt.show()

Metrics for this competition

The script with the implementation of the Metrics was created and made available by Carl McBride Ellis, PhD | LinkedIn for this competition.

More info and references on the Kaggle site :

www.kaggle.com/competitions/probabilistic-forecasting-i-temperature/overview/evaluation

Continuous Ranked probability Score

Continuous Ranked Probability Score (CRPS) is the metric used to assess the accuracy of a Probabilistic Forecasting. It measures the difference between the Forecasting CDF and the Empirical CDF of the True value (represented by the Heaviside Function).

The closer the two functions are to each other, the smaller the area and the better the score.

Integration is performed numerically using the trapezoidal method.

Leaderboard

Conclusion

This implementation is a baseline that can of course be improved upon, and much can still be explored.

What’s important to know is that investing in CP means investing in a foundation that will be useful for almost all AI fields (NLP, Recommender System, Computer Vision, Time series forecasting, ….), in short, anything that has to do with predictions!

References

www.kaggle.com/competitions/probabilistic-forecasting-i-temperature/overview/suggested-reading

1st place — Catboost — CQR — 21/07/2024–01 (kaggle.com)

CatBoost — Yandex Technologies

[0706.3188] A tutorial on conformal prediction (arxiv.org)

[1905.03222] Conformalized Quantile Regression(arxiv.org)

Continuous Ranked Probability Score (CRPS) (lokad.com)

Acknowledgments

My thanks go first to the community for all the expressions of sympathy I’ve received. I’d also like to thank the other participants in this competition. We all gained from learning an important subject!

Of course, I’d also like to thank Carl McBride Ellis, PhD | LinkedIn, not only for organizing this competition so well, but also for inspiring us and sharing his knowledge on so many subjects (link to his book below)

https://www.packtpub.com/en-us/product/modern-time-series-forecasting-with-python-9781835883181

Thanks also to Wojtek Kuberski | LinkedIn , Hakim Elakhrass | LinkedIn (NannyML) for the opportunity to discover her products (6-month NannyML Cloud license) with a book as additional prize!

The Little Book of ML Metrics (nannyml.com)

Last but not least, I’d like to thank Valeriy Manokhin, PhD, MBA, CQF | LinkedIn for all his work, his book and his course (and the awesome prize : His course on Conformal Prediction (valued at $750 USD).

If you haven’t already done so, buy his book and take his course, you won’t be disappointed!

Thanks for reading!

--

--