class: center, middle, inverse, title-slide # CRPS-Learning ## Jonathan Berrisch, Florian Ziel ### University of Duisburg-Essen ### 2022-02-25 --- name: motivation # Motivation .pull-left[ The Idea: - Combine multiple forecasts instead of choosing one - Combination weights may vary over **time**, over the **distribution** or **both** 2 Popular options for combining distributions: - Combining across quantiles (this paper) - Horizontal aggregation, vincentization - Combining across probabilities - Vertical aggregation ] .pull-right[ <div style="position:relative; margin-top:-50px; z-index: 0"> .panelset[ .panel[.panel-name[Time] ![](data:image/png;base64,#index_files/figure-html/unnamed-chunk-1-1.svg)<!-- --> ] .panel[.panel-name[Distribution] ![](data:image/png;base64,#index_files/figure-html/unnamed-chunk-2-1.svg)<!-- --> ]] ] --- name: pred_under_exp_advice # The Framework of Prediction under Expert Advice ### The sequential framework .pull-left[ Each day, `\(t = 1, 2, ... T\)` - The **forecaster** receives predictions `\(\widehat{X}_{t,k}\)` from `\(K\)` **experts** - The **forecaster** assings weights `\(w_{t,k}\)` to each **expert** - The **forecaster** calculates her prediction: `\begin{equation} \widetilde{X}_{t} = \sum_{k=1}^K w_{t,k} \widehat{X}_{t,k}. \label{eq_forecast_def} \end{equation}` - The realization for `\(t\)` is observed ] .pull-left[ - The experts can be institutions, persons, or models - The forecasts can be point-forecasts (i.e., mean or median) or full predictive distributions - We do not need any assumptions concerning the underlying data - <a id='cite-cesa2006prediction'></a><a href='#bib-cesa2006prediction'>Cesa-Bianchi and Lugosi (2006)</a> ] --- name: regret # The Regret Weights are updated sequentially according to the past performance of the `\(K\)` experts.
A loss function `\(\ell\)` is needed (to compute the **cumulative regret** `\(R_{t,k}\)`) `\begin{equation} R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i) \label{eq_regret} \end{equation}` The cumulative regret: - Indicates the predictive accuracy of the expert `\(k\)` until time `\(t\)`. - Measures how much the forecaster *regrets* not having followed the expert's advice Popular loss functions for point forecasting <a id='cite-gneiting2011making'></a><a href='#bib-gneiting2011making'>Gneiting (2011a)</a>: .pull-left[ - `\(\ell_2\)`-loss `\(\ell_2(x, y) = | x -y|^2\)` - optimal for mean prediction ] .pull-right[ - `\(\ell_1\)`-loss `\(\ell_1(x, y) = | x -y|\)` - optimal for median predictions ] --- name: popular_algs # Popular Algorithms and the Risk .pull-left[ ### Popular Aggregation Algorithms #### The naive combination `\begin{equation} w_{t,k}^{\text{Naive}} = \frac{1}{K} \end{equation}` #### The exponentially weighted average forecaster (EWA) `\begin{align} w_{t,k}^{\text{EWA}} & = \frac{e^{\eta R_{t,k}} }{\sum_{k = 1}^K e^{\eta R_{t,k}}} = \frac{e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} }{\sum_{k = 1}^K e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} } \label{eq_ewa_general} \end{align}` ] .pull-right[ ### Optimality In stochastic settings, the cumulative Risk should be analyzed <a id='cite-wintenberger2017optimal'></a><a href='#bib-wintenberger2017optimal'>Wintenberger (2017)</a>: `\begin{align} &\underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \\ &\underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}} \label{eq_def_cumrisk} \end{align}` ] --- # Optimal Convergence .pull-left[ ### The selection problem `\begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \stackrel{t\to \infty}{\rightarrow} a \quad \text{with} \quad a \leq 0. \label{eq_opt_select} \end{equation}` The forecaster is asymptotically not worse than the best expert `\(\widehat{\mathcal{R}}_{t,\min}\)`. ### The convex aggregation problem `\begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \stackrel{t\to \infty}{\rightarrow} b \quad \text{with} \quad b \leq 0 . \label{eq_opt_conv} \end{equation}` The forecaster is asymptotically not worse than the best convex combination `\(\widehat{X}_{t,\pi}\)` in hindsight (**oracle**). ] .pull-right[ Optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} <a href='#bib-wintenberger2017optimal'>Wintenberger (2017)</a>: `\begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & = \mathcal{O}\left(\frac{\log(K)}{t}\right)\label{eq_optp_select} \end{align}` `\begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) & = \mathcal{O}\left(\sqrt{\frac{\log(K)}{t}}\right) \label{eq_optp_conv} \end{align}` Algorithms can statisfy both \eqref{eq_optp_select} and \eqref{eq_optp_conv} depending on: - The loss function - Regularity conditions on `\(Y_t\)` and `\(\widehat{X}_{t,k}\)` - The weighting scheme ] --- name:crps .pull-left[ ## Probabilistic Setting An appropriate loss: `\begin{align*} \text{CRPS}(F, y) & = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx \label{eq_crps} \end{align*}` It's strictly proper <a id='cite-gneiting2007strictly'></a><a href='#bib-gneiting2007strictly'>Gneiting and Raftery (2007)</a>. Using the CRPS, we can calculate time-adaptive weights `\(w_{t,k}\)`. However, what if the experts' performance varies in parts of the distribution?
Utilize this relation: `\begin{align*} \text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) \, d p. \label{eq_crps_qs} \end{align*}` ... to combine quantiles of the probabilistic forecasts individually using the quantile-loss QL. ] .pull-right[ ## Optimal Convergence
exp-concavity of the loss is required for \eqref{eq_optp_select} and \eqref{eq_optp_conv}
QL is convex, but not exp-concave
The Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition. Convergence rates of BOA are:
Almost optimal w.r.t *selection* \eqref{eq_optp_select} <a id='cite-gaillard2018efficient'></a><a href='#bib-gaillard2018efficient'>Gaillard and Wintenberger (2018)</a>.
Almost optimal w.r.t *convex aggregation* \eqref{eq_optp_conv} <a href='#bib-wintenberger2017optimal'>Wintenberger (2017)</a>. ] --- name: simple_example # A Probabilistic Example .pull-left[ Simple Example: `\begin{align} Y_t & \sim \mathcal{N}(0,\,1) \\ \widehat{X}_{t,1} & \sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} & \sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \label{eq:dgp_sim1} \end{align}` - True weights vary over `\(p\)` - Figures show the ECDFs and calculated weights using `\(T=25\)` realizations - Pointwise is better than constant - Pointwise solution creates rough estimates We propose 2 smoothing procedures: - **P-Spline smoothing** upcoming slides - Basis smoothing <a id='cite-BERRISCH2021'></a><a href='#bib-BERRISCH2021'>Berrisch and Ziel (2021)</a> 3.2 ] .pull-right[ <div style="position:relative; margin-top:-50px; z-index: 0"> .panelset[ .panel[.panel-name[CDFs] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] .panel[.panel-name[Weights] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ] ]] --- # The P-Smooth Procedure .pull-left[ Penalized cubic B-Splines for smoothing weights: Let `\(\varphi=(\varphi_1,\ldots, \varphi_L)\)` be bounded basis functions on `\((0,1)\)` Then we approximate `\(w_{t,k}\)` by `\begin{align} w_{t,k}^{\text{smooth}} = \sum_{l=1}^L \beta_l \varphi_l = \beta'\varphi \end{align}` with parameter vector `\(\beta\)`. The latter is estimated penalized `\(L_2\)`-smoothing which minimizes `\begin{equation} \| w_{t,k} - \beta' \varphi \|^2_2 + \lambda \| \mathcal{D}^{d} (\beta' \varphi) \|^2_2 \label{eq_function_smooth} \end{equation}` with differential operator `\(\mathcal{D}\)` Computation is easy, since we have an analytical solution ] .pull-right[ We receive the constant solution for high values of `\(\lambda\)` when setting `\(d=1\)` <center> <img src="weights_lambda.gif"> </center> ] --- name: simulation # Simulation Study .pull-left[ Data Generating Process of the [simple probabilistic example](#simple_example) - Constant solution `\(\lambda \rightarrow \infty\)` - Pointwise Solution of the proposed BOAG - Smoothed Solution of the proposed BOAG - Weights are smoothed during learning - Smooth weights are used to calculate Regret, adjust weights, etc. ] .pull-right[ <div style="position:relative; margin-top:-50px; z-index: 0"> .panelset[ .panel[.panel-name[QL Deviation] Deviation from best attainable `\(\boldsymbol{QL}_\boldsymbol{\mathcal{P}}\)` (1000 runs). ![](data:image/png;base64,#pre_vs_post.gif) ] .panel[.panel-name[Lambda] CRPS Values for different `\(\lambda\)` (1000 runs) ![](data:image/png;base64,#pre_vs_post_lambda.gif) ]]] --- # Simulation Study The same simulation carried out for different algorithms (1000 runs): <center> <img src="algos_constant.gif"> </center> --- # Simulation Study .pull-left-1[ **New DGP:** `\begin{align} Y_t & \sim \mathcal{N}\left(0.15 \operatorname{asinh}(\mu_t),\,1\right) \\ \mu_t &= 0.99 \mu_{t-1} + \varepsilon_t\\ \varepsilon_t & \sim \mathcal{N}(0,1) \\ \widehat{X}_{t,1} & \sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} & \sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \label{eq_dgp_sim2} \end{align}`
Changing optimal weights
Single run example depicted aside
No forgetting leads to long-term constant weights <center> <img src="forget.png"> </center> ] .pull-right-2[ **Weights of expert 2** <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] --- name:extensions # Possible Extensions .pull-left[ **Forgetting** - Only taking part of the old cumulative regret into account - Exponential forgetting of past regret `\begin{align*} R_{t,k} & = R_{t-1,k}(1-\xi) + \ell(\widetilde{F}_{t},Y_i) - \ell(\widehat{F}_{t,k},Y_i) \label{eq_regret_forget} \end{align*}` **Fixed Shares** <a id='cite-herbster1998tracking'></a><a href='#bib-herbster1998tracking'>Herbster and Warmuth (1998)</a> - Adding fixed shares to the weights - Shrinkage towards a constant solution `\begin{align*} \widetilde{w}_{t,k} = \rho \frac{1}{K} + (1-\rho) w_{t,k} \label{fixed_share_simple}. \end{align*}` ] .pull-right[ **Non-Equidistant Knots** - Important regions can receive more knots - Destroys shrinkage towards constant - Profoc utilizes the beta distribution to create knots <center> <img src="uneven_grid_a_b.gif"> </center> ] --- name: application # Application Study: Overview .pull-left-1[ .font90[ Data: - Forecasting European emission allowances (EUA) - Daily month-ahead prices - Jan 13 - Dec 20 (Phase III, 2092 Obs) Combination methods: - Naive, BOAG, EWAG, ML-PolyG, BMA Tuning paramter grids: - Smoothing Penalty: `\(\Lambda= \{0\}\cup \{2^x|x\in \{-4,-3.5,\ldots,12\}\}\)` - Learning Rates: `\(\mathcal{E}= \{2^x|x\in \{-1,-0.5,\ldots,9\}\}\)` ] ] .pull-right-2[ <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> ] --- # Application Study: Experts .font90[ Simple exponential smoothing with additive errors (**ETS-ANN**): `\begin{align*} Y_{t} = l_{t-1} + \varepsilon_t \quad \text{with} \quad l_t = l_{t-1} + \alpha \varepsilon_t \quad \text{and} \quad \varepsilon_t \sim \mathcal{N}(0,\sigma^2) \end{align*}` Quantile regression (**QuantReg**): For each `\(p \in \mathcal{P}\)` we assume: `\begin{align*} F^{-1}_{Y_t}(p) = \beta_{p,0} + \beta_{p,1} Y_{t-1} + \beta_{p,2} |Y_{t-1}-Y_{t-2}| \end{align*}` ARIMA(1,0,1)-GARCH(1,1) with Gaussian errors (**ARMA-GARCH**): `\begin{align*} Y_{t} = \mu + \phi(Y_{t-1}-\mu) + \theta \varepsilon_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1) \end{align*}` ARIMA(0,1,0)-I-EGARCH(1,1) with Gaussian errors (**I-EGARCH**): `\begin{align*} Y_{t} = \mu + Y_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \log(\sigma_t^2) = \omega + \alpha Z_{t-1}+ \gamma (|Z_{t-1}|-\mathbb{E}|Z_{t-1}|) + \beta \log(\sigma_{t-1}^2) \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1) \end{align*}` ARIMA(0,1,0)-GARCH(1,1) with student-t errors (**I-GARCHt**}): `\begin{align*} Y_{t} = \mu + Y_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim t(0,1, \nu) \end{align*}` ] --- # Application Study: Results <div style="position:relative; margin-top:-25px; z-index: 0"> .panelset[ .panel[.panel-name[Significance] <table class=" lightable-material" style='font-family: "Source Sans Pro", helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:center;"> ETS-ANN </th> <th style="text-align:center;"> QuantReg </th> <th style="text-align:center;"> ARMA-GARCH </th> <th style="text-align:center;"> I-EGARCH </th> <th style="text-align:center;"> I-GARCHt </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;background-color: #FF808C !important;"> 2.103 (>.999) </td> <td style="text-align:center;background-color: #FF808C !important;"> 1.360 (>.999) </td> <td style="text-align:center;background-color: #FFB180 !important;"> 0.522 (0.993) </td> <td style="text-align:center;background-color: #FFB480 !important;"> 0.503 (0.999) </td> <td style="text-align:center;background-color: #F3FF80 !important;"> -0.035 (0.411) </td> </tr> </tbody> </table> <table class=" lightable-material" style='font-family: "Source Sans Pro", helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> BOAG </th> <th style="text-align:center;"> EWAG </th> <th style="text-align:center;"> ML-PolyG </th> <th style="text-align:center;"> BMA </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> pointwise </td> <td style="text-align:center;background-color: #99EE80 !important;"> -0.161 (0.067) </td> <td style="text-align:center;background-color: #D6FF80 !important;"> -0.085 (0.177) </td> <td style="text-align:center;background-color: #AFF580 !important;"> -0.136 (0.126) </td> <td style="text-align:center;background-color: #FFF980 !important;"> 0.030 (0.753) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> smooth </td> <td style="text-align:center;background-color: #85E780 !important;"> -0.185 (0.037) </td> <td style="text-align:center;background-color: #D1FF80 !important;"> -0.094 (0.150) </td> <td style="text-align:center;background-color: #99EE80 !important;"> -0.161 (0.066) </td> <td style="text-align:center;background-color: #FFF980 !important;"> 0.027 (0.722) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> constant </td> <td style="text-align:center;background-color: #AEF580 !important;"> -0.137 (0.020) </td> <td style="text-align:center;background-color: #E0FF80 !important;"> -0.067 (0.144) </td> <td style="text-align:center;background-color: #B3F780 !important;"> -0.132 (0.027) </td> <td style="text-align:center;background-color: #FFF880 !important;"> 0.035 (0.826) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> smooth* </td> <td style="text-align:center;background-color: #80E680 !important;"> -0.191 (0.023) </td> <td style="text-align:center;background-color: #9CEF80 !important;"> -0.158 (0.025) </td> <td style="text-align:center;background-color: #80E680 !important;"> -0.190 (0.021) </td> <td style="text-align:center;background-color: #FFFE80 !important;"> -0.009 (0.333) </td> </tr> </tbody> </table> CRPS difference to **Naive** (scaled by `\(10^4\)`) of single experts and four combination methods with four options. Additionally, we show the p-value of the DM-test, testing against **Naive**. The smallest value is bold. We also report the optimal ex-post selection by **smooth*** ] .panel[.panel-name[QL] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ] .panel[.panel-name[Cumulative Loss Difference] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> .panel[.panel-name[Weights] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> ] ]] --- name: application # Application to Multivariate Settings .pull-left[ Apply BOA to (24 dimensional) day-ahead power forecasts Forecasts from probabilistic neural networks: Pre-Print to be released soon: Grzegorz Marcjasz, Michał Narajewski, Rafał Weron and Florian Ziel ```r # 182 obs as burn-in mod <- online( y = Y, # 736 x 24 experts = experts, # 736, 24, 99, 2 tau = 1:99 / 100 ) ```
CRPS Values
norm
jsu
comb
1.448
1.372
1.334
] .pull-right[ <div style="position:relative; margin-top:50px; z-index: 0">
] --- # Smoothing in Multivariate Setting .pull-left[ Smoothing parameters can be optimized online Below parameters chosen for illustrative purposes ```r mod <- online( y = Y, experts = experts, tau = 1:99 / 100, p_smooth_pr = list( lambda = c(20, 50, 100) # 50 selected ), p_smooth_mv = list( lambda = c(20, 50, 100) # 50 selected ) ) ```
CRPS Values
norm
jsu
comb
1.448
1.372
1.333
] .pull-right[ <div style="position:relative; margin-top:50px; z-index: 0">
] --- name: conclusion # Wrap-Up .font90[ .pull-left[ Potential Downsides: - Pointwise optimization can induce quantile crossing - Can be solved by sorting the predictions Upsides: - Pointwise learning outperforms the Naive solution significantly - Online learning is much faster than batch methods - Smoothing further improves the predictive performance - Asymptotically not worse than the best convex combination ] .pull-left[ Important: - The choice of the learning rate is crucial - The loss function has to meet certain criteria The [
profoc](https://profoc.berrisch.biz/) R Package: - Implements all algorithms discussed above - Is written using RcppArmadillo
its fast - Accepts vectors for most parameters - Best parameter combination chosen online - Implements - Forgetting, Fixed Share - Different loss functions + gradients - Multivariate Inputs ] ] <a href="https://github.com/BerriJ" class="github-corner" aria-label="View source on Github"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#f2f2f2; color:#212121; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style> --- name:references # References 1 <p><cite><a id='bib-BERRISCH2021'></a><a href="#cite-BERRISCH2021">Berrisch, J. and F. Ziel</a> (2021). “CRPS learning”. In: <em>Journal of Econometrics</em>.</cite></p> <p><cite><a id='bib-cesa2006prediction'></a><a href="#cite-cesa2006prediction">Cesa-Bianchi, N. and G. Lugosi</a> (2006). <em>Prediction, learning, and games</em>. Cambridge university press.</cite></p> <p><cite><a id='bib-gaillard2018efficient'></a><a href="#cite-gaillard2018efficient">Gaillard, P. and O. Wintenberger</a> (2018). “Efficient online algorithms for fast-rate regret bounds under sparsity”. In: <em>Advances in Neural Information Processing Systems</em>. , pp. 7026–7036.</cite></p> <p><cite><a id='bib-gneiting2011making'></a><a href="#cite-gneiting2011making">Gneiting, T.</a> (2011a). “Making and evaluating point forecasts”. In: <em>Journal of the American Statistical Association</em> 106.494, pp. 746–762.</cite></p> <p><cite><a id='bib-gneiting2007strictly'></a><a href="#cite-gneiting2007strictly">Gneiting, T. and A. E. Raftery</a> (2007). “Strictly proper scoring rules, prediction, and estimation”. In: <em>Journal of the American statistical Association</em> 102.477, pp. 359–378.</cite></p> <p><cite><a id='bib-herbster1998tracking'></a><a href="#cite-herbster1998tracking">Herbster, M. and M. K. Warmuth</a> (1998). “Tracking the best expert”. In: <em>Machine learning</em> 32.2, pp. 151–178.</cite></p> <p><cite><a id='bib-wintenberger2017optimal'></a><a href="#cite-wintenberger2017optimal">Wintenberger, O.</a> (2017). “Optimal learning with Bernstein online aggregation”. In: <em>Machine Learning</em> 106.1, pp. 119–141.</cite></p>