class: center, middle, inverse, title-slide # CRPS-Learning ## Jonathan Berrisch, Florian Ziel ### University of Duisburg-Essen ### 2021-05-27 --- class:middle name: content # Outline - [Motivation](#motivation) - [The Framework of Prediction under Expert Advice](#pred_under_exp_advice) - [The Continious Ranked Probability Scrore](#crps) - [Optimality of (Pointwise) CRPS-Learning](#crps_optim) - [A Simple Probabilistic Example](#simple_example) - [The Proposed CRPS-Learning Algorithm](#proposed_algorithm) - [Simulation Results](#simulation) - [Possible Extensions](#extensions) - [Application Study](#application) - [Wrap-Up](#conclusion) - [References](#references) --- class: center, middle, sydney-blue # Motivation --- name: motivation # Motivation .pull-left[ The Idea: - Combine multiple forecasts instead of choosing one - Combination weights may vary over **time**, over the **distribution** or **both** 2 Popular options for combining distributions: - Combining across quantiles (this paper) - Horizontal aggregation, vincentization - Combining across probabilities - Vertical aggregation ] .pull-right[ <div style="position:relative; margin-top:-50px; z-index: 0"> .panelset[ .panel[.panel-name[Time] ![](data:image/png;base64,#index_files/figure-html/unnamed-chunk-1-1.svg)<!-- --> ] .panel[.panel-name[Distribution] ![](data:image/png;base64,#index_files/figure-html/unnamed-chunk-2-1.svg)<!-- --> ]] ] --- class: center, middle, sydney-blue # The Framework of Prediction under Expert Advice --- name: pred_under_exp_advice # The Framework of Prediction under Expert Advice ## The sequential framework .pull-left[ Each day, `\(t = 1, 2, ... T\)` - The **forecaster** receives predictions `\(\widehat{X}_{t,k}\)` from `\(K\)` **experts** - The **forecaster** assings weights `\(w_{t,k}\)` to each **expert** - The **forecaster** calculates her prediction: `\begin{equation} \widetilde{X}_{t} = \sum_{k=1}^K w_{t,k} \widehat{X}_{t,k}. \label{eq_forecast_def} \end{equation}` - The realization for `\(t\)` is observed ] .pull-left[ - The experts can be institutions, persons, or models - The forecasts can be point-forecasts (i.e., mean or median) or full predictive distributions - We do not need any assumptions concerning the underlying data - <a href='#bib-cesa2006prediction'>Cesa-Bianchi and Lugosi (2006)</a> ] --- name: regret # The Regret Weights are updated sequentially according to the past performance of the `\(K\)` experts. That is, a loss function `\(\ell\)` is needed. This is used to compute the **cumulative regret** `\(R_{t,k}\)` `\begin{equation} R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i) \label{eq_regret} \end{equation}` The cumulative regret: - Indicates the predictive accuracy of the expert `\(k\)` until time `\(t\)`. - Measures how much the forecaster *regrets* not having followed the expert's advice Popular loss functions for point forecasting <a href='#bib-gneiting2011making'>Gneiting (2011a)</a>: .pull-left[ - `\(\ell_2\)`-loss `\(\ell_2(x, y) = | x -y|^2\)` - optimal for mean prediction ] .pull-right[ - `\(\ell_1\)`-loss `\(\ell_1(x, y) = | x -y|\)` - optimal for median predictions ] --- name: popular_algs # Popular Aggregation Algorithms #### The naive combination `\begin{equation} w_{t,k}^{\text{Naive}} = \frac{1}{K} \end{equation}` #### The exponentially weighted average forecaster (EWA) `\begin{align} w_{t,k}^{\text{EWA}} & = \frac{e^{\eta R_{t,k}} }{\sum_{k = 1}^K e^{\eta R_{t,k}}} = \frac{e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} }{\sum_{k = 1}^K e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} } \label{eq_ewa_general} \end{align}` #### The polynomial weighted aggregation (PWA) `\begin{align} w_{t,k}^{\text{PWA}} & = \frac{ 2(R_{t,k})^{q-1}_{+} }{ \|(R_t)_{+}\|^{q-2}_q} \label{eq_pwa_general} \end{align}` with `\(q\geq 2\)` and `\(x_{+}\)` the (vector) of positive parts of `\(x\)`. --- # Optimality In stochastic settings, the cumulative Risk should be analyezed <a href='#bib-wintenberger2017optimal'>Wintenberger (2017)</a>: `\begin{align} \underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \qquad\qquad\qquad \text{ and } \qquad\qquad\qquad \underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}} \label{eq_def_cumrisk} \end{align}` There are two problems that an algorithm should solve in iid settings: .pull-left[ ### The selection problem `\begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \stackrel{t\to \infty}{\rightarrow} a \quad \text{with} \quad a \leq 0. \label{eq_opt_select} \end{equation}` The forecaster is asymptotically not worse than the best expert `\(\widehat{\mathcal{R}}_{t,\min}\)`. ] .pull-right[ ### The convex aggregation problem `\begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \stackrel{t\to \infty}{\rightarrow} b \quad \text{with} \quad b \leq 0 . \label{eq_opt_conv} \end{equation}` The forecaster is asymptotically not worse than the best convex combination `\(\widehat{X}_{t,\pi}\)` in hindsight (**oracle**). ] --- # Optimality Satisfying the convexity property \eqref{eq_opt_conv} comes at the cost of slower possible convergence. According to <a href='#bib-wintenberger2017optimal'>Wintenberger (2017)</a>, an algorithm has optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} if `\begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & = \mathcal{O}\left(\frac{\log(K)}{t}\right)\label{eq_optp_select} \end{align}` `\begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) & = \mathcal{O}\left(\sqrt{\frac{\log(K)}{t}}\right) \label{eq_optp_conv} \end{align}` Algorithms can statisfy both \eqref{eq_optp_select} and \eqref{eq_optp_conv} depending on: - The loss function - Regularity conditions on `\(Y_t\)` and `\(\widehat{X}_{t,k}\)` - The weighting scheme --- # Optimality According to <a href='#bib-cesa2006prediction'>Cesa-Bianchi and Lugosi (2006)</a> EWA \eqref{eq_ewa_general} satisfies the optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if the: - Loss `\(\ell\)` is exp-concave - Learning-rate `\(\eta\)` is chosen correctly Those results can be converted to stochastic iid settings <a href='#bib-kakade2008generalization'>Kakade and Tewari (2008)</a> <a href='#bib-gaillard2014second'>Gaillard, Stoltz, and Van Erven (2014)</a>. The optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick. Thereby, the loss is linearized: `\begin{align} \ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x \end{align}` `\(\ell'\)` is the subgradient of `\(\ell\)` in its first coordinate evaluated at forecast combination `\(\widetilde{X}\)`. Combining probabilistic forecasts calls for a probabilistic loss function ??? We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}. --- name: crps # The Continuous Ranked Probability Score .pull-left[ **An appropriate choice:** `\begin{align*} \text{CRPS}(F, y) & = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx \label{eq_crps} \end{align*}` It's strictly proper <a href='#bib-gneiting2007strictly'>Gneiting and Raftery (2007)</a>. Using the CRPS, we can calculate time-adaptive weight `\(w_{t,k}\)`. However, what if the experts' performance is not uniform over all parts of the distribution? The idea: utilize this relation: `\begin{align*} \text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) \, d p. \label{eq_crps_qs} \end{align*}` ] .pull-right[ to combine quantiles of the probabilistic forecasts individually using the quantile-loss (QL): `\begin{align*} \text{QL}_p(q, y) & = (\mathbb{1}\{y < q\} -p)(q - y) \end{align*}` </br> **But is it optimal?** CRPS is exp-concave
EWA \eqref{eq_ewa_general} with CRPS satisfies \eqref{eq_optp_select} and \eqref{eq_optp_conv} </br> </br> QL is convex, but not exp-concave
Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition while almost keeping optimal convergence ] --- name: crps_optim # CRPS-Learning Optimality For convex losses, BOAG satisfies that there exist a `\(C>0\)` such that for `\(x>0\)` it holds that `\begin{equation} P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \leq C \log(\log(t)) \left(\sqrt{\frac{\log(K)}{t}} + \frac{\log(K)+x}{t}\right) \right) \geq 1-e^{x} \label{eq_boa_opt_conv} \end{equation}`
Almost optimal w.r.t *convex aggregation* \eqref{eq_optp_conv} <a href='#bib-wintenberger2017optimal'>Wintenberger (2017)</a> . The same algorithm satisfies that there exist a `\(C>0\)` such that for `\(x>0\)` it holds that `\begin{equation} P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \leq C\left(\frac{\log(K)+\log(\log(Gt))+ x}{\alpha t}\right)^{\frac{1}{2-\beta}} \right) \geq 1-e^{x} \label{eq_boa_opt_select} \end{equation}` if `\(Y_t\)` is bounded, the considered loss `\(\ell\)` is convex `\(G\)`-Lipschitz and weak exp-concave in its first coordinate. This is for losses that satisfy **A1** and **A2**. --- # CRPS-Learning Optimality .pull-left[ **A1** For some `\(G>0\)` it holds for all `\(x_1,x_2\in \mathbb{R}\)` and `\(t>0\)` that $$ | \ell(x_1, Y_t)-\ell(x_2, Y_t) | \leq G |x_1-x_2|$$ **A2** For some `\(\alpha>0\)`, `\(\beta\in[0,1]\)` it holds for all `\(x_1,x_2 \in \mathbb{R}\)` and `\(t>0\)` that `\begin{align*} \mathbb{E}[ & \ell(x_1, Y_t)-\ell(x_2, Y_t) | \mathcal{F}_{t-1}] \leq \\ & \mathbb{E}[ \ell'(x_1, Y_t)(x_1 - x_2) |\mathcal{F}_{t-1}] \\ & + \mathbb{E}\left[ \left. \left( \alpha(\ell'(x_1, Y_t)(x_1 - x_2))^{2}\right)^{1/\beta} \right|\mathcal{F}_{t-1}\right] \end{align*}`
Almost optimal w.r.t *selection* \eqref{eq_optp_select} <a href='#bib-gaillard2018efficient'>Gaillard and Wintenberger (2018)</a>. ] .pull-right[ **Lemma 1** `\begin{align} 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min} & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\min} \label{eq_risk_ql_crps_expert} \\ 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi} & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\pi} . \label{eq_risk_ql_crps_convex} \end{align}` Pointwise can outperform constant procedures QL is convex but not exp-concave: <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> Almost optimal convergence w.r.t. *convex aggregation* \eqref{eq_boa_opt_conv} <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#00b02f;" xmlns="http://www.w3.org/2000/svg"> <path d="M173.898 439.404l-166.4-166.4c-9.997-9.997-9.997-26.206 0-36.204l36.203-36.204c9.997-9.998 26.207-9.998 36.204 0L192 312.69 432.095 72.596c9.997-9.997 26.207-9.997 36.204 0l36.203 36.204c9.997 9.997 9.997 26.206 0 36.204l-294.4 294.401c-9.998 9.997-26.207 9.997-36.204-.001z"></path></svg> </br> For almost optimal congerence w.r.t. *selection* \eqref{eq_boa_opt_select} we need to check **A1** and **A2**: QL is Lipschitz continuous: <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> **A1** holds <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#ffa600;" xmlns="http://www.w3.org/2000/svg"> <path d="M173.898 439.404l-166.4-166.4c-9.997-9.997-9.997-26.206 0-36.204l36.203-36.204c9.997-9.998 26.207-9.998 36.204 0L192 312.69 432.095 72.596c9.997-9.997 26.207-9.997 36.204 0l36.203 36.204c9.997 9.997 9.997 26.206 0 36.204l-294.4 294.401c-9.998 9.997-26.207 9.997-36.204-.001z"></path></svg> </br> ] --- # CRPS-Learning Optimality .pull-left[ Conditional quantile risk: `\(\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y_t) | \mathcal{F}_{t-1}]\)`. <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> convexity properties of `\(\mathcal{Q}_p\)` depend on the conditional distribution `\(Y_t|\mathcal{F}_{t-1}\)`. **Proposition 1** Let `\(Y\)` be a univariate random variable with (Radon-Nikodym) `\(\nu\)`-density `\(f\)`, then for the second subderivative of the quantile risk `\(\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y) ]\)` of `\(Y\)` it holds for all `\(p\in(0,1)\)` that `\(\mathcal{Q}_p'' = f.\)` Additionally, if `\(f\)` is a continuous Lebesgue-density with `\(f\geq\gamma>0\)` for some constant `\(\gamma>0\)` on its support `\(\text{spt}(f)\)` then is `\(\mathcal{Q}_p\)` is `\(\gamma\)`-strongly convex. Strong convexity with `\(\beta=1\)` implies **A2** <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#ffa600;" xmlns="http://www.w3.org/2000/svg"> <path d="M173.898 439.404l-166.4-166.4c-9.997-9.997-9.997-26.206 0-36.204l36.203-36.204c9.997-9.998 26.207-9.998 36.204 0L192 312.69 432.095 72.596c9.997-9.997 26.207-9.997 36.204 0l36.203 36.204c9.997 9.997 9.997 26.206 0 36.204l-294.4 294.401c-9.998 9.997-26.207 9.997-36.204-.001z"></path></svg> <a href='#bib-gaillard2018efficient'>Gaillard and Wintenberger (2018)</a> ] .pull-right[ <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> **A1** and **A2** give us almost optimal convergence w.r.t. selection \eqref{eq_boa_opt_select} <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#00b02f;" xmlns="http://www.w3.org/2000/svg"> <path d="M173.898 439.404l-166.4-166.4c-9.997-9.997-9.997-26.206 0-36.204l36.203-36.204c9.997-9.998 26.207-9.998 36.204 0L192 312.69 432.095 72.596c9.997-9.997 26.207-9.997 36.204 0l36.203 36.204c9.997 9.997 9.997 26.206 0 36.204l-294.4 294.401c-9.998 9.997-26.207 9.997-36.204-.001z"></path></svg> </br> **Theorem 1** The gradient based fully adaptive Bernstein online aggregation (BOAG) applied pointwise for all `\(p\in(0,1)\)` on `\(\text{QL}\)` satisfies \eqref{eq_boa_opt_conv} with minimal CRPS given by `$$\widehat{\mathcal{R}}_{t,\pi} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi}.$$` If `\(Y_t|\mathcal{F}_{t-1}\)` is bounded and has a pdf `\(f_t\)` satifying `\(f_t>\gamma >0\)` on its support `\(\text{spt}(f_t)\)` then \ref{eq_boa_opt_select} holds with `\(\beta=1\)` and `$$\widehat{\mathcal{R}}_{t,\min} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min}$$`. ] --- name: simple_example # A Probabilistic Example .pull-left[ Simple Example: `\begin{align} Y_t & \sim \mathcal{N}(0,\,1) \\ \widehat{X}_{t,1} & \sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} & \sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \label{eq:dgp_sim1} \end{align}` - True weights vary over `\(p\)` - Figures show the ECDF and calculated weights using `\(T=25\)` realizations - Pointwise solution creates rough estimates - Pointwise is better than constant - Smooth solution is better than pointwise ] .pull-right[ <div style="position:relative; margin-top:-50px; z-index: 0"> .panelset[ .panel[.panel-name[CDFs] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] .panel[.panel-name[Weights] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ]] ] --- # The Smoothing Procedure .pull-left[ We are using penalized cubic b-splines: Let `\(\varphi=(\varphi_1,\ldots, \varphi_L)\)` be bounded basis functions on `\((0,1)\)` Then we approximate `\(w_{t,k}\)` by `\begin{align} w_{t,k}^{\text{smooth}} = \sum_{l=1}^L \beta_l \varphi_l = \beta'\varphi \end{align}` with parameter vector `\(\beta\)`. The latter is estimated penalized `\(L_2\)`-smoothing which minimizes `\begin{equation} \| w_{t,k} - \beta' \varphi \|^2_2 + \lambda \| \mathcal{D}^{d} (\beta' \varphi) \|^2_2 \label{eq_function_smooth} \end{equation}` with differential operator `\(\mathcal{D}\)` Smoothing can be applied ex-post or inside of the algorithm (
[Simulation](#simulation)). ] .pull-right[ We receive the constant solution for high values of `\(\lambda\)` when setting `\(d=1\)` <center> <img src="weights_lambda.gif"> </center> ] --- class: center, middle, sydney-blue # The Proposed CRPS-Learning Algorithm --- name:proposed_algorithm # The Proposed CRPS-Learning Algorithm .pull-left-3[ .font90[ **Initialization:** Array of expert predicitons: `\(\widehat{X}_{t,k,p}\)` Vector of Prediction targets: `\(Y_t\)` Starting Weights: `\(w_0=(w_{0,1},\ldots, w_{0,K})\)`, Penalization parameter: `\(\lambda\geq 0\)` B-spline and penalty matrices `\(B\)` and `\(D\)` on `\(\mathcal{P}= (p_1,\ldots,p_M)\)` Hat matrix: `$$\mathcal{H} = B(B'B+ \lambda D'D)^{-1} B'$$` Cumulative Regret: `\(R_{0,k} = 0\)` Range parameter: `\(E_{0,k}=0\)` ]] .pull-right-3[ .font90[ **Core**: for(t in 1:T) { for(p in `\(\mathcal{P}\)`) { `\(\widetilde{X}_{t,k}(p) = \sum_{k=1}^K w_{t-1,k,p} \widehat{X}_{t,k}(p)\)` .grey[\# Prediction] for(k in 1:K){ `\(r_{t,k,p} = \text{QL}_p^{\nabla}(\widehat{X}_{t,k}(p),Y_t) - \text{QL}_p^{\nabla}(\widetilde{X}_{t}(p),Y_t)\)` `\(E_{t,k,p} = \max(E_{t-1,k,p}, |r_{t,k,p}|)\)` `\(\eta_{t,k,p}=\min\left(1/2E_{t,k,p}, \sqrt{\log(K)/ \sum_{i=1}^t (r^2_{i, k,p})}\right)\)` `\(R_{t,k,p} = R_{t-1,k,p} + \frac{1}{2} \left( r_{t,k,p} \left( 1+ \eta_{t,k,p} r_{t,k,p} \right) + 2E_{t,k,p} \mathbb{1}(\eta_{t,k,p}r_{t,k,p} > \frac{1}{2}) \right)\)` `\(w_{t,k,p} = \eta_{t,k,p} \exp \left(- \eta_{t,k,p} R_{t,k,p} \right) w_{0,k,p} / \left( \frac{1}{K} \sum_{k = 1}^K \eta_{t,k,p} \exp \left( - \eta_{t,k,p} R_{t,k,p}\right) \right)\)` }.grey[\#k]}.grey[\#p] for(k in 1:K){ `\(w_{t,k} = \mathcal{H} w_{t,k}(\mathcal{P})\)` .grey[\# Smoothing] } .grey[\#k]} .grey[\#t] ] ] --- class: center, middle, sydney-blue # Simulation Study --- name: simulation # Simulation Study .pull-left[ Data Generating Process of the [simple probabilistic example](#simple_example) - Constant solution `\(\lambda \rightarrow \infty\)` - Pointwise Solution of the proposed BOAG - Smoothed Solution of the proposed BOAG - Weights are smoothed during learning - Smooth weights are used to calculate Regret, adjust weights, etc. - Smooth ex-post solution - Weights are smoothed after the learning - Algorithm always uses non-smoothed weights ] .pull-right[ <div style="position:relative; margin-top:-50px; z-index: 0"> .panelset[ .panel[.panel-name[QL Deviation] Deviation from best attainable `\(\text{QL}_p\)` (1000 runs). ![](data:image/png;base64,#pre_vs_post.gif) ] .panel[.panel-name[CRPS vs. Lambda] CRPS Values for different `\(\lambda\)` (1000 runs) ![](data:image/png;base64,#pre_vs_post_lambda.gif) ]] ] --- # Simulation Study The same simulation carried out for different algorithms (1000 runs): <center> <img src="algos_constant.gif"> </center> --- # Simulation Study .pull-left-1[ **New DGP:** `\begin{align} Y_t & \sim \mathcal{N}\left(\frac{\sin(0.005 \pi t )}{2},\,1\right) \\ \widehat{X}_{t,1} & \sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} & \sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \label{eq_dgp_sim2} \end{align}`
Changing optimal weights
Single run example depicted aside
No forgetting leads to long-term constant weights ] .pull-right-2[ **Weights of expert 2** <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] --- # Simulation Results The simulation using the new DGP carried out for different algorithms (1000 runs): <center> <img src="algos_changing.gif"> </center> --- class: center, middle, sydney-blue # Possible Extensions --- name:extensions # Possible Extensions .pull-left[ **Forgetting** - Only taking part of the old cumulative regret into account - Exponential forgetting of past regret `\begin{align*} R_{t,k} & = R_{t-1,k}(1-\xi) + \ell(\widetilde{F}_{t},Y_i) - \ell(\widehat{F}_{t,k},Y_i) \label{eq_regret_forget} \end{align*}` **Fixed Shares** <a href='#bib-herbster1998tracking'>Herbster and Warmuth (1998)</a> - Adding fixed shares to the weights - Shrinkage towards a constant solution `\begin{align*} \widetilde{w}_{t,k} = \rho \frac{1}{K} + (1-\rho) w_{t,k} \label{fixed_share_simple}. \end{align*}` ] .pull-right[ **Non-Equidistant Knots** - Non-equidistant spline-basis could be used - Potentially improves the tail-behavior - Destroys shrinkage towards constant <center> <img src="uneven_grid.gif"> </center> ] --- class: center, middle, sydney-blue # Application Study --- name: application # Application Study: Overview .pull-left-1[ .font90[ Data: - Forecasting European emission allowances (EUA) - Daily month-ahead prices - Jan 13 - Dec 20 (Phase III, 2092 Obs) Combination methods: - Naive, BOAG, EWAG, ML-PolyG, BMA Tuning paramter grids: - Smoothing Penalty: `\(\Lambda= \{0\}\cup \{2^x|x\in \{-4,-3.5,\ldots,12\}\}\)` - Learning Rates: `\(\mathcal{E}= \{2^x|x\in \{-1,-0.5,\ldots,9\}\}\)` ] ] .pull-right-2[ <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> ] --- # Application Study: Experts .font90[ Simple exponential smoothing with additive errors (**ETS-ANN**): `\begin{align*} Y_{t} = l_{t-1} + \varepsilon_t \quad \text{with} \quad l_t = l_{t-1} + \alpha \varepsilon_t \quad \text{and} \quad \varepsilon_t \sim \mathcal{N}(0,\sigma^2) \end{align*}` Quantile regression (**QuantReg**): For each `\(p \in \mathcal{P}\)` we assume: `\begin{align*} F^{-1}_{Y_t}(p) = \beta_{p,0} + \beta_{p,1} Y_{t-1} + \beta_{p,2} |Y_{t-1}-Y_{t-2}| \end{align*}` ARIMA(1,0,1)-GARCH(1,1) with Gaussian errors (**ARMA-GARCH**): `\begin{align*} Y_{t} = \mu + \phi(Y_{t-1}-\mu) + \theta \varepsilon_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1) \end{align*}` ARIMA(0,1,0)-I-EGARCH(1,1) with Gaussian errors (**I-EGARCH**): `\begin{align*} Y_{t} = \mu + Y_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \log(\sigma_t^2) = \omega + \alpha Z_{t-1}+ \gamma (|Z_{t-1}|-\mathbb{E}|Z_{t-1}|) + \beta \log(\sigma_{t-1}^2) \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1) \end{align*}` ARIMA(0,1,0)-GARCH(1,1) with student-t errors (**I-GARCHt**}): `\begin{align*} Y_{t} = \mu + Y_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim t(0,1, \nu) \end{align*}` ] --- # Application Study: Results <div style="position:relative; margin-top:-25px; z-index: 0"> .panelset[ .panel[.panel-name[Significance] <table class=" lightable-material" style='font-family: "Source Sans Pro", helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:center;"> ETS-ANN </th> <th style="text-align:center;"> QuantReg </th> <th style="text-align:center;"> ARMA-GARCH </th> <th style="text-align:center;"> I-EGARCH </th> <th style="text-align:center;"> I-GARCHt </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;background-color: #FF808C !important;"> 2.103 (>.999) </td> <td style="text-align:center;background-color: #FF808C !important;"> 1.360 (>.999) </td> <td style="text-align:center;background-color: #FFB180 !important;"> 0.522 (0.993) </td> <td style="text-align:center;background-color: #FFB480 !important;"> 0.503 (0.999) </td> <td style="text-align:center;background-color: #F3FF80 !important;"> -0.035 (0.411) </td> </tr> </tbody> </table> <table class=" lightable-material" style='font-family: "Source Sans Pro", helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> BOAG </th> <th style="text-align:center;"> EWAG </th> <th style="text-align:center;"> ML-PolyG </th> <th style="text-align:center;"> BMA </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> pointwise </td> <td style="text-align:center;background-color: #99EE80 !important;"> -0.161 (0.067) </td> <td style="text-align:center;background-color: #D6FF80 !important;"> -0.085 (0.177) </td> <td style="text-align:center;background-color: #AFF580 !important;"> -0.136 (0.126) </td> <td style="text-align:center;background-color: #FFF980 !important;"> 0.030 (0.753) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> smooth </td> <td style="text-align:center;background-color: #85E780 !important;"> -0.185 (0.037) </td> <td style="text-align:center;background-color: #D1FF80 !important;"> -0.094 (0.150) </td> <td style="text-align:center;background-color: #99EE80 !important;"> -0.161 (0.066) </td> <td style="text-align:center;background-color: #FFF980 !important;"> 0.027 (0.722) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> constant </td> <td style="text-align:center;background-color: #AEF580 !important;"> -0.137 (0.020) </td> <td style="text-align:center;background-color: #E0FF80 !important;"> -0.067 (0.144) </td> <td style="text-align:center;background-color: #B3F780 !important;"> -0.132 (0.027) </td> <td style="text-align:center;background-color: #FFF880 !important;"> 0.035 (0.826) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> smooth* </td> <td style="text-align:center;background-color: #80E680 !important;"> -0.191 (0.023) </td> <td style="text-align:center;background-color: #9CEF80 !important;"> -0.158 (0.025) </td> <td style="text-align:center;background-color: #80E680 !important;"> -0.190 (0.021) </td> <td style="text-align:center;background-color: #FFFE80 !important;"> -0.009 (0.333) </td> </tr> </tbody> </table> CRPS difference to **Naive** (scaled by `\(10^4\)`) of single experts and four combination methods with four options. Additionally, we show the p-value of the DM-test, testing against **Naive**. The smallest value is bold. We also report the optimal ex-post selection by **smooth*** ] .panel[.panel-name[QL] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ] .panel[.panel-name[Cumulative Loss Difference] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> .panel[.panel-name[Weights] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> ] ]] --- class: center, middle, sydney-blue # Wrap-Up --- name: conclusion # Wrap-Up .font90[ .pull-left[ Potential Downsides: - Pointwise optimization can induce quantile crossing - Can be solved by sorting the predictions Upsides: - Pointwise learning outperforms the Naive solution significantly - Online learning is much faster than batch methods - Smoothing further improves the predictive performance - Asymptotically not worse than the best convex combination ] .pull-left[ Important: - The choice of the learning rate is crucial - The loss function has to meet certain criteria The [
profoc](https://profoc.berrisch.biz/) R Package: - Implements all algorithms discussed above - Is written using RcppArmadillo
its fast - Accepts vectors for most parameters - The best parameter combination is chosen online - Implements - Forgetting, Fixed Share - Different loss functions + gradients ] ] <a href="https://github.com/BerriJ" class="github-corner" aria-label="View source on Github"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#f2f2f2; color:#212121; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style> ??? Execution Times: T = 5000 Opera: Ml-Poly > 157 ms Boa > 212 ms Profoc: Ml-Poly > 17 BOA > 16 --- class: center, middle [
CRPS-Learning](https://arxiv.org/abs/2102.00968) --- name:references # References 1 Cesa-Bianchi, N. and G. Lugosi (2006). _Prediction, learning, and games_. Cambridge university press. Gaillard, P., G. Stoltz, and T. Van Erven (2014). "A second-order bound with excess losses". In: _Conference on Learning Theory_. PMLR. , pp. 176-196. Gaillard, P. and O. Wintenberger (2018). "Efficient online algorithms for fast-rate regret bounds under sparsity". In: _Advances in Neural Information Processing Systems_. , pp. 7026-7036. Gneiting, T. (2011a). "Making and evaluating point forecasts". In: _Journal of the American Statistical Association_ 106.494, pp. 746-762. Gneiting, T. and A. E. Raftery (2007). "Strictly proper scoring rules, prediction, and estimation". In: _Journal of the American statistical Association_ 102.477, pp. 359-378. Herbster, M. and M. K. Warmuth (1998). "Tracking the best expert". In: _Machine learning_ 32.2, pp. 151-178. Kakade, S. M. and A. Tewari (2008). "On the Generalization Ability of Online Strongly Convex Programming Algorithms." In: _NIPS_. , pp. 801-808. --- # References 2 Wintenberger, O. (2017). "Optimal learning with Bernstein online aggregation". In: _Machine Learning_ 106.1, pp. 119-141. --- class: center, middle [
](#content)