2026-02-10
11 Adversarial Bandit & Exp3 Algo
A \(k\) armed adversarial bandit is an arbitrary array of reward vector \((x_t)_1^n\),such that \(x_i\in[0,1]^k\). In each round, the learner choose an action distribution and sample an action from such distribution, receiving reward \(x_{t,a_t}\).
It is important to notice that we are only comparing with a best FIXED policy, rather than a global best policy.
Sublinear worst-case regret could only be achieved under stochastic policy. Actually if we deterministically choose the action, the worst-case regret \(\ge n(1-1/k)\).
Importance weight estimator: Let \(P_{t,i}\) be the probability of \(A_t=i\) conditioned on previous observations. Let \[ \hat X_{t,i}=\frac{[A_t=i]X_t}{P_{t,i}} \] It is easy to see it is unbiased estimation of \(x_{t,i}\), with variance \(\frac{x_{t,i}^2(1-P{t,i})}{P_{t,i}}\).
We could also define the loss \(y_{t,i}=1-x_{t,i}\), and have a similar loss-based important weight estimator \(\hat Y_{t,i}=\frac{[A_t=i]Y_t}{P_{}t,i]}\) an similar variance.
Exp3 Algo: Let \(\hat S_{t,i}=\sum_{s\le t}\hat X_{s,i}\). For the \(t\)-th turn, we sample our action from the distribution \(P_{t,i}\propto \exp(\eta\hat S_{t,i})\). \(\eta\) here is called learning rate, reflecting how aggressive the algorithm exploits.
Theorem: Let \(\eta=\sqrt{\log k/(nk)}\), then \(R_n\le 2\sqrt{nk\log k}\).
Proof:
Define \(R_{n,i}=\sum x_{t,i}-E [\sum X_t]=E[\hat S_{n,i}-\hat S_n]\), and \(W_t=\sum_{j=1}^k \exp(\eta \hat S_{t,j})\).
\(\exp(\eta \hat S_{n,i})\le W_n=k\prod_{j=1}^n \frac{W_j}{W_{j-1}}\).
\(\frac{W_t}{W_{t-1}}=\sum \frac{\exp(\eta \hat S_{t-1,j})\exp(\eta \hat X_{t,j})}{W_{t-1}}=\sum P(t,j)\exp(\eta \hat X_{t,j})\).
Using \(\exp(x)\le 1+x+x^2\), and $1+y(y) $ we obtain that \[ \frac{W_t}{W_{t-1}}\le \exp(\eta \sum P_{t,j}\hat X_{t,j}+\eta^2\sum P_{t,j}\hat X^2_{t,j})\\ \exp(\eta\hat S_{n,i})\le k\exp(\eta \hat S_{n}+\eta^2 \sum\sum P_{t,j}\hat X^2_{t,j})\\ R_{n,i}=\hat S_{n,i}-\hat S_n\le \frac{\log k}{\eta}+\eta \sum\sum P_{t,j}\hat X^2_{t,j} \]
Expand \(\hat X_{t,j}^2\) using \(y\): \[ E[\sum_{j=1}^{k}P_{t,j}\hat X^2_{t,j}]=E[\sum_{j=1}^k P_{t,j}(1-\frac{[A_t=j]y_{t,j}}{P_{t,j}})^2]\\ = E[1-2Y_t+\sum_j y_{t,j}^2]\\ = E[(1-Y_t)^2+\sum_{j\neq A_t} y^2_{t,j}]\le k \]
And this brings us the final expression \(R_{n,i}\le \frac{\log k}{\eta}+\eta nk\). Substituting \(\eta=\sqrt{\log k/nk}\) brings us the result.
12 Exp3-IX
Exp3 shows a sublinear expected regret with a high variance. A better algorithm is provided to improve this issue, such that for each \(\delta\), we can make sure that \[ \hat R_n=\max_a \sum_{t=1}^{n}(y_{t,A_t}-y_{t,a})=O(\sqrt{nk\log\frac{k}{\delta}}) \] The high variance of exp3 comes from its importance sampling. We could control the variance at the price of a small bias.
First, let's use the loss-based estimator in the rest of the chapter, and use \(\hat L_{t,i}=\sum_{j\le t}\hat Y_{j,i}\) instead of \(S\). That brings us a equivalent inequality as the reward-based one \[ \hat L_{n}-\hat L_{n,i}\le \frac{\log k}{\eta}+\frac{\eta}{2}\sum_{j=1}^{k}\hat L_{n,j} \]
Also, we define observed loss \(\tilde L_n=\sum_{t=1}^{n} y_{t,A_t}\) and \(L_{n,i}=\sum_{t=1}^{n}y_{t,i}\), and the per-arm regret \(\hat R_{n,i}=\tilde L_n-L_{n,i}\), which brings us \[ \hat R_{n,i}\le (\tilde L_n-\hat L_n)+(\hat L_{n,i}-L_{n,i})+\frac{\log k}{\eta}+\frac{\eta}{2}\sum_{j=1}^{k}\hat L_{n,j} \] Putting the bias of importance weighing inside our inequality.
Exp3 IX The new algorithm modifies the importance weighing by \[ \hat Y_{t,i}=\frac{[A_t=i]Y_t}{P_{t,i}+\gamma} \] The IX stands for implicit exploration. As \(\hat Y_{t,i}\) is an under-estimated value, it encourages more exploration.
Under such definition, we have bias \(\tilde L_n-\hat L_n=\gamma \sum_{j=1}^{k}\hat L_{n,j}\). On the other hand, we have
- Lemma: For \(\delta\in(0,1)\), with probability at least \(1-\delta\), \(\max (\hat L_{n,i}-L_{n,i})\le \frac{\log((k+1)/\delta)}{2\gamma}\) and \(\sum\hat L_{n,i}-L_{n,i}\le \frac{\log((k+1)/\delta)}{2\gamma}\) holds.
As \(L_{n,i}\le n\), we got \[ \hat R_n\le \frac{\log(k)}{\eta}+(\gamma+\frac{\eta}{2})nk+(\gamma+\frac{\eta}{2}+1)\frac{\log((k+1)/\delta)}{2\gamma} \] Substituting \(\gamma=\frac{\eta}{2}\), and \(\eta=O(\sqrt{\frac{\log k}{nk}})\) brings us the result. $$
$$