学习笔记-BanditAlgo-3

bandit 2026-02-10

学习笔记-BanditAlgo-3

2026-02-10

11 Adversarial Bandit & Exp3 Algo

A $k$ armed adversarial bandit is an arbitrary array of reward vector $(x_t)_1^n$，such that $x_i\in[0,1]^k$. In each round, the learner choose an action distribution and sample an action from such distribution, receiving reward $x_{t,a_t}$.

It is important to notice that we are only comparing with a best FIXED policy, rather than a global best policy.

Sublinear worst-case regret could only be achieved under stochastic policy. Actually if we deterministically choose the action, the worst-case regret $\ge n(1-1/k)$.

Importance weight estimator: Let $P_{t,i}$ be the probability of $A_t=i$ conditioned on previous observations. Let \[ \hat X_{t,i}=\frac{[A_t=i]X_t}{P_{t,i}} \] It is easy to see it is unbiased estimation of $x_{t,i}$, with variance $\frac{x_{t,i}^2(1-P{t,i})}{P_{t,i}}$.

We could also define the loss $y_{t,i}=1-x_{t,i}$, and have a similar loss-based important weight estimator $\hat Y_{t,i}=\frac{[A_t=i]Y_t}{P_{}t,i]}$ an similar variance.

Exp3 Algo: Let $\hat S_{t,i}=\sum_{s\le t}\hat X_{s,i}$. For the $t$-th turn, we sample our action from the distribution $P_{t,i}\propto \exp(\eta\hat S_{t,i})$. $\eta$ here is called learning rate, reflecting how aggressive the algorithm exploits.

Theorem: Let $\eta=\sqrt{\log k/(nk)}$, then $R_n\le 2\sqrt{nk\log k}$.

Proof:

Define $R_{n,i}=\sum x_{t,i}-E [\sum X_t]=E[\hat S_{n,i}-\hat S_n]$, and $W_t=\sum_{j=1}^k \exp(\eta \hat S_{t,j})$.
$\exp(\eta \hat S_{n,i})\le W_n=k\prod_{j=1}^n \frac{W_j}{W_{j-1}}$.
$\frac{W_t}{W_{t-1}}=\sum \frac{\exp(\eta \hat S_{t-1,j})\exp(\eta \hat X_{t,j})}{W_{t-1}}=\sum P(t,j)\exp(\eta \hat X_{t,j})$.
Using $\exp(x)\le 1+x+x^2$, and $1+y(y) $ we obtain that \[ \frac{W_t}{W_{t-1}}\le \exp(\eta \sum P_{t,j}\hat X_{t,j}+\eta^2\sum P_{t,j}\hat X^2_{t,j})\\ \exp(\eta\hat S_{n,i})\le k\exp(\eta \hat S_{n}+\eta^2 \sum\sum P_{t,j}\hat X^2_{t,j})\\ R_{n,i}=\hat S_{n,i}-\hat S_n\le \frac{\log k}{\eta}+\eta \sum\sum P_{t,j}\hat X^2_{t,j} \]
Expand $\hat X_{t,j}^2$ using $y$: \[ E[\sum_{j=1}^{k}P_{t,j}\hat X^2_{t,j}]=E[\sum_{j=1}^k P_{t,j}(1-\frac{[A_t=j]y_{t,j}}{P_{t,j}})^2]\\ = E[1-2Y_t+\sum_j y_{t,j}^2]\\ = E[(1-Y_t)^2+\sum_{j\neq A_t} y^2_{t,j}]\le k \]

And this brings us the final expression $R_{n,i}\le \frac{\log k}{\eta}+\eta nk$. Substituting $\eta=\sqrt{\log k/nk}$ brings us the result.

12 Exp3-IX

Exp3 shows a sublinear expected regret with a high variance. A better algorithm is provided to improve this issue, such that for each $\delta$, we can make sure that \[ \hat R_n=\max_a \sum_{t=1}^{n}(y_{t,A_t}-y_{t,a})=O(\sqrt{nk\log\frac{k}{\delta}}) \] The high variance of exp3 comes from its importance sampling. We could control the variance at the price of a small bias.

First, let's use the loss-based estimator in the rest of the chapter, and use $\hat L_{t,i}=\sum_{j\le t}\hat Y_{j,i}$ instead of $S$. That brings us a equivalent inequality as the reward-based one \[ \hat L_{n}-\hat L_{n,i}\le \frac{\log k}{\eta}+\frac{\eta}{2}\sum_{j=1}^{k}\hat L_{n,j} \]

Also, we define observed loss $\tilde L_n=\sum_{t=1}^{n} y_{t,A_t}$ and $L_{n,i}=\sum_{t=1}^{n}y_{t,i}$, and the per-arm regret $\hat R_{n,i}=\tilde L_n-L_{n,i}$, which brings us \[ \hat R_{n,i}\le (\tilde L_n-\hat L_n)+(\hat L_{n,i}-L_{n,i})+\frac{\log k}{\eta}+\frac{\eta}{2}\sum_{j=1}^{k}\hat L_{n,j} \] Putting the bias of importance weighing inside our inequality.

Exp3 IX The new algorithm modifies the importance weighing by \[ \hat Y_{t,i}=\frac{[A_t=i]Y_t}{P_{t,i}+\gamma} \] The IX stands for implicit exploration. As $\hat Y_{t,i}$ is an under-estimated value, it encourages more exploration.

Under such definition, we have bias $\tilde L_n-\hat L_n=\gamma \sum_{j=1}^{k}\hat L_{n,j}$. On the other hand, we have

Lemma: For $\delta\in(0,1)$, with probability at least $1-\delta$, $\max (\hat L_{n,i}-L_{n,i})\le \frac{\log((k+1)/\delta)}{2\gamma}$ and $\sum\hat L_{n,i}-L_{n,i}\le \frac{\log((k+1)/\delta)}{2\gamma}$ holds.

As $L_{n,i}\le n$, we got \[ \hat R_n\le \frac{\log(k)}{\eta}+(\gamma+\frac{\eta}{2})nk+(\gamma+\frac{\eta}{2}+1)\frac{\log((k+1)/\delta)}{2\gamma} \] Substituting $\gamma=\frac{\eta}{2}$, and $\eta=O(\sqrt{\frac{\log k}{nk}})$ brings us the result. $$