学习笔记-BanditAlgo-4

bandit 2026-02-13

学习笔记-BanditAlgo-4

2026-02-13

13 Lower Bounds

Sometimes we are more interesting on the lower bounds for reasonable policies.

For policy $\pi$ and a set of environments $\mathcal{E}$, we have the worth-case regret $R_n(\pi,\mathcal{E})=\sup_{v\in \mathcal{E}} R_n(\pi,v)$. Thus for a set of policy $$ it is reasonable to define a minimax regret with $R_n^*(\Pi,\mathcal{E})=\inf_{\pi\in\Pi} R_n(\pi,\mathcal{E})$. That is, a policy that minimize the worth-case regret.

Though it is hard to find the minimax policy, we could find policies with minimax optimal up to a constant. We already did that for Gaussian bandit due to the $O(\sqrt{kn})$ lower bound for the minimax policy of Gaussian bandit, which we will prove later.

Let's try to explain it by explaining it for a stronger statement: For any policy, we are able to construct two k-arm Gaussian bandits $P$ and $P'$ with Gaussian $N(\mu,1)$ and $N(\mu',1)$, such that $\max \{R_n(\pi,P),R_n(\pi,P')\}\ge \Theta(\sqrt{kn})$.

We hope that the two bandits are difficult to distinguish and to be treated differently. We first set $\mu=(\Delta,0,0,\dots)$, resulting in $R_n(\pi,P)=\Delta(n-E[T_1])$. Let $i$ be the arm that $\pi$ plays the least often in $P$. It is obvious that $E[T_i]\le \frac{n}{k-1}$. We then set $\mu'=(\Delta,0,\dots,2\Delta,\dots)$, where $\mu'_i=2\Delta$, and the regret of the bandit $R_n(\pi,P')\ge \Delta E'[T_1]$. Though lacking rigor, when $\Delta=\sqrt{\frac{1}{E[T_i(n)]}}\ge \sqrt{\frac{k-1}{n}}$, we expect that $E[T_1]\approx E'[T_1]$. Thus, we achieve a lower bound of $\frac{1}{2}\sqrt{n(k-1)}$.

15 Minimax Lower Bound

Divergence Decomposition:

Consider two k-armed bandits $v=(P_i)$ and $v'=(P'_i)$, along with a fixed policy $\pi$. Let $\mathbb P_v=\mathbb P_{v,\pi}$ be the probability measures on the canonical bandit model induced by the $n$-round interconnection of $\pi$ and $v$.
We have \[ D(\mathbb P_v,\mathbb P_{v'})=\sum_{i=1}^{k}E_v(T_i) D(P_i,P'_i) \]

Theorem: Consider the class of Gaussian bandit $N(\mu,1)$. For any policy, there exists a $\mu$ such that $R_n(\pi,\mu)\ge \frac{1}{27}\sqrt{(k-1)n}$.

The idea of the prove is already mentioned in the last section. Details are added using divergence:

$R_n(\pi,\mu)\ge \mathbb P_{\mu}(T_1\le n/2)\frac{n\Delta}{2}$, and $R_n(\pi,\mu)\ge \mathbb P_{\mu}(T_1\ge n/2)\frac{n\Delta}{2}$
On the other hand, Bretagnolle–Huber inequality shows that $P_{\mu}(T_1\le n/2)+\mathbb P_{\mu}(T_1\ge n/2)\ge \frac{1}{2}\exp(-D(\mathbb P_{\mu},\mathbb P_{\mu'}))$.
Using Divergence Decomposition and $E[T_i]\le \frac{n}{k-1}$ we get $D(\mathbb P_{\mu},\mathbb P_{\mu'})\le \frac{n\Delta^2}{k-1}$.
Substitute $\Delta=\sqrt{(k-1)/4n}$ and lower bounding $2\max(a,b)\ge a+b$ and $\exp(-1/2)$ brings us the result.

16 Instance-Dependent Lower Bounds

Asymptotic Bounds

The minimax lower bound is a bit too conservative. There are reasonable policies which specialize on some (easier) bandits and achieves better lower bounds, though might be performing badly on some others.

Consistent: A policy is consistent over a class of bandits $E$ if for all $p>0$ and $v\in E$, it stands that $\frac{R_n(\pi,v)}{p^n}\to 0$. That is, the policy provides a $\tilde O(1)$ regrets.

Let $M$ be a set of distribution, for $P\in M$, define $d(P,\mu^*,M)=\inf_{P'\in M}\{D(P,P'):\mu(P')>\mu^*\}$. That is, the smallest divergence between $P$ and a $P'$ with mean larger than $\mu^*$.

Theorem: For unstructured class of bandits $E=M_1\times\dots\times M_k$ and consistent policy $\pi$, it holds for all bandits $v=(P_i)_{i\in[k]}$ that \[ \lim\inf \frac{R_n(\pi,v)}{\log n}\ge c^*(v,E)=\sum_{i,\Delta_i>0}\frac{\Delta_i}{d(P_i,\mu^*,M_i)} \] where $\mu^*$ is the mean of the optimal arm, and $\Delta_i$ is the suboptimal gap.

For most classes, the lower bound of $c^*(v,E)$ could be reached, bringing us the concept of asymptotically optimal policy on $E$.

Finite-time bound

The above theorem only told us about the limit, while in reality we face finite $n$.

Lemma: Assume two bandit $v=(P_i)$ and $v'=(P'_i)$ only differs on reward distribution of arm $i$, and $i$ is uniquely optimal in $v'$ while suboptimal in $v$. $\lambda=\mu_i(v')-\mu_i(v)$. Then for any policy $\pi$, it holds that \[ E_{v\pi}[T_i(n)]\ge \frac{1}{D(P_i,P'_{i})}(\log(\frac{\min(\lambda-\Delta_i(v),\Delta_i(v))}{4})+\log n-\log(R_n(v)+R_n(v'))) \] That is, the expected $T_i$ is somehow bounded by the inverse of $ D(P_i,P_i')$

And it leads to the following theorem: Consider a gaussian bandit $v$ with suboptimal gap $\Delta$. Let $E(v)=\{v':\mu_i(v')\in[\mu_i,\mu_i+2\Delta_i]\}$. Then, for all consistent policy (with $R_n\le Cn^p$), and $\epsilon\in(0,1)$, we have \[ R_n(\pi,v)\ge \frac{2}{(1+\epsilon)^2}\sum_{i:\Delta_i>0} (\frac{(1-p)\log n+\log(\frac{\epsilon \Delta_i}{8C})}{\Delta_i})^2 \] The proof is easy after applying the lemma into regret decomposition rule.

17 High-Probability Lower Bounds

For bandits in this section, we assume $\Delta_i\le 1$.

There is no randomness in the $E[R_n]$ of stochastic bandit. However, we could define a random pseudo-regret $\bar R_n=\sum T_i(n)\Delta_i$. And we say a policy is reasonable if for any bandit $v$ it holds that $R_n(\pi,v)\le B\sqrt{(k-1)n}$.

Then we have the theorem of high probability lower bound: \[ P(\bar R_n\ge \frac{1}{4}\min(n,\frac{1}{B}\sqrt{(k-1)n}\log\frac{1}{4\delta}))\ge \delta \] The proof is similar to the minimax lower bound, where we first set a bandit with $\mu=(\Delta,0,\dots)$, then choose a $i$ with minimum $E[T_i]$, followed by a bandit with $\mu_i'\leftarrow 2\Delta$, bounded using bandit divergence.

In detail, as the policy is reasonable, $E[T_i(n)]\le \frac{B}{\Delta}\sqrt{\frac{n}{k-1}}$. Then, \[ P(\bar R_n\ge \frac{\Delta n}{2})+P'(\bar R'_n\ge \frac{\Delta n}{2})\\ \ge P(T_i\ge \frac{n}{2})+P'(T_i<\frac{n}{2})\\\ge \frac{1}{2}\exp(-D(P|P'))\ge \frac{1}{2}\exp(-2B\Delta \sqrt{\frac{n}{k-1}}) \] Simply tuning $\Delta$ brings us the result.