13 Lower Bounds

Sometimes we are more interesting on the lower bounds for reasonable policies.

For policy \(\pi\) and a set of environments \(\mathcal{E}\), we have the worth-case regret \(R_n(\pi,\mathcal{E})=\sup_{v\in \mathcal{E}} R_n(\pi,v)\). Thus for a set of policy $$ it is reasonable to define a minimax regret with \(R_n^*(\Pi,\mathcal{E})=\inf_{\pi\in\Pi} R_n(\pi,\mathcal{E})\). That is, a policy that minimize the worth-case regret.

Though it is hard to find the minimax policy, we could find policies with minimax optimal up to a constant. We already did that for Gaussian bandit due to the \(O(\sqrt{kn})\) lower bound for the minimax policy of Gaussian bandit, which we will prove later.

Let's try to explain it by explaining it for a stronger statement: For any policy, we are able to construct two k-arm Gaussian bandits \(P\) and \(P'\) with Gaussian \(N(\mu,1)\) and \(N(\mu',1)\), such that \(\max \{R_n(\pi,P),R_n(\pi,P')\}\ge \Theta(\sqrt{kn})\).

We hope that the two bandits are difficult to distinguish and to be treated differently. We first set \(\mu=(\Delta,0,0,\dots)\), resulting in \(R_n(\pi,P)=\Delta(n-E[T_1])\). Let \(i\) be the arm that \(\pi\) plays the least often in \(P\). It is obvious that \(E[T_i]\le \frac{n}{k-1}\). We then set \(\mu'=(\Delta,0,\dots,2\Delta,\dots)\), where \(\mu'_i=2\Delta\), and the regret of the bandit \(R_n(\pi,P')\ge \Delta E'[T_1]\). Though lacking rigor, when \(\Delta=\sqrt{\frac{1}{E[T_i(n)]}}\ge \sqrt{\frac{k-1}{n}}\), we expect that \(E[T_1]\approx E'[T_1]\). Thus, we achieve a lower bound of \(\frac{1}{2}\sqrt{n(k-1)}\).

15 Minimax Lower Bound

Divergence Decomposition:

  • Consider two k-armed bandits \(v=(P_i)\) and \(v'=(P'_i)\), along with a fixed policy \(\pi\). Let \(\mathbb P_v=\mathbb P_{v,\pi}\) be the probability measures on the canonical bandit model induced by the \(n\)-round interconnection of \(\pi\) and \(v\).

  • We have \[ D(\mathbb P_v,\mathbb P_{v'})=\sum_{i=1}^{k}E_v(T_i) D(P_i,P'_i) \]

Theorem: Consider the class of Gaussian bandit \(N(\mu,1)\). For any policy, there exists a \(\mu\) such that \(R_n(\pi,\mu)\ge \frac{1}{27}\sqrt{(k-1)n}\).

The idea of the prove is already mentioned in the last section. Details are added using divergence:

  • \(R_n(\pi,\mu)\ge \mathbb P_{\mu}(T_1\le n/2)\frac{n\Delta}{2}\), and \(R_n(\pi,\mu)\ge \mathbb P_{\mu}(T_1\ge n/2)\frac{n\Delta}{2}\)
  • On the other hand, Bretagnolle–Huber inequality shows that \(P_{\mu}(T_1\le n/2)+\mathbb P_{\mu}(T_1\ge n/2)\ge \frac{1}{2}\exp(-D(\mathbb P_{\mu},\mathbb P_{\mu'}))\).
  • Using Divergence Decomposition and \(E[T_i]\le \frac{n}{k-1}\) we get \(D(\mathbb P_{\mu},\mathbb P_{\mu'})\le \frac{n\Delta^2}{k-1}\).
  • Substitute \(\Delta=\sqrt{(k-1)/4n}\) and lower bounding \(2\max(a,b)\ge a+b\) and \(\exp(-1/2)\) brings us the result.

16 Instance-Dependent Lower Bounds

Asymptotic Bounds

The minimax lower bound is a bit too conservative. There are reasonable policies which specialize on some (easier) bandits and achieves better lower bounds, though might be performing badly on some others.

Consistent: A policy is consistent over a class of bandits \(E\) if for all \(p>0\) and \(v\in E\), it stands that \(\frac{R_n(\pi,v)}{p^n}\to 0\). That is, the policy provides a \(\tilde O(1)\) regrets.

Let \(M\) be a set of distribution, for \(P\in M\), define \(d(P,\mu^*,M)=\inf_{P'\in M}\{D(P,P'):\mu(P')>\mu^*\}\). That is, the smallest divergence between \(P\) and a \(P'\) with mean larger than \(\mu^*\).

Theorem: For unstructured class of bandits \(E=M_1\times\dots\times M_k\) and consistent policy \(\pi\), it holds for all bandits \(v=(P_i)_{i\in[k]}\) that \[ \lim\inf \frac{R_n(\pi,v)}{\log n}\ge c^*(v,E)=\sum_{i,\Delta_i>0}\frac{\Delta_i}{d(P_i,\mu^*,M_i)} \] where \(\mu^*\) is the mean of the optimal arm, and \(\Delta_i\) is the suboptimal gap.

For most classes, the lower bound of \(c^*(v,E)\) could be reached, bringing us the concept of asymptotically optimal policy on \(E\).

Finite-time bound

The above theorem only told us about the limit, while in reality we face finite \(n\).

Lemma: Assume two bandit \(v=(P_i)\) and \(v'=(P'_i)\) only differs on reward distribution of arm \(i\), and \(i\) is uniquely optimal in \(v'\) while suboptimal in \(v\). \(\lambda=\mu_i(v')-\mu_i(v)\). Then for any policy \(\pi\), it holds that \[ E_{v\pi}[T_i(n)]\ge \frac{1}{D(P_i,P'_{i})}(\log(\frac{\min(\lambda-\Delta_i(v),\Delta_i(v))}{4})+\log n-\log(R_n(v)+R_n(v'))) \] That is, the expected \(T_i\) is somehow bounded by the inverse of $ D(P_i,P_i')$

And it leads to the following theorem: Consider a gaussian bandit \(v\) with suboptimal gap \(\Delta\). Let \(E(v)=\{v':\mu_i(v')\in[\mu_i,\mu_i+2\Delta_i]\}\). Then, for all consistent policy (with \(R_n\le Cn^p\)), and \(\epsilon\in(0,1)\), we have \[ R_n(\pi,v)\ge \frac{2}{(1+\epsilon)^2}\sum_{i:\Delta_i>0} (\frac{(1-p)\log n+\log(\frac{\epsilon \Delta_i}{8C})}{\Delta_i})^2 \] The proof is easy after applying the lemma into regret decomposition rule.

17 High-Probability Lower Bounds

For bandits in this section, we assume \(\Delta_i\le 1\).

There is no randomness in the \(E[R_n]\) of stochastic bandit. However, we could define a random pseudo-regret \(\bar R_n=\sum T_i(n)\Delta_i\). And we say a policy is reasonable if for any bandit \(v\) it holds that \(R_n(\pi,v)\le B\sqrt{(k-1)n}\).

Then we have the theorem of high probability lower bound: \[ P(\bar R_n\ge \frac{1}{4}\min(n,\frac{1}{B}\sqrt{(k-1)n}\log\frac{1}{4\delta}))\ge \delta \] The proof is similar to the minimax lower bound, where we first set a bandit with \(\mu=(\Delta,0,\dots)\), then choose a \(i\) with minimum \(E[T_i]\), followed by a bandit with \(\mu_i'\leftarrow 2\Delta\), bounded using bandit divergence.

In detail, as the policy is reasonable, \(E[T_i(n)]\le \frac{B}{\Delta}\sqrt{\frac{n}{k-1}}\). Then, \[ P(\bar R_n\ge \frac{\Delta n}{2})+P'(\bar R'_n\ge \frac{\Delta n}{2})\\ \ge P(T_i\ge \frac{n}{2})+P'(T_i<\frac{n}{2})\\\ge \frac{1}{2}\exp(-D(P|P'))\ge \frac{1}{2}\exp(-2B\Delta \sqrt{\frac{n}{k-1}}) \] Simply tuning \(\Delta\) brings us the result.