13 Lower Bounds

Sometimes we are more interesting on the lower bounds for reasonable policies.

For policy \(\pi\) and a set of environments \(\mathcal{E}\), we have the worth-case regret \(R_n(\pi,\mathcal{E})=\sup_{v\in \mathcal{E}} R_n(\pi,v)\). Thus for a set of policy $$ it is reasonable to define a minimax regret with \(R_n^*(\Pi,\mathcal{E})=\inf_{\pi\in\Pi} R_n(\pi,\mathcal{E})\). That is, a policy that minimize the worth-case regret.

Though it is hard to find the minimax policy, we could find policies with minimax optimal up to a constant. We already did that for Gaussian bandit due to the \(O(\sqrt{kn})\) lower bound for the minimax policy of Gaussian bandit, which we will prove later.

Let's try to explain it by explaining it for a stronger statement: For any policy, we are able to construct two k-arm Gaussian bandits \(P\) and \(P'\) with Gaussian \(N(\mu,1)\) and \(N(\mu',1)\), such that \(\max \{R_n(\pi,P),R_n(\pi,P')\}\ge \Theta(\sqrt{kn})\).

We hope that the two bandits are difficult to distinguish and to be treated differently. We first set \(\mu=(\Delta,0,0,\dots)\), resulting in \(R_n(\pi,P)=\Delta(n-E[T_1])\). Let \(i\) be the arm that \(\pi\) plays the least often in \(P\). It is obvious that \(E[T_i]\le \frac{n}{k-1}\). We then set \(\mu'=(\Delta,0,\dots,2\Delta,\dots)\), where \(\mu'_i=2\Delta\), and the regret of the bandit \(R_n(\pi,P')\ge \Delta E'[T_1]\). Though lacking rigor, when \(\Delta=\sqrt{\frac{1}{E[T_i(n)]}}\ge \sqrt{\frac{k-1}{n}}\), we expect that \(E[T_1]\approx E'[T_1]\). Thus, we achieve a lower bound of \(\frac{1}{2}\sqrt{n(k-1)}\).

15 Minimax Lower Bound

Divergence Decomposition:

  • Consider two k-armed bandits \(v=(P_i)\) and \(v'=(P'_i)\), along with a fixed policy \(\pi\). Let \(\mathbb P_v=\mathbb P_{v,\pi}\) be the probability measures on the canonical bandit model induced by the \(n\)-round interconnection of \(\pi\) and \(v\).

  • We have \[ D(\mathbb P_v,\mathbb P_{v'})=\sum_{i=1}^{k}E_v(T_i) D(P_i,P'_i) \]

Theorem: Consider the class of Gaussian bandit \(N(\mu,1)\). For any policy, there exists a \(\mu\) such that \(R_n(\pi,\mu)\ge \frac{1}{27}\sqrt{(k-1)n}\).

The idea of the prove is already mentioned in the last section. Details are added using divergence:

  • \(R_n(\pi,\mu)\ge \mathbb P_{\mu}(T_1\le n/2)\frac{n\Delta}{2}\), and \(R_n(\pi,\mu)\ge \mathbb P_{\mu}(T_1\ge n/2)\frac{n\Delta}{2}\)
  • On the other hand, Bretagnolle–Huber inequality shows that \(P_{\mu}(T_1\le n/2)+\mathbb P_{\mu}(T_1\ge n/2)\ge \frac{1}{2}\exp(-D(\mathbb P_{\mu},\mathbb P_{\mu'}))\).
  • Using Divergence Decomposition and \(E[T_i]\le \frac{n}{k-1}\) we get \(D(\mathbb P_{\mu},\mathbb P_{\mu'})\le \frac{n\Delta^2}{k-1}\).
  • Substitute \(\Delta=\sqrt{(k-1)/4n}\) and lower bounding \(2\max(a,b)\ge a+b\) and \(\exp(-1/2)\) brings us the result.