学习笔记-BanditAlgo-2

bandit 2026-01-21

学习笔记-BanditAlgo-2

2026-01-21

In the following sessions, we assume stochastic bandits with finitely many arms, and the distribution of each bandit is a \(1-\)subgaussian.

6 Explore Then Commit Algo

Explore each arm for \(m\) times, then choose the arm with the highest observed reward for \(n-mk\) times.

Let \(\mu^*\) be \(\max \mu_i\), \(\Delta_i=\mu^*-\mu_i\). Without loss of generality, we assume \(\mu_1=\mu^*\).

Theorem: \(R_n\le m\sum \Delta_i + (n-mk)\sum \Delta_i \exp(-\frac{m\Delta_i^2}{4})\).

Proof:

The \(m\sum \Delta_i\) is easy to understand. We shall focus on the regret caused by incorrectly predicting \(\max \mu_i\).
The probability of incorrectly predicting \(\mu^*=\mu_i\) is no more than the probability of \(\hat\mu_i-\hat\mu_1\ge 0\).
In other world, we only need to calculate the probability of \(X_i=(\hat\mu_i-\mu_i)-(\hat \mu_1-\mu_1)\ge \Delta_i\).
\(\hat\mu_i\) and \(\hat\mu_1\) are both \(1/\sqrt{m}\) subgaussian, so that \(X_i\) is \(2/\sqrt{m}\) subgaussian.
We can easily obtain the result by the property of subgaussian.

7 Upper Confidence Bound Algo

We know that \(P[\mu\ge \hat \mu +\sqrt{\frac{2\log(1/\delta)}{n}}]<\delta\), so it is natural to define it as the Upper Confidence Bound of an arm.

That leads to the upper confidence bound algorithm:

let \(T_i(t)\) means how many times has the \(i\)-th arm been chosen. Define \(UCB_i(t,\delta)=\hat\mu+\sqrt{\frac{2\log(1/\delta)}{T_i(t)}}\).
For each turn \(t\), find argmax \(UCB_i(t-1,\delta)\), then play it for this turn.

From a natural perspective, choosing the argmax UCB looks good. UCB might be huge for two reasons: it is under-explored, or it has high reward.

The \(\delta\) is called the confidence level which controls whether we explore or exploit. The UCB is called the index of an arm.

Consider the following failure case: the index of the optimal arm falls below its mean. It could leads to the optimal arm never being played again, causing a linear regret. To control this probability, \(\delta\) has to be rather small, \(\le O(\frac{1}{n})\) to ensure convergence. The following part proves us a bound when \(\delta=\frac{1}{n^2}\).

Theorem 1: \[ R_n\le 3\sum \Delta_i+\sum_{\Delta_i>0} \frac{16\log n}{\Delta_i} \] Proof:

WLOG, \(A_1\) is the optimal arm.
Let \(X_{j,i}\) be the index of arm \(i\) when before it is chosen for the \(j\)-th time. We can define \(\hat\mu_{i,t}\) as \(\frac{1}{j}\sum X_{j\le t,i}\).
For the \(i\)-th arm, for a constant \(u_i\), let \(v_{i,u_i}=\hat \mu_{i,u_i}+\sqrt{\frac{2\log(1/\delta)}{u_i}}\). We define a good event \(G_i\) by \(\{\mu_1<\min_t UCB_1(t,\delta)\} \cap \{v_{i,u_i}<\mu_1\}\).
If \(G_i\) stands, we can show that \(i\) will not be chosen for \(\ge u_i\) times. If otherwise, then consider the \(u_i\)-th time we choose \(i\): before this turn, the index of \(i=v_{i,u_i}<\mu_1<UCB_1\), which means that \(i\) couldn't be the one chosen for this turn, leading to a contradiction.
On the other hand, the probability of \(G_i^c\) is low.
- \(P[\mu_1>\min_t UCB_1(t,\delta)]\le \sum_t P[\mu_1>UCB_1(t,\delta)]\le n\delta\).
- For another constant \(c\in[0,1]\) satisfying \(\Delta_i-\sqrt{\frac{2\log(1/\delta)}{u_i}}\ge c\Delta_i\) (eq *), we have \[ P[v_{i,u_i}\ge\mu_1]=P[\hat \mu_{i,u_i}-\mu_i\ge\sqrt{\frac{2\log(1/\delta)}{u_i}}]\le P[\mu_{i,u_i}-\mu_i\ge c\Delta_i]\le \exp(-\frac{u_i c^2 \Delta_i^2}{2}) \]
Thus the expected chosen times of arm \(i\) will not exceed \(u_i+n(n\delta+\exp(-\frac{u_i c^2 \Delta_i^2}{2})\). After using \(\delta=\frac{1}{n^2}\), we only need to choose a \(c\) and \(u_i\). After substituting \(u_i\) using eq * we get \(\frac{2\log n^2}{(1-c)^2\Delta_i^2}+1+n^{1-2c^2/(1-c)^2}\).

Choosing \(c=1/2\) brings us the result.

Theorem 1 depends on \(\frac{1}{\Delta_i}\). When there is small suboptimal gap, the bound will be meaningless. Thus we have another bound regardless of the \(\frac{1}{\Delta_i}\).

Theorem 2: \[ R_n\le 8\sqrt{nk\log n}+3\sum \Delta_i \]