学习笔记-BanditAlgo-1

bandit 2026-01-19

学习笔记-BanditAlgo-1

2026-01-19

bandit

1.1 The language of bandit

Bandit problem:

Horizon \(n\): the player plays \(n\) round.
Arms (actions): In each round, the learner first chooses an action \(A_t\in A\). \(k-\)arm means that the number of actions is \(k\).
Environment: In each round, after the learner chooses the action, the environment reveals a reward \(X_t\in \mathbb R\).

The true environment lies in an environment class \(E\), but the learner don't know exactly which the environment is.

The learner can only decide based on the historical \(A\) and \(X\).
Regret: For any policy \(\pi\), the regret of the learner relative to the policy is the difference between the expected reward of the policy and the expected reward of the learner. (As the learner does not know the environment so of course it does worse than a good policy).
Worst case regret: As regret depends on environment, the worst case regret is the max regret over all environments.
Stochastic Bernoulli Bandits: There exists a vector \(\mu\in[0,1]^k\) such that \(X_t(a)\sim Bern(\mu_a)\).
Stochastic Stationary Bandits: The reward of the environment depends only on the current action.

An example is where \(X_t=\langle x,\theta\rangle+\eta_t\) where \(\eta_t\) is a Gaussian. Another example is where we replace \(Bern\) with Gaussian in the Bernoulli case.
Adversarial Bandit: The environment is too chaotic, and we only want to calculate regret with some constant policy instead of the best one.
Limitation (c.f. Reinforcement learning / partial monitoring): In bandit algos, we only care about today and don't need to plan about future. The long-term planning falls into Reinforcement Learning. Also, if the reward is not fully observed, it falls into partial monitoring.

4 Stochastic Bandits

We have a collection of distribution \(P_a:a\in A\), and \(X_t\sim P_{A_t}\).

The learner's goal is to maximize \(S_n=\sum X_t\). A default choice is to maximize \(E[S_n]\), but other property like var and tail is also worth considering in some cases.

The learner often has partial information about \(\{P_a\}\). Environment satisfying the information forms an environment class \(E\).

Unstructured Bandits: There exists sets of distribution \(M_a\) for each \(a\in A\) such that \(E=\prod_a M_a\). That is to say, result of action \(a\) brings no information about distribution of action \(b\).

Parametric: The number of degrees of freedom needed to describe \(E\) is finite, like Bernoulli bandits.
Structured Bandits:

Eg1. \(A=\{0,1\}\) and \(P=\{Bern(p),Bern(1-p)\}\). We can learn \(p\) just by playing action \(0\).

Eg2. (Stochastic linear bandit) \(P_a=N(\langle a,\theta\rangle,1)\). We can learn \(\theta\) by just playing \(d\) actions when \(\theta\in R^d\).

To formally define regret, for an instance of stochastic bandit \(v=(P_a)\) and define \(\mu_a(v)=E_{x\sim P_a}[x]\), and \(\mu^*(v)=\max \mu_a(v)\). That is, \(\mu^*(v)\) is the reward if always choosing the best action.

Then we define the regret of a learner's policy is \(R_n(\pi,v)=n\mu^*(v)-E[\sum X_t]\). We need to minimize the regret.

We hope to find a policy with sublinear regret, that is, \(\lim \frac{R_n(\pi,v)}{n}=0\).

Action gap / immediate regret: For action \(a\), \(\Delta_a(v)=\mu^*(v)-\mu_a(v)\). \(T_a(t)=\sum _{s=1}^t [A_s=a]\).

Regret Decomposition Lemma: Regret \(R_n=\sum_a \Delta_a E[T_a(n)]\).

If provided that everything is approx. measurable, then we have \(R_n=\int_A \Delta_a E[T_a(n)]\).

## 5.3 Subgaussian Random Variables

Moment GF: \(M_X(\lambda)=E[\exp(\lambda X)]\).

A random variable is heavy-tailed if \(M_X(\lambda)=\infty\) for all \(\lambda>0\).

Cumulant GF: \(\psi_X(\lambda)=\log M_X(\lambda)\).

Subgaussian: A variable \(X\) is \(\sigma-\)subgaussian, if \(\psi_X(\lambda)\le \frac{1}{2}\lambda^2\sigma^2\).

Theorem: \(\sigma-\)subgaussian random variable \(X\) satisfies: \(P(X\ge \epsilon)\le \exp(-\frac{\epsilon^2}{2\sigma^2})\).

A similar expression is that, \(P(X \ge \sqrt{2\sigma^2\log(1/\delta)})\le \delta\).

Lemma: For independent \(\sigma_1,\sigma_2\) subgaussian variable \(X_1,X_2\), we have:

\(E[X]=0\), \(V[X]\le \sigma^2\).
\(cX\) is \(|c\sigma|\) subgaussian.
\(X_1+X_2\) is \(\sqrt{\sigma_1^2+\sigma_2^2}\) subgaussian.

Which shows us that, \(\hat\mu-\mu=\frac{1}{n}\sum (X_t-\mu)\) is \(\sqrt{n}\sigma\) subgaussian.