学习笔记-BanditAlgo-5

bandit 2026-02-20

学习笔记-BanditAlgo-5

2026-02-20

18 Contextual Bandit

In reality, we not only observe the action-reward pair, but we also observe some contexts and make our decisions accordingly.

One bandit per context

In this section we mainly focus on adversarial bandit, where we observe a secretly chosen context \(c_t\) before we select action distribution \(P_t\).

And we want to compare our policy with context-dependent fix policy. That is, \[ R_n=E[\sum_{c\in C} \max_{i\in[k]} \sum_{c_t=c}(x_{t,i}-X_t)] \] If \(C\) is finite, then a natural policy is to use a exp-3 for each \(c\), and get a regret of \[ R_n=\sum_{c\in C} R_{n,c}\le 2\sqrt{(\#[c_t=c])\times k\log k} \] When all \(c\) appears equally often, we have the worth-case upper bound \(2\sqrt{|C|nk\log k}\).

Bandit with expert advice & Exp4

In the bandit with expert advice framework, we have \(M\) experts making recommendation. That is, there is a \(E^{(t)}_{M\times k}\) for each timestep where each row represent the action distribution of the expert. And in each timestep, we calculate a leaner's action distribution based on \(E\) and get our own reward. The regret is compared with the best expert.

Under such scenario we have exp4 algorithm:

Maintain a distribution \(Q_t\) with initial state \((\frac{1}{m},\dots,\frac{1}{m})\).
For each round, sample \(A_t\) from \(P_t=Q_tE^{(t)}\) and get reward \(X_t=x_{t,A_t}\).
Calculate loss-based importance weighing \(\tilde X_{t,i}=1-\frac{[A_t=i]}{P_{t,i}+\gamma}(1-X_t)\).
Update \(Q_t\) by \(Q_{t+1,i}\sim\hat \exp(\eta \tilde X_{t,i}) Q_{t,i}\).

When \(\gamma=0\) and \(\eta=\sqrt{\frac{2\log m}{nk}}\), \(R_n\le \sqrt{nk\log m}\).

In the former one bandit per context scenery, we could see the the class of experts as all functions from \(C\to [k]\).

One way to measure the agreements between experts is \(E^*_t=\sum_{s=1}^t \sum_{i=1}^k \max_{j} E^{(s)}_{j,i}\). It is obvious that \(E^*_n\le n\min(k,m)\). Then \(E^*_n/n\) could be used to measure effective number of experts. In fact, we could set \(\eta_t=\sqrt{\frac{\log m}{E^*_t}}\) to achieve a bound of \(O (\sqrt{E^*_n\log m})\).

19 Stochastic Linear Bandits

In round \(t\), the learner observes the context \(c_t\) and made an action \(a_t\). The difference is that, the reward is \(X_t=r(c_t,a_t)+\eta_t\) where \(\eta_t\) is a noise which we assume to be a 1-subgaussian conditioned on previous action&context&reward.

A simple assumption to capture further information is to assume that we have a feature map \(\psi:C\times [k]\to \mathbb R^d\) and \(r(c,a)=<\theta,\psi(c,a)>\).

This leads to a simplified version where \(X_t=\langle\theta,A_t\rangle+\eta_t\) and \(A_t\in \mathcal A_t\subset \mathbb R^d\). In such scenario, we can define random pseudo regret \(\tilde R_n=\sum_{a'\in \mathcal A_t}\langle\theta,a-A_t\rangle\) and regret \(R_n=E[\tilde R_n]\). Setting \(\mathcal A_t=\{\psi(c_t,i)\}\) brings us the above contextual linear bound,

We will try to generalize UCB to the stochastic linear bandit, named LinUCB.

The main logic of LinUCB is to select a confidence set \(\mathcal C_t\subset \mathbb R^d\) each round and calculate \(UCB_t(a)=\max_{\theta\in\mathcal C_t}\langle \theta,a\rangle\), and action with the max UCB is selected as \(A_t\).

The confidence set needs to satisfy that, there is high probability that \(\theta\) is in \(\mathcal C_t\), and it is small enough for calculation.

Anyway, we can have an estimation of \(\theta\) according to previous observation. The metric we use is normalized MSE, that is, \(\arg\min_{\theta} \sum_{s\in [t]} (X_s-\langle\theta,A_s\rangle)^2+\lambda ||\theta||^2\). The solution to this formula is easy: \[ \hat\theta_t=(\lambda I+\sum_{s\in[t]} A_sA_s^{T})\sum_{s\in[t]} A_sX_s \] Then we simply choose \(\mathcal C_t=\{\theta\mid ||\theta-\hat \theta_t||^2_{V_t}\le \beta_t\}\), where \(V_t=\lambda I+\sum_{s\in[t]} A_sA_s^{T}\), which is an ellipsoid centered at \(\hat \theta_t\) with axis = eigenvectors of \(V_t\). As volume of \(V_t\) increases throughout the process, \(\beta_t\) needs to increase accordingly.

When \(||a||\le L\), The regret of LinUCB satisfies \(\hat R_n\le \sqrt{8dn\beta_n\log(\frac{d\lambda+dL^2}{d\lambda})}\). Let \(m\) be the upperbound of \(||\theta||\), \(\sqrt{\beta_n}\) is chosen to be \(\sqrt{\lambda}m+\sqrt{2\log (\frac{1}{\delta})+d\log(\frac{d\lambda+dL^2}{d\lambda})}\). When \(\delta=1/n\), we achieve a bound of \(O(d\sqrt n \log (nL))\).

21 Optimal Design for Least Squares Estimators

For easier calculation, we drop out the regularization term first. That brings us \(\hat \theta=V^{-1}\sum_{t=1}^n a_tX_t\) where \(V=\sum_{s\in [t]} a_sa^T_s\).

This chapter focuses on developing a strategy to quickly shrink the size of the confidence set \(\mathcal C_t\). We are not developing a fixed combinatorial optimization. Instead, we need a design, or a distribution of action \(\pi(a)\) (the probability of playing action \(a\)), that gathers the most information. Then we could define \(V(\pi)=\sum_{a\in \mathcal A}\pi(a)a_sa_s^T\), which presents the information of each dimension.

G-optimal:

Define the maximum uncertainty across available actions: \(g(\pi)=\max_{a\in\mathcal A} ||a||_{V(\pi)^{-1}}^2\).
Minimize \(g(\pi)\).

D-optimal:

Work directly on \(V(\pi)\).Define \(d(\pi)=\log \det V(\pi)\).
Maximize \(d(\pi)\).

To ensure the confidence width \(\le \epsilon\) with probability \(1-\delta\), the number of each action taken \(n_a=\frac{2\pi(a)g(\pi)}{\epsilon^2}\log(\frac{1}{\delta})\), bounding the total actions needed \(n\le |Supp(\pi)|+\frac{2g(\pi)}{\epsilon^2}\log(\frac{1}{\delta})\).

Kiefer-Wolfowitz Theorem: G-optimal is equivalent to D-optimal, and the minimum \(g(\pi)=d\), the dimension of the linear space. Furthermore, there exists a design such that \(|Supp(\pi)|\le d(d+1)/2\).

22 Stochastic Linear Bandits with Finitely Many Arms

When \(\mathcal A_t\) is fixed and finite with \(|\mathcal A|=k\), it is possible to reduce the regret to \(O(\sqrt{nd\log (k\log n / \delta)})\).

We have the following algorithm (initially \(l=1\)):

Find G-optimal design \(\pi_l\in P(\mathcal A)\). Let \(\epsilon=2^{-l}\).
Set \(T_l(a)=\frac{2d\pi_l(a)}{\epsilon^2}\log(\frac{kl(l+1)}{\delta})\) and perform action \(a\) for \(T_l(a)\) times.
Calculate the empirical estimate \(\hat \theta_l=V_l^{-1}\sum_{t\in T_l} a_tX_t\) where \(T_l\) are timesteps in the current round.
Eliminate arms with low UCB: \(\mathcal A\leftarrow \{a:\max_{b\in \mathcal A} \langle\hat \theta_l,b-a\rangle \ge 2\epsilon\}\) and go back to 1.

\(\delta=1/n\) brings us the regret of \(O(\sqrt{nd\log nk})\).

23 Stochastic Linear Bandits with Sparsity

The difference is that \(\theta\) is assumed to have many zero entries. In reality, that means there are many redundant features. Let \(||\theta||_0\) be \(\sum_i [\theta_i\neq 0]\). The sparse condition is to say, \(||\theta||_0\le m_0\) and \(||\theta||_2\le m_2\). Also, we assume that \(\langle\theta,a\rangle\le 1\).

Hypercube case

A simple case is where we constrain \(a\) to be a hypercube \([-1,1]^{d}\). Thus the optimal \(a\) is \(a=sgn(\theta)\).

Let's assume that \(a_t\) could be seen as sampled from a Rademacher distribution, and this brings us \(E[A_{t,i}X_t]=\theta_i\) and \(V[A_{t,i}X_t]=E[A_{t,i}^2X_t^2]-\theta_i^2=E[(\langle\theta,A_t\rangle+\eta)^2]\le 2\).

Our policy for the hypercube case is rather simple, by treating each dimension independently. For each dimension, it is a simple \(\pm 1\) 2-arm bandit problem, and we could take a similar explore-and-commit method.

Set \(E_{1,i}=1\) and \(C_i=R\). We try to reduce \(C\) to \((0,+\infty)\) or \((-\infty,0)\).
For each \(i\), if \(0\in C_i\) set \(A_{t,i}\sim \pm 1\), otherwise set \(A_{t,i}=sgn(C_i)\).
Construct empirical estimators \(T_i(t)=\sum_{s=1}^t E_{s,i}\) and \(\hat \theta_{t,i}=\frac{\sum E_{s,i}X_{s}A_{s,i}}{\sum E_{s,i}}\).
Construct confidence interval \(W_{t,i}=2\sqrt{(\frac{1}{T_i(t)}+\frac{1}{T_i^2(t)})\log (n\sqrt{2T_i(t)+1})}\) and \(C_i\leftarrow\hat \theta_{t,i}\pm W_{t,i}\).
\(E_{t+1,i}=E_{t,i}\times [0\not\in C_i]\).

The regret satisfies \(R_n=O(||\theta||_1+||\theta_0||\sqrt{n\log n})\).