Bayesian regret of Thompson sampling #
This file provides a Bayesian regret upper bound (integral_regret_le) for Thompson sampling under
the assumption (among others) that it has the correct prior over environments.
The Bayesian regret upper bound relies on a clipped upper confidence bound whose definition and properties are also given in this file.
Main definitions #
ucb A R l u σ2 δ a n: clipped upper confidence bound used in the regret analysis of Thompson sampling for a sequence of actionsA : ℕ → Ω → Fin K, rewardsR : ℕ → Ω → ℝ, reward lower boundl : ℝ, reward upper boundu : ℝ, sub-Gaussian variance proxyσ2 : ℝ, confidence parameterδ : ℝ, actiona : Fin K, and timen : ℕ.ucb' n h l u σ2 δ a: clipped upper confidence bound for actiona : Fin Kat timen : ℕgiven the historyh : Iic n → Fin K × ℝ(rather than the entire sequences of actions and rewards).
Main results #
integral_regret_le: if Thompson sampling has the correct prior over environments and every environment hasKactions, each of which has a corresponding reward betweenlanduthat is sub-Gaussian with variance proxyσ2after its mean is subtracted, then the Bayesian regret at timenis at most(2 * K + 1) * (u - l) + 8 * √(σ2 * K * n * Real.log n).
Clipped upper confidence bound used in the regret analysis of Thompson sampling.
Equations
- Bandits.ClippedUCB.ucb A R l u σ2 δ a n ω = if Learning.pullCount A a n ω = 0 then u else max l (min u (Learning.empMean A R a n ω + √(2 * σ2 * Real.log (1 / δ) / ↑(Learning.pullCount A a n ω))))
Instances For
Clipped upper confidence bound (history-based version).
Equations
- Bandits.ClippedUCB.ucb' n h l u σ2 δ a = if Learning.pullCount' n h a = 0 then u else max l (min u (Learning.empMean' n h a + √(2 * σ2 * Real.log (1 / δ) / ↑(Learning.pullCount' n h a))))
Instances For
If Thompson sampling has the correct prior over environments and every environment has K
actions, each of which has a corresponding reward between l and u that is sub-Gaussian with
variance proxy σ2 after its mean is subtracted, then the Bayesian regret at time n is at most
(2 * K + 1) * (u - l) + 8 * √(σ2 * K * n * Real.log n).