Bayesian regret of Thompson sampling #

This file provides a Bayesian regret upper bound (integral_regret_le) for Thompson sampling under the assumption (among others) that it has the correct prior over environments.

The Bayesian regret upper bound relies on a clipped upper confidence bound whose definition and properties are also given in this file.

Main definitions #

ucb A R l u σ2 δ a n : clipped upper confidence bound used in the regret analysis of Thompson sampling for a sequence of actions A : ℕ → Ω → Fin K, rewards R : ℕ → Ω → ℝ, reward lower bound l : ℝ, reward upper bound u : ℝ, sub-Gaussian variance proxy σ2 : ℝ, confidence parameter δ : ℝ, action a : Fin K, and time n : ℕ.
ucb' n h l u σ2 δ a: clipped upper confidence bound for action a : Fin K at time n : ℕ given the history h : Iic n → Fin K × ℝ (rather than the entire sequences of actions and rewards).

Main results #

integral_regret_le: if Thompson sampling has the correct prior over environments and every environment has K actions, each of which has a corresponding reward between l and u that is sub-Gaussian with variance proxy σ2 after its mean is subtracted, then the Bayesian regret at time n is at most (2 * K + 1) * (u - l) + 8 * √(σ2 * K * n * Real.log n).

source

noncomputable def Bandits.ClippedUCB.ucb {K : ℕ} {Ω : Type u_1} (A : ℕ → Ω → Fin K) (R : ℕ → Ω → ℝ) (l u σ2 δ : ℝ) (a : Fin K) (n : ℕ) (ω : Ω) :

ℝ

Clipped upper confidence bound used in the regret analysis of Thompson sampling.

Equations

Bandits.ClippedUCB.ucb A R l u σ2 δ a n ω = if Learning.pullCount A a n ω = 0 then u else max l (min u (Learning.empMean A R a n ω + √(2 * σ2 * Real.log (1 / δ) / ↑(Learning.pullCount A a n ω))))

Instances For

source

@[simp]

theorem Bandits.ClippedUCB.ucb_zero {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {a : Fin K} {ω : Ω} :

ucb A R l u σ2 δ a 0 ω = u

source

theorem Bandits.ClippedUCB.ucb_mem_Icc {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} (h : l ≤ u) {a : Fin K} {n : ℕ} {ω : Ω} :

ucb A R l u σ2 δ a n ω ∈ Set.Icc l u

source

theorem Bandits.ClippedUCB.measurable_ucb {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [MeasurableSpace Ω] {a : Fin K} {n : ℕ} (hA : ∀ (t : ℕ), Measurable (A t)) (hR : ∀ (t : ℕ), Measurable (R t)) :

Measurable (ucb A R l u σ2 δ a n)

source

theorem Bandits.ClippedUCB.measurable_uncurry_ucb_comp {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [MeasurableSpace Ω] (hA : ∀ (t : ℕ), Measurable (A t)) (hR : ∀ (t : ℕ), Measurable (R t)) {f : Ω → Fin K} (hf : Measurable f) {g : Ω → ℕ} (hg : Measurable g) :

Measurable fun (ω : Ω) => ucb A R l u σ2 δ (f ω) (g ω) ω

source

theorem Bandits.ClippedUCB.integrable_uncurry_ucb_comp {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [MeasurableSpace Ω] (hA : ∀ (t : ℕ), Measurable (A t)) (hR : ∀ (t : ℕ), Measurable (R t)) {f : Ω → Fin K} (hf : Measurable f) {g : Ω → ℕ} (hg : Measurable g) {P : MeasureTheory.Measure Ω} [MeasureTheory.IsFiniteMeasure P] :

MeasureTheory.Integrable (fun (ω : Ω) => ucb A R l u σ2 δ (f ω) (g ω) ω) P

source

noncomputable def Bandits.ClippedUCB.ucb' {K : ℕ} (n : ℕ) (h : ↥(Finset.Iic n) → Fin K × ℝ) (l u σ2 δ : ℝ) (a : Fin K) :

ℝ

Clipped upper confidence bound (history-based version).

Equations

Bandits.ClippedUCB.ucb' n h l u σ2 δ a = if Learning.pullCount' n h a = 0 then u else max l (min u (Learning.empMean' n h a + √(2 * σ2 * Real.log (1 / δ) / ↑(Learning.pullCount' n h a))))

Instances For

source

theorem Bandits.ClippedUCB.measurable_uncurry_ucb' {K : ℕ} {l u σ2 δ : ℝ} {n : ℕ} :

Measurable fun (p : (↥(Finset.Iic n) → Fin K × ℝ) × Fin K) => ucb' n p.1 l u σ2 δ p.2

source

theorem Bandits.ClippedUCB.ucb_succ_eq_ucb' {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {a : Fin K} {n : ℕ} {ω : Ω} :

ucb A R l u σ2 δ a (n + 1) ω = ucb' n (Learning.history A R n ω) l u σ2 δ a

source

theorem Bandits.ClippedUCB.sum_ucb_sub_mean_le {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {n : ℕ} {ω : Ω} (μ : Fin K → ℝ) (hμ : ∀ (a : Fin K), μ a ∈ Set.Icc l u) (hi : l ≤ u) (hc : ∀ s < n, Learning.pullCount A (A s ω) s ω ≠ 0 → Learning.empMean A R (A s ω) s ω - μ (A s ω) < √(2 * σ2 * Real.log (1 / δ) / ↑(Learning.pullCount A (A s ω) s ω))) :

∑ s ∈ Finset.range n, (ucb A R l u σ2 δ (A s ω) s ω - μ (A s ω)) ≤ (u - l) * ↑K + 4 * √(2 * σ2 * Real.log (1 / δ) * ↑K * ↑n)

source

theorem Bandits.ClippedUCB.integral_sum_range_actionMean_bestAction_sub_ucb_bestAction_le {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] {E : Ω → 𝓔} {Q : MeasureTheory.Measure 𝓔} {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel κ] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {alg : Learning.Algorithm (Fin K) ℝ} (h : Learning.IsBayesAlgEnvSeq Q κ alg E A R P) (hlu : l ≤ u) (hm : ∀ (e : 𝓔) (a : Fin K), ∫ (x : ℝ), id x ∂κ (e, a) ∈ Set.Icc l u) (hσ2 : 0 < σ2) (hs : ∀ (e : 𝓔) (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ℝ) => x - ∫ (x : ℝ), id x ∂κ (e, a)) ⟨σ2, ⋯⟩ (κ (e, a))) (hδ : 0 < δ) (n : ℕ) :

∫ (x : Ω), (fun (ω : Ω) => ∑ t ∈ Finset.range n, (Learning.IsBayesAlgEnvSeq.actionMean κ E (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) ω - ucb A R l u σ2 δ (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) t ω)) x ∂P ≤ (u - l) * (↑n - 1) * ↑n * δ

source

theorem Bandits.ClippedUCB.integral_sum_range_ucb_action_sub_actionMean_action_le {K : ℕ} {l u σ2 δ : ℝ} {Ω : Type u_1} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] {E : Ω → 𝓔} {Q : MeasureTheory.Measure 𝓔} {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel κ] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {alg : Learning.Algorithm (Fin K) ℝ} (h : Learning.IsBayesAlgEnvSeq Q κ alg E A R P) (hlu : l ≤ u) (hm : ∀ (e : 𝓔) (a : Fin K), ∫ (x : ℝ), id x ∂κ (e, a) ∈ Set.Icc l u) (hσ2 : 0 < σ2) (hs : ∀ (e : 𝓔) (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ℝ) => x - ∫ (x : ℝ), id x ∂κ (e, a)) ⟨σ2, ⋯⟩ (κ (e, a))) (hδ : 0 < δ) (n : ℕ) :

∫ (x : Ω), (fun (ω : Ω) => ∑ t ∈ Finset.range n, (ucb A R l u σ2 δ (A t ω) t ω - Learning.IsBayesAlgEnvSeq.actionMean κ E (A t ω) ω)) x ∂P ≤ (u - l) * ↑K + 4 * √(2 * σ2 * Real.log (1 / δ) * ↑K * ↑n) + (u - l) * ↑K * (↑n - 1) * ↑n * δ

source

theorem Bandits.TS.integral_ucb_action_eq_integral_ucb_bestAction {K : ℕ} [Nonempty (Fin K)] {l u σ2 δ : ℝ} {Ω : Type u_1} [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel κ] {E : Ω → 𝓔} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q κ (tsAlgorithm hK Q κ) E A R P) (n : ℕ) :

∫ (x : Ω), (fun (ω : Ω) => ClippedUCB.ucb A R l u σ2 δ (A n ω) n ω) x ∂P = ∫ (x : Ω), (fun (ω : Ω) => ClippedUCB.ucb A R l u σ2 δ (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) n ω) x ∂P

source

theorem Bandits.TS.integral_regret_eq_add {K : ℕ} [Nonempty (Fin K)] {l u σ2 δ : ℝ} {Ω : Type u_1} [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel κ] {E : Ω → 𝓔} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q κ (tsAlgorithm hK Q κ) E A R P) (hm : ∀ (e : 𝓔) (a : Fin K), ∫ (x : ℝ), id x ∂κ (e, a) ∈ Set.Icc l u) (n : ℕ) :

∫ (x : Ω), Learning.IsBayesAlgEnvSeq.regret κ E A n x ∂P = ∫ (x : Ω), (fun (ω : Ω) => ∑ t ∈ Finset.range n, (Learning.IsBayesAlgEnvSeq.actionMean κ E (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) ω - ClippedUCB.ucb A R l u σ2 δ (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) t ω)) x ∂P + ∫ (x : Ω), (fun (ω : Ω) => ∑ t ∈ Finset.range n, (ClippedUCB.ucb A R l u σ2 δ (A t ω) t ω - Learning.IsBayesAlgEnvSeq.actionMean κ E (A t ω) ω)) x ∂P

source

theorem Bandits.TS.integral_regret_le {K : ℕ} [Nonempty (Fin K)] {l u σ2 : ℝ} {Ω : Type u_1} [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel κ] {E : Ω → 𝓔} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q κ (tsAlgorithm hK Q κ) E A R P) (hlu : l ≤ u) (hm : ∀ (e : 𝓔) (a : Fin K), ∫ (x : ℝ), id x ∂κ (e, a) ∈ Set.Icc l u) (hσ2 : 0 < σ2) (hs : ∀ (e : 𝓔) (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ℝ) => x - ∫ (x : ℝ), id x ∂κ (e, a)) ⟨σ2, ⋯⟩ (κ (e, a))) (n : ℕ) :

∫ (x : Ω), Learning.IsBayesAlgEnvSeq.regret κ E A n x ∂P ≤ (2 * ↑K + 1) * (u - l) + 8 * √(σ2 * ↑K * ↑n * Real.log ↑n)

If Thompson sampling has the correct prior over environments and every environment has K actions, each of which has a corresponding reward between l and u that is sub-Gaussian with variance proxy σ2 after its mean is subtracted, then the Bayesian regret at time n is at most (2 * K + 1) * (u - l) + 8 * √(σ2 * K * n * Real.log n).

Documentation

LeanMachineLearning.Online.Bandit.Algorithms.Regret.BayesRegretTS

Bayesian regret of Thompson sampling #

Main definitions #

Main results #