Documentation

LeanMachineLearning.Online.Bandit.Algorithms.Regret.BayesRegretTS

Bayesian regret of Thompson sampling #

This file provides a Bayesian regret upper bound (integral_regret_le) for Thompson sampling under the assumption (among others) that it has the correct prior over environments.

The Bayesian regret upper bound relies on a clipped upper confidence bound whose definition and properties are also given in this file.

Main definitions #

Main results #

noncomputable def Bandits.ClippedUCB.ucb {K : } {Ω : Type u_1} (A : ΩFin K) (R : Ω) (l u σ2 δ : ) (a : Fin K) (n : ) (ω : Ω) :

Clipped upper confidence bound used in the regret analysis of Thompson sampling.

Equations
Instances For
    @[simp]
    theorem Bandits.ClippedUCB.ucb_zero {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} {a : Fin K} {ω : Ω} :
    ucb A R l u σ2 δ a 0 ω = u
    theorem Bandits.ClippedUCB.ucb_mem_Icc {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} (h : l u) {a : Fin K} {n : } {ω : Ω} :
    ucb A R l u σ2 δ a n ω Set.Icc l u
    theorem Bandits.ClippedUCB.measurable_ucb {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} [MeasurableSpace Ω] {a : Fin K} {n : } (hA : ∀ (t : ), Measurable (A t)) (hR : ∀ (t : ), Measurable (R t)) :
    Measurable (ucb A R l u σ2 δ a n)
    theorem Bandits.ClippedUCB.measurable_uncurry_ucb_comp {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} [MeasurableSpace Ω] (hA : ∀ (t : ), Measurable (A t)) (hR : ∀ (t : ), Measurable (R t)) {f : ΩFin K} (hf : Measurable f) {g : Ω} (hg : Measurable g) :
    Measurable fun (ω : Ω) => ucb A R l u σ2 δ (f ω) (g ω) ω
    theorem Bandits.ClippedUCB.integrable_uncurry_ucb_comp {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} [MeasurableSpace Ω] (hA : ∀ (t : ), Measurable (A t)) (hR : ∀ (t : ), Measurable (R t)) {f : ΩFin K} (hf : Measurable f) {g : Ω} (hg : Measurable g) {P : MeasureTheory.Measure Ω} [MeasureTheory.IsFiniteMeasure P] :
    MeasureTheory.Integrable (fun (ω : Ω) => ucb A R l u σ2 δ (f ω) (g ω) ω) P
    noncomputable def Bandits.ClippedUCB.ucb' {K : } (n : ) (h : (Finset.Iic n)Fin K × ) (l u σ2 δ : ) (a : Fin K) :

    Clipped upper confidence bound (history-based version).

    Equations
    Instances For
      theorem Bandits.ClippedUCB.measurable_uncurry_ucb' {K : } {l u σ2 δ : } {n : } :
      Measurable fun (p : ((Finset.Iic n)Fin K × ) × Fin K) => ucb' n p.1 l u σ2 δ p.2
      theorem Bandits.ClippedUCB.ucb_succ_eq_ucb' {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} {a : Fin K} {n : } {ω : Ω} :
      ucb A R l u σ2 δ a (n + 1) ω = ucb' n (Learning.history A R n ω) l u σ2 δ a
      theorem Bandits.ClippedUCB.sum_ucb_sub_mean_le {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} {n : } {ω : Ω} (μ : Fin K) ( : ∀ (a : Fin K), μ a Set.Icc l u) (hi : l u) (hc : s < n, Learning.pullCount A (A s ω) s ω 0Learning.empMean A R (A s ω) s ω - μ (A s ω) < (2 * σ2 * Real.log (1 / δ) / (Learning.pullCount A (A s ω) s ω))) :
      sFinset.range n, (ucb A R l u σ2 δ (A s ω) s ω - μ (A s ω)) (u - l) * K + 4 * (2 * σ2 * Real.log (1 / δ) * K * n)
      theorem Bandits.ClippedUCB.integral_sum_range_actionMean_bestAction_sub_ucb_bestAction_le {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] {E : Ω𝓔} {Q : MeasureTheory.Measure 𝓔} {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) } [ProbabilityTheory.IsMarkovKernel κ] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {alg : Learning.Algorithm (Fin K) } (h : Learning.IsBayesAlgEnvSeq Q κ alg E A R P) (hlu : l u) (hm : ∀ (e : 𝓔) (a : Fin K), (x : ), id x κ (e, a) Set.Icc l u) (hσ2 : 0 < σ2) (hs : ∀ (e : 𝓔) (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ) => x - (x : ), id x κ (e, a)) σ2, (κ (e, a))) ( : 0 < δ) (n : ) :
      (x : Ω), (fun (ω : Ω) => tFinset.range n, (Learning.IsBayesAlgEnvSeq.actionMean κ E (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) ω - ucb A R l u σ2 δ (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) t ω)) x P (u - l) * (n - 1) * n * δ
      theorem Bandits.ClippedUCB.integral_sum_range_ucb_action_sub_actionMean_action_le {K : } {l u σ2 δ : } {Ω : Type u_1} {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] {E : Ω𝓔} {Q : MeasureTheory.Measure 𝓔} {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) } [ProbabilityTheory.IsMarkovKernel κ] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {alg : Learning.Algorithm (Fin K) } (h : Learning.IsBayesAlgEnvSeq Q κ alg E A R P) (hlu : l u) (hm : ∀ (e : 𝓔) (a : Fin K), (x : ), id x κ (e, a) Set.Icc l u) (hσ2 : 0 < σ2) (hs : ∀ (e : 𝓔) (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ) => x - (x : ), id x κ (e, a)) σ2, (κ (e, a))) ( : 0 < δ) (n : ) :
      (x : Ω), (fun (ω : Ω) => tFinset.range n, (ucb A R l u σ2 δ (A t ω) t ω - Learning.IsBayesAlgEnvSeq.actionMean κ E (A t ω) ω)) x P (u - l) * K + 4 * (2 * σ2 * Real.log (1 / δ) * K * n) + (u - l) * K * (n - 1) * n * δ
      theorem Bandits.TS.integral_ucb_action_eq_integral_ucb_bestAction {K : } [Nonempty (Fin K)] {l u σ2 δ : } {Ω : Type u_1} [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) } [ProbabilityTheory.IsMarkovKernel κ] {E : Ω𝓔} {A : ΩFin K} {R : Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q κ (tsAlgorithm hK Q κ) E A R P) (n : ) :
      (x : Ω), (fun (ω : Ω) => ClippedUCB.ucb A R l u σ2 δ (A n ω) n ω) x P = (x : Ω), (fun (ω : Ω) => ClippedUCB.ucb A R l u σ2 δ (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) n ω) x P
      theorem Bandits.TS.integral_regret_eq_add {K : } [Nonempty (Fin K)] {l u σ2 δ : } {Ω : Type u_1} [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) } [ProbabilityTheory.IsMarkovKernel κ] {E : Ω𝓔} {A : ΩFin K} {R : Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q κ (tsAlgorithm hK Q κ) E A R P) (hm : ∀ (e : 𝓔) (a : Fin K), (x : ), id x κ (e, a) Set.Icc l u) (n : ) :
      (x : Ω), Learning.IsBayesAlgEnvSeq.regret κ E A n x P = (x : Ω), (fun (ω : Ω) => tFinset.range n, (Learning.IsBayesAlgEnvSeq.actionMean κ E (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) ω - ClippedUCB.ucb A R l u σ2 δ (Learning.IsBayesAlgEnvSeq.bestAction κ E ω) t ω)) x P + (x : Ω), (fun (ω : Ω) => tFinset.range n, (ClippedUCB.ucb A R l u σ2 δ (A t ω) t ω - Learning.IsBayesAlgEnvSeq.actionMean κ E (A t ω) ω)) x P
      theorem Bandits.TS.integral_regret_le {K : } [Nonempty (Fin K)] {l u σ2 : } {Ω : Type u_1} [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) } [ProbabilityTheory.IsMarkovKernel κ] {E : Ω𝓔} {A : ΩFin K} {R : Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q κ (tsAlgorithm hK Q κ) E A R P) (hlu : l u) (hm : ∀ (e : 𝓔) (a : Fin K), (x : ), id x κ (e, a) Set.Icc l u) (hσ2 : 0 < σ2) (hs : ∀ (e : 𝓔) (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ) => x - (x : ), id x κ (e, a)) σ2, (κ (e, a))) (n : ) :
      (x : Ω), Learning.IsBayesAlgEnvSeq.regret κ E A n x P (2 * K + 1) * (u - l) + 8 * (σ2 * K * n * Real.log n)

      If Thompson sampling has the correct prior over environments and every environment has K actions, each of which has a corresponding reward between l and u that is sub-Gaussian with variance proxy σ2 after its mean is subtracted, then the Bayesian regret at time n is at most (2 * K + 1) * (u - l) + 8 * √(σ2 * K * n * Real.log n).