Thompson Sampling #

This file defines the Thompson sampling algorithm. This algorithm samples an action according to its probability of being optimal under the posterior over environments given the history so far.

Main definitions #

tsAlgorithm hK Q κ: a Thompson sampling algorithm with actions in Fin K given hK : 0 < K, a prior distribution over parameters Q : Measure 𝓔, and a Markov kernel κ : Kernel (𝓔 × Fin K) ℝ. This kernel defines how a parameter e : 𝓔 gives rise to a stationary environment: stationaryEnv (κ.sectR e) : Environment (Fin K) ℝ.

Main results #

hasCondDistrib_action : if Thompson sampling has the correct prior over environments, then the conditional distribution of the next action given the history so far is equal to the conditional distribution of the best action given the history so far.

source

noncomputable def Bandits.TS.policy {K : ℕ} {𝓔 : Type u_1} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] (hK : 0 < K) (Q : MeasureTheory.Measure 𝓔) [MeasureTheory.IsProbabilityMeasure Q] (κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ) [ProbabilityTheory.IsMarkovKernel κ] (n : ℕ) :

ProbabilityTheory.Kernel (↥(Finset.Iic n) → Fin K × ℝ) (Fin K)

The Thompson sampling policy samples an action according to its probability of being optimal under the posterior over environments given the history so far. The posterior under a uniform algorithm is used to avoid a circular definition.

Equations

Bandits.TS.policy hK Q κ n = (Learning.IT.bayesTrajMeasurePosterior Q κ Learning.uniformAlgorithm n).map (Learning.IsBayesAlgEnvSeq.bestAction κ id)

Instances For

source

instance Bandits.instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy {K : ℕ} {𝓔 : Type u_1} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {hK : 0 < K} {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel κ] {n : ℕ} :

ProbabilityTheory.IsMarkovKernel (TS.policy hK Q κ n)

source

noncomputable def Bandits.TS.initialPolicy {K : ℕ} {𝓔 : Type u_1} [MeasurableSpace 𝓔] (hK : 0 < K) (Q : MeasureTheory.Measure 𝓔) (κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ) :

MeasureTheory.Measure (Fin K)

The initial action is sampled according to its probability of being optimal under the prior over environments.

Equations

Bandits.TS.initialPolicy hK Q κ = MeasureTheory.Measure.map (Learning.IsBayesAlgEnvSeq.bestAction κ id) Q

Instances For

source

instance Bandits.instIsProbabilityMeasureFinInitialPolicy {K : ℕ} {𝓔 : Type u_1} [MeasurableSpace 𝓔] {hK : 0 < K} {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} :

MeasureTheory.IsProbabilityMeasure (TS.initialPolicy hK Q κ)

source

noncomputable def Bandits.tsAlgorithm {K : ℕ} {𝓔 : Type u_1} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] (hK : 0 < K) (Q : MeasureTheory.Measure 𝓔) [MeasureTheory.IsProbabilityMeasure Q] (κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ) [ProbabilityTheory.IsMarkovKernel κ] :

Learning.Algorithm (Fin K) ℝ

The Thompson sampling algorithm with actions in Fin K, where Q : Measure 𝓔 is a prior distribution over parameters, and κ : Kernel (𝓔 × Fin K) ℝ is a Markov kernel that defines the stationary environment stationaryEnv (κ.sectR e) that corresponds to a parameter e : 𝓔.

At every time n, the Thompson sampling policy uses the posterior over the parameters given the history up to time n to derive the probability of each action being optimal. The action for time n is sampled according to these probabilities.

Equations

Bandits.tsAlgorithm hK Q κ = { policy := Bandits.TS.policy hK Q κ, h_policy := ⋯, p0 := Bandits.TS.initialPolicy hK Q κ, hp0 := ⋯ }

Instances For

source

theorem Bandits.TS.hasCondDistrib_action {K : ℕ} [Nonempty (Fin K)] {Ω : Type u_1} [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {E : Ω → 𝓔} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {κ : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel κ] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q κ (tsAlgorithm hK Q κ) E A R P) (n : ℕ) :

ProbabilityTheory.HasCondDistrib (A (n + 1)) (Learning.history A R n) 𝓛[Learning.IsBayesAlgEnvSeq.bestAction κ E | Learning.history A R n; P] P

If Thompson sampling has the correct prior over environments, then the conditional distribution of the next action given the history so far is equal to the conditional distribution of the best action given the history so far.

Documentation

LeanMachineLearning.Online.Bandit.Algorithms.TS

Thompson Sampling #

Main definitions #

Main results #