Documentation

LeanMachineLearning.Online.Bandit.Algorithms.TS

Thompson Sampling #

This file defines the Thompson sampling algorithm. This algorithm samples an action according to its probability of being optimal under the posterior over environments given the history so far.

Main definitions #

Main results #

noncomputable def Bandits.TS.policy {K : ℕ} {𝓔 : Type u_1} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] (hK : 0 < K) (Q : MeasureTheory.Measure 𝓔) [MeasureTheory.IsProbabilityMeasure Q] (Îș : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ) [ProbabilityTheory.IsMarkovKernel Îș] (n : ℕ) :

The Thompson sampling policy samples an action according to its probability of being optimal under the posterior over environments given the history so far. The posterior under a uniform algorithm is used to avoid a circular definition.

Equations
Instances For
    noncomputable def Bandits.TS.initialPolicy {K : ℕ} {𝓔 : Type u_1} [MeasurableSpace 𝓔] (hK : 0 < K) (Q : MeasureTheory.Measure 𝓔) (Îș : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ) :

    The initial action is sampled according to its probability of being optimal under the prior over environments.

    Equations
    Instances For
      noncomputable def Bandits.tsAlgorithm {K : ℕ} {𝓔 : Type u_1} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] (hK : 0 < K) (Q : MeasureTheory.Measure 𝓔) [MeasureTheory.IsProbabilityMeasure Q] (Îș : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ) [ProbabilityTheory.IsMarkovKernel Îș] :

      The Thompson sampling algorithm with actions in Fin K, where Q : Measure 𝓔 is a prior distribution over parameters, and Îș : Kernel (𝓔 × Fin K) ℝ is a Markov kernel that defines the stationary environment stationaryEnv (Îș.sectR e) that corresponds to a parameter e : 𝓔.

      At every time n, the Thompson sampling policy uses the posterior over the parameters given the history up to time n to derive the probability of each action being optimal. The action for time n is sampled according to these probabilities.

      Equations
      Instances For
        theorem Bandits.TS.hasCondDistrib_action {K : ℕ} [Nonempty (Fin K)] {Ω : Type u_1} [MeasurableSpace Ω] {𝓔 : Type u_2} [MeasurableSpace 𝓔] [StandardBorelSpace 𝓔] [Nonempty 𝓔] {E : Ω → 𝓔} {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {Q : MeasureTheory.Measure 𝓔} [MeasureTheory.IsProbabilityMeasure Q] {Îș : ProbabilityTheory.Kernel (𝓔 × Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel Îș] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q Îș (tsAlgorithm hK Q Îș) E A R P) (n : ℕ) :

        If Thompson sampling has the correct prior over environments, then the conditional distribution of the next action given the history so far is equal to the conditional distribution of the best action given the history so far.