LeanMachineLearning exposition

2.7.ย Online.Bandit.Algorithms.TS๐Ÿ”—

Thompson Sampling

This file defines the Thompson sampling algorithm. This algorithm samples an action according to its probability of being optimal under the posterior over environments given the history so far.

Main definitions

  • tsAlgorithm hK Q ฮบ: a Thompson sampling algorithm with actions in Fin K given hK : 0 < K, a prior distribution over parameters Q : Measure ๐“”, and a Markov kernel ฮบ : Kernel (๐“” ร— Fin K) โ„. This kernel defines how a parameter e : ๐“” gives rise to a stationary environment: stationaryEnv (ฮบ.sectR e) : Environment (Fin K) โ„.

Main results

  • hasCondDistrib_action : if Thompson sampling has the correct prior over environments, then the conditional distribution of the next action given the history so far is equal to the conditional distribution of the best action given the history so far.

Module LeanMachineLearning.Online.Bandit.Algorithms.TS contains 6 exposed declarations.

policy๐Ÿ”—

DefinitionBandits.TS.policy

The Thompson sampling policy samples an action according to its probability of being optimal under the posterior over environments given the history so far. The posterior under a uniform algorithm is used to avoid a circular definition.

๐Ÿ”—def
Bandits.TS.policy.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] [StandardBorelSpace ๐“”] [Nonempty ๐“”] (hK : 0 < K) (Q : MeasureTheory.Measure ๐“”) [MeasureTheory.IsProbabilityMeasure Q] (ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„) [ProbabilityTheory.IsMarkovKernel ฮบ] (n : โ„•) : ProbabilityTheory.Kernel (โ†ฅ(Finset.Iic n) โ†’ Fin K ร— โ„) (Fin K)
Bandits.TS.policy.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] [StandardBorelSpace ๐“”] [Nonempty ๐“”] (hK : 0 < K) (Q : MeasureTheory.Measure ๐“”) [MeasureTheory.IsProbabilityMeasure Q] (ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„) [ProbabilityTheory.IsMarkovKernel ฮบ] (n : โ„•) : ProbabilityTheory.Kernel (โ†ฅ(Finset.Iic n) โ†’ Fin K ร— โ„) (Fin K)

Code

noncomputable
def TS.policy (hK : 0 < K) (Q : Measure ๐“”) [IsProbabilityMeasure Q] (ฮบ : Kernel (๐“” ร— Fin K) โ„)
    [IsMarkovKernel ฮบ] (n : โ„•) : Kernel (Iic n โ†’ (Fin K) ร— โ„) (Fin K) :=
  have : Nonempty (Fin K) := Fin.pos_iff_nonempty.mp hK
  (IT.bayesTrajMeasurePosterior Q ฮบ uniformAlgorithm n).map (bestAction ฮบ id)
Body uses (3)
Used by (2)

Actions: Source ยท Open Issue

instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy๐Ÿ”—

InstanceBandits.instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy

No docstring.

๐Ÿ”—theorem
Bandits.instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] [StandardBorelSpace ๐“”] [Nonempty ๐“”] {hK : 0 < K} {Q : MeasureTheory.Measure ๐“”} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„} [ProbabilityTheory.IsMarkovKernel ฮบ] {n : โ„•} : ProbabilityTheory.IsMarkovKernel (TS.policy hK Q ฮบ n)
Bandits.instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] [StandardBorelSpace ๐“”] [Nonempty ๐“”] {hK : 0 < K} {Q : MeasureTheory.Measure ๐“”} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„} [ProbabilityTheory.IsMarkovKernel ฮบ] {n : โ„•} : ProbabilityTheory.IsMarkovKernel (TS.policy hK Q ฮบ n)

Code

instance {hK : 0 < K} {Q : Measure ๐“”} [IsProbabilityMeasure Q] {ฮบ : Kernel (๐“” ร— Fin K) โ„}
    [IsMarkovKernel ฮบ] {n : โ„•} : IsMarkovKernel (TS.policy hK Q ฮบ n)
Type uses (1)
Body uses (5)
Used by (1)

Actions: Source ยท Open Issue

Proof
Kernel.IsMarkovKernel.map _ (by fun_prop)

initialPolicy๐Ÿ”—

DefinitionBandits.TS.initialPolicy

The initial action is sampled according to its probability of being optimal under the prior over environments.

๐Ÿ”—def
Bandits.TS.initialPolicy.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] (hK : 0 < K) (Q : MeasureTheory.Measure ๐“”) (ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„) : MeasureTheory.Measure (Fin K)
Bandits.TS.initialPolicy.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] (hK : 0 < K) (Q : MeasureTheory.Measure ๐“”) (ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„) : MeasureTheory.Measure (Fin K)

Code

noncomputable
def TS.initialPolicy (hK : 0 < K) (Q : Measure ๐“”) (ฮบ : Kernel (๐“” ร— Fin K) โ„) : Measure (Fin K) :=
  have : Nonempty (Fin K) := Fin.pos_iff_nonempty.mp hK
  Q.map (bestAction ฮบ id)
Body uses (1)
Used by (2)

Actions: Source ยท Open Issue

instIsProbabilityMeasureFinInitialPolicy๐Ÿ”—

InstanceBandits.instIsProbabilityMeasureFinInitialPolicy

No docstring.

๐Ÿ”—theorem
Bandits.instIsProbabilityMeasureFinInitialPolicy.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] {hK : 0 < K} {Q : MeasureTheory.Measure ๐“”} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„} : MeasureTheory.IsProbabilityMeasure (TS.initialPolicy hK Q ฮบ)
Bandits.instIsProbabilityMeasureFinInitialPolicy.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] {hK : 0 < K} {Q : MeasureTheory.Measure ๐“”} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„} : MeasureTheory.IsProbabilityMeasure (TS.initialPolicy hK Q ฮบ)

Code

instance {hK : 0 < K} {Q : Measure ๐“”} [IsProbabilityMeasure Q] {ฮบ : Kernel (๐“” ร— Fin K) โ„} :
    IsProbabilityMeasure (TS.initialPolicy hK Q ฮบ)
Type uses (1)
Body uses (2)
Used by (1)

Actions: Source ยท Open Issue

Proof
Measure.isProbabilityMeasure_map (by fun_prop)

tsAlgorithm๐Ÿ”—

DefinitionBandits.tsAlgorithm

The Thompson sampling algorithm with actions in Fin K, where Q : Measure ๐“” is a prior distribution over parameters, and ฮบ : Kernel (๐“” ร— Fin K) โ„ is a Markov kernel that defines the stationary environment stationaryEnv (ฮบ.sectR e) that corresponds to a parameter e : ๐“”.

At every time n, the Thompson sampling policy uses the posterior over the parameters given the history up to time n to derive the probability of each action being optimal. The action for time n is sampled according to these probabilities.

๐Ÿ”—def
Bandits.tsAlgorithm.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] [StandardBorelSpace ๐“”] [Nonempty ๐“”] (hK : 0 < K) (Q : MeasureTheory.Measure ๐“”) [MeasureTheory.IsProbabilityMeasure Q] (ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„) [ProbabilityTheory.IsMarkovKernel ฮบ] : Learning.Algorithm (Fin K) โ„
Bandits.tsAlgorithm.{u_1} {K : โ„•} {๐“” : Type u_1} [MeasurableSpace ๐“”] [StandardBorelSpace ๐“”] [Nonempty ๐“”] (hK : 0 < K) (Q : MeasureTheory.Measure ๐“”) [MeasureTheory.IsProbabilityMeasure Q] (ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„) [ProbabilityTheory.IsMarkovKernel ฮบ] : Learning.Algorithm (Fin K) โ„

Code

noncomputable
def tsAlgorithm (hK : 0 < K) (Q : Measure ๐“”) [IsProbabilityMeasure Q] (ฮบ : Kernel (๐“” ร— Fin K) โ„)
    [IsMarkovKernel ฮบ] : Algorithm (Fin K) โ„ where
  policy := TS.policy hK Q ฮบ
  p0 := TS.initialPolicy hK Q ฮบ
Type uses (1)
Body uses (4)
Used by (4)

Actions: Source ยท Open Issue

hasCondDistrib_action๐Ÿ”—

LemmaBandits.TS.hasCondDistrib_action

If Thompson sampling has the correct prior over environments, then the conditional distribution of the next action given the history so far is equal to the conditional distribution of the best action given the history so far.

๐Ÿ”—theorem
Bandits.TS.hasCondDistrib_action.{u_1, u_2} {K : โ„•} [Nonempty (Fin K)] {ฮฉ : Type u_1} [MeasurableSpace ฮฉ] {๐“” : Type u_2} [MeasurableSpace ๐“”] [StandardBorelSpace ๐“”] [Nonempty ๐“”] {E : ฮฉ โ†’ ๐“”} {A : โ„• โ†’ ฮฉ โ†’ Fin K} {R : โ„• โ†’ ฮฉ โ†’ โ„} {Q : MeasureTheory.Measure ๐“”} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„} [ProbabilityTheory.IsMarkovKernel ฮบ] {P : MeasureTheory.Measure ฮฉ} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q ฮบ (tsAlgorithm hK Q ฮบ) E A R P) (n : โ„•) : ProbabilityTheory.HasCondDistrib (A (n + 1)) (Learning.history A R n) ๐“›[Learning.IsBayesAlgEnvSeq.bestAction ฮบ E | Learning.history A R n; P] P
Bandits.TS.hasCondDistrib_action.{u_1, u_2} {K : โ„•} [Nonempty (Fin K)] {ฮฉ : Type u_1} [MeasurableSpace ฮฉ] {๐“” : Type u_2} [MeasurableSpace ๐“”] [StandardBorelSpace ๐“”] [Nonempty ๐“”] {E : ฮฉ โ†’ ๐“”} {A : โ„• โ†’ ฮฉ โ†’ Fin K} {R : โ„• โ†’ ฮฉ โ†’ โ„} {Q : MeasureTheory.Measure ๐“”} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐“” ร— Fin K) โ„} [ProbabilityTheory.IsMarkovKernel ฮบ] {P : MeasureTheory.Measure ฮฉ} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q ฮบ (tsAlgorithm hK Q ฮบ) E A R P) (n : โ„•) : ProbabilityTheory.HasCondDistrib (A (n + 1)) (Learning.history A R n) ๐“›[Learning.IsBayesAlgEnvSeq.bestAction ฮบ E | Learning.history A R n; P] P

Code

lemma TS.hasCondDistrib_action (hK : 0 < K) (h : IsBayesAlgEnvSeq Q ฮบ (tsAlgorithm hK Q ฮบ) E A R P)
    (n : โ„•) :
    HasCondDistrib (A (n + 1)) (history A R n)
      (condDistrib (bestAction ฮบ E) (history A R n) P) P where
  aemeasurable
Type uses (4)
Body uses (14)
Used by (1)

Actions: Source ยท Open Issue

Proof
((measurable_history h.measurable_action h.measurable_feedback n).prodMk
      (h.measurable_action (n + 1))).aemeasurable
  map_eq := by
    have hm : Measurable (bestAction ฮบ id) := by fun_prop
    rw [(h.hasCondDistrib_action' n).map_eq]
    refine Measure.compProd_congr ?_
    calc
      _ =แต[P.map (history A R n)]
          (IT.bayesTrajMeasurePosterior Q ฮบ uniformAlgorithm n).map (bestAction ฮบ id) := by rfl
      _ =แต[P.map (history A R n)]
          (condDistrib E (history A R n) P).map (bestAction ฮบ id) := by
          filter_upwards [(h.hasCondDistrib_env_history
            (IT.isBayesAlgEnvSeq_bayesTrajMeasure Q ฮบ uniformAlgorithm)
            absolutelyContinuous_uniformAlgorithm n).condDistrib_eq] with _ hc
          simp_rw [Kernel.map_apply _ hm, IT.bayesTrajMeasurePosterior, hc]
      _ =แต[P.map (history A R n)]
          condDistrib (bestAction ฮบ E) (history A R n) P :=
          (condDistrib_comp (history A R n) h.measurable_param.aemeasurable hm).symm
  1. Bandits.TS.policy
  2. Bandits.instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy
  3. Bandits.TS.initialPolicy
  4. Bandits.instIsProbabilityMeasureFinInitialPolicy
  5. Bandits.tsAlgorithm
  6. Bandits.TS.hasCondDistrib_action