2.7.ย Online.Bandit.Algorithms.TS
Thompson Sampling
This file defines the Thompson sampling algorithm. This algorithm samples an action according to its probability of being optimal under the posterior over environments given the history so far.
Main definitions
-
tsAlgorithm hK Q ฮบ: a Thompson sampling algorithm with actions inFin KgivenhK : 0 < K, a prior distribution over parametersQ : Measure ๐, and a Markov kernelฮบ : Kernel (๐ ร Fin K) โ. This kernel defines how a parametere : ๐gives rise to a stationary environment:stationaryEnv (ฮบ.sectR e) : Environment (Fin K) โ.
Main results
-
hasCondDistrib_action: if Thompson sampling has the correct prior over environments, then the conditional distribution of the next action given the history so far is equal to the conditional distribution of the best action given the history so far.
Module LeanMachineLearning.Online.Bandit.Algorithms.TS contains 6 exposed declarations.
policy๐
Bandits.TS.policyThe Thompson sampling policy samples an action according to its probability of being optimal under the posterior over environments given the history so far. The posterior under a uniform algorithm is used to avoid a circular definition.
Bandits.TS.policy.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] [StandardBorelSpace ๐] [Nonempty ๐] (hK : 0 < K) (Q : MeasureTheory.Measure ๐) [MeasureTheory.IsProbabilityMeasure Q] (ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ) [ProbabilityTheory.IsMarkovKernel ฮบ] (n : โ) : ProbabilityTheory.Kernel (โฅ(Finset.Iic n) โ Fin K ร โ) (Fin K)Bandits.TS.policy.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] [StandardBorelSpace ๐] [Nonempty ๐] (hK : 0 < K) (Q : MeasureTheory.Measure ๐) [MeasureTheory.IsProbabilityMeasure Q] (ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ) [ProbabilityTheory.IsMarkovKernel ฮบ] (n : โ) : ProbabilityTheory.Kernel (โฅ(Finset.Iic n) โ Fin K ร โ) (Fin K)
Code
noncomputable
def TS.policy (hK : 0 < K) (Q : Measure ๐) [IsProbabilityMeasure Q] (ฮบ : Kernel (๐ ร Fin K) โ)
[IsMarkovKernel ฮบ] (n : โ) : Kernel (Iic n โ (Fin K) ร โ) (Fin K) :=
have : Nonempty (Fin K) := Fin.pos_iff_nonempty.mp hK
(IT.bayesTrajMeasurePosterior Q ฮบ uniformAlgorithm n).map (bestAction ฮบ id)Body uses (3)
Actions: Source ยท Open Issue
instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy๐
Bandits.instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicyNo docstring.
Bandits.instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] [StandardBorelSpace ๐] [Nonempty ๐] {hK : 0 < K} {Q : MeasureTheory.Measure ๐} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ} [ProbabilityTheory.IsMarkovKernel ฮบ] {n : โ} : ProbabilityTheory.IsMarkovKernel (TS.policy hK Q ฮบ n)Bandits.instIsMarkovKernelForallSubtypeNatMemFinsetIicProdFinRealPolicy.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] [StandardBorelSpace ๐] [Nonempty ๐] {hK : 0 < K} {Q : MeasureTheory.Measure ๐} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ} [ProbabilityTheory.IsMarkovKernel ฮบ] {n : โ} : ProbabilityTheory.IsMarkovKernel (TS.policy hK Q ฮบ n)
Code
instance {hK : 0 < K} {Q : Measure ๐} [IsProbabilityMeasure Q] {ฮบ : Kernel (๐ ร Fin K) โ}
[IsMarkovKernel ฮบ] {n : โ} : IsMarkovKernel (TS.policy hK Q ฮบ n)Type uses (1)
Body uses (5)
Used by (1)
Actions: Source ยท Open Issue
Proof
Kernel.IsMarkovKernel.map _ (by fun_prop)
initialPolicy๐
Bandits.TS.initialPolicyThe initial action is sampled according to its probability of being optimal under the prior over environments.
Bandits.TS.initialPolicy.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] (hK : 0 < K) (Q : MeasureTheory.Measure ๐) (ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ) : MeasureTheory.Measure (Fin K)Bandits.TS.initialPolicy.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] (hK : 0 < K) (Q : MeasureTheory.Measure ๐) (ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ) : MeasureTheory.Measure (Fin K)
Code
noncomputable def TS.initialPolicy (hK : 0 < K) (Q : Measure ๐) (ฮบ : Kernel (๐ ร Fin K) โ) : Measure (Fin K) := have : Nonempty (Fin K) := Fin.pos_iff_nonempty.mp hK Q.map (bestAction ฮบ id)
Body uses (1)
Used by (2)
Actions: Source ยท Open Issue
instIsProbabilityMeasureFinInitialPolicy๐
Bandits.instIsProbabilityMeasureFinInitialPolicyNo docstring.
Bandits.instIsProbabilityMeasureFinInitialPolicy.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] {hK : 0 < K} {Q : MeasureTheory.Measure ๐} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ} : MeasureTheory.IsProbabilityMeasure (TS.initialPolicy hK Q ฮบ)Bandits.instIsProbabilityMeasureFinInitialPolicy.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] {hK : 0 < K} {Q : MeasureTheory.Measure ๐} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ} : MeasureTheory.IsProbabilityMeasure (TS.initialPolicy hK Q ฮบ)
Code
instance {hK : 0 < K} {Q : Measure ๐} [IsProbabilityMeasure Q] {ฮบ : Kernel (๐ ร Fin K) โ} :
IsProbabilityMeasure (TS.initialPolicy hK Q ฮบ)Type uses (1)
Body uses (2)
Used by (1)
Actions: Source ยท Open Issue
Proof
Measure.isProbabilityMeasure_map (by fun_prop)
tsAlgorithm๐
Bandits.tsAlgorithm
The Thompson sampling algorithm with actions in Fin K, where Q : Measure ๐ is a prior
distribution over parameters, and ฮบ : Kernel (๐ ร Fin K) โ is a Markov kernel that defines the
stationary environment stationaryEnv (ฮบ.sectR e) that corresponds to a parameter e : ๐.
At every time n, the Thompson sampling policy uses the posterior over the parameters given the
history up to time n to derive the probability of each action being optimal. The action for time
n is sampled according to these probabilities.
Bandits.tsAlgorithm.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] [StandardBorelSpace ๐] [Nonempty ๐] (hK : 0 < K) (Q : MeasureTheory.Measure ๐) [MeasureTheory.IsProbabilityMeasure Q] (ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ) [ProbabilityTheory.IsMarkovKernel ฮบ] : Learning.Algorithm (Fin K) โBandits.tsAlgorithm.{u_1} {K : โ} {๐ : Type u_1} [MeasurableSpace ๐] [StandardBorelSpace ๐] [Nonempty ๐] (hK : 0 < K) (Q : MeasureTheory.Measure ๐) [MeasureTheory.IsProbabilityMeasure Q] (ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ) [ProbabilityTheory.IsMarkovKernel ฮบ] : Learning.Algorithm (Fin K) โ
Code
noncomputable
def tsAlgorithm (hK : 0 < K) (Q : Measure ๐) [IsProbabilityMeasure Q] (ฮบ : Kernel (๐ ร Fin K) โ)
[IsMarkovKernel ฮบ] : Algorithm (Fin K) โ where
policy := TS.policy hK Q ฮบ
p0 := TS.initialPolicy hK Q ฮบType uses (1)
Body uses (4)
Used by (4)
Actions: Source ยท Open Issue
hasCondDistrib_action๐
Bandits.TS.hasCondDistrib_actionIf Thompson sampling has the correct prior over environments, then the conditional distribution of the next action given the history so far is equal to the conditional distribution of the best action given the history so far.
Bandits.TS.hasCondDistrib_action.{u_1, u_2} {K : โ} [Nonempty (Fin K)] {ฮฉ : Type u_1} [MeasurableSpace ฮฉ] {๐ : Type u_2} [MeasurableSpace ๐] [StandardBorelSpace ๐] [Nonempty ๐] {E : ฮฉ โ ๐} {A : โ โ ฮฉ โ Fin K} {R : โ โ ฮฉ โ โ} {Q : MeasureTheory.Measure ๐} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ} [ProbabilityTheory.IsMarkovKernel ฮบ] {P : MeasureTheory.Measure ฮฉ} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q ฮบ (tsAlgorithm hK Q ฮบ) E A R P) (n : โ) : ProbabilityTheory.HasCondDistrib (A (n + 1)) (Learning.history A R n) ๐[Learning.IsBayesAlgEnvSeq.bestAction ฮบ E | Learning.history A R n; P] PBandits.TS.hasCondDistrib_action.{u_1, u_2} {K : โ} [Nonempty (Fin K)] {ฮฉ : Type u_1} [MeasurableSpace ฮฉ] {๐ : Type u_2} [MeasurableSpace ๐] [StandardBorelSpace ๐] [Nonempty ๐] {E : ฮฉ โ ๐} {A : โ โ ฮฉ โ Fin K} {R : โ โ ฮฉ โ โ} {Q : MeasureTheory.Measure ๐} [MeasureTheory.IsProbabilityMeasure Q] {ฮบ : ProbabilityTheory.Kernel (๐ ร Fin K) โ} [ProbabilityTheory.IsMarkovKernel ฮบ] {P : MeasureTheory.Measure ฮฉ} [MeasureTheory.IsProbabilityMeasure P] (hK : 0 < K) (h : Learning.IsBayesAlgEnvSeq Q ฮบ (tsAlgorithm hK Q ฮบ) E A R P) (n : โ) : ProbabilityTheory.HasCondDistrib (A (n + 1)) (Learning.history A R n) ๐[Learning.IsBayesAlgEnvSeq.bestAction ฮบ E | Learning.history A R n; P] P
Code
lemma TS.hasCondDistrib_action (hK : 0 < K) (h : IsBayesAlgEnvSeq Q ฮบ (tsAlgorithm hK Q ฮบ) E A R P)
(n : โ) :
HasCondDistrib (A (n + 1)) (history A R n)
(condDistrib (bestAction ฮบ E) (history A R n) P) P where
aemeasurableType uses (4)
Body uses (14)
Used by (1)
Actions: Source ยท Open Issue
Proof
((measurable_history h.measurable_action h.measurable_feedback n).prodMk
(h.measurable_action (n + 1))).aemeasurable
map_eq := by
have hm : Measurable (bestAction ฮบ id) := by fun_prop
rw [(h.hasCondDistrib_action' n).map_eq]
refine Measure.compProd_congr ?_
calc
_ =แต[P.map (history A R n)]
(IT.bayesTrajMeasurePosterior Q ฮบ uniformAlgorithm n).map (bestAction ฮบ id) := by rfl
_ =แต[P.map (history A R n)]
(condDistrib E (history A R n) P).map (bestAction ฮบ id) := by
filter_upwards [(h.hasCondDistrib_env_history
(IT.isBayesAlgEnvSeq_bayesTrajMeasure Q ฮบ uniformAlgorithm)
absolutelyContinuous_uniformAlgorithm n).condDistrib_eq] with _ hc
simp_rw [Kernel.map_apply _ hm, IT.bayesTrajMeasurePosterior, hc]
_ =แต[P.map (history A R n)]
condDistrib (bestAction ฮบ E) (history A R n) P :=
(condDistrib_comp (history A R n) h.measurable_param.aemeasurable hm).symm