Regret, gap, best arm #

Main definitions #

gap ν a is the gap of an action a, i.e., the difference between the highest mean of the actions and the mean of a.
regret ν A t ω is the regret of a sequence of pulls A : ℕ → Ω → 𝓐 at time t for the reward kernel ν : Kernel 𝓐 ℝ and the outcome ω : Ω.
bestArm ν is an action with the highest mean.

source

noncomputable def Bandits.gap {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} (ν : ProbabilityTheory.Kernel 𝓐 ℝ) (a : 𝓐) :

ℝ

Gap of an action a: difference between the highest mean of the actions and the mean of a.

Equations

Bandits.gap ν a = (⨆ (i : 𝓐), ∫ (x : ℝ), id x ∂ν i) - ∫ (x : ℝ), id x ∂ν a

Instances For

source

theorem Bandits.gap_nonneg {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {a : 𝓐} [Finite 𝓐] :

0 ≤ gap ν a

source

theorem Bandits.gap_nonneg_of_le {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {a : 𝓐} {u : ℝ} (h : ∀ (a : 𝓐), ∫ (x : ℝ), id x ∂ν a ≤ u) :

0 ≤ gap ν a

The gap is non-negative if the means are bounded by u : ℝ (even if 𝓐 is not Finite).

source

theorem Bandits.gap_le_of_mem_Icc {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {a : 𝓐} [Nonempty 𝓐] {l u : ℝ} (h : ∀ (a : 𝓐), ∫ (x : ℝ), id x ∂ν a ∈ Set.Icc l u) :

gap ν a ≤ u - l

source

noncomputable def Bandits.regret {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} (ν : ProbabilityTheory.Kernel 𝓐 ℝ) (A : ℕ → Ω → 𝓐) (t : ℕ) (ω : Ω) :

ℝ

Regret of a sequence of pulls k : ℕ → 𝓐 at time t for the reward kernel ν ; Kernel 𝓐 ℝ.

Equations

Bandits.regret ν A t ω = (↑t * ⨆ (a : 𝓐), ∫ (x : ℝ), id x ∂ν a) - ∑ s ∈ Finset.range t, ∫ (x : ℝ), id x ∂ν (A s ω)

Instances For

source

theorem Bandits.regret_eq_sum_gap {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {A : ℕ → Ω → 𝓐} {ω : Ω} {t : ℕ} :

regret ν A t ω = ∑ s ∈ Finset.range t, gap ν (A s ω)

source

theorem Bandits.regret_nonneg {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {A : ℕ → Ω → 𝓐} {ω : Ω} {t : ℕ} [Finite 𝓐] :

0 ≤ regret ν A t ω

source

theorem Bandits.gap_eq_zero_of_regret_eq_zero {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {A : ℕ → Ω → 𝓐} {ω : Ω} {t : ℕ} [Finite 𝓐] (hr : regret ν A t ω = 0) {s : ℕ} (hs : s < t) :

gap ν (A s ω) = 0

source

theorem Bandits.regret_eq_sum_pullCount_mul_gap {𝓐 : Type u_1} {Ω : Type u_2} [DecidableEq 𝓐] {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {A : ℕ → Ω → 𝓐} {ω : Ω} {t : ℕ} [Fintype 𝓐] :

regret ν A t ω = ∑ a : 𝓐, ↑(Learning.pullCount A a t ω) * gap ν a

source

theorem Bandits.integral_regret_eq_sum_gap_mul_integral_pullCount {𝓐 : Type u_1} {Ω : Type u_2} [DecidableEq 𝓐] {m𝓐 : MeasurableSpace 𝓐} {mΩ : MeasurableSpace Ω} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {A : ℕ → Ω → 𝓐} {n : ℕ} [StandardBorelSpace 𝓐] [Fintype 𝓐] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hA : ∀ (n : ℕ), Measurable (A n)) :

∫ (x : Ω), regret ν A n x ∂P = ∑ a : 𝓐, gap ν a * ∫ (x : Ω), (fun (ω : Ω) => ↑(Learning.pullCount A a n ω)) x ∂P

source

theorem Bandits.integral_regret_le_of_forall_integral_pullCount_le {𝓐 : Type u_1} {Ω : Type u_2} [DecidableEq 𝓐] {m𝓐 : MeasurableSpace 𝓐} {mΩ : MeasurableSpace Ω} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {A : ℕ → Ω → 𝓐} {R : ℕ → Ω → ℝ} {n : ℕ} [Nonempty 𝓐] [StandardBorelSpace 𝓐] [Fintype 𝓐] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {alg : Learning.Algorithm 𝓐 ℝ} {env : Learning.Environment 𝓐 ℝ} {B : 𝓐 → ℝ} (h : Learning.IsAlgEnvSeq A R alg env P) (h_le : ∀ (a : 𝓐), gap ν a ≠ 0 → ∫ (ω : Ω), ↑(Learning.pullCount A a n ω) ∂P ≤ B a) :

∫ (x : Ω), regret ν A n x ∂P ≤ ∑ a : 𝓐, gap ν a * B a

To bound the expected regret, it suffices to bound the expected number of pulls for each action with positive gap.

source

noncomputable def Bandits.bestArm {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} [Fintype 𝓐] [Nonempty 𝓐] (ν : ProbabilityTheory.Kernel 𝓐 ℝ) :

𝓐

action with the highest mean.

Equations

Bandits.bestArm ν = ⋯.choose

Instances For

source

theorem Bandits.le_bestArm {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} [Fintype 𝓐] [Nonempty 𝓐] (a : 𝓐) :

∫ (x : ℝ), id x ∂ν a ≤ ∫ (x : ℝ), id x ∂ν (bestArm ν)

source

theorem Bandits.gap_eq_bestArm_sub {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {a : 𝓐} [Fintype 𝓐] [Nonempty 𝓐] :

gap ν a = ∫ (x : ℝ), id x ∂ν (bestArm ν) - ∫ (x : ℝ), id x ∂ν a

source

@[simp]

theorem Bandits.gap_bestArm {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} [Fintype 𝓐] [Nonempty 𝓐] :

gap ν (bestArm ν) = 0

source

theorem Bandits.integral_eq_of_gap_eq_zero {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {a : 𝓐} [Fintype 𝓐] [Nonempty 𝓐] (hg : gap ν a = 0) :

∫ (x : ℝ), id x ∂ν (bestArm ν) = ∫ (x : ℝ), id x ∂ν a

source

theorem Bandits.avg_mean_reward_tendsto_of_sublinear_regret {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {A : ℕ → Ω → 𝓐} {ω : Ω} (hr : (fun (x : ℕ) => regret ν A x ω) =o[Filter.atTop ] fun (t : ℕ) => ↑t) :

Filter.Tendsto (fun (t : ℕ) => (∑ s ∈ Finset.range t, ∫ (x : ℝ), id x ∂ν (A s ω)) / ↑t) Filter.atTop (nhds (⨆ (a : 𝓐), ∫ (x : ℝ), id x ∂ν a))

If the regret is sublinear, the average mean reward tends to the highest mean of the arms.

source

theorem Bandits.pullCount_rate_tendsto_of_sublinear_regret {𝓐 : Type u_1} {Ω : Type u_2} [DecidableEq 𝓐] {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 ℝ} {A : ℕ → Ω → 𝓐} {ω : Ω} {a : 𝓐} [Finite 𝓐] (hr : (fun (x : ℕ) => regret ν A x ω) =o[Filter.atTop ] fun (t : ℕ) => ↑t) (hg : 0 < gap ν a) :

Filter.Tendsto (fun (t : ℕ) => ↑(Learning.pullCount A a t ω) / ↑t) Filter.atTop (nhds 0)

If the regret is sublinear, the rate of suboptimal arm pulls tends to zero.

Documentation

LeanMachineLearning.Online.Bandit.Regret

Regret, gap, best arm #

Main definitions #