Documentation

LeanMachineLearning.Online.Bandit.Regret

Regret, gap, best arm #

Main definitions #

noncomputable def Bandits.gap {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} (ν : ProbabilityTheory.Kernel 𝓐 ) (a : 𝓐) :

Gap of an action a: difference between the highest mean of the actions and the mean of a.

Equations
Instances For
    theorem Bandits.gap_nonneg {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {a : 𝓐} [Finite 𝓐] :
    0 gap ν a
    theorem Bandits.gap_nonneg_of_le {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {a : 𝓐} {u : } (h : ∀ (a : 𝓐), (x : ), id x ν a u) :
    0 gap ν a

    The gap is non-negative if the means are bounded by u : ℝ (even if 𝓐 is not Finite).

    theorem Bandits.gap_le_of_mem_Icc {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {a : 𝓐} [Nonempty 𝓐] {l u : } (h : ∀ (a : 𝓐), (x : ), id x ν a Set.Icc l u) :
    gap ν a u - l
    noncomputable def Bandits.regret {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} (ν : ProbabilityTheory.Kernel 𝓐 ) (A : Ω𝓐) (t : ) (ω : Ω) :

    Regret of a sequence of pulls k : ℕ → 𝓐 at time t for the reward kernel ν ; Kernel 𝓐 ℝ.

    Equations
    Instances For
      theorem Bandits.regret_eq_sum_gap {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {A : Ω𝓐} {ω : Ω} {t : } :
      regret ν A t ω = sFinset.range t, gap ν (A s ω)
      theorem Bandits.regret_nonneg {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {A : Ω𝓐} {ω : Ω} {t : } [Finite 𝓐] :
      0 regret ν A t ω
      theorem Bandits.gap_eq_zero_of_regret_eq_zero {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {A : Ω𝓐} {ω : Ω} {t : } [Finite 𝓐] (hr : regret ν A t ω = 0) {s : } (hs : s < t) :
      gap ν (A s ω) = 0
      theorem Bandits.regret_eq_sum_pullCount_mul_gap {𝓐 : Type u_1} {Ω : Type u_2} [DecidableEq 𝓐] {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {A : Ω𝓐} {ω : Ω} {t : } [Fintype 𝓐] :
      regret ν A t ω = a : 𝓐, (Learning.pullCount A a t ω) * gap ν a
      theorem Bandits.integral_regret_eq_sum_gap_mul_integral_pullCount {𝓐 : Type u_1} {Ω : Type u_2} [DecidableEq 𝓐] {m𝓐 : MeasurableSpace 𝓐} { : MeasurableSpace Ω} {ν : ProbabilityTheory.Kernel 𝓐 } {A : Ω𝓐} {n : } [StandardBorelSpace 𝓐] [Fintype 𝓐] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hA : ∀ (n : ), Measurable (A n)) :
      (x : Ω), regret ν A n x P = a : 𝓐, gap ν a * (x : Ω), (fun (ω : Ω) => (Learning.pullCount A a n ω)) x P
      theorem Bandits.integral_regret_le_of_forall_integral_pullCount_le {𝓐 : Type u_1} {Ω : Type u_2} [DecidableEq 𝓐] {m𝓐 : MeasurableSpace 𝓐} { : MeasurableSpace Ω} {ν : ProbabilityTheory.Kernel 𝓐 } {A : Ω𝓐} {R : Ω} {n : } [Nonempty 𝓐] [StandardBorelSpace 𝓐] [Fintype 𝓐] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {alg : Learning.Algorithm 𝓐 } {env : Learning.Environment 𝓐 } {B : 𝓐} (h : Learning.IsAlgEnvSeq A R alg env P) (h_le : ∀ (a : 𝓐), gap ν a 0 (ω : Ω), (Learning.pullCount A a n ω) P B a) :
      (x : Ω), regret ν A n x P a : 𝓐, gap ν a * B a

      To bound the expected regret, it suffices to bound the expected number of pulls for each action with positive gap.

      noncomputable def Bandits.bestArm {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} [Fintype 𝓐] [Nonempty 𝓐] (ν : ProbabilityTheory.Kernel 𝓐 ) :
      𝓐

      action with the highest mean.

      Equations
      Instances For
        theorem Bandits.le_bestArm {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } [Fintype 𝓐] [Nonempty 𝓐] (a : 𝓐) :
        (x : ), id x ν a (x : ), id x ν (bestArm ν)
        theorem Bandits.gap_eq_bestArm_sub {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {a : 𝓐} [Fintype 𝓐] [Nonempty 𝓐] :
        gap ν a = (x : ), id x ν (bestArm ν) - (x : ), id x ν a
        @[simp]
        theorem Bandits.gap_bestArm {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } [Fintype 𝓐] [Nonempty 𝓐] :
        gap ν (bestArm ν) = 0
        theorem Bandits.integral_eq_of_gap_eq_zero {𝓐 : Type u_1} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {a : 𝓐} [Fintype 𝓐] [Nonempty 𝓐] (hg : gap ν a = 0) :
        (x : ), id x ν (bestArm ν) = (x : ), id x ν a
        theorem Bandits.avg_mean_reward_tendsto_of_sublinear_regret {𝓐 : Type u_1} {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {A : Ω𝓐} {ω : Ω} (hr : (fun (x : ) => regret ν A x ω) =o[Filter.atTop] fun (t : ) => t) :
        Filter.Tendsto (fun (t : ) => (∑ sFinset.range t, (x : ), id x ν (A s ω)) / t) Filter.atTop (nhds (⨆ (a : 𝓐), (x : ), id x ν a))

        If the regret is sublinear, the average mean reward tends to the highest mean of the arms.

        theorem Bandits.pullCount_rate_tendsto_of_sublinear_regret {𝓐 : Type u_1} {Ω : Type u_2} [DecidableEq 𝓐] {m𝓐 : MeasurableSpace 𝓐} {ν : ProbabilityTheory.Kernel 𝓐 } {A : Ω𝓐} {ω : Ω} {a : 𝓐} [Finite 𝓐] (hr : (fun (x : ) => regret ν A x ω) =o[Filter.atTop] fun (t : ) => t) (hg : 0 < gap ν a) :
        Filter.Tendsto (fun (t : ) => (Learning.pullCount A a t ω) / t) Filter.atTop (nhds 0)

        If the regret is sublinear, the rate of suboptimal arm pulls tends to zero.