4. Defining an Algorithm🔗

This tutorial explains the structures used in the Lean Machine Learning library to describe algorithms, the environment they interact with, and how to state theorems about their interaction. We then illustrate them with the UCB bandit algorithm.

4.1. Algorithm and environment🔗

In LML, we prove theorems about the interaction of an algorithm with an environment. An algorithm takes actions, to which the environment responds with feedback (e.g., rewards for the bandit case, the gradient of a function in optimization problems). In general, both action and feedback can depend on the entire history up to the current time and can be randomized. The Algorithm structure is defined as follows:

structure

Learning.Algorithm.{u_4, u_5} (𝓐 : Type u_4) (𝓨 : Type u_5)
  [MeasurableSpace 𝓐] [MeasurableSpace 𝓨] : Type (max u_4 u_5)
Learning.Algorithm.{u_4, u_5}
  (𝓐 : Type u_4) (𝓨 : Type u_5)
  [MeasurableSpace 𝓐]
  [MeasurableSpace 𝓨] : Type (max u_4 u_5)

A stochastic, sequential algorithm.

Constructor

Learning.Algorithm.mk.{u_4, u_5}

Fields

policy : (n : ℕ) → ProbabilityTheory.Kernel (↥(Finset.Iic n) → 𝓐 × 𝓨) 𝓐

Policy or sampling rule: distribution of the next action.

h_policy : ∀ (n : ℕ), ProbabilityTheory.IsMarkovKernel (self.policy n)

The policy is a Markov kernel.

p0 : MeasureTheory.Measure 𝓐

Distribution of the first action.

hp0 : MeasureTheory.IsProbabilityMeasure self.p0

The first action distribution is a probability measure.

This structure refers to two types, the type of actions 𝓐 and the type of feedback 𝓨. Both are measurable spaces, since we consider stochastic algorithms and environments. The interaction will start with the algorithm playing a first action, which is in general random with distribution p0. The field hp0 registers that p0 is a probability measure (and it is in square brackets to tell Lean to infer it automatically whenever possible). After time n, there is a history of actions and feedbacks Iic n → 𝓐 × 𝓨 (n+1 pairs action and feedback). So after time 0 (the processes are 0-indexed) the history contains the action at time 0 and the feedback that followed. The policy field contain for each time n a kernel from that history to the action space. That is, it maps every possible history to a random next action (and that map is measurable). The h_policy field records that the measure describing the next action is a probability measure.

If the algorithms actions are not random, we can use the detAlgorithm definition to build an algorithm from the data of a measurable function for the next action and a choice for the first action.

def

Learning.detAlgorithm.{u_1, u_2} {𝓐 : Type u_1} {𝓨 : Type u_2}
  {m𝓐 : MeasurableSpace 𝓐} {m𝓨 : MeasurableSpace 𝓨}
  (nextA : (n : ℕ) → (↥(Finset.Iic n) → 𝓐 × 𝓨) → 𝓐)
  (h_next : ∀ (n : ℕ), Measurable (nextA n)) (action0 : 𝓐) :
  Algorithm 𝓐 𝓨
Learning.detAlgorithm.{u_1, u_2}
  {𝓐 : Type u_1} {𝓨 : Type u_2}
  {m𝓐 : MeasurableSpace 𝓐}
  {m𝓨 : MeasurableSpace 𝓨}
  (nextA :
    (n : ℕ) →
      (↥(Finset.Iic n) → 𝓐 × 𝓨) → 𝓐)
  (h_next :
    ∀ (n : ℕ), Measurable (nextA n))
  (action0 : 𝓐) : Algorithm 𝓐 𝓨

A deterministic algorithm, which chooses the action given by the function nextAction.

We can see here that we did not need to prove that the kernels are IsMarkovKernel and that the distribution of the first action is a probability measure. Lean knows that deterministic kernels are Markov.

The Environment structure is the mirror of the Algorithm structure, with a kernel for the feedback instead of the actions and a kernel for the first feedback instead of the first action.

structure

Learning.Environment.{u_4, u_5} (𝓐 : Type u_4) (𝓨 : Type u_5)
  [MeasurableSpace 𝓐] [MeasurableSpace 𝓨] : Type (max u_4 u_5)
Learning.Environment.{u_4, u_5}
  (𝓐 : Type u_4) (𝓨 : Type u_5)
  [MeasurableSpace 𝓐]
  [MeasurableSpace 𝓨] : Type (max u_4 u_5)

A stochastic environment.

Constructor

Learning.Environment.mk.{u_4, u_5}

Fields

feedback : (n : ℕ) → ProbabilityTheory.Kernel ((↥(Finset.Iic n) → 𝓐 × 𝓨) × 𝓐) 𝓨

Distribution of the next observation as function of the past history.

h_feedback : ∀ (n : ℕ), ProbabilityTheory.IsMarkovKernel (self.feedback n)

The feedback kernels are Markov kernels.

ν0 : ProbabilityTheory.Kernel 𝓐 𝓨

Distribution of the first observation given the first action.

hp0 : ProbabilityTheory.IsMarkovKernel self.ν0

The initial observation kernel is a Markov kernel.

ν0 gives the distribution of the first feedback given the first action, and feedback gives the distribution of the next feedback given the history and the next action.

In many applications the feedback depends only on the last action and not on the prior history. We provide an obliviousEnv definition that builds an environment for those cases.

def

Learning.obliviousEnv.{u_1, u_2} {𝓐 : Type u_1} {𝓨 : Type u_2}
  {m𝓐 : MeasurableSpace 𝓐} {m𝓨 : MeasurableSpace 𝓨}
  (ν : ℕ → ProbabilityTheory.Kernel 𝓐 𝓨)
  [∀ (n : ℕ), ProbabilityTheory.IsMarkovKernel (ν n)] : Environment 𝓐 𝓨
Learning.obliviousEnv.{u_1, u_2}
  {𝓐 : Type u_1} {𝓨 : Type u_2}
  {m𝓐 : MeasurableSpace 𝓐}
  {m𝓨 : MeasurableSpace 𝓨}
  (ν : ℕ → ProbabilityTheory.Kernel 𝓐 𝓨)
  [∀ (n : ℕ),
      ProbabilityTheory.IsMarkovKernel
        (ν n)] :
  Environment 𝓐 𝓨

An oblivious environment, in which the distribution of the next feedback depends only on the last action, but in a possibly time-dependent manner.

(ν (n + 1)).prodMkLeft _ is the kernel ν (n + 1) seen as a Kernel ((Iic n → 𝓐 × 𝓨) × 𝓐) 𝓨 by ignoring the history.

If furthermore the feedback kernel does not change with time, we can use the stationaryEnv definition to build the environment.

def

Learning.stationaryEnv.{u_1, u_2} {𝓐 : Type u_1} {𝓨 : Type u_2}
  {m𝓐 : MeasurableSpace 𝓐} {m𝓨 : MeasurableSpace 𝓨}
  (ν : ProbabilityTheory.Kernel 𝓐 𝓨)
  [ProbabilityTheory.IsMarkovKernel ν] : Environment 𝓐 𝓨
Learning.stationaryEnv.{u_1, u_2}
  {𝓐 : Type u_1} {𝓨 : Type u_2}
  {m𝓐 : MeasurableSpace 𝓐}
  {m𝓨 : MeasurableSpace 𝓨}
  (ν : ProbabilityTheory.Kernel 𝓐 𝓨)
  [ProbabilityTheory.IsMarkovKernel ν] :
  Environment 𝓐 𝓨

A stationary environment, in which the distribution of the next feedback depends only on the last action.

4.2. Sequences of actions and feedback, probability space🔗

Once algorithm and environment are defined, we can introduce sequences of actions and feedback and assume that they are generated by the interaction of the algorithm with the environment. This is done by the IsAlgEnvSeq structure.

structure

Learning.IsAlgEnvSeq.{u_1, u_2, u_3} {𝓐 : Type u_1} {𝓨 : Type u_2}
  {Ω : Type u_3} {m𝓐 : MeasurableSpace 𝓐} {m𝓨 : MeasurableSpace 𝓨}
  {mΩ : MeasurableSpace Ω} (A : ℕ → Ω → 𝓐) (Y : ℕ → Ω → 𝓨)
  (alg : Algorithm 𝓐 𝓨) (env : Environment 𝓐 𝓨)
  (P : MeasureTheory.Measure Ω) [MeasureTheory.IsFiniteMeasure P] : Prop
Learning.IsAlgEnvSeq.{u_1, u_2, u_3}
  {𝓐 : Type u_1} {𝓨 : Type u_2}
  {Ω : Type u_3} {m𝓐 : MeasurableSpace 𝓐}
  {m𝓨 : MeasurableSpace 𝓨}
  {mΩ : MeasurableSpace Ω} (A : ℕ → Ω → 𝓐)
  (Y : ℕ → Ω → 𝓨) (alg : Algorithm 𝓐 𝓨)
  (env : Environment 𝓐 𝓨)
  (P : MeasureTheory.Measure Ω)
  [MeasureTheory.IsFiniteMeasure P] : Prop

An algorithm-environment sequence: a sequence of actions and feedbacks generated by an algorithm interacting with an environment.

Constructor

Learning.IsAlgEnvSeq.mk.{u_1, u_2, u_3}

Fields

measurable_action : ∀ (n : ℕ), Measurable (A n)

The action sequence is measurable.

measurable_feedback : ∀ (n : ℕ), Measurable (Y n)

The feedback sequence is measurable.

hasLaw_action_zero : ProbabilityTheory.HasLaw (fun ω ↦ A 0 ω) alg.p0 P

The first action has the correct law.

hasCondDistrib_feedback_zero : ProbabilityTheory.HasCondDistrib (Y 0) (A 0) env.ν0 P

The first feedback has the correct conditional distribution.

hasCondDistrib_action : ∀ (n : ℕ), ProbabilityTheory.HasCondDistrib (A (n + 1)) (history A Y n) (alg.policy n) P

The next action has the correct conditional distribution given the history.

hasCondDistrib_feedback : ∀ (n : ℕ), ProbabilityTheory.HasCondDistrib (Y (n + 1)) (fun ω ↦ (history A Y n ω, A (n + 1) ω)) (env.feedback n) P

The next feedback has the correct conditional distribution given the history and next action.

This structure takes as input two sequences of random variables (two stochastic processes), A and Y, which represent the actions and feedback generated by the interaction of the algorithm with the environment. It states that those sequences are measurable and that they have the correct conditional distributions given by the algorithm and environment. The measurable space Ω and the measure P are not imposed: they can be chosen as we want, as long as the conditions of IsAlgEnvSeq are satisfied. This definition requires 𝓐 and 𝓨 to be nonempty standard Borel spaces, because Mathlib's theory about conditional distributions requires those assumptions. All spaces of interest in machine learning are standard Borel, so this is not a restriction.

Given any algorithm and environment, there always exists a sequence of actions and feedback that satisfies IsAlgEnvSeq by the Ionescu-Tulcea theorem. However other constructions of such sequences are possible, and it is easier to work with generic sequences satisfying IsAlgEnvSeq than with a specific construction. Which sequence we choose does not matter for the results we prove, since all such sequences are equal in distribution, as stated by the following theorem.

theorem

Learning.isAlgEnvSeq_unique.{u_1, u_2, u_4, u_5} {𝓐 : Type u_1}
  {𝓨 : Type u_2} {m𝓐 : MeasurableSpace 𝓐} {m𝓨 : MeasurableSpace 𝓨}
  {Ω : Type u_4} {Ω' : Type u_5} {mΩ : MeasurableSpace Ω}
  {mΩ' : MeasurableSpace Ω'} {alg : Algorithm 𝓐 𝓨}
  {env : Environment 𝓐 𝓨} {P : MeasureTheory.Measure Ω}
  [MeasureTheory.IsProbabilityMeasure P] {P' : MeasureTheory.Measure Ω'}
  [MeasureTheory.IsProbabilityMeasure P'] {A₁ : ℕ → Ω → 𝓐}
  {R₁ : ℕ → Ω → 𝓨} {A₂ : ℕ → Ω' → 𝓐} {R₂ : ℕ → Ω' → 𝓨}
  (h1 : IsAlgEnvSeq A₁ R₁ alg env P)
  (h2 : IsAlgEnvSeq A₂ R₂ alg env P') :
  MeasureTheory.Measure.map (trajectory A₁ R₁) P =
    MeasureTheory.Measure.map (trajectory A₂ R₂) P'
Learning.isAlgEnvSeq_unique.{u_1, u_2,
    u_4, u_5}
  {𝓐 : Type u_1} {𝓨 : Type u_2}
  {m𝓐 : MeasurableSpace 𝓐}
  {m𝓨 : MeasurableSpace 𝓨} {Ω : Type u_4}
  {Ω' : Type u_5} {mΩ : MeasurableSpace Ω}
  {mΩ' : MeasurableSpace Ω'}
  {alg : Algorithm 𝓐 𝓨}
  {env : Environment 𝓐 𝓨}
  {P : MeasureTheory.Measure Ω}
  [MeasureTheory.IsProbabilityMeasure P]
  {P' : MeasureTheory.Measure Ω'}
  [MeasureTheory.IsProbabilityMeasure P']
  {A₁ : ℕ → Ω → 𝓐} {R₁ : ℕ → Ω → 𝓨}
  {A₂ : ℕ → Ω' → 𝓐} {R₂ : ℕ → Ω' → 𝓨}
  (h1 : IsAlgEnvSeq A₁ R₁ alg env P)
  (h2 : IsAlgEnvSeq A₂ R₂ alg env P') :
  MeasureTheory.Measure.map
      (trajectory A₁ R₁) P =
    MeasureTheory.Measure.map
      (trajectory A₂ R₂) P'

The law of the sequence of actions and observations generated by an algorithm-environment pair is unique: it does not depend on the probability space used.

4.3. Example: the UCB algorithm and a stochastic bandit environment🔗

We now illustrate the use of Algorithm, Environment, and IsAlgEnvSeq by defining the UCB bandit algorithm, a bandit environment, and stating a theorem about the regret of UCB in that environment.

In a stochastic bandit, an algorithm chooses at each time an action from a finite set (here Fin K, the type of natural numbers less than K) and receives a reward drawn from a distribution that depends only on the action, not on the prior history.

The environment is thus simply stationaryEnv ν for some kernel ν : Kernel (Fin K) ℝ.

4.3.1. Algorithm🔗

The UCB algorithm chooses at time n + 1 the action that maximizes the sum of the empirical mean reward and an exploration bonus. It starts by choosing each action once and then chooses \arg\max_a (\hat{\mu}_{n,a} + \sqrt{\frac{2c \log (n + 2)}{N_{n,a}}}), in which \hat{\mu}_{n,a} is the empirical mean reward of action a at time n (empMean' in the code), N_{n,a} is the number of times action a has been chosen up to time n (pullCount' in the code), and c is a parameter of the algorithm.

To define the algorithm, we first define the exploration bonus and the next action function, and then we use detAlgorithm to build the algorithm. We also need to prove that the next action function is measurable, which is done by the measurable_nextArm lemma. Note that we are careful to use a measurable version of the argmax function, argmax.

def

Bandits.ucbWidth' {K : ℕ} (c : ℝ) (n : ℕ)
  (h : ↥(Finset.Iic n) → Fin K × ℝ) (a : Fin K) : ℝ
Bandits.ucbWidth' {K : ℕ} (c : ℝ) (n : ℕ)
  (h : ↥(Finset.Iic n) → Fin K × ℝ)
  (a : Fin K) : ℝ

The exploration bonus of the UCB algorithm, which corresponds to the width of a confidence interval.

def

Bandits.UCB.nextArm {K : ℕ} (hK : 0 < K) (c : ℝ) (n : ℕ)
  (h : ↥(Finset.Iic n) → Fin K × ℝ) : Fin K
Bandits.UCB.nextArm {K : ℕ} (hK : 0 < K)
  (c : ℝ) (n : ℕ)
  (h : ↥(Finset.Iic n) → Fin K × ℝ) :
  Fin K

Arm pulled by the UCB algorithm at time n + 1.

theorem

Bandits.UCB.measurable_nextArm {K : ℕ} (hK : 0 < K) (c : ℝ) (n : ℕ) :
  Measurable (Bandits.UCB.nextArm hK c n)
Bandits.UCB.measurable_nextArm {K : ℕ}
  (hK : 0 < K) (c : ℝ) (n : ℕ) :
  Measurable (Bandits.UCB.nextArm hK c n)

def

Bandits.ucbAlgorithm {K : ℕ} (hK : 0 < K) (c : ℝ) : Algorithm (Fin K) ℝ
Bandits.ucbAlgorithm {K : ℕ} (hK : 0 < K)
  (c : ℝ) : Algorithm (Fin K) ℝ

The UCB algorithm.

The last line builds the algorithm using detAlgorithm and the function UCB.nextArm. Its measurability is proved by the fun_prop tactic, which proves measurability of functions by using lemmas tagged with @[fun_prop]. The last argument ⟨0, hK⟩ is the first action of the algorithm, which is 0 as an element of Fin K.

4.3.2. A theorem about UCB🔗

We can now state a theorem about the regret of UCB in a stochastic bandit environment (which we won't prove here). Let's first define the regret, which for stochastic bandits is the difference between the mean rewar that the algorithm would have obtained if it played always the best action, and the sum of mean rewards of the actions played.

def

Bandits.regret.{u_1, u_2} {𝓐 : Type u_1} {Ω : Type u_2}
  {m𝓐 : MeasurableSpace 𝓐} (ν : ProbabilityTheory.Kernel 𝓐 ℝ)
  (A : ℕ → Ω → 𝓐) (t : ℕ) (ω : Ω) : ℝ
Bandits.regret.{u_1, u_2} {𝓐 : Type u_1}
  {Ω : Type u_2} {m𝓐 : MeasurableSpace 𝓐}
  (ν : ProbabilityTheory.Kernel 𝓐 ℝ)
  (A : ℕ → Ω → 𝓐) (t : ℕ) (ω : Ω) : ℝ

Regret of a sequence of pulls k : ℕ → 𝓐 at time t for the reward kernel ν ; Kernel 𝓐 ℝ.

The quantity (ν a)[id] is the mean reward of action a in the environment defined by ν.

We can now state the regret bound for UCB.

theorem

Bandits.UCB.regret_le.{u_1} {K : ℕ} {hK : 0 < K} {c : ℝ}
  {ν : ProbabilityTheory.Kernel (Fin K) ℝ}
  [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1}
  {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω}
  [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K}
  {R : ℕ → Ω → ℝ} {σ2 : NNReal}
  (h :
    IsAlgEnvSeq A R (Bandits.ucbAlgorithm hK (c * ↑σ2))
      (stationaryEnv ν) P)
  (hν :
    ∀ (a : Fin K),
      ProbabilityTheory.HasSubgaussianMGF
        (fun x ↦ x - ∫ (x : ℝ), id x ∂ν a) σ2 (ν a))
  (hσ2 : σ2 ≠ 0) (hc : 0 < c) (n : ℕ) :
  ∫ (x : Ω), Bandits.regret ν A n x ∂P ≤
    ∑ a,
      (8 * c * ↑σ2 * Real.log (↑n + 1) / Bandits.gap ν a +
        Bandits.gap ν a * (2 + 2 * (Bandits.UCB.constSum c n).toReal))
Bandits.UCB.regret_le.{u_1} {K : ℕ}
  {hK : 0 < K} {c : ℝ}
  {ν : ProbabilityTheory.Kernel (Fin K) ℝ}
  [ProbabilityTheory.IsMarkovKernel ν]
  {Ω : Type u_1} {mΩ : MeasurableSpace Ω}
  {P : MeasureTheory.Measure Ω}
  [MeasureTheory.IsProbabilityMeasure P]
  {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ}
  {σ2 : NNReal}
  (h :
    IsAlgEnvSeq A R
      (Bandits.ucbAlgorithm hK (c * ↑σ2))
      (stationaryEnv ν) P)
  (hν :
    ∀ (a : Fin K),
      ProbabilityTheory.HasSubgaussianMGF
        (fun x ↦ x - ∫ (x : ℝ), id x ∂ν a)
        σ2 (ν a))
  (hσ2 : σ2 ≠ 0) (hc : 0 < c) (n : ℕ) :
  ∫ (x : Ω), Bandits.regret ν A n x ∂P ≤
    ∑ a,
      (8 * c * ↑σ2 * Real.log (↑n + 1) /
          Bandits.gap ν a +
        Bandits.gap ν a *
          (2 +
            2 *
              (Bandits.UCB.constSum c
                  n).toReal))

Regret bound for the UCB algorithm.

The arguments of the theorem are the following:

h states that the sequence of actions and rewards we are considering is generated by the interaction of UCB with parameter c * σ2 with the stationary environment defined by ν.
hν states that the reward distribution of each arm is subgaussian with variance proxy σ2.
hσ2 and hc state that the parameters σ2 and c are positive.
n is the time horizon.

The theorem gives an upper bound on the expected regret of UCB at time n.

4.4. Building vs analyzing algorithms🔗

When building an algorithm, we describe it with functions from the history (Iic n → 𝓐 × R) to the action space 𝓐. Thus, to construct UCB, we used the following empirical mean function.

def

Learning.empMean'.{u_1} {𝓐 : Type u_1} [DecidableEq 𝓐] (n : ℕ)
  (h : ↥(Finset.Iic n) → 𝓐 × ℝ) (a : 𝓐) : ℝ
Learning.empMean'.{u_1} {𝓐 : Type u_1}
  [DecidableEq 𝓐] (n : ℕ)
  (h : ↥(Finset.Iic n) → 𝓐 × ℝ) (a : 𝓐) :
  ℝ

Empirical mean of arm a at time n.

When analyzing an algorithm, we work with sequences of actions and rewards A : ℕ → Ω → 𝓐 and R' : ℕ → Ω → R that satisfy IsAlgEnvSeq. For the analysis, the empirical mean is defined as a stochastic process on the same probability space Ω.

def

Learning.empMean.{u_1, u_3} {𝓐 : Type u_1} {Ω : Type u_3}
  [DecidableEq 𝓐] (A : ℕ → Ω → 𝓐) (R' : ℕ → Ω → ℝ) (a : 𝓐) (t : ℕ)
  (ω : Ω) : ℝ
Learning.empMean.{u_1, u_3} {𝓐 : Type u_1}
  {Ω : Type u_3} [DecidableEq 𝓐]
  (A : ℕ → Ω → 𝓐) (R' : ℕ → Ω → ℝ) (a : 𝓐)
  (t : ℕ) (ω : Ω) : ℝ

Empirical mean reward obtained when pulling action a up to time t (exclusive).

empMean A R' a is a stochastic process with type ℕ → Ω → ℝ that gives the empirical mean of action a at each time.