miguelgondu's blog

Translating measure theory into probability

This post discusses how measure theory formalizes the operations that we usually do in probability (computing the probability of an event, getting the distributions of random variables...). With it, I hope to establish a translation table between abstract concepts (like measurable spaces and σ\sigma-additive measures) and their respective notions in probability. I hope to make these concepts more tangible by using a running example: counting cars and the Poisson distribution. Feel free to skip to the end for a quick summary.

A running example: counting cars

Let's say you and your friends want to start a drive-through business. In order to allocate work shifts or to predict your daily revenue, it would be great to estimate how many cars will go through your street in a fixed interval of time (say, Mondays between 2pm and 4pm). Let's call this amount XX. XX is what we call a random variable: there is a stochastic, non-deterministic experiment going on out there in the world, and we want to be able to say things about it. We want to be able to answer questions about events that relate to XX, questions like how likely is it that we will get more than 50 cars in said time interval.

Probability theory sets up a formalism for answering these questions, and it does so by using a branch of mathematics called measure theory. I'll use this running example to make the abstract concepts that we'll face more grounded and real.

Formalizing events

In our scenario, we have several possible events (seeing between 10 and 20 cars, getting less than 5 customers...), and the theory that we will be establishing will allow us to say how likely each one of these events are.

But first, what do we expect of events? If AA and BB are events, we would expect ABA \land B and ABA \lor B to be possible events (seeing 5 cars go by and/or having 3 takeout orders), we would also expect notA\text{not}\, A to be an event. Collections of sets that are closed under these three operations (intersection, union and complement) are called σ\sigma-algebras:

Definition: A σ\sigma-algebra over a set Ω\Omega is a collection of subsets F\mathcal{F} (called events) that satisfies:

  1. Both Ω\Omega and the empty set \varnothing are in F\mathcal{F}.
  2. F\mathcal{F} is closed under (countable) unions, intersections and complements.

The pair (Ω,F)(\Omega, \mathcal{F}) is called a measurable space.1

You can think of Ω\Omega as the set containing all possible outcomes ω\omega of your experiment. In the context of e.g. coin tossing, Ω={heads,tails}\Omega = \{\text{heads}, \text{tails}\}, and in our example of cars going through your street, Ω={seeing no cars,seeing 1 car,}\Omega = \{\text{seeing no cars}, \text{seeing 1 car}, \dots\}. We have two special events (that are always included, according to the definition): Ω\Omega and \varnothing. You can think of them as tokens for absolute certainty and for impossibility, respectively.

There are two measurable spaces that I would like to discuss, both because they are examples of this definition, and because they come up when explaining some of the concepts that come next. The first one is the discrete measureable space (given by the naturals N\mathbb{N} and the σ\sigma-algebra of all possible subsets P(N))\mathcal{P}(\mathbb{N})), and the second one is the set of real numbers R\mathbb{R} with Borel sets B\mathcal{B}. You can think of the Borel σ\sigma-algebra as the smallest one that contains all open intervals (a,b)(a,b), their unions and intersections.2 The definition of B\mathcal{B} might seem arbitrary for now, but it will make more sense once we introduce random variables.

Notice that, in our running example, we are using the discrete measurable space (by idenfitying seeing no cars=0\text{seeing no cars} = 0, seeing 1 car=1\text{seeing 1 car} = 1 and so on).

Formalizing probability

The probability that an event EFE\in\mathcal{F} (for example E={seeing between 10 and 20 cars}E = \{\text{seeing between 10 and 20 cars}\}) is formalized using the notion of a measure. Let's start with the definition:

Definition: A σ\sigma-additive measure (or just measure) on a measureable space (Ω,F)(\Omega, \mathcal{F}) is a function μ:F[0,)\mu:\mathcal{F}\to[0, \infty) such that, if {Ai}i=1\{A_i\}_{i=1}^\infty is a family of pairwise disjoint sets, then

μ(i=1Ai)=i=1μ(Ai).\mu\left(\cup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty \mu(A_i).

In summary, we expect measures to be non-negative and to treat disjoint sets additively. Notice how this generalizes the idea of, say, volume: if two solids AA and BB in R3\mathbb{R}^3 are disjoint, we would expect the volume of ABA\cup B to be the sum of their volumes.

In the previous section we talked about two measurable spaces, let's discuss the usual measures they have:

  • For (R,B)(\mathbb{R}, \mathcal{B}), we should be specifying the length of open intervals I=(a,b)I = (a, b). Our intuition says we should be defining
μ(I)=ba.\mu(I) = b - a.

This measure can be extended to arbitrary sets in B\mathcal{B}, but how this is done is a story for another time.

  • For (N,P(N)))(\mathbb{N}, \mathcal{P}(\mathbb{N}))), we need to specify the size of any possible subset AA of N\mathbb{N}. The standard measure that we define in discrete measurable spaces (including finite ones) is the counting measure, literally given by counting how many elements are in AA:
μ(A)=#(A)=amount of elements in A.\mu(A) = \#(A) = \text{amount of elements in }A.

There's a key difference between sizes and probabilities, though: we assume probabilities to be bounded3. We expect events with probability 1 to be absolutely certain. Adding this restriction we get the following definition:

Definition: a probability space is a triplet (Ω,F,Prob)(\Omega, \mathcal{F}, \text{Prob}) such that

  • (Ω,F)(\Omega, \mathcal{F}) is a measurable space.
  • Prob\text{Prob} is a σ\sigma-additive measure that satisfies Prob(Ω)=1\text{Prob}(\Omega) = 1.4

The measure Prob\text{Prob} in a probability space satisfies all three Kolmogorov's axioms, which attempt to formalize our intuitive notions of probability.

Formalizing random variables

Now we have a probability measure Prob\text{Prob} that allows us to measure the probability of events EFE\in\mathcal{F}. How can we link this to experiments that are running somewhere outside in the world?

The outcomes of experiments are measured using random variables. In our example, XX takes an outcome ωΩ\omega \in \Omega and associates it with a number X(ω)RX(\omega)\in\mathbb{R}. We have other examples, like Y(ω)=Y(\omega) = the number of customers after seeing ω\omega cars or Z(ω)=Z(\omega) = the revenue after seeing ω\omega cars.

This association is formalized using measurable functions.

Definition: Let (Ω,F)(\Omega, \mathcal{F}) and (Θ,G)(\Theta, \mathcal{G}) be two measurable spaces. f ⁣:ΩΘf\colon\Omega\to\Theta is a measurable function if for all BGB\in \mathcal{G}, f1(B)Ff^{-1}(B) \in \mathcal{F}. In other words, if the inverse image of a measurable set in (Θ,G)(\Theta, \mathcal{G}) is a measurable set in (Ω,F)(\Omega, \mathcal{F}).

This condition (f1(B)Ff^{-1}(B)\in\mathcal{F}) says that it makes sense to query for events of the type {ωΩ ⁣:f(ω)B}\{\omega\in\Omega\colon f(\omega) \in B\}, since they will always be measurable sets (if BB is a measureable set in (Θ,G)(\Theta, \mathcal{G})).

As we were discussing, the output of a random variable is a real number. This means that random variables are a particular kind of measurable functions:

Definition: Let (Θ,G)(\Theta, \mathcal{G}) be either the real numbers with Borel sets, or the discrete measurable space. A function X ⁣:(Ω,F)(Θ,G)X\colon(\Omega, \mathcal{F})\to(\Theta, \mathcal{G}) is a random variable if X1(B)FX^{-1}(B)\in\mathcal{F} for measurable sets BGB\in\mathcal{G}.

People usually make the distinction between continuous and discrete random variables (depending on whether Θ\Theta is R\mathbb{R} or N\mathbb{N}, respectively). Thankfully, measure theory allows us to treat these two cases using the same symbols, as we will see when we discuss integration later on.

I still think this definition isn't completely transparent, because "Borel sets" sounds abstract. Remember that Borel sets are just (unions or complements of) open intervals in R\mathbb{R}. Since the inverse image of a set is very well behaved with respect to unions, intersections and complements, then it suffices to consider BB to be an interval. An alternative (and maybe more transparent) definition of a random variable is usually given as

Definition: Let (Θ,G)(\Theta, \mathcal{G}) be either the real numbers with Borel sets, or the discrete measurable space. A function X ⁣:(Ω,F)(Θ,G)X\colon(\Omega, \mathcal{F})\to(\Theta, \mathcal{G}) is a random variable if X1([,a))FX^{-1}\left([-\infty, a)\right)\in\mathcal{F} for all aRa\in\mathbb{R}. That is, if the sets {ωΩ ⁣:X(ω)<a}\{\omega\in\Omega\colon X(\omega) < a\} are events for all aRa\in \mathbb{R}.

Now we are talking! In summary, random variables are formalized as measurable functions: Functions because they associate outcomes ω\omega with real numbers (or only natural numbers), and measurable because we want the sets X(ω)<aX(\omega) < a (e.g. seeing less than 10 cars) to have meaningful probability.

The distribution of a random variable

Now let's talk about the distribution of a random variable. You might have heard things like "this variable is normally distributed" or "that variable follows the binomial distribution", and you might have run computations using this fact, and the densities of these distributions. This is all formalized like this:

Definition: Let (Θ,G,μ)(\Theta, \mathcal{G}, \mu) be either the real numbers with Borel sets, or the discrete measurable space. Any random variable X ⁣:(Ω,F,Prob)(Θ,G,μ)X\colon(\Omega, \mathcal{F}, \text{Prob})\to(\Theta, \mathcal{G}, \mu) induces a probability measure PX ⁣:G[0,)P_X\colon\mathcal{G}\to[0,\infty) on Θ\Theta given by

PX(A)=Prob(XA)=Prob(X1(A)).P_X(A) = \text{Prob}(X\in A) = \text{Prob}(X^{-1}(A)).

PXP_X is called the distribution of XX.

In other words, the distribution of a random variable is a way of computing probabilities of events. If we have an event like E={seeing between 10 and 20 cars}E = \{\text{seeing between 10 and 20 cars}\}, we can compute its probability using our random variable:

Prob(E)=PX({10,11,,20}),\text{Prob}(E) = P_X(\{10, 11, \dots, 20\}),

and we already have plenty of probability distributions PXP_X that model certain phenomena in the world. In the case of couting cars, people use the Poisson distribution. But how can we go from PXP_X to an actual number?, for that we must rely on integration with respect to measures and densities of distributions.

But before going through with these two topics, I want to define a concept that we see frequently in probability. Using the distribution PX ⁣:G[0,)P_X\colon\mathcal{G}\to[0, \infty) we can define the cumulative density function as

PX(x)=PX({w ⁣:X(w)x})=Prob(Xx).P_X(x) = P_X(\{w\colon X(w) \leq x\}) = \text{Prob}(X \leq x).

As you can see, we use the same symbol (PXP_X) to refer to a completely different function (in this case from Θ\Theta to R\mathbb{R}). In many ways, this function defines the distribution of XX and we will see why after discussing integration and densities.

An interlude: integration

Remember that the Riemann integral measures the area below a curve given by an integrable function f ⁣:[a,b]Rf\colon[a,b]\to\mathbb{R}. This area is given by

abf(x)dx=limni=0nf(ci)(xi+1xi),\int_a^bf(x)\mathrm{d}x = \lim_{n\to\infty}\sum_{i=0}^{n} f(c_i)(x_{i+1} - x_i),

where we are partitioning the [a,b][a,b] interval into nn segments of length xi+1xix_{i+1} - x_i, and selecting ci[xi,xi+1]c_i\in[x_i, x_{i+1}].

Being handwavy5, there is a way of defining an integral with respect to arbitrary measures μ\mu, and it looks like this

Afdμ="Eif(ci)μ(Ei)\int_A f\mathrm{d}\mu \,\,``=" \sum_{E_i}f(c_i)\mu(E_i)

where ciEic_i\in E_i and AA is the disjoint union of all sets EiE_i. Notice how this relates to the Riemann sums: we are measuring abstract lengths by replacing xi+1xix_{i+1} - x_i with μ(Ei)\mu(E_i). This is called the Lebesgue integration of ff, and it extends Riemann integration beyond intervals and into more arbitrary measurable spaces, including probability spaces.

Let's discuss how this integral looks like when we integrate with respect to the measures that we had defined for (R,B)(\mathbb{R}, \mathcal{B}) and (N,P(N))(\mathbb{N}, \mathcal{P}(\mathbb{N})):

  • Notice that when using the elementary measure μ((a,b))=ba\mu((a, b)) = b-a, we end up computing a number that coincides with the Riemann integral. Properly defining the Lebesgue integral (and checking that it matches when a function ff is both Riemann and Lebesgue integrable) would require some work, so I leave it for future posts5.
  • If we are in the discrete measurable space with the counting metric μ=# ⁣:P(N)[0,)\mu = \#\colon\mathcal{P}(\mathbb{N})\to[0,\infty), this integral takes a particular form: if A={a1,,an}A = \{a_1, \dots, a_n\} be a subset of N\mathbb{N}, we can easily decompose it into the following pairwise-disjoint sets: E1={a1}E_1 = \{a_1\}, E2={a2}E_2 = \{a_2\} and so on... The resulting integral would look like
Afdμ=i=1nf(ai).\int_A f\mathrm{d}\mu = \sum_{i=1}^n f(a_i).

This means that integrating with respect to the counting metric is just our everyday addition!

Circling back to random variables: If we have a random variable XX and its distribution PXP_X, we can consider the integral of any6 real function ff over an event ABA\in\mathcal{B} with respect PXP_X:

AfdPX="Aif(ci)PX(Ai),\int_A f\mathrm{d}P_X\,\,``=" \sum_{A_i}f(c_i)P_X(A_i),

and if f1f \equiv 1:

AdPX=PX(A)=Prob(XA).\int_A \mathrm{d}P_X = P_X(A) = \text{Prob}(X\in A).

The density of a random variable

Let's summarize what we have discussed so far: we have defined events as sets EΩE\subseteq \Omega in a σ\sigma-algebra F\mathcal{F}, we defined the probability of an event Prob\text{Prob} as a measure on (Ω,F)(\Omega, \mathcal{F}) that satisfies being non-negative, normalized and σ\sigma-additive.

We also considered random variables as measurable functions X ⁣:(Ω,F,Prob)(Θ,G,μ)X\colon (\Omega, \mathcal{F}, \text{Prob})\to(\Theta, \mathcal{G}, \mu), where Θ\Theta is either the real numbers with Borel sets and Lebesgue measure (for continuous random variables) or the natural numbers with all subsets and the counting measure (for discrete random variables). XX induces a measure on (Θ,G,μ)(\Theta, \mathcal{G}, \mu) given by PX(A)=Prob(XA)P_X(A) = \text{Prob}(X\in A). This measure is the distribution of XX, and it also defines the cumulative density function PX(x)=Prob(Xx)P_X(x) = \text{Prob}(X \leq x).

We are still wondering how to compute PX(A)=Prob(XA)P_X(A) = \text{Prob}(X\in A), but we noticed that AdPX=Prob(A)\int_A\mathrm{d}P_X = \text{Prob}(A). The density of XX allows us to compute this integral:

Definition: Let (Θ,G,μ)(\Theta, \mathcal{G}, \mu) be either the real numbers with Borel sets, or the discrete measurable space. A function pX:ΘRp_X:\Theta\to\mathbb{R} that satisfies PX(A)=AdPX=ApXdμP_X(A) = \int_A \mathrm{d}P_X = \int_A p_X\mathrm{d}\mu is called the density of XX with respect to μ\mu. If Ω\Omega is the discrete measurable space, pXp_X is usually called the mass of XX.

Let's see how this definition plays out in our particular example. We know that Θ=N\Theta = \mathbb{N} and that μ\mu is the counting measure, and it is well known7 that our variable XX (cars that go by in an interval of time) follows the Poisson distribution. This means that8

pX(x;λ)=eλλxx!p_X(x;\, \lambda) = \frac{e^{-\lambda}\lambda^x}{x!}
Prob(10X20)=1020dPX=1020pXdμ=x=1020pX(x;λ)=x=1020eλλxx!,\text{Prob}(10 \leq X \leq 20) = \int_{10}^{20}\mathrm{d}P_X = \int_{10}^{20}p_X\mathrm{d}\mu = \sum_{x=10}^{20}p_X(x; \lambda) = \sum_{x=10}^{20}\frac{e^{-\lambda}\lambda^x}{x!},

and this is a number that we can actually compute after specifying λ\lambda. A good question is: how do we know the actual λ\lambda that makes this Poisson distribution describe the random process of cars going through our street? Estimating λ\lambda from data is called inference, but that is a topic for another time.

One final note regarding densities: if the cumulative density function PX(x)=Prob(Xx)P_X(x) = \text{Prob}(X\leq x) is differentiable, you can re-construct the density by taking the derivative: pX(x)=PX(x)p_X(x) = P_X'(x). Using the fundamental theorem of calculus, we realize that we can easily compute probabilities in intervals:

Prob(aXb)=abdPX=PX(b)PX(a)\text{Prob}(a\leq X \leq b) = \int_a^b\mathrm{d}P_X = P_X(b) - P_X(a)

Conclusion: A translation table

In this post we discussed how some concepts from probability theory are formalized using measure theory, ending up with this translation table:

ProbabilityMeasure
EventEFE\in\mathcal{F}, where (Ω,F)(\Omega, \mathcal{F}) is a measurable space.
ProbabilityA measure Prob ⁣:F[0,)\text{Prob}\colon \mathcal{F}\to[0,\infty) that satisfies Prob(Ω)=1\text{Prob}(\Omega) = 1.
Random variableA measurable function X ⁣:(Ω,F)(Θ,G)X\colon(\Omega, \mathcal{F})\to (\Theta, \mathcal{G}), where Θ\Theta is either R\mathbb{R} or N\mathbb{N}.
DistributionThe measure PX(A)=Prob(XA)=Prob(X1(A))P_X(A) = \text{Prob}(X\in A) = \text{Prob}(X^{-1}(A))
Cumulative densityThe induced function PX ⁣:ΘRP_X\colon\Theta\to\mathbb{R} given by PX(x)=Prob(Xx)P_X(x) = \text{Prob}(X\leq x).
Density (or mass)A function pX ⁣:ΘRp_X\colon\Theta\to\mathbb{R} that satisfies PX(A)=ApXdxP_X(A) = \int_Ap_X\mathrm{d}x.

References


  1. This definition is not standard. It is enough to say that F\mathcal{F} contains Ω\Omega, that it is closed under countable unions and complements.
  2. The technical definition is that the Borel sets are the σ\sigma-algebra generated by the standard topology in R\mathbb{R}. You don't have to worry what any of those words mean for now, but you can think of a topology on a set as defining what the "open intervals" should look like. The fact that Borel sets are defined this way allows for all continuous functions on R\mathbb{R} to be measurable. See more in the references.
  3. Not really, there are formalizations of probability that ditch the normalization axiom. There are also formalizations that skip σ\sigma-additivity or positivity. These, however, belong more to the realm of philosophy of probability than what mathematicians and statisticians use in their daily practice. The SEP has a great entry on it, in case you want to read more.
  4. Notice that the σ\sigma-additivity implies that Prob(Ω)=Prob(Ω)=Prob(Ω)+Prob()\text{Prob}(\Omega) = \text{Prob}(\Omega \cup \varnothing) = \text{Prob}(\Omega) + \text{Prob}(\varnothing), which means that Prob()=0\text{Prob}(\varnothing) = 0.
  5. If you want a formal treatment of integration in Measure Theory, check the references.
  6. This any is a stretch. The functions have to be integrable with respect to PXP_X (i.e. EfdPX\int_Ef\mathrm{d}P_X should be a finite number).
  7. The Poisson distribution can be thought of as a limit of the Binomal. But there's also a different way to derive the fact that said pXp_X relates to counting things in fixed intervals of time. Check this.
  8. The integral AdPX\int_A\mathrm{d}P_X is transformed into a sum because μ\mu is the counting measure in N\mathbb{N}. Remember that measure theory allows us to treat probabilities of events in the continuous and discrete setting using the same symbol.