This post discusses how measure theory formalizes the operations that we usually do in probability (computing the probability of an event, getting the distributions of random variables...). With it, I hope to establish a translation table between abstract concepts (like measurable spaces and -additive measures) and their respective notions in probability. I hope to make these concepts more tangible by using a running example: counting cars and the Poisson distribution. Feel free to skip to the end for a quick summary.
Let's say you and your friends want to start a drive-through business. In order to allocate work shifts or to predict your daily revenue, it would be great to estimate how many cars will go through your street in a fixed interval of time (say, Mondays between 2pm and 4pm). Let's call this amount . is what we call a random variable: there is a stochastic, non-deterministic experiment going on out there in the world, and we want to be able to say things about it. We want to be able to answer questions about events that relate to , questions like how likely is it that we will get more than 50 cars in said time interval.
Probability theory sets up a formalism for answering these questions, and it does so by using a branch of mathematics called measure theory. I'll use this running example to make the abstract concepts that we'll face more grounded and real.
In our scenario, we have several possible events (seeing between 10 and 20 cars, getting less than 5 customers...), and the theory that we will be establishing will allow us to say how likely each one of these events are.
But first, what do we expect of events? If and are events, we would expect and to be possible events (seeing 5 cars go by and/or having 3 takeout orders), we would also expect to be an event. Collections of sets that are closed under these three operations (intersection, union and complement) are called -algebras:
Definition: A -algebra over a set is a collection of subsets (called events) that satisfies:
The pair is called a measurable space.1
You can think of as the set containing all possible outcomes of your experiment. In the context of e.g. coin tossing, , and in our example of cars going through your street, . We have two special events (that are always included, according to the definition): and . You can think of them as tokens for absolute certainty and for impossibility, respectively.
There are two measurable spaces that I would like to discuss, both because they are examples of this definition, and because they come up when explaining some of the concepts that come next. The first one is the discrete measureable space (given by the naturals and the -algebra of all possible subsets , and the second one is the set of real numbers with Borel sets . You can think of the Borel -algebra as the smallest one that contains all open intervals , their unions and intersections.2 The definition of might seem arbitrary for now, but it will make more sense once we introduce random variables.
Notice that, in our running example, we are using the discrete measurable space (by idenfitying , and so on).
The probability that an event (for example ) is formalized using the notion of a measure. Let's start with the definition:
Definition: A -additive measure (or just measure) on a measureable space is a function such that, if is a family of pairwise disjoint sets, then
In summary, we expect measures to be non-negative and to treat disjoint sets additively. Notice how this generalizes the idea of, say, volume: if two solids and in are disjoint, we would expect the volume of to be the sum of their volumes.
In the previous section we talked about two measurable spaces, let's discuss the usual measures they have:
This measure can be extended to arbitrary sets in , but how this is done is a story for another time.
There's a key difference between sizes and probabilities, though: we assume probabilities to be bounded3. We expect events with probability 1 to be absolutely certain. Adding this restriction we get the following definition:
Definition: a probability space is a triplet such that
The measure in a probability space satisfies all three Kolmogorov's axioms, which attempt to formalize our intuitive notions of probability.
Now we have a probability measure that allows us to measure the probability of events . How can we link this to experiments that are running somewhere outside in the world?
The outcomes of experiments are measured using random variables. In our example, takes an outcome and associates it with a number . We have other examples, like the number of customers after seeing cars or the revenue after seeing cars.
This association is formalized using measurable functions.
Definition: Let and be two measurable spaces. is a measurable function if for all , . In other words, if the inverse image of a measurable set in is a measurable set in .
This condition () says that it makes sense to query for events of the type , since they will always be measurable sets (if is a measureable set in ).
As we were discussing, the output of a random variable is a real number. This means that random variables are a particular kind of measurable functions:
Definition: Let be either the real numbers with Borel sets, or the discrete measurable space. A function is a random variable if for measurable sets .
People usually make the distinction between continuous and discrete random variables (depending on whether is or , respectively). Thankfully, measure theory allows us to treat these two cases using the same symbols, as we will see when we discuss integration later on.
I still think this definition isn't completely transparent, because "Borel sets" sounds abstract. Remember that Borel sets are just (unions or complements of) open intervals in . Since the inverse image of a set is very well behaved with respect to unions, intersections and complements, then it suffices to consider to be an interval. An alternative (and maybe more transparent) definition of a random variable is usually given as
Definition: Let be either the real numbers with Borel sets, or the discrete measurable space. A function is a random variable if for all . That is, if the sets are events for all .
Now we are talking! In summary, random variables are formalized as measurable functions: Functions because they associate outcomes with real numbers (or only natural numbers), and measurable because we want the sets (e.g. seeing less than 10 cars) to have meaningful probability.
Now let's talk about the distribution of a random variable. You might have heard things like "this variable is normally distributed" or "that variable follows the binomial distribution", and you might have run computations using this fact, and the densities of these distributions. This is all formalized like this:
Definition: Let be either the real numbers with Borel sets, or the discrete measurable space. Any random variable induces a probability measure on given by
is called the distribution of .
In other words, the distribution of a random variable is a way of computing probabilities of events. If we have an event like , we can compute its probability using our random variable:
and we already have plenty of probability distributions that model certain phenomena in the world. In the case of couting cars, people use the Poisson distribution. But how can we go from to an actual number?, for that we must rely on integration with respect to measures and densities of distributions.
But before going through with these two topics, I want to define a concept that we see frequently in probability. Using the distribution we can define the cumulative density function as
As you can see, we use the same symbol () to refer to a completely different function (in this case from to ). In many ways, this function defines the distribution of and we will see why after discussing integration and densities.
Remember that the Riemann integral measures the area below a curve given by an integrable function . This area is given by
where we are partitioning the interval into segments of length , and selecting .
Being handwavy5, there is a way of defining an integral with respect to arbitrary measures , and it looks like this
where and is the disjoint union of all sets . Notice how this relates to the Riemann sums: we are measuring abstract lengths by replacing with . This is called the Lebesgue integration of , and it extends Riemann integration beyond intervals and into more arbitrary measurable spaces, including probability spaces.
Let's discuss how this integral looks like when we integrate with respect to the measures that we had defined for and :
This means that integrating with respect to the counting metric is just our everyday addition!
Circling back to random variables: If we have a random variable and its distribution , we can consider the integral of any6 real function over an event with respect :
and if :
Let's summarize what we have discussed so far: we have defined events as sets in a -algebra , we defined the probability of an event as a measure on that satisfies being non-negative, normalized and -additive.
We also considered random variables as measurable functions , where is either the real numbers with Borel sets and Lebesgue measure (for continuous random variables) or the natural numbers with all subsets and the counting measure (for discrete random variables). induces a measure on given by . This measure is the distribution of , and it also defines the cumulative density function .
We are still wondering how to compute , but we noticed that . The density of allows us to compute this integral:
Definition: Let be either the real numbers with Borel sets, or the discrete measurable space. A function that satisfies is called the density of with respect to . If is the discrete measurable space, is usually called the mass of .
Let's see how this definition plays out in our particular example. We know that and that is the counting measure, and it is well known7 that our variable (cars that go by in an interval of time) follows the Poisson distribution. This means that8
and this is a number that we can actually compute after specifying . A good question is: how do we know the actual that makes this Poisson distribution describe the random process of cars going through our street? Estimating from data is called inference, but that is a topic for another time.
One final note regarding densities: if the cumulative density function is differentiable, you can re-construct the density by taking the derivative: . Using the fundamental theorem of calculus, we realize that we can easily compute probabilities in intervals:
In this post we discussed how some concepts from probability theory are formalized using measure theory, ending up with this translation table:
Probability | Measure |
---|---|
Event | , where is a measurable space. |
Probability | A measure that satisfies . |
Random variable | A measurable function , where is either or . |
Distribution | The measure |
Cumulative density | The induced function given by . |
Density (or mass) | A function that satisfies . |