This post will provide an overview of what I learned in my first semester probability course. The purpose of this post is to improve my own understanding and presentation of the material. The focus will be on simplifying the content and as a result, the treatment of the subject will not be rigorous. The article will be intended for an undergraduate. I will provide intuition and refer interested readers to references.

Along with course lecture notes, the other material that I used were from Resnick and Tabago to help get a simplified view of probability theory. Then I worked my way up to Cinlar, Durrett and Dembo. Ash's probability book is also a good book to follow for beginners. Those books provide a greater details and exposure to the topic and will do more justice I can in a single post.

The summary will be as follows:

  1. Measure Theory
    1. Measurable Spaces
    2. Random Variables
    3. Expectations
  2. Law of Large Numbers
    1. Weak Law
    2. Strong Law
  3. Central Limit Theorem
    1. Characteristic/Generating Functions
    2. Central Limit Theorem
  4. Markov Chains
  5. Martingales
  6. Poisson Processes
  7. Measure Theory:

    Measure theory provides a structure to formally develop probability theory

    Measure Spaces

    Suppose a researcher performs an experiment in which they know all the possible outcomes. This collection of outcomes/events is denoted by \(\Omega\) and will sometimes be called sample space. If the researcher can quantify the probability of an outcome, then they must surely know the probability of the event not happening (compliment), as well as the probability of that event along with others occuring in conjunction (intersection).

    A probability measure, \(\mathbb{P}: \Omega \to [0, 1]\), is a mapping from the sample space to a number between 0 and 1. It is perfectly fine for an outcome to have 0 probability. Our researcher above would not handle probilities of the outcomes in the sample space in isolation because some events depend on others and they may be concerned with more than one event. For this reason and others, we can see that a probability measure takes in as input any combination of events. More specifically, the probability measure must work on sets that are the results of boolean algebra (union. subtraction,...). Therefore, it is not enough to enumerate all the outcomes in \(\Omega\), every combination of them must also be enumarated and stored in a different set denoted by \(\mathcal{F}\) which some may recognize as the power set. We call \(\mathcal{F}\) a \(\sigma\)-algebra with the main property of it being that it is closed under boolean algebra. (*Of course the sigma-algebra need not be so large but this definition is sufficient*)

    The tuple \((\Omega, \mathcal{F}, \mathbb{P})\) is called probility-measure space. Notice the generality of this framework. It can be extended in any experiment, whether discrete or not. As a matter of fact, we do not even have to think of probilities in an experimental sense \(\frac{\text{number of successes}}{\text{number of trails}}\), any assignment that satisfies the above assumptions will work as a probability measure space. will work in defining any experiment.

    Side note: We had a closed set under boolean algebra which can be thought as very similar to a vector space which is closed under linear operations. It also shares some similaries of the idea of basis of a set. For example, if the sample space can be partitioned in the "smallest" disjoint sets then we get something similar to "basis" of the sample space. The probability of the union of two events would just be the sum, The difference of two sets would not have an affect. The intersection would always have measure 0. Lastly it makes computing expectations much easier.

    Random Variables

    A random variable \(X: \mathcal{F} \to \mathbb{R}\) is simply another way to represent the outcomes in \(\Omega\). Measure spaces are defined to be general, however, with random variables, the outcomes of interest are narrowly and explicitly stated. For example, a gambler may only be concerned with the difference in score of a game being less than 5, therefore they can let their sample space take on a value 0 if the difference is greater than 5 and 1 if the difference is less than 5 and assign probabilities to those events.

    Let the probability density function (pdf) of \(X\) be \(f_X (x) := \mathbb{P}[{\omega \in \Omega \vert X(\omega) = x}]\) or the probability measure of events that make \(X=x\). The gambler may go about computing \(f_X (1)\) as the proportion of games that have been decided by 5 points or less during the season (*this is an estimate and not the true probability). The point is that the gambler need not be too concerned with all the information in the underlying measure space after the random variable \(X\) is introduced. The random variables effectively reduces the sample space to things we care about. The gamblers job is being able to pull-back from the range\((X)\) to \(\Omega\) to correctly assign the probabilities and therefore the function \(X\) should have an inverse (injective to be more specific).

    Another way to represent the probabilities of \(X\) is by the the cumulative distribution function (cdf) \(F_X (x) = \mathbb{P}[{\omega \in \Omega \vert X(\omega) \leq x}]\). The cdf and pdf are identicical and each of them uniquely determine a random variable. There are properties for a cdf to be properly defined such as

    • \(\lim_{x \to \infty} F_X (x) = 1\)
    • \(\lim_{x \to -\infty} F_X (x) = 0\)
    • Left continuous

    There is nothing stopping us from applying other functions to these random variables to transform them. In essense, it is the same as defining a new random variable with the transformed mapping. For example, if the gambler wanted the random variable to represent how much he wins if the score difference is less than 5, he would still be dealing with the same pull-back probability space.

    Expectations

    Expectation arises from a concept that is so natural, the weighted average. A researcher may ask what can they expect the random variable to be in the long run and a good answer to that would be the expectation. It can be argued that it is the "best" way to characterize a probability distribution. However, the expectation does not need to be one of the possible outcomes.

    Expectation of \(X\) is defined as \(\mathbb{E}[X] = \int_{\forall \omega} X(\omega) d \mathbb{P} = \int_{\forall X} X f_X (x) dx\), where the integral is taken to be as summation if the rv is discrete. This may resemble the Riemann integrals that is taught in calculus and they both are the same in most cases. In the Reimann integral, the probability measure is substituted to be the Lebesgue measure which is just a measure of length (i.e. in 1-dim \(Leb(a,b)= b - a\)). Whereas in the Reimann integral, the function is evaluated at a point then weighted by small increments of x, expecation, evaluates the function (random variable) at a point and weights it by the measure. With that reason, expectation is can be seen as the general and robust concept; not needing one to know antiderivatives to derive.

    Law of Large Numbers (LLN):

    The first fundamental theorem introduced and it describes the relationship between a sequence of random variables and their expectation.

    Weak Law of Large Numbers:

    Let \(X_1, X_2,...\) be independent and identically distributed (iid) random variables with \(\mathbb{E}[X_i] \lt \infty \). Let \(S_n := \sum_{i=1}^{n} X_i\) be the partial sum. Then \[\frac{S_n}{n} \overset{p} \to \mathbb{E}[X]\]

    To thoroughly appreciate the LLN it must be understood what is being stated. It states that there is a determistic relationship between the sample mean, which is a random variable, and the expectation which is simply a constant. The relationship is through a convergence, but not like the one that is usually taught in calculus. Because these are random variables (i.e. value not known until observed), a new type of convergence must be defined that can deal with this.

    We define convergence in probability as \(\lim_{n \to \infty} \mathbb{P} [\vert \frac{S_n}{n} - \mathbb{E} [X] \vert > \epsilon] = 0 \forall \epsilon > 0 \). In epsilon-delta terms, \(\forall \epsilon , \delta > 0, \exists N(\epsilon, \delta)\) such that \(\forall n \geq N(\epsilon, \delta), \mathbb{P} [\vert \frac{S_n}{n} - \mathbb{E} [X] \vert > \epsilon] \leq \delta \). In words, we say that is unlikely that the sample average will deviate far away from the individual expecation as n increases. Similar to regular convergence, the behavior at the beginning of the sequence doesnt matter much, only that the tail settles to the expectation in metric distance and in probability. This is usually proved using Chebyshev/Markov inequality .

    Strong Law of Large Numbers:

    Let \(X_1, X_2,...\) be independent and identically distributed (iid) random variables with \(\mathbb{E}[X_i] \lt \infty \). Let \(S_n := \sum_{i=1}^{n} X_i\) be the partial sum. Then \[\frac{S_n}{n} \overset{a.s.} \to \mathbb{E}[X]\]

    If there is a weak law then there should be a strong law. The strong long features a stronger and maybe unrealistic mode of convergence called almost sure convergence (or almost everywhere, ...). The definition for a.s. convergence is \(\mathbb{P} [\lim_{n \to \infty} \vert \frac{S_n}{n} - \mathbb{E} [X] \vert > \epsilon] = 0\). In epsilon-delta arguments, \(\forall \epsilon > 0, \exists N(e) > 0\) such that \(\forall n>N(e), \mathbb{P} [\vert \frac{S_n}{n} - \mathbb{E} [X] \vert > \epsilon] = 0\). In words, for any epsilon given, if we go far enough in the tail sequence of \(S_n\), any event that makes \(S_n\) vary far away from the expectation has probability 0 (or the ones that make them close is 1). This is proved using Borel-Cantelli lemmas and a truncation argument usually.

    It may be hard to distinguish between the strong and the weak law but there are differences. Convergence in probability specifies a probability threshold of \(\delta\) for events that make \(S_n\) "far away" from the expectation. Meanwhile, a.s. convergence requires that delta be 0. So as you can imagine, if something converges a.s., then it must converge in probability.

    The laws can be proved without the iid property and varying which moments are bounded. Notice that the weak law does not tell us anything about the rate of convergence, it just says that more samples will get us closer. This is for all iid samples no matter their underlying distribution. We will see how this is used later in statistics.

    Central Limit Theorem (CLT)

    The second fundamental theorem that I learned and it characterizes the distribution of the sum of random variables

    Characteristic Functions

    Before I introduce Central Limit Theorem, I will discuss characteristic functions (cf). Partial sums of random variables have occured pretty often in the previous paragraphs. Determining the distribution of these sums of random variables using straight-forward methods such as convolution can be tedious. Therefore a new method of characterizing the distribution of sums of random variables.

    Some may be familiar with moment generating functions (mgf), \(\mathbb{E} [e^{tX}]\). But this do not always exist because the function \(e^{tX}\) is not bounded and therefore the expectation may be infinity. *It can be shown that in order for the mgf to exist, the tail of the distribution must be exponentially bounded. Where the existence of the mgf is lacking, the cf makes up for it and adds in a few properties of its own. The cf is defined as \(\mathbb{E} [e^{itX}] = \mathbb{E} [cos(tX) + isin(tX)]\) (for those familiar with Euler's formula) *Look how this relates to Fourier transforms*. Notice that \(\vert e^{itX} \vert \lt 1\) and therefore always exists.

    Characteristic functions can manipulate the sum of random variables into the product of cfs. By the uniqueness theorem, each distribution has its own characteristic function. Also, by the continuity theorem, pointwise convergence of cf imply convergence of random variables. Thus if we can show that the characteristic function of a sum of random variables converges to something then we have shown that the distribution of the sum is converging to something. Through Taylor expansion, we can see that the sum of random variables converges to the cf of a normal random variable.

    We can also show that the expectation converges by Portmanteau theorem . The version we learned uses the Lindenberg-fuller triangulization proof.

    Central Limit Theorem:

    Let \(X_1, X_2,...\) be independent and identically distributed (iid) random variables with \(\mathbb{E}[X_i] \lt \infty \) and \(\mathbb{E}[X_{i}^{2}] \lt \infty\). Let \(S_n := \sum_{i=1}^{n} X_i\) be the partial sum. Then \[F_{\frac{S_n}{n}} (x) \overset{d} \to \Phi(\frac{x - \mathbb{E} [X]}{\sqrt{n * \mathbb{V}ar[X_1]}})\] where \(\Phi(x)\) is the cdf of a normal(0,1) random variable

    Indeed the setup starts off the same as LLN by computing the sample average of of iid random variables. Then there is a claim that for a large enough n, the distribution of the sample mean looks roughly normal. There is an additional assumption of the finite 2nd moment because the variance is a parameter of the normal distribution. It extends from the LLN which stated that there is "high" probabiliy of getting the true expectation as the sample mean, and instead gives us a probability around that expectation.

    The new convergence should be a familiar one. It is simply the pointwise convergence of functions at continuity points of the normal distribution (i.e. \(\mathbb{R}\)). The cdf's of \(\frac{S_n}{n}\) and the normal distribution evaluated at certain points cannot be too far apart. The implications of this is mostly seen in statistics where we gain the ability to get asymptotic and finite distribution for estimators. The normal distribution is also extremely simple to work with because it has nice well-studied properties.

    The central limit theorem can also be rewritten as \[\frac{S_n - n * \mathbb{E} [X_1]}{\sqrt{n* \mathbb{V}ar[X_1]}} = \frac{\sqrt{n}}{\sqrt{\mathbb{V}ar[X_1]}}(\frac{S_n}{n} - \mathbb{E}) \overset{d} \to \Phi(x) \] The second term gives a bit more intuition of what is actually going on. For large n, the sample mean gets very close to the expectation by LLN. Hence the difference \(\frac{S_n}{n} - \mathbb{E} [X_1]\) must get pretty small and converge to 0. In order to slow this this convergence, we scale it by \(\sqrt{n}\) to magnify the deviations. The deviations tend to act as a normal random variable.

    Notice there is no mention of the actual underlying distribution of \(X_i\). Also notice that the central limit theorem only discusses the what the resulting distribution of the normalized sum of random variables is.

    Markov Chains

    A fairly common stochastic process that is simple to deal with

    A sequence of random variables \(X_{t \in I}\), where \(I\) is an index set (think time). is a stochastic process if they are all defined on the same measure space \((\Omega, \mathcal{F}, \mathbb{P})\). So an iid sample can be considered a stochastic process but that isnt really interesting. However, Markov chains on the other hand are interesting and intuitive to understand.

    To define a MC what is needed is the transition matrix \(\mathcal{P}\) and state space \(S = \{1, 2, 3, ..., k\}\). I will only discuss homogenous discrete time MC for now, so let index set \(I = {1, 2, 3,...}\). The transition matrix is \[\mathcal{P} = \begin{pmatrix} a_{11} & a_{21} & ... & a_{k1} \\ a_{12} & a_{22} & ... & a_{k2} \\ . & . & ... & . \\ a_{1k} & a_{2k} & ... & a_{kk} \end{pmatrix}\] The crucial point of a MC is that it possesses the Markovian property. This means that the next step of transition matrix is independent of the path it took to get to its current position. The only thing that matters, in a probabilistic sense, is the position that it is at today. More explicitly, \(\mathbb{P}[X_{n+1} = j \vert X_n = i_n, X_{n-1} = i_{n-1}, X_1 = i_1]\) =\( \mathbb{P}[X_{n+1} = j \vert X_n = i_n]\)

    A good example of a MC is a random walk (i.e. flip a coin and we go right if heads, left if tails and handle edge cases however we want to).

    Usually, the only MC we care about are the ergodic ones (aperiodic and irreducible) because those have long run behaviors that we can compute. Long-run behavior in the sense of the proportion of time the chain spends in each state. This distribution is called the stationary distribution and denoted by the vector \(\pi\) and \(\pi(x) = \lim_{n \to \infty} \mathbb{P}[X_n = x]\), where \(x \in S\) If \(\pi\) is a stationary distribution then \(\mathbf{\pi}^T \mathcal{P} = \mathbf{\pi}\). The stationary distribution can be thought of as in eigenvector with eigenvalue 1. Which we can show to exist if the chain is irreducible. The stationary distribution \(\pi(x)\) can be viewed as \(\frac{1}{\mathbb{E}[\text{length of excursion from x back to x}]}\). Usually, the best way to compute this is through first step analysis.

    We also get an extension of the LLN for MC called the Ergodic Theorem. If \(\{X_n\}_{n \geq 0}\) is a MC with transition matrix \(\mathcal{P}\) and stationary distribution \(\pi\) and \(f:S \to \mathbb{R}\) then \[\frac{1}{n} \sum_{i=0}^{n-1} f(X_i) \overset{a.s} \to \sum_{x \in S} \pi (x) f(x)\] The result is the expectation of the chain evaluated at the transformed state space via the function f. The sample average of the chain should get closer and closer to the actual mean of the distribution.

    The state space does not have to be finite, it can be extended into a countable infinite set. When \(S\) is extended into infinities it adds a new dimension to the irreducible chain. In a finite irreducible chain, there is always a positive probability of going from one state to another in some number of steps. When the state space becomes infinity, there is a chance that there may be no chance of coming back to a state after leaving it even if all the states communicate. If there is a chance of never coming back to a state after leaving it, then the chain is called transient, otherwise it is recurrent. Furthermore, if a chain is reccurent it can be positive recurrent or null recurrent. These can be seen a random walk example where p is less than, equal to or greater than 1/2.

    The stationary distribution also extends to the infinite state space. If the chain is transient then it doesnt make sense to discuss the stationary distriution because the probability of being in that state in the long run is 0. However, if the chain is recurrent, the stationary distribution can be determined as the expected time until return.

    Strong Markov Theorem - independence of future and past???

    Martingale

    A stochastic process obsessed with gambling

    Martingales appear after conditional probability. It is the right transition because conditioning on \(\mathcal{F_{t \in I}}\) (called filtrations is the crucial to understanding martingales, one has to be familiar with conditioning. It is another king of stochastic process. The defining property of a martingale \(\{X_t\}_{t \geq 0}\) is that \(\mathbb{E}[X_t \vert \mathcal{F} = X_s]\) where s \(\lt\) t. More simply my expectation for my profit tomorrow based on what I have today is that same as what I have today. Another way of saying this that I am expected to have the same profit each round. Therefore \(\mathbb{E}[X_t]= \mathbb{E}[X_s] = \mathbb{E}[X_0]\).

    If a stopping time \(T := \{inf_{n \geq 0} \vert X_n \in A\}\) (i.e. first time something happens) introduced where \(A\) is some event then the stopped process, \(X_{min\{n, T\}}\) is still a martingale. If we can make some claims about the boundedness of \(T\) or bound the process itself then the Optional Stopping Time theorem states that \(\lim_{n \to \infty} \mathbb{E}[X_{min\{n, T\}}]\) = \(\mathbb{E}[\lim_{n \to \infty} X_{min\{n, T\}}] = \mathbb{E}[X_0]\). This is useful for computing probability of absorbing states.

    When the martingale can be bounded, Doobs Convergence theorem states that the martingale converges a.s. to a random variable. Usually this end up being a fixed point???

    Poisson Process:

    The last and probably most applicable stochastic process discussed

    The Poisson Process is a counting process that relates the exponential distribution and the poisson distribution. It models the number of event that occur in a time interval as a Poisson distribution; and the time until event happens as an exponential distribution.

    Using the superpositon and thinning properties of the process, a few intuitive results about poisson and exponential distribution can be demonstraties; i.e. max of exponential is a bernoulli. given we have n total, the total is multinomial

    The Poisson process is the building blocks of the continuous time MC. It can be taught of as that the chain only transitions when the the process gets incremented by 1, which happens with an exponential distribution. The transition matrix turns into the transistion rate matrix.