Appendix A
Review of Discrete Probability

This appendix provides a review of the essentials of discrete probability. It is concerned with observations, experiments or actions that have a finite number of unpredictable outcomes. The set of all possible outcomes is called the sample space (standard terminology) and is denoted by the symbol $\Omega$. An element of $\Omega$ (an individual outcome) will be denoted by $\omega$. A coin toss for example, has two possible outcomes: heads (H) or tails (T). The sample space is $\Omega=\{H,T\}$ and $\omega=H$ is one of the possible outcomes. Another example is the roll of a dice which has 6 outcomes so that $\Omega=\{1,2,3,4,5,6\}$. A subset of the sample space is called an event and is denoted by a capital letter such as $A$ or $B$. In the dice example, let $A$ be the event that an even number is rolled, then $A=\{2,4,6\}$.

Each outcome, $\omega$, will have a probability assigned to it, denoted $P(\omega)$. The probability is a real number ranging from $0$ to $1$ that signifies the likelihood that an outcome will occur. If $P(\omega)=0$ then $\omega$ will never occur and if $P(\omega)=1$ then $\omega$ will always occur. An intermediate value such as $P(\omega)=1/2$ means that $\omega$ will occur roughly half the time if the experiment is repeated many times. In general, if you perform the experiment a large number of times, $N$, and the number of times that $\omega$ occurs is $n(\omega)$, then the ratio $n(\omega)/N$ should approximately equal the probability of $\omega$. It is possible to define $P(\omega)$ as the limit of this ratio.

\begin{equation}\tag{A.1} P(\omega) = \lim_{N \to \infty} \frac{n(\omega)}{N} \end{equation}

The function $P(\omega)$, which assigns probabilities to outcomes, is called a probability distribution. We will now look at some of its defining properties. To begin with, if the probabilities are defined as in equation A.1, then clearly the sum of all the probabilities must equal 1.

\begin{equation}\tag{A.2} \sum_{\omega \in \Omega} P(\omega) = 1 \end{equation}

It is often necessary to determine the probability that one of a subset of all the possible outcomes will occur. If $A$ is a subset of $\Omega$ then $P(A)$ is the probability that one of the outcomes contained in $A$ will occur. Using the definition in equation A.1 it should be obvious that:

\begin{equation}\tag{A.3} P(A)=\sum_{\omega \in A} P(\omega) \end{equation}

Many other properties can be derived from the algebra of sets. Let $A + B$ be the set of all elements in either $A$ or $B$ (no duplicates) and let $AB$ be the the set of all elements in both $A$ and $B$, then:

\begin{equation}\tag{A.4} P(A + B) = P(A) + P(B) - P(AB) \end{equation}

If $A$ and $B$ have no elements in common then they are exclusive events, i.e. they can not both occur simultaneously. In this case equation A.4 reduces to $P(A + B) = P(A) + P(B)$. In general, the probability that any one of a number of exclusive events will occur is just equal to the sum of their individual probabilities.

Conditional probabilities and the closely related concept of independence are very important and useful in probability calculations. Let $P(A|B)$ be the probability that $A$ has occurred given that we know $B$ has occurred. In short, we will refer to this as the probability of $A$ given $B$ or the probability of $A$ conditioned on $B$. What $P(A|B)$ really represents is the probability of $A$ using $B$ as the sample space instead of $\Omega$. If $A$ and $B$ have no elements in common then $P(A|B)=0$. If they have all elements in common so that $A=B$ then obviously $P(A|B)=1$. In general we have

\begin{equation}\tag{A.5} P(A|B) = \frac{P(AB)}{P(B)} \end{equation}

Using a single fair dice roll as an example, let $A=\{1,3\}$ and $B=\{3,5\}$ then $AB=\{3\}$, $P(AB)=1/6$, $P(B)=1/3$ and

\begin{equation}\tag{A.6} P(A|B) = \frac{1/6}{1/3} = \frac{1}{2} \end{equation}

Knowledge that $B$ has occurred has increased the probability of $A$ from $P(A)=1/3$ to $P(A|B)=1/2$. The result can also be deduced by simple logic. We know that $B$ has occurred therefore the roll was either a 3 or a 5. Half of the $B$ events are caused by a 3 and half by a 5 but only the 3 also counts as an $A$ event also, therefore $P(A|B)=1/2$.

Conditional probabilities are not necessarily symmetric. $P(B|A)$ need not be equal to $P(A|B)$. Using the definition in equation A.5, you can show that

\begin{equation}\tag{A.7} P(A|B) P(B) = P(B|A) P(A) \end{equation}

so the two conditional probabilities are only equal if $P(A)=P(B)$. Another useful thing to keep in mind is that conditional probabilities obey the same properties as non-conditional probabilities. This means for example that if $A$ and $B$ are exclusive events then $P(A+B|C) = P(A|C) + P(B|C)$.

The concept of independence is naturally related to conditional probability. Two events are independent if the occurrence of one has no effect on the probability of the other. In terms of conditional probabilities this means that $P(A|B)=P(A)$. Independence is always symmetric, if $A$ is independent of $B$ then $B$ is independent of $A$. Using the definition in equation A.5 you can see that independence also implies that

\begin{equation}\tag{A.8} P(AB) = P(A) P(B) \end{equation}

This is often taken as the defining relation for independence.

Another important concept in probability is the law of total probability. Let the sample space $\Omega$ be partitioned by the sets $B_1$ and $B_2$ so that every element in $\Omega$ is in one and only one of the two sets and we can write $\Omega = B_1 + B_2$. This means that the occurrence of $A$ coincides with the occurrence of $B_1$ or $B_2$ but not both and we can write

\begin{equation}\tag{A.9} A = A B_1 + A B_2 = A (B_1 + B_2) = A \Omega \end{equation}

The probability of $A$ is then

\begin{equation}\tag{A.10} P(A) = P(A B_1) + P(A B_2) \end{equation}

This can be extended to any number of sets that partition $\Omega$.

To carry out any kind of probabilistic analysis we need random variables. A random variable is a bit like the probability distributions discussed above in that it assigns a number to each of the elements in the sample space. It is therefore really more like a function that maps elements in the sample space to real numbers. A random variable is usually denoted with an upper case letter such as $X$ and the values it can assume are given subscripted lower case letters such as $x_i$ for $i=1,2,\ldots,n$ where $n$ is the number of possible values. The mapping from an element $\omega$ to a value $x_i$ is denoted as $X(\omega)=x_i$. Note that it is not necessary that every element be assigned a unique value and the particular value assigned will depend on what you want to analyze.

A simple example is a coin toss betting game. You guess what the result of the toss will be. If your guess is correct you win $1 otherwise you loose $1. The sample space consists of only two elements, a correct guess and an incorrect guess $\Omega=\{\mathrm{correct},\mathrm{incorrect}\}$. If you are interested in analyzing the amounts won and lost by playing several such games then the obvious choice for the random variable is $X(\mathrm{correct})=1$, $X(\mathrm{incorrect})=-1$. If you are just interested in the number of games won or lost then the random variable $Y(\mathrm{correct})=1$, $Y(\mathrm{incorrect})=0$ would be better. Often an analysis in terms of one variable can be converted into another variable by finding a relation between them. In the above example $X = 2Y - 1$ could be used to convert between the variables.

As another example consider tossing a coin three times. The sample space consists of 8 elements $\Omega=\{TTT,TTH,THT,THH,HTT,HTH,HHT,HHH\}$ where $T$ indicates the toss was a tail and $H$ a head. This time we let $X$ be the random variable that counts the number of heads in the three tosses. It can have values 0, 1, 2, or 3 and not every element in the sample space has a unique value. The values are $X(TTT)=0$, $X(TTH)=X(THT)=X(HTT)=1$, $X(THH)=X(HTH)=X(HHT)=2$, $X(HHH)=3$.

Probability distributions are most often expressed in terms of the values that a random variable can take. The usual notation is

\begin{equation}\tag{A.11} P(X=x_i) = p(x_i) \end{equation}

The function $p(x_i)$ is the probability distribution for the random variable $X$. It is often also called the probability mass function. Note that it is not necessarily the same as the probability distribution for the individual elements of the sample space since multiple elements may be mapped to the same value by the random variable. In the three coin toss example, each element in the sample space has a probability of $1/8$, assuming a fair coin. The probability distribution for $X$ however is $p(0)=1/8$, $p(1)=3/8$, $p(2)=3/8$, $p(3)=1/8$. It will always be true that the sum over all the probabilities must equal 1.

\begin{equation}\tag{A.12} \sum_i p(x_i) = 1 \end{equation}

The two most important properties of a random variable are its expectation and variance. The expectation is simply the average value of the random variable. In the coin toss betting game, $X$ can have a value of +1 or -1 corresponding to winning or losing. In $N$ flips of the coin let $k$ be the number of wins and $N-k$ the number of losses. The total amount won is then

\begin{equation}\tag{A.13} W = k - (N-k) \end{equation}

and the average amount won per flip is

\begin{equation}\tag{A.14} \frac{W}{N} = \frac{k}{N} - (1-\frac{k}{N}) \end{equation}

As the number of flips becomes very large the ratio $k/N$ will equal $p(1)$, the probability of winning, and the equation then becomes equal to expectation of the random variable.

\begin{equation}\tag{A.15} E[X] = p(1) - p(-1) \end{equation}

Where $p(-1)=1-p(1)$ is the probability of losing and $E[X]$ is the usual notation for the expectation of $X$. In this case the expectation is the average amount that you can expect to win per flip if you play the game for a very long time.

In general if $X$ can take on $n$ values, $x_i$, $i=1,2,\ldots,n$ with corresponding probabilities $p(x_i)$ then the expectation is

\begin{equation}\tag{A.16} E[X] = \sum_{i=1}^n p(x_i) x_i \end{equation}

The expectation gives you the average but in reality large deviations from the average may be possible. The variance of a random variable gives a sense for how large those deviations can be. It measures the average of the squares of the deviations. The equation for the variance is:

\begin{equation}\tag{A.17} \mathrm{Var}[X] = \sum_{i=1}^n p(x_i) (x_i - E[X])^2 \end{equation}

The equation simplifies somewhat to

\begin{equation}\tag{A.18} \mathrm{Var}[X] = E[X^2] - E[X]^2 \end{equation}

where

\begin{equation}\tag{A.19} E[X^2] = \sum_{i=1}^n p(x_i) x_i^2 \end{equation}

is the expectation for the square of the random variable. In general the expectation for any function $g(X)$ is:

\begin{equation}\tag{A.20} E[g(X)] = \sum_{i=1}^n p(x_i) g(x_i) \end{equation}

Another useful measure of deviation from the average is called the standard deviation, $\sigma$. It is found by taking the square root of the variance.

\begin{equation}\tag{A.21} \sigma = \sqrt{\mathrm{Var}[X]} \end{equation}

As we saw above, a sample space can have more than one random variable defined on it. If we have two variables $X$ and $Y$ then we can define the probability that $X=x_i$ at the same time that $Y=y_j$. This is called the joint probability distribution for $X$ and $Y$.

\begin{equation}\tag{A.22} P(X=x_i,Y=y_j) = p(x_i,y_j) \end{equation}

The individual distributions, $p(x_i)$ and $p(y_j)$, are recovered by summing the joint distribution over one of the variables. To get $p(x_i)$ you sum $p(x_i,y_j)$ over all the possible values of $Y$.

\begin{equation}\tag{A.23} p(x_i) = \sum_j p(x_i,y_j) \end{equation}

and likewise for $p(y_j)$

\begin{equation}\tag{A.24} p(y_j) = \sum_i p(x_i,y_j) \end{equation}

From these last two equations it is obvious that if you sum over both variables of the distribution, the result should equal 1.

\begin{equation}\tag{A.25} \sum_i \sum_j p(x_i,y_j) = 1 \end{equation}

It is possible to construct a joint distribution for any number of random variables, not just 2. For example $p(x_i,y_j,z_k)$ would be a joint distribution for the variables $X$, $Y$, and $Z$.

With a joint distribution you can calculate the expectation and variance for functions of variables. The expectation for the sum $X + Y$ is:

\begin{eqnarray}\tag{A.26} E[X+Y] & = & \sum_i \sum_j p(x_i,y_j)(x_i + y_j)\\ & = & \sum_i x_i \sum_j p(x_i,y_j) + \sum_j y_j \sum_i p(x_i,y_j)\nonumber\\ & = & \sum_i x_i p(x_i) + \sum_j y_j p(y_j)\nonumber\\ & = & E[X] + E[Y]\nonumber \end{eqnarray}

The property that the expectation for a sum of variables is equal to the sum of their expectations is called linearity and it is true for the sum of any number of variables. For three variables for example $E[X+Y+Z]=E[X]+E[Y]+E[Z]$. Another easily verifiable consequence of linearity is that for any constants $a$ and $b$

\begin{equation}\tag{A.27} E[aX+bY] = aE[X] + bE[Y] \end{equation}

In the example of the coin toss game we had two random variables that were related by $X = 2Y -1$. The linearity property of the expectation means that $E[X] = 2E[Y] - 1$, where we used the fact that the expectation of a constant is just the constant.

The expectation for the product $XY$ is

\begin{equation}\tag{A.28} E[XY] = \sum_i \sum_j p(x_i,y_j) x_i y_j \end{equation}

If the variables $X$ and $Y$ are independent then the joint distribution can be factored into a product of the individual distributions, $p(x_i,y_j) = p(x_i) p(y_j)$. In this case you can show that the expectation of the product is the product of the expectations, $E[XY] = E[X] E[Y]$.

For the variance of a sum we have

\begin{eqnarray}\tag{A.29} \mathrm{Var}[X+Y] & & =\\ & & E[(X - E[X] + Y - E[Y])^2] \end{eqnarray}

after expanding and simplifying this becomes

\begin{eqnarray}\tag{A.30} \mathrm{Var}[X+Y] & & =\\ & & \mathrm{Var}[X] + \mathrm{Var}[Y] +\\ & & 2\mathrm{Cov}[X,Y] \end{eqnarray}

where $\mathrm{Cov}[X,Y]$ is called the covariance of $X$ and $Y$. The covariance is defined as:

\begin{equation}\tag{A.31} \mathrm{Cov}[X,Y] = E[XY] - E[X]E[Y] \end{equation}

For independent variables the covariance is zero. The variance of the sum is then just the sum of the variances.

This completes the review of discrete probability. You do not need to understand everything in this review in order to understand the contents of this book. The more you do understand, the more likely you will be able to extend the concepts in this book to build even more powerful trading strategies.

Appendix AReview of Discrete Probability

Appendix A
Review of Discrete Probability