B. Math for Simulations, part 2

In the last post, I wrote an introduction to random variables and probability distributions. Included were the discrete probability distributions called binomial distribution and poisson distribution. I mentioned also the uniform distribution. Now I’ll attempt to interpret normal (Gaussian) distributions and the central limit theorem.

Previous: Math for Simulations, part 1

As mentioned before, these posts are my aide-memoires. They do not replace the course slides, but present some additional notes that may help understand the theory behind the techniques.

B.2 Concepts in Math with Continuous Samples

B.2.3 The Summation Function ∑

Summation (and its big brother product) appears a lot in MMAI, so it is good to know it well. Obviously, this is old hat to anyone who took discrete math in high school, but it’s worth reviewing.

Take the following set: {1, 3, 9}

The concept of summation is pretty easy. If you add all the elements in the set together, you get the answer 13. Using summation notation, this would be $$ \color{#515151}{\sum [1, 3, 9]= 1 + 3 + 9 = 13} $$

The summation symbol, ∑ (sigma), basically means: add everything that follows. Now, usually one does not punch in every number in the set. You would assign the numbers to positional variables instead.

So \(\color{#515151}{\sum\{x_1, x_2, x_3\}=x_1+x_2+x_3=13}\).

Not much better. The power of the summation equation lies using an iterator or index denoted “i” to indicate each element in turn. For our example, we want to add the elements from position 1 through position 3, so we set the limit to 3, and the starting value of i to 1.

So \(\color{#515151}{\sum^3_{i=1}x_i=x_1+x_2+x_3=13}\).

Often, especially when programming, we won’t know the length of the set, i.e. the number of elements. It is simpler just to put down a symbol meaning “the length of the set”. Usually that symbol is n.

So now we get: \(\color{#515151}{\sum^n_{i=1}x_i=13}\).

In programming, you might program it thus. Note that for programming, the first element in a set is indexed 0, not 1.

X = [1, 3, 9] # The set
answer = 0
n = len(X)  # i.e., 3 elements in the set
i = 0  # The first position.
for i in range(0, n):  # i starts at 0, then 1, then 2, ...
    answer += X[i] 
print(answer)
13

Of course it is simpler and faster to use a library function:

import numpy as np

print(np.sum([1, 3, 9]))
13

3.2.3.1 Summation of squares

Q: What is \(\color{#515151}{\sum^n_{i=1}x_i^2}\), assuming X is still {1, 3, 9} ?

A: It is the sum of the squares of each element in X. \(\color{#515151}{\sum^n_{i=1}x_i^2=x_1^2+x_2^2+x_3^2=91}\)

In Python:

X = [1, 3, 9]
answer = 0
n = len(X)
i = 0  # The first position is 0, not 1 in Python
for i in range(0, n):
    answer += X[i] ** 2  # each value raised to 2
print(answer)
91

One uses list comprehensions a lot in MMAI ML. You do not need to set the start of index i or to set the limit n, because those are interpreted by Python for you:

import numpy as np
X = [1, 3, 9]
squared_X = [ x ** 2 for x in X ]  # List Comprehension every x^2
print(squared_X)
[1, 9, 81]

answer = np.sum(squared_X)
print(answer)
91

And making it even simpler…

import numpy as np

print(np.sum(np.square([1, 3, 9])))
91

B.2.2 Concepts for Statistics

B.2.2.1 Variance

Variance is a measurement of the spread between numbers in a dataset relative to the mean (i.e., relative to the average of the set). 1 It is depicted by the symbol \(\color{#515151}{\sigma^2} \). Variance is calculated with the function: $$ \color{#515151}{\sigma^2 = {{\sum}_{i=1}^n(x_i - \overline x)^2 \over N }} $$

where

There’s that summation equation! So, let’s consider the set:

X = {1, 7, 3, 8, 9, 2, 5}

The smallest value is 1, and the highest value is 9. Since 9-1= 8, then the distance will be in the ballpark of 8.

Punching this in a calculator sucks. In Python either use a list comprehension as a handy-dandy summation function, or use numpy’s built in var function:


import numpy as np

my_set = np.array([1, 7, 3, 8, 9, 2, 5])
print(round(np.var(my_set), 4)
  
8.2857

I tried the same calculation with Python’s statistics.variance() function. It came up with something different, since it subtracts 1 from N as it calculates a “hypothetical infinite population.” I won’t go into that here.

B.2.2.2 Standard Deviation

Standard Deviation is, just like variance, the measurement of the dispersion of a set relative to its mean.2 It is the square root of variance. The main difference between variance and standard deviation is that standard deviation is expressed in the same units as the original values (metres, visits per hour), whereas variance is the original values squared (metres squared, visits per hour squared). 3

Standard deviation’s symbol is lower-case sigma, σ. Taking the set [1, 7, 3, 8, 9, 2, 5] above, its variance is calculated as 8.2857.

B.2.2.3 Expected Value

Expected value is just a fancy term for the mean (i.e., the average) of the outcomes assigned to the random variable. It is as simple as (1 + 3 + 5)/3 = 9/3 = 3. Average = mean = expected value.

$$ \color{#515151}{E(X) = {\sum_{i=1}^n x_i \over N}} $$

B.2.2.4 Population vs Sample

B.2.2.5 Covariance

Remember random variables?

Covariance, in the context of random variables, determines how much the values of two random variables change together. Basically a measure of the variance between the two4 5. Take the following two random variables with the same sample space 1, 2, and 3:

X = {1, 2, 3} and Y = {1, 2, 3}

Let us say the first sampling of a population gives us the outcomes 1, 3, 2, 3, 3, 1. Their frequency distributions are assigned to the r.v’s:

Ω(1) = [1, 3, 2, 3, 3, 1]  # 1 is counted 2x, 2 just 1x, and 3 is 3x.
Ω(2)= [3, 3, 3, 1, 2, 1]  # The same!

f(X) = f[Ω(1)] = [2, 1, 3]
f(Y) = f[Ω(2)] = [2, 1, 3]

Gosh, they have identical distributions! The have highly positive covariance. In fact cov(X, Y) = 1, as high as it gets.

Now, say we have outrageously different samples.

X = {1, 2, 3} and Y = {1, 2, 3}

Ω(1) = [1, 2, 1, 1, 2, 3]
Ω(2) = [3, 3, 2, 3, 2, 1]

f(X) = f[Ω(1)] = [3, 2, 1]
f(Y) = f[Ω(2)] = [1, 2, 3]

Here the frequency distributions of x[1] and x[3] are completely opposite!
cov(X,Y) = -1, which is as low as it gets. It has a highly negative covariance.

Covariance between the outcomes of two random variables is formally defined as the expected value of the product of their deviations from their individual expected values. $$ \color{#515151}{cov(X, Y) = E[(X-E[X])(Y - E[Y])]} $$

That’s pretty hard to digest since vector multiplication is a subject best left for linear algebra. Let’s put this into language we can understand, a summation equation: $$ \color{#515151}{ (X-E[X])(Y-E[Y]) = \sum_{i=1}^n(x_i - \overline{X})(y_i - \overline{Y})\\ \therefore E[(X-E[X])(Y-E[Y])] = {\sum_{i=1}^n(x_i - \overline{X})(y_i - \overline{Y} \over n} } $$

This is the covariance for a population. We are dealing with samples, not the population, so the summation equation is altered. E is no longer the mean of a population (e.g. a summation of a set divided by its length), but rather the summation of a set divided by its length minus 1.

$$ \color{#515151}{cov(X, Y) = {1 \over n-1}\sum_{i=1}^n(x_i - \overline{X})(y_i - \overline{Y})} $$

Note that when computing with a computer, data should be centered to prevent catastrophic cancellation. More on that in ML later on.

Take again the second series of outcomes from above:

f(X) = [3, 2, 1]
f(Y) = [1, 2, 3]
E[X] = 6/3 = 2
E[Y] = 6/3 = 2

(x1 - EX) * (y1 - EY) = (3 - 2) * (1 - 2) = -1
(x2 - EX) * (y2 - EY) = (2 - 2) * (2 - 2) = 0
(x3 - EX) * (y3 - EY) = (1 - 2) * (3 - 2) = -1

-1 + 0 - 1  = -2
-2 / (n - 1) = -2 / (3 - 1) = -2/2 = -1

cov(X,Y) = -1

In Python:

from statistics import covariance

# High variance
X = [1, 2, 3]
Y = [1, 2, 3]
covariance(X, Y)
1.0

# Negative variance 
X = [3, 2, 1]
Y = [1, 2, 3]
covariance(X, Y)
-1.0

# No variance
X = [3, 3, 3]  # i.e. a horizontal line
Y = [1, 2, 3]  # i.e. a diagonal line up
covariance(X, Y)
0.0

B.2.3 Normal Distribution (The Bell Curve)

The Bell Curve, a.k.a the Normal Distribution, a.k.a. Gaussian Distribution, describes a great number of phenomena, such as height of a group of people or width of a part manufacturered in a production process. The “height of a group of people” is nicely considered by Neal Stephenson in his book, “Cryptonomicon.”

The density function for the Normal Distribution is calculated: probability of a single test of a random variable being a given value. It is calculated: $$ \color{#515151}{f(x) = {1 \over\sigma\sqrt{2\pi}}e^{-{1 \over 2}({x-\mu \over \sigma})^2}} $$

where

(BTW, there is no need to memorize the formulae for MMAI. It is sufficient to punch in the values into Excel or a programming function such as Python’s scipy.stats.norm().)

Let’s take our famous set {1, 7, 3, 8, 9, 2, 5}. Its µ is 5. Its σ is 2.8785. What is the probability that a single test will come up exactly 6? How about any value 8 and above?

$$ \color{#515151}{f(6) = {1 \over{2.8785\sqrt{2\pi}}}e^{-{1 \over 2}({6-5 \over 2.8785})^2}} $$

I’m not going to bother doing this by hand. In Python:

import numpy as np
from scipy.stats import norm
 
population = np.array([1, 7, 3, 8, 9, 2, 5])
x = 6

mean = np.mean(population)  # Should be 5
print(round(mean, 4))
5.0

std_deviation = np.std(population)  # Should be 2.8785
print(round(std_deviation, 4))
2.8785

prob = norm.pdf(x, loc=mean, scale=std_deviation)  # Probability density function
print(round(prob, 4))
0.1305

As for “values 8 and above,” it would be useful to know about the cumulative distribution function. The CDF calculates the probability that an r.v. will take a value less than or equal to its input. So we could punch in “5” into a CDF to get the probability of values from 0 to 5. To get the values for 8 and above, we just subtract the CDF from 1.0.

...

min_x = 8
cdf = norm.cdf(min_x, loc=mean, scale=std_deviation)
prob = 1 - cdf
print(round(prob, 4))
0.1487

Bear in mind that a population of seven elements is hardly earth-shattering. The small set is used just to demonstrate the mathematics. The power of probability distributions is best used with increasingly large data sets. Enter the central limit theorem.

B.2.4 The Central Limit Theorem

The Central Limit Theorem holds that a data set’s probability distribution will look increasingly like a normal distribution around its mean (i.e., a bell curve) as one adds more and more samples, so long as:

This holds regardless of the population’s actual distribution shape. 6 For populations whose probability distribution is skewed from a normal distribution about its mean, you will want 30 or more samples. Otherwise, fewer may do.

If an electorate is close to normal distribution about a mean, fewer samples (e.g., fewer polling boxes) are sufficient to predict the winner of an election. Think of the 2022 Quebec election that was declared 12 minutes after the polls closed.

B.3 Conclusion

Next post is on simulation, now that the math is out of the way. The concepts in the last two posts will come back again and again for MMAI, so it is good to have a clear understanding of them. If this post isn’t clear enough, Google away! Now go back to the course slides.


  1. Investopedia. Variance ↩︎

  2. Investopedia. Standard Deviation ↩︎

  3. What’s the difference between standard deviation and variance? Scribbr ↩︎

  4. CFI Team. Covariance Corporate Finance Institute. ↩︎

  5. Wikipedia. Covariance ↩︎

  6. Investopedia. Central Limit Theorem ↩︎