B. Math for Simulations, part 2

Mon, Oct 3, 2022 In the last post, I wrote an introduction to random variables and probability distributions. Included were the discrete probability distributions called binomial distribution and poisson distribution. I mentioned also the uniform distribution. Now I’ll attempt to interpret normal (Gaussian) distributions and the central limit theorem.

Previous: Math for Simulations, part 1

As mentioned before, these posts are my aide-memoires. They do not replace the course slides, but present some additional notes that may help understand the theory behind the techniques.

B.2 Concepts in Math with Continuous Samples

B.2.3 The Summation Function ∑

Summation (and its big brother product) appears a lot in MMAI, so it is good to know it well. Obviously, this is old hat to anyone who took discrete math in high school, but it’s worth reviewing.

Take the following set: {1, 3, 9}

The concept of summation is pretty easy. If you add all the elements in the set together, you get the answer 13. Using summation notation, this would be $$ \color{#515151}{\sum [1, 3, 9]= 1 + 3 + 9 = 13} $$

The summation symbol, ∑ (sigma), basically means: add everything that follows. Now, usually one does not punch in every number in the set. You would assign the numbers to positional variables instead.

The first element of the set we would denote $\color{#515151}{x_1=1}$
The second element is denoted $\color{#515151}{x_2=3}$
The third element is $\color{#515151}{x_3=9}$.

So $\color{#515151}{\sum\{x_1, x_2, x_3\}=x_1+x_2+x_3=13}$.

Not much better. The power of the summation equation lies using an iterator or index denoted “i” to indicate each element in turn. For our example, we want to add the elements from position 1 through position 3, so we set the limit to 3, and the starting value of i to 1.

So $\color{#515151}{\sum^3_{i=1}x_i=x_1+x_2+x_3=13}$.

Often, especially when programming, we won’t know the length of the set, i.e. the number of elements. It is simpler just to put down a symbol meaning “the length of the set”. Usually that symbol is n.

So now we get: $\color{#515151}{\sum^n_{i=1}x_i=13}$.

In programming, you might program it thus. Note that for programming, the first element in a set is indexed 0, not 1.

X = [1, 3, 9] # The set
answer = 0
n = len(X)  # i.e., 3 elements in the set
i = 0  # The first position.
for i in range(0, n):  # i starts at 0, then 1, then 2, ...
    answer += X[i] 
print(answer)
13

Of course it is simpler and faster to use a library function:

import numpy as np

print(np.sum([1, 3, 9]))
13

3.2.3.1 Summation of squares

Q: What is $\color{#515151}{\sum^n_{i=1}x_i^2}$, assuming X is still {1, 3, 9} ?

A: It is the sum of the squares of each element in X. $\color{#515151}{\sum^n_{i=1}x_i^2=x_1^2+x_2^2+x_3^2=91}$

In Python:

X = [1, 3, 9]
answer = 0
n = len(X)
i = 0  # The first position is 0, not 1 in Python
for i in range(0, n):
    answer += X[i] ** 2  # each value raised to 2
print(answer)
91

One uses list comprehensions a lot in MMAI ML. You do not need to set the start of index i or to set the limit n, because those are interpreted by Python for you:

import numpy as np
X = [1, 3, 9]
squared_X = [ x ** 2 for x in X ]  # List Comprehension every x^2
print(squared_X)
[1, 9, 81]

answer = np.sum(squared_X)
print(answer)
91

And making it even simpler…

import numpy as np

print(np.sum(np.square([1, 3, 9])))
91

B.2.2 Concepts for Statistics

B.2.2.1 Variance

Variance is a measurement of the spread between numbers in a dataset relative to the mean (i.e., relative to the average of the set). ¹ It is depicted by the symbol $\color{#515151}{\sigma^2} $. Variance is calculated with the function: $$ \color{#515151}{\sigma^2 = {{\sum}_{i=1}^n(x_i - \overline x)^2 \over N }} $$

where

$\color{#515151}{x_i}$ is each value in the set
$\color{#515151}{\overline x}$ is the average of all the values in the set. The “overline” over the x ($\color{#515151}{\overline{x}} $) always means “the mean of all values of x.”
N is the number of values in the set. Here it is equivalent to n.

There’s that summation equation! So, let’s consider the set:

X = {1, 7, 3, 8, 9, 2, 5}

The smallest value is 1, and the highest value is 9. Since 9-1= 8, then the distance will be in the ballpark of 8.

N = n = 7, the length of the set.
$\color{#515151}{\overline x}$ is (1 + 7 + 3 + 8 + 9 + 2 + 5)/7 = 5.0
$\color{#515151}{{\sum}_{i=1}^n(x_i - \overline x)^2 = (1 - 5.0)^2 + (7 - 5.0)^2 + … + (5 - 5.0)^2 = 58}$
$\color{#515151}{\sigma^2 = { 58 \over 7 } ≈ 8.2857 }$

Punching this in a calculator sucks. In Python either use a list comprehension as a handy-dandy summation function, or use numpy’s built in var function:


import numpy as np

my_set = np.array([1, 7, 3, 8, 9, 2, 5])
print(round(np.var(my_set), 4)
  
8.2857

I tried the same calculation with Python’s statistics.variance() function. It came up with something different, since it subtracts 1 from N as it calculates a “hypothetical infinite population.” I won’t go into that here.

B.2.2.2 Standard Deviation

Standard Deviation is, just like variance, the measurement of the dispersion of a set relative to its mean.² It is the square root of variance. The main difference between variance and standard deviation is that standard deviation is expressed in the same units as the original values (metres, visits per hour), whereas variance is the original values squared (metres squared, visits per hour squared). ³

Standard deviation’s symbol is lower-case sigma, σ. Taking the set [1, 7, 3, 8, 9, 2, 5] above, its variance is calculated as 8.2857.

The square root of variance 8.2857 is $\color{#515151}{\sqrt{8.2857} = 2.8785}$
Therefore $\color{#515151}{\sigma = 2.8785 }$

B.2.2.3 Expected Value

Expected value is just a fancy term for the mean (i.e., the average) of the outcomes assigned to the random variable. It is as simple as (1 + 3 + 5)/3 = 9/3 = 3. Average = mean = expected value.

$$ \color{#515151}{E(X) = {\sum_{i=1}^n x_i \over N}} $$

B.2.2.4 Population vs Sample

A population represents the full set of available data on a subject.
A sample is a subset of a population. For our purposes, a sample should be a perfectly random selection.

B.2.2.5 Covariance

Remember random variables?

Covariance, in the context of random variables, determines how much the values of two random variables change together. Basically a measure of the variance between the two⁴ ⁵. Take the following two random variables with the same sample space 1, 2, and 3:

X = {1, 2, 3} and Y = {1, 2, 3}

Let us say the first sampling of a population gives us the outcomes 1, 3, 2, 3, 3, 1. Their frequency distributions are assigned to the r.v’s:

Ω(1) = [1, 3, 2, 3, 3, 1]  # 1 is counted 2x, 2 just 1x, and 3 is 3x.
Ω(2)= [3, 3, 3, 1, 2, 1]  # The same!

f(X) = f[Ω(1)] = [2, 1, 3]
f(Y) = f[Ω(2)] = [2, 1, 3]

Gosh, they have identical distributions! The have highly positive covariance. In fact cov(X, Y) = 1, as high as it gets.

Now, say we have outrageously different samples.

X = {1, 2, 3} and Y = {1, 2, 3}

Ω(1) = [1, 2, 1, 1, 2, 3]
Ω(2) = [3, 3, 2, 3, 2, 1]

f(X) = f[Ω(1)] = [3, 2, 1]
f(Y) = f[Ω(2)] = [1, 2, 3]

Here the frequency distributions of x[1] and x[3] are completely opposite!
cov(X,Y) = -1, which is as low as it gets. It has a highly negative covariance.

Covariance between the outcomes of two random variables is formally defined as the expected value of the product of their deviations from their individual expected values. $$ \color{#515151}{cov(X, Y) = E[(X-E[X])(Y - E[Y])]} $$

That’s pretty hard to digest since vector multiplication is a subject best left for linear algebra. Let’s put this into language we can understand, a summation equation: $$ \color{#515151}{ (X-E[X])(Y-E[Y]) = \sum_{i=1}^n(x_i - \overline{X})(y_i - \overline{Y})\\ \therefore E[(X-E[X])(Y-E[Y])] = {\sum_{i=1}^n(x_i - \overline{X})(y_i - \overline{Y} \over n} } $$

This is the covariance for a population. We are dealing with samples, not the population, so the summation equation is altered. E is no longer the mean of a population (e.g. a summation of a set divided by its length), but rather the summation of a set divided by its length minus 1.

$$ \color{#515151}{cov(X, Y) = {1 \over n-1}\sum_{i=1}^n(x_i - \overline{X})(y_i - \overline{Y})} $$

Note that when computing with a computer, data should be centered to prevent catastrophic cancellation. More on that in ML later on.

Take again the second series of outcomes from above:

f(X) = [3, 2, 1]
f(Y) = [1, 2, 3]
E[X] = 6/3 = 2
E[Y] = 6/3 = 2

(x1 - EX) * (y1 - EY) = (3 - 2) * (1 - 2) = -1
(x2 - EX) * (y2 - EY) = (2 - 2) * (2 - 2) = 0
(x3 - EX) * (y3 - EY) = (1 - 2) * (3 - 2) = -1

-1 + 0 - 1  = -2
-2 / (n - 1) = -2 / (3 - 1) = -2/2 = -1

cov(X,Y) = -1

In Python:

from statistics import covariance

# High variance
X = [1, 2, 3]
Y = [1, 2, 3]
covariance(X, Y)
1.0

# Negative variance 
X = [3, 2, 1]
Y = [1, 2, 3]
covariance(X, Y)
-1.0

# No variance
X = [3, 3, 3]  # i.e. a horizontal line
Y = [1, 2, 3]  # i.e. a diagonal line up
covariance(X, Y)
0.0

B.2.3 Normal Distribution (The Bell Curve)

The Bell Curve, a.k.a the Normal Distribution, a.k.a. Gaussian Distribution, describes a great number of phenomena, such as height of a group of people or width of a part manufacturered in a production process. The “height of a group of people” is nicely considered by Neal Stephenson in his book, “Cryptonomicon.”

The density function for the Normal Distribution is calculated: probability of a single test of a random variable being a given value. It is calculated: $$ \color{#515151}{f(x) = {1 \over\sigma\sqrt{2\pi}}e^{-{1 \over 2}({x-\mu \over \sigma})^2}} $$

where

µ represents the mean, i.e., $\color{#515151}{\overline x}$ from the previous example.
σ is the standard deviation, the measurement of dispersion in a dataset.

(BTW, there is no need to memorize the formulae for MMAI. It is sufficient to punch in the values into Excel or a programming function such as Python’s scipy.stats.norm().)

Let’s take our famous set {1, 7, 3, 8, 9, 2, 5}. Its µ is 5. Its σ is 2.8785. What is the probability that a single test will come up exactly 6? How about any value 8 and above?

$$ \color{#515151}{f(6) = {1 \over{2.8785\sqrt{2\pi}}}e^{-{1 \over 2}({6-5 \over 2.8785})^2}} $$

I’m not going to bother doing this by hand. In Python:

import numpy as np
from scipy.stats import norm
 
population = np.array([1, 7, 3, 8, 9, 2, 5])
x = 6

mean = np.mean(population)  # Should be 5
print(round(mean, 4))
5.0

std_deviation = np.std(population)  # Should be 2.8785
print(round(std_deviation, 4))
2.8785

prob = norm.pdf(x, loc=mean, scale=std_deviation)  # Probability density function
print(round(prob, 4))
0.1305

As for “values 8 and above,” it would be useful to know about the cumulative distribution function. The CDF calculates the probability that an r.v. will take a value less than or equal to its input. So we could punch in “5” into a CDF to get the probability of values from 0 to 5. To get the values for 8 and above, we just subtract the CDF from 1.0.

...

min_x = 8
cdf = norm.cdf(min_x, loc=mean, scale=std_deviation)
prob = 1 - cdf
print(round(prob, 4))
0.1487

Bear in mind that a population of seven elements is hardly earth-shattering. The small set is used just to demonstrate the mathematics. The power of probability distributions is best used with increasingly large data sets. Enter the central limit theorem.

B.2.4 The Central Limit Theorem

The Central Limit Theorem holds that a data set’s probability distribution will look increasingly like a normal distribution around its mean (i.e., a bell curve) as one adds more and more samples, so long as:

The samples are the same length
The variables are independent, i.e. the values of one r.v. do not influence the values of another. For example, using two samples from the same socioeconomic class of voters in a riding will be more skewed than if samples were taken from two different socioeconomic classes. Checking the covariance of the samples (remember that?) may be of interest.

This holds regardless of the population’s actual distribution shape. ⁶ For populations whose probability distribution is skewed from a normal distribution about its mean, you will want 30 or more samples. Otherwise, fewer may do.

The mean of the combined data sets will get closer to the mean of the population
The variances will likewise will get closer to the variances of the population
So basically: more samples = more accurate results!

If an electorate is close to normal distribution about a mean, fewer samples (e.g., fewer polling boxes) are sufficient to predict the winner of an election. Think of the 2022 Quebec election that was declared 12 minutes after the polls closed.

B.3 Conclusion

Next post is on simulation, now that the math is out of the way. The concepts in the last two posts will come back again and again for MMAI, so it is good to have a clear understanding of them. If this post isn’t clear enough, Google away! Now go back to the course slides.

Investopedia. Variance ↩︎
Investopedia. Standard Deviation ↩︎
What’s the difference between standard deviation and variance? Scribbr ↩︎
CFI Team. Covariance Corporate Finance Institute. ↩︎
Wikipedia. Covariance ↩︎
Investopedia. Central Limit Theorem ↩︎