Probability and Statistic Primer

Mon, Jan 30, 2023 AI uses statistics and probabilities. Data that is sifted in machine learning is rarely laser-like in its patterns. Statistics and probabilities are used to ensure that predictions are as accurate as they can be given the data used.

This post is based on the MMAI 863 course. It does not explain the slides, but serves examples to assist the author. It is not complete, either, as basic set theory definitions (unions, intersections, etc) are not included as they are (at least to the author) quite obvious.

This primer does not rely on the notes on calculus or linear arithmetic.

Definitions

General

Statistics: collection, analysis, interpretation, organization and presentation of numerical data.
Descriptive statistics: data summarization (mean, median, frequency, standard deviation)
Inferential statistics: conclusions about a population given a sample
Probability: conclusions about samples given a population.

Technique

Random variable: the outcome of a probabilistic experiment. Denoted with a capital Roman letter.
Sample space $\Omega$: the set of all possible outcomes of a probabilistic experiment. Each outcome in the sample space is denoted by a small omega, $\omega$.
Event: a subset of outcomes of a probabilistic experiment.
Sample statistics: random variables that are descriptive measures of a sample.
Population parameters: constant – but usually unknown – characteristics of a population.

Assignment and Interpretation

Classical approach: all outcomes are equally probable: $1\over{n}$, where n is the count of all outcomes.
Relative frequentist approach: Long-run relative frequency. $P(Event)={{c}\over{n}}$, where c is the count of favourable outcomes, and n is the total number of outcomes.
Subjective/Indifference approach: “Type 1” decision-making; intuition, beliefs, etc.
Interpretation: Probability is the relative frequency of an event. It is based on experiment, not subjectivity. Applied engineering and science depend on repeatability of experiments.

Probability Properties

Disjoint: when two events are mutually exclusive.
A: Schrödinger’s cat is alive
B: Schrödinger’s cat is dead.
Since alive $\neg$ dead: $A\cap{B}=\emptyset$.
Joint: when two events are not mutually exclusive.
A: Bob got COVID.
B: Bob got taxoplasmosis.
Since Bob can have both COVID and taxoplasmosis at the same time, lucky fellow, $A\cap{B}\neq\emptyset$.
MECE: “Mutually exclusive, collectively exhaustive.” From Paul Millerd on LinkedIn: the set {1, 2, 3, 4, 5, 6} is MECE when dealing with a six-sided die. Should you get a 3 on a roll, you cannot have any other number. The set is fully representive of all the possible results.

The Addition Rule

$P(A+B)=P(A)+P(B)$, if $A\cap{B}=\emptyset$
A: Schrödinger’s cat has a 70% chance of being alive (depends on the half-life?)
B: Schrödinger’s cat has a 30% chance of being dead.
$P(A+B)=0.7+0.3=1.0$
Similarly, for any MECE set of more than one event:
$P(A_{1}+A_{2} + .. + A_{n})=1$
$P(A+B)=P(A)+P(B)-P(A\cap{B})$, if $A\cap{B}\neq\emptyset$
A: Bob has a 8% chance of having COVID this week.
B: Bob has a 0.5% chance of suffering from taxoplasmosis this week.
C: Bob has thankfully a 0.04% chance of suffering from both.
$P(A+B)=0.08+0.005-0.0004=0.0846$. \

Bob has an 8.46% chance of suffering from either COVID or taxoplasmosis (but not both).

For any non-MECE set of three events:
$P(A_1+A_2+A_3)=$
$P(A_1)+P(A_2)+P(A_3)-P(A_1\cap{A_2})-P(A_1\cap{A_3})-P(A_2\cap{A_3})-P(A_1\cap{A_2}\cap{A_3})$

The Product Rule (Conditional Probability)

Conditional probability: The probability of event B occurring when it is known that event A has occurred, denoted $P(B|A)$. It is read, “the probability that B occurs, given that A occurs.”

$$ P(B|A)={P(A\cap{B})\over{P(A)}} $$

and

$$ P(A\cap{B})=P(B|A)P(A) $$

Independent events: If $P(A\cap{B})=P(A)P(B)$, then the events A and B are independent.

Generalization

This is an awful example, but let’s see what happens.

A: Bob has a 25% chance of going to a restaurant this week.
B: Bob has a 10% chance of eating sushi in the same time period.
C: Bob has a 0.05% chance of getting taxoplasmosis.
D: Bob has a 50% chance of spilling his soya sauce, knowing Bob.

What’s the chance that all of this is going to happen?

$$ P(A,B,C,D)=P(A)P(B|A)P(C|A\cap{B})P(D|A\cap{B}\cap{C})
$$

What is $P(B|A)$? Let us say, from history, half of the time Bob goes to a restaurant, he goes the sushi restaurant in the food court. P(B|A) is therefore 50%.

$$ P(A,B,C,D)=0.25\times0.5\times{P(C|A\cap{B})}\times{P(D|A\cap{B}\cap{C})}
$$

What about $P(C|A\cap{B})$? Well, Bob has never gotten taxoplasmosis while eating sushi in a restaurant. That time Bob ate his own home-made sushi is out of scope.

$$ P(A,B,C,D)=0.25\times0.5\times{0}\times{P(D|A\cap{B}\cap{C})}
$$

There is no point in going on. Bob will probably never spill his soya sauce while getting taxoplasmosis eating restaurant sushi, because historically he never got sushi-restaurant-taxoplasmosis before. With any luck, one gets the picture.

MECE Product Rule

$$ P(A,B,C,D)=P(A)P(B)P(C)P(D) $$

Bayes’ Rule

This is big for AI. It demonstrates how systems can learn based on new data. Peter Gleeson, author of Bayes Rule Explained for Beginners, does a nice job in his post. Read it. Understand it. Some quick notes:

Bayes’ Rule:

$$ P(A|B) = P(A)\times{{P(B|A)\over{P(B)}}} $$

P(A|B) is posterior probability, i.e. the updated probability after the evidence is considered (Gleeson: what are the chances of a goal, given the cheering coming from the neighbour watching the game on TV?)
P(A) is the prior probability (what’s the likelihood of a goal at all?)
P(B|A) is the likelihood, the probability of the evidence, given that A is true (What’s the likelihood of the neighbour cheering if there is a goal?)
P(B) is the marginal probability, the probability of the evidence, no matter the circumstance (what’s the probability of cheering at all? Doesn’t matter why, perhaps it is for a goal, a break-away, an opposing player being tripped without the ref seeing, the streaker being chased by the cops ..?)

Bayes’ Rule is derived from the definition of conditional probabilility.

$$\begin{aligned} P(A\cap{B}) &= P(A|B)P(B) \\ P(A|B)P(B) &= P(B|A)P(A) \\ P(A|B) &= {P(B|A)P(A)\over{P(B)}} \end{aligned}$$

Generalization

Where r represents one probability of A chosen from a sample space divided by a set of probabilities ${A_1, A_2, …, A_k}$:

$$\begin{aligned} P(A_r|B) &= {P(A_r\cap{B})\over{\Sigma^k_{i=1}P(A_i\cap{B})}}\\ P(A_r|B) &= {P(B_r)P(A|B_r)\over{\Sigma^k_{i=1}P(B_i)P(A|B_i)}} \end{aligned}$$

Diachronic Bayes is a term referring to updating the probability of a hypothesis P(A|B) given new data (B) over time. It’s just a term for applying Bayes’ Rule in practice.

Discrete Distribution Functions

Reread Decision Analytics’ Math for Simulations 1.

Continuous Distribution Functions

Continuous data has an infinite set of real numbers between 0 and 1. The probability that any exact value in the set will be hit is close to 0. The probability density function (PDF) and cumulative distribution function (CDF) are used. PDF and CDF use integral calculus to calculate area under the curve (AUC).

PDF: $P(a<X<b)$, the AUC between bounds a and b
CDF: $P(X\le{x})$, the AUC bounded on the right by x, with no left bound (i.e., $-\infty$).

BTW, get used to AUC. It comes up as a measure of an ML model’s fitness for purpose.

Covariance and the Correlation Coefficient

Reread Decision Analytics’ Math for Simulations 2.

Conclusion

These notes on math should be enough to understand the basics of Machine Learning concepts. Some linear arithmetic may be necessary; in which case the ML notes will have asides.