Probability and Statistic Primer

AI uses statistics and probabilities. Data that is sifted in machine learning is rarely laser-like in its patterns. Statistics and probabilities are used to ensure that predictions are as accurate as they can be given the data used.

This post is based on the MMAI 863 course. It does not explain the slides, but serves examples to assist the author. It is not complete, either, as basic set theory definitions (unions, intersections, etc) are not included as they are (at least to the author) quite obvious.

This primer does not rely on the notes on calculus or linear arithmetic.

Definitions

General

Technique

Assignment and Interpretation

Probability Properties

The Addition Rule

  1. \(P(A+B)=P(A)+P(B)\), if \(A\cap{B}=\emptyset\)
    A: Schrödinger’s cat has a 70% chance of being alive (depends on the half-life?)
    B: Schrödinger’s cat has a 30% chance of being dead.
    \(P(A+B)=0.7+0.3=1.0\)

  2. Similarly, for any MECE set of more than one event:
    \(P(A_{1}+A_{2} + .. + A_{n})=1\)

  3. \(P(A+B)=P(A)+P(B)-P(A\cap{B})\), if \(A\cap{B}\neq\emptyset\)
    A: Bob has a 8% chance of having COVID this week.
    B: Bob has a 0.5% chance of suffering from taxoplasmosis this week.
    C: Bob has thankfully a 0.04% chance of suffering from both.
    \(P(A+B)=0.08+0.005-0.0004=0.0846\). \

Bob has an 8.46% chance of suffering from either COVID or taxoplasmosis (but not both).

  1. For any non-MECE set of three events:
    \(P(A_1+A_2+A_3)=\)
    \(P(A_1)+P(A_2)+P(A_3)-P(A_1\cap{A_2})-P(A_1\cap{A_3})-P(A_2\cap{A_3})-P(A_1\cap{A_2}\cap{A_3})\)

The Product Rule (Conditional Probability)

Conditional probability: The probability of event B occurring when it is known that event A has occurred, denoted \(P(B|A)\). It is read, “the probability that B occurs, given that A occurs.”

$$ P(B|A)={P(A\cap{B})\over{P(A)}} $$

and

$$ P(A\cap{B})=P(B|A)P(A) $$

Independent events: If \(P(A\cap{B})=P(A)P(B)\), then the events A and B are independent.

Generalization

This is an awful example, but let’s see what happens.

A: Bob has a 25% chance of going to a restaurant this week.
B: Bob has a 10% chance of eating sushi in the same time period.
C: Bob has a 0.05% chance of getting taxoplasmosis.
D: Bob has a 50% chance of spilling his soya sauce, knowing Bob.

What’s the chance that all of this is going to happen?

$$ P(A,B,C,D)=P(A)P(B|A)P(C|A\cap{B})P(D|A\cap{B}\cap{C})
$$

What is \(P(B|A)\)? Let us say, from history, half of the time Bob goes to a restaurant, he goes the sushi restaurant in the food court. P(B|A) is therefore 50%.

$$ P(A,B,C,D)=0.25\times0.5\times{P(C|A\cap{B})}\times{P(D|A\cap{B}\cap{C})}
$$

What about \(P(C|A\cap{B})\)? Well, Bob has never gotten taxoplasmosis while eating sushi in a restaurant. That time Bob ate his own home-made sushi is out of scope.

$$ P(A,B,C,D)=0.25\times0.5\times{0}\times{P(D|A\cap{B}\cap{C})}
$$

There is no point in going on. Bob will probably never spill his soya sauce while getting taxoplasmosis eating restaurant sushi, because historically he never got sushi-restaurant-taxoplasmosis before. With any luck, one gets the picture.

MECE Product Rule

$$ P(A,B,C,D)=P(A)P(B)P(C)P(D) $$

Bayes’ Rule

This is big for AI. It demonstrates how systems can learn based on new data. Peter Gleeson, author of Bayes Rule Explained for Beginners, does a nice job in his post. Read it. Understand it. Some quick notes:

Bayes’ Rule:

$$ P(A|B) = P(A)\times{{P(B|A)\over{P(B)}}} $$

Bayes’ Rule is derived from the definition of conditional probabilility.

$$\begin{aligned} P(A\cap{B}) &= P(A|B)P(B) \\ P(A|B)P(B) &= P(B|A)P(A) \\ P(A|B) &= {P(B|A)P(A)\over{P(B)}} \end{aligned}$$

Generalization

Where r represents one probability of A chosen from a sample space divided by a set of probabilities \({A_1, A_2, …, A_k}\):

$$\begin{aligned} P(A_r|B) &= {P(A_r\cap{B})\over{\Sigma^k_{i=1}P(A_i\cap{B})}}\\ P(A_r|B) &= {P(B_r)P(A|B_r)\over{\Sigma^k_{i=1}P(B_i)P(A|B_i)}} \end{aligned}$$

Diachronic Bayes is a term referring to updating the probability of a hypothesis P(A|B) given new data (B) over time. It’s just a term for applying Bayes’ Rule in practice.

Discrete Distribution Functions

Reread Decision Analytics’ Math for Simulations 1.

Continuous Distribution Functions

Continuous data has an infinite set of real numbers between 0 and 1. The probability that any exact value in the set will be hit is close to 0. The probability density function (PDF) and cumulative distribution function (CDF) are used. PDF and CDF use integral calculus to calculate area under the curve (AUC).

BTW, get used to AUC. It comes up as a measure of an ML model’s fitness for purpose.

Covariance and the Correlation Coefficient

Reread Decision Analytics’ Math for Simulations 2.

Conclusion

These notes on math should be enough to understand the basics of Machine Learning concepts. Some linear arithmetic may be necessary; in which case the ML notes will have asides.