C. Simulation, part 1

Mon, Oct 17, 2022 Simulation is a one-to-one virtual model of reality. A series of variables represent real-life items. A gumball machine simulator represents, say, a series of gumballs, the machine mouth, perhaps the coins, and describes how those items interact. A model needs just enough detail to give an analyst the insight he or she needs.

Simulation is a one-to-one virtual model of reality. A series of variables represent real-life items. A gumball machine simulator represents, say, a series of gumballs, the machine mouth, perhaps the coins, and describes how those items interact. A model needs just enough detail to give an analyst the insight he or she needs. Think of the Pareto principle: 80% of consequences come from 20% of the causes. As an exaggeration: one does not need to add the weight of the gumballs, coriolis force, and the air pressure dynamics outside the machine. Too much detail is expensive in computing and gives diminishing returns. Simulation is a brute-force (but fast) approach to problems that cannot be reduced to simple math.

In this post, we will run a simulation of random sales volumes generated by a normal distribution. It seems silly since one does not need a simulation to represent normal distributions: statistical arithmetic would do just fine. However, this technique serves as an introductory proof of concept because it can be proved to work.

The previous post discussed statistical distributions, but before charging ahead, there is one more concept to go over: the Confidence Interval (CI).

C.1 Simulation Proof of Concept

C.1.1 Confidence Interval

A confidence interval gives an insight on the quality of an estimate. The quality (or lack thereof) is determined by unavoidable sampling errors that crop up when pulling samples from a population. The CI is used later on in the course to determine the null hypothesis. More on null hypothesis later, but know that it is critical.

According to Investopedia, CI is the probability that a “population parameter will fall between a set of values for a certain proportion of times.”

Given a confidence value, say 95%, a point estimate (an approximate value derived from a sample, e.g. a mean or a minimum) falls within a range large enough to ensure that the true answer is actually in there at least 95% of the time. The range is the confidence interval. The smaller the sample size, the larger the CI. The more data, the smaller (i.e., the more precise) the CI.

The formula is: $\color{#515151} {CI = \overline{X} \pm Z({S \over{\sqrt{n}}})}$, where X is the input series, n is the size of the sample, S is the population standard deviation, and Z is the Z-value of the confidence value. The Z-value can be calculated or, better still, looked up in a z-table of commonly-used confidence values such as 95% or 99%.

C.1.1.1 CI Example

Take the sample heights (in centimeters): {168, 174, 132, 203, 197, 145, 188, 182, 170, 154}, and a given confidence level of 95%:

$$ \scriptsize{\color{#717171}{ \begin{aligned} X &= [168, 174, 132, 203, 197, 145, 188, 182, 170, 154]\\ n &= 10\\ \overline{X} &= {\sum X\over{n}} = 171.3 \\ S &= 21.4805 \text{ (Standard Deviation of X, } \sigma \text{)}\\ Z &= 1.96 \text{ (The standard Z for 95% confidence, from a look-up table)}\\ CI &= \overline{X} \pm Z({S\over{\sqrt{n}}})\\ CI &= 171.3 \pm 1.96({21.4805\over{\sqrt{10}}})\\ CI &= 171.3 \pm 13.3138\\ CI &= [157.99, 184.61] \end{aligned} }} $$

As always, one can use numpy and scipy to do the work for you, with a small twist. The “norm.interval()” function uses standard error instead of standard deviation.

import numpy as np
import scipy.stats as st

my_set = np.array([168, 174, 132, 203, 197, 145, 188, 182, 170, 154])

confidence = 0.95

# st.norm.interval() uses standard error of mean (SEM) instead of
# standard deviation. Since this is a POC, we will set ddof to 0
# instead of default 1 as if this was a large sample without bias.
se = st.sem(my_set, ddof=0) 
mu = my_set.mean()

# One uses norm.interval for n>30, but we'll use it here anyway.
ci = st.norm.interval(confidence=confidence, loc=mu, scale=se)
print(ci)

>> (157.9865199348312, 184.61348006516883)

C.1.2 An easy simulation from a probability distribution

Let us pretend that we are selling widgets. The fixed cost of the company deals with rents, salaries, flat-rate electricity, etc. The variable cost is how much it costs to create each widget. The expected volume of sales and the distribution were determined by looking at the firm’s sales history.

Sale price per unit: $1000
Variable cost: $500 per unit
Fixed costs: $3,000,000
Expected volume of sales: 8000 units avg
Expected Distribution: normal (bell-shaped), 95% confidence
Expected range: 5000 to 11000 units

The idea of the simulation is to generate random inputs based on the probability distribution, and then analyze the outputs for insights such as probability of losing money on sales and probability of getting filthy rich.

C.1.2.1 Expected profit and variance this year

The input random variable is Y = range(5000x500, 11000x500)
The output random variable is X = range(5000x500-3M, 11000x500-3M). See the python code below to find out why, specifically the variable max_profit. The rv’s have a linear relationship, just like in linear programming discussed in earlier posts.

Linear relationships make for an easy simulation. Basically, the simulation will fire up a long series of random numbers according to the probability distribution. In Excel, this is done with “@Risk” which has some pretty graphs. In Python, our friends scipy and numpy come to help. I’ll skip the graphs.

First, run the simulation:

import numpy as np
from scipy.stats import norm, sem

unit_price = 1000
unit_cost = 500
fixed_costs = 3000000

expected_sales_volume = 8000
expected_min_sales = 5000
expected_max_sales = 11000
z_score = 1.96  # 95% of the area under bell curve within
                # 1.96 standard deviations of the mean.

expected_profit = (expected_sales_volume * unit_price) - \
                  ((expected_sales_volume * unit_cost) + fixed_costs)

max_profit = (expected_max_sales * unit_price) - \
             ((expected_max_sales * unit_cost) + fixed_costs)

print(f"Expected Profit: {expected_profit}")
"1000000"

print(f"Maximum Profit: {max_profit}")
"2500000"

# What is the standard deviation for normal dist?
mu = expected_profit
standard_deviation = (max_profit - mu) / z_score
print("Standard deviation: ", round(standard_deviation))
"765306"

DATA_SIZE = 100  # "Iterations" in @Risk-speak
simulated_profits = np.random.normal(
    loc=mu,
    scale=standard_deviation,
    size=DATA_SIZE
)
sample_mu = np.mean(simulated_profits)
sample_standard_deviation = np.std(simulated_profits)
print("Sample standard deviation =", round(sample_standard_deviation))

Executing the code gives us:

Expected Profit: 1000000
Maximum Profit: 2500000
Standard deviation:  765306
Sample standard deviation = 718493

Note how the sample standard deviation is off the population standard deviation. Simulations calculate estimates. The next time the simulation is run, it will have a different answer than this one, but the two values should be within the same ballpark. As more samples go into the simulation, the sample values tighten up to the true values.

Let’s ask ourselves some questions:

What is the expected profit? (One doesn’t really need a simulation if we already know the expected sales volume, but let’s play the game).
```
sample_expected_profit = round(sample_mu, 2)
print("Sample expected profit = ${:,}".format(
   sample_expected_profit)
)
```
Which gives us:
```
Sample expected profit = $1,133,369.46
```
Did you see how the sample expected profit is not the same as the expected profit of $1,000,000.00? Running the simulation again and again will give different answers, but they will be within the confidence value of 95%.

What is the probability that we won’t break even, i.e. P(Y<0|X)?

This one requires the use of CI. We will calculate up to the lower range of the CI of the break-even point, which is profit=0. The confidence value is 95%.

target_profit = 0  # Break-even. Neg = loss, Pos = profit
confidence = 0.95
# stats.norm uses the Standard Error of the Mean (SE or SEM)
# to calculate the confidence interval
sample_se = sem(simulated_profits, ddof=0)
target_ci = norm.interval(
    confidence=confidence, loc=target_profit, scale=sample_se
)

# Remember the CDF function? It gives the prob of an area from 0 to
# a given point.
prob_no_break_even = norm.cdf(
    target_ci[0],  # Lower-end of the CI for target profit $0
    loc=sample_expected_profit,
    scale=sample_standard_deviation
)
print("Probability of not breaking even: ",
      round(prob_no_break_even, 4)
)

The answer:

Probability of not breaking even:  0.0381

Hmmm. 4% chance. Doesn’t seem so bad. Of course, this is a rather low sample, and the chances will range up to 10%, but we’ll get into that later down in the page when we look at precision.

What is the probability that we make more than $1M, i.e. P(Y>1000000|X)?

target_profit = 1000000
target_ci = norm.interval(
    confidence=confidence, loc=target_profit, scale=sample_se
)
print("Probability of scoring more than 1M in sales: ",
    round(
        1 - norm.cdf(
            target_ci[1],  # The upper bound of the CI this time
            loc=sample_expected_profit,
            scale=sample_standard_deviation),
        4
    )
)

Probability of scoring more than 1M in sales:  0.4959

There is a 50% chance we’re rolling in the dough.

C.1.3 The power of added data

We derived those values from a pretty small sample. As we add more samples, the closer the simulation gets to “true.” However, adding too many iterations brings diminishing returns for the computational expense. To optimize the number, one must decide on how precise the simulation needs to be, a number represented by q. Once one has q, then one can calculate the number of samples required, n:

$$ \color{#515151} { n = \left[{ZS_n\over{q}}\right]^2 } $$

where Sn is the sample standard deviation, Z is the z-score of the confidence value, and q is the precision. Here, precision is defined as half of the width of the confidence interval.

For example, in the case of the unit sales with the confidence value of 95%, perhaps we want a precision of $20000 in profits, that is to say we want +/- $20000 in our answers. We set q, quite simply, to 20000. Z is set to 1.96.

Sn needs to be determined by experiment, but fortunately those runs can be moderately small to estimate it. We already did that above with a sample size of only 100. Here it is again:

Sample standard deviation = 718493

This sample standard deviation is off the population standard deviation because 100 samples really is not that much. The estimate is good enough to set a second, better sample size. Why? Because with normal distribution, a sample size of only 100 is not a grave sin. Remember the central limit theorem? Now let’s get the sample size we want for precision $40000:

$$ \color{#515151} { n = \left[{1.96\times718493\over{40000}}\right]^2 = 4900 } $$

To test it out:

def simulate_profit(
    mu: float,
    standard_deviation:float,
    precision_in_dollars:int
)->list[float, np.array]:
   """Simulate profits on a sample SD and a given precision q"""
    OPTIMAL_SAMPLE_SIZE = int(
        ((z_score * standard_deviation)/precision_in_dollars)**2
    )
    print(f"For a precision of ${precision_in_dollars}, "
          f"use {OPTIMAL_SAMPLE_SIZE} iterations.")
    simulated_profits = np.random.normal(
        loc=mu,
        scale=standard_deviation,
        size=OPTIMAL_SAMPLE_SIZE
    )
    expected_profit = np.mean(simulated_profits)
    sample_se = sem(simulated_profits, ddof=0)
    target_ci = norm.interval(
        confidence=confidence, loc=expected_profit, scale=sample_se
    )
    return expected_profit, target_ci


simulated_profit, target_ci = simulate_profit(
    expected_profit, 718493, 20000
)
print("""
The simulated profit is ${}, within a range of \n${} - ${}
""".format(
    round(simulated_profit, 2),
    round(target_ci[0], 2),
    round(target_ci[1],2))
)

For a precision of $20000, use 4900 iterations.

The simulated profit is $994889.65, within a range of
$973219.93 - $1016559.36

Pretty close to the target of +/- $20K, and close to the actual profit of $1M! How about a precision of $40000?

$$ \color{#515151} { n = \left[{1.96\times718493\over{40000}}\right]^2 = 1239 } $$

Feeding 1239 into the function simulate_profit() above gives us:

For a precision of $40000, use 1239 iterations.

The simulated profit is $967948.03, within a range of
$925222.65 - $1010673.4

In short, $0.925M +/- $40K. The simulated profit has moved away from the $1M true.

In the next post, we will continue simulations with demand estimation.