Ch 4: Probability & Statistics - Intermediate¶
Track: Foundation | Try code in Playground | Back to chapter overview
Read online or run locally
You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-04-probability-statistics/notebooks/02_intermediate.ipynb in Jupyter.
Chapter 4: Probability & Statistics¶
Notebook 02 - Intermediate¶
Key probability distributions, Central Limit Theorem, Bayes' theorem, and Maximum Likelihood Estimation—all with ML-relevant examples.
What you'll learn: - Bernoulli, Binomial, Normal, Poisson, Uniform distributions - Central Limit Theorem: why normal appears everywhere - Bayes' theorem: medical test & spam classification - Expected value, variance, MLE intuition
Time estimate: 3 hours
Generated by Berta AI | Created by Luigi Pascal Rondanini
1. Probability Distributions Overview¶
Each distribution models a different kind of randomness. Think of them as templates: when your data looks a certain way, you pick the matching distribution.
- Bernoulli: A single yes/no question. "Is this email spam?" One trial, two outcomes. Parameter p = P(success).
- Binomial: How many yeses in N tries. "How many spam emails in 100?" Same as N independent Bernoulli trials.
- Normal: The bell curve. Most things cluster around the average—heights, test scores, measurement errors. Central Limit Theorem makes it show up everywhere.
- Poisson: How many rare events in a time period. "Server crashes per day." "Customer arrivals per hour."
- Uniform: Every outcome equally likely. "Random number between 0 and 1."
| Distribution | Use Case | Parameters |
|---|---|---|
| Bernoulli | Single binary trial (correct/incorrect) | p |
| Binomial | n Bernoulli trials (accuracy over n samples) | n, p |
| Normal | Continuous, CLT, activations | μ, σ² |
| Poisson | Counts per interval (events, requests) | λ |
| Uniform | Equal probability over range | a, b |
See assets/diagrams/probability_distributions.svg for relationships.
Plotting all four distributions. The code below builds a 2×2 grid: Bernoulli (two bars), Binomial (bell shape), Normal (smooth curve), Poisson (counts). We implement Binomial and Normal from scratch to see the math; Poisson uses scipy for convenience.
Each distribution appears in ML: Bernoulli for single predictions (spam or not), Binomial for accuracy over N samples, Normal for errors and activations, Poisson for count data (clicks, events).
import numpy as np
import matplotlib.pyplot as plt
from math import factorial, exp, sqrt, pi
def binomial_pmf(k, n, p):
"""P(X=k) for Binomial(n,p). From scratch, no scipy."""
if k < 0 or k > n:
return 0.0
coef = factorial(n) / (factorial(k) * factorial(n - k))
return coef * (p ** k) * ((1 - p) ** (n - k))
def normal_pdf(x, mu=0, sigma=1):
"""Normal PDF from scratch. φ(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))"""
return (1 / (sigma * sqrt(2 * pi))) * exp(-0.5 * ((x - mu) / sigma) ** 2)
x_binom = np.arange(0, 21)
pmf_binom = [binomial_pmf(k, 20, 0.5) for k in x_binom]
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
# Bernoulli
axes[0,0].bar([0, 1], [0.4, 0.6], color=['#e74c3c', '#2ecc71'], edgecolor='black')
axes[0,0].set_xlabel('Outcome')
axes[0,0].set_ylabel('P(X)')
axes[0,0].set_title('Bernoulli(p=0.6)')
# Binomial
axes[0,1].bar(x_binom, pmf_binom, color='steelblue', edgecolor='navy')
axes[0,1].set_xlabel('k (successes)')
axes[0,1].set_ylabel('P(X=k)')
axes[0,1].set_title('Bin(20, 0.5)')
# Normal
x_norm = np.linspace(-4, 4, 200)
y_norm = [normal_pdf(x) for x in x_norm]
axes[1,0].plot(x_norm, y_norm, 'b-', linewidth=2)
axes[1,0].fill_between(x_norm, y_norm, alpha=0.3)
axes[1,0].set_xlabel('x')
axes[1,0].set_ylabel('φ(x)')
axes[1,0].set_title('Normal(0, 1)')
# Poisson (approximation)
from scipy.stats import poisson
x_poiss = np.arange(0, 15)
pmf_poiss = poisson.pmf(x_poiss, 4)
axes[1,1].bar(x_poiss, pmf_poiss, color='coral', edgecolor='darkred')
axes[1,1].set_xlabel('k')
axes[1,1].set_ylabel('P(X=k)')
axes[1,1].set_title('Poisson(λ=4)')
plt.tight_layout()
plt.savefig('../assets/diagrams/distributions_overview.svg')
plt.show()
What just happened: We drew the four distribution shapes. Bernoulli: just two bars (failure/success). Binomial: peaks in the middle (most likely to get ~10 heads in 20 flips). Normal: the iconic bell. Poisson: skewed for rare events (λ=4 means typically 3–5 events).
Try it yourself: Change the Binomial to Bin(20, 0.2)—now successes are rare. The peak moves left. Change Poisson to λ=10—the shape becomes more symmetric, approaching Normal. The CLT explains why!
2. Central Limit Theorem¶
No matter what shape your data has, averages of samples will look normal. This is almost magical. Take uniform data (flat), skewed data, or weird multi-modal data—if you repeatedly draw samples of size n and average them, those averages form a bell curve. The larger n, the more normal it gets.
Why is this so important for AI? Sample means are everywhere: average accuracy over runs, mean of a minibatch, ensemble predictions. The CLT justifies using normal-based confidence intervals and hypothesis tests even when the raw data isn't normal.
Predict: If we repeatedly sample means of 30 random values from a Uniform distribution, what shape will the distribution of those means have?
Simulating the CLT. We take 10,000 samples of size 30 from Uniform(0,1), compute each sample's mean, and histogram those 10,000 means. The left panel shows raw uniform data (flat). The right shows the distribution of means—bell-shaped!
# Simulate CLT: sample means from Uniform(0,1)
n_samples = 10000 # number of sample means
sample_size = 30 # each mean is from 30 draws
means = [np.random.uniform(0, 1, sample_size).mean() for _ in range(n_samples)]
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
# Original: single uniform sample
single_sample = np.random.uniform(0, 1, 1000)
axes[0].hist(single_sample, bins=30, color='steelblue', edgecolor='black', density=True)
axes[0].set_title('Uniform(0,1) — Original')
axes[0].set_xlabel('x')
# Sample means with n=5
means_5 = [np.random.uniform(0, 1, 5).mean() for _ in range(n_samples)]
axes[1].hist(means_5, bins=40, color='green', alpha=0.6, edgecolor='black', density=True, label='n=5')
axes[1].set_title('Means of 5 samples')
axes[1].set_xlabel('Sample mean')
# Sample means with n=30
axes[2].hist(means, bins=50, color='purple', alpha=0.7, edgecolor='black', density=True, label='n=30')
x_fit = np.linspace(0.3, 0.7, 100)
y_fit = [normal_pdf(x, 0.5, sqrt(1/12/30)) for x in x_fit] # σ² = 1/12 for Uniform
axes[2].plot(x_fit, y_fit, 'r-', linewidth=2, label='Normal fit')
axes[2].set_title('Means of 30 samples → Normal!')
axes[2].set_xlabel('Sample mean')
axes[2].legend()
plt.tight_layout()
plt.savefig('../assets/diagrams/clt_simulation.svg')
plt.show()
print("As n increases, distribution of sample means becomes Normal (CLT).")
What just happened: The three panels show: (1) Uniform data is flat. (2) Means of 5 samples—starting to mound. (3) Means of 30 samples—clearly normal. The red curve is the theoretical Normal fit. The CLT works!
Common mistake: Assuming your raw data must be normal to use normal-based tests. The CLT says sample means are normal—so even if raw data is skewed or uniform, the mean of 30+ samples is approximately normal. That justifies t-tests and z-tests.
3. Bayes' Theorem¶
Updating beliefs with evidence. You start with a prior belief P(H). You see evidence E. Bayes tells you how to update to the posterior P(H|E). A full worked example: 1% of people have a rare disease. The test is 95% sensitive (catches true cases) and 98% specific (correctly clears healthy people). You test positive. What's P(disease|positive)? Most people guess 95%—but it's only ~32%! Why? The disease is rare, so most positives are false positives from the 99% healthy population.
- Prior P(H): belief before evidence
- Likelihood P(E|H): how likely evidence given hypothesis
- Posterior P(H|E): updated belief after evidence
See assets/diagrams/bayes_theorem.svg for the medical test example.
Medical test in code. We implement the Bayes update and apply it to the disease example. The key is computing P(positive) = P(positive|disease)×P(disease) + P(positive|healthy)×P(healthy)—the denominator.
def bayes_update(p_prior, p_likelihood, p_evidence):
"""Posterior = Likelihood * Prior / Evidence"""
return (p_likelihood * p_prior) / p_evidence if p_evidence > 0 else 0.0
# Medical test: P(disease)=0.01, sensitivity=0.95, specificity=0.98
p_disease = 0.01
p_pos_given_disease = 0.95 # sensitivity
p_neg_given_healthy = 0.98 # specificity
p_pos_given_healthy = 1 - p_neg_given_healthy
p_pos = p_pos_given_disease * p_disease + p_pos_given_healthy * (1 - p_disease)
p_disease_given_pos = bayes_update(p_disease, p_pos_given_disease, p_pos)
print("Medical test example:")
print(f" P(disease) = {p_disease}")
print(f" P(positive|disease) = {p_pos_given_disease}")
print(f" P(disease|positive) = {p_disease_given_pos:.3f}")
print("→ Even with positive test, only ~32% chance of disease (rare prior).")
What just happened: Even with a positive test (95% sensitive!), the posterior is only ~32% because the prior was 1%. Rare diseases need very specific tests. This is why Bayes matters: the prior shifts the conclusion dramatically.
Spam classification: Same formula, different domain. "Free" appears in 60% of spam but only 10% of ham. Given that an email contains "free," we update P(spam) upward.
# Spam classification: P(spam)=0.4, P('free'|spam)=0.6, P('free'|ham)=0.1
p_spam = 0.4
p_free_given_spam = 0.6
p_free_given_ham = 0.1
p_free = p_free_given_spam * p_spam + p_free_given_ham * (1 - p_spam)
p_spam_given_free = bayes_update(p_spam, p_free_given_spam, p_free)
print("\nSpam classification example:")
print(f" P(spam) = {p_spam}")
print(f" P('free'|spam) = {p_free_given_spam}")
print(f" P(spam|'free') = {p_spam_given_free:.3f}")
print("→ Word 'free' raises spam probability significantly.")
What just happened: P(spam|"free") jumps from the prior 0.4 to about 0.86. One word can strongly shift the probability—that's how Naive Bayes spam filters work (using many words).
Try it yourself: Change p_spam to 0.1 (less spam overall). How does P(spam|"free") change? The prior influences the posterior: with less spam in the world, the word "free" is less indicative.
4. Expected Value and Variance¶
E[X] is the long-run average. Var(X) is the spread. For model selection: a model with higher mean accuracy might still be worse if its variance is huge—you can't trust any single run. Under a limited evaluation budget (e.g., 5 random seeds), the low-variance model might be preferred.
ML example: Two models A and B. A has higher mean but higher variance. Under limited evaluation budget, B might be preferred (more reliable).
Simulating model variability. We draw 100 accuracy values for each model. Model A: mean 0.92, std 0.05. Model B: mean 0.88, std 0.02. The boxplot shows spread.
Think of it like choosing a job: higher salary (mean) is nice, but stability (low variance) matters too. A model that sometimes gets 0.99 and sometimes 0.75 is risky for production.
# Simulated model accuracies over 100 runs (e.g., different random seeds)
np.random.seed(42)
model_a = np.random.normal(0.92, 0.05, 100) # higher mean, higher variance
model_b = np.random.normal(0.88, 0.02, 100) # lower mean, lower variance
fig, ax = plt.subplots(figsize=(8, 4))
ax.boxplot([model_a, model_b], labels=['Model A (high var)', 'Model B (low var)'])
ax.set_ylabel('Accuracy')
ax.set_title('Model Selection: Mean vs Variance (100 runs each)')
ax.axhline(y=0.9, color='gray', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('../assets/diagrams/expected_value_variance.svg')
plt.show()
print(f"Model A: E[X]={model_a.mean():.3f}, Var={model_a.var():.4f}")
print(f"Model B: E[X]={model_b.mean():.3f}, Var={model_b.var():.4f}")
What just happened: Model A's box is wider—more variability. Model B is more consistent. If you only had 3 runs, A might give 0.95, 0.89, 0.92—hard to know the true mean. B would be more predictable.
5. Maximum Likelihood Estimation (MLE) Intuition¶
MLE: Finding the most likely explanation for your data. Given observed data, which parameter values make that data most probable? For Bernoulli: given [1,0,1,1,0], what p makes this sequence most likely? The answer: p = proportion of 1s = 3/5. Simple!
Predict: If we observe 7 heads in 10 flips, what is the MLE of p?
Bernoulli MLE in code. We compute p_hat = 7/10 and plot the likelihood function L(p) = p^7 × (1-p)^3. It peaks at p=0.7.
def mle_bernoulli(data):
"""MLE for Bernoulli: p_hat = proportion of successes."""
return sum(data) / len(data) if data else 0.0
data = [1, 0, 1, 1, 0, 1, 1, 1, 0, 1] # 7 ones
p_mle = mle_bernoulli(data)
print(f"Data: {data}")
print(f"MLE p_hat = {p_mle}")
# Visualize likelihood as function of p
p_vals = np.linspace(0.01, 0.99, 100)
likelihoods = [np.prod([p if x==1 else (1-p) for x in data]) for p in p_vals]
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(p_vals, likelihoods, 'b-')
ax.axvline(x=p_mle, color='red', linestyle='--', label=f'MLE p={p_mle}')
ax.set_xlabel('p')
ax.set_ylabel('Likelihood L(p|data)')
ax.set_title('Bernoulli MLE: Likelihood maximized at p = 7/10')
ax.legend()
plt.tight_layout()
plt.savefig('../assets/diagrams/mle_bernoulli.svg')
plt.show()
What just happened: The likelihood curve shows how probable the data is for each p. At p=0.7, it's maximized. MLE is the foundation of most parameter estimation in ML—including neural network training (cross-entropy is log-likelihood).
Try it yourself: Change the data to [1,1,1,0,0] (3 heads in 5 flips). Run mle_bernoulli. The MLE should be 0.6. Then try [1,1,1,1,1] — what happens? The MLE is 1.0, but the likelihood curve is skewed.
Common mistake: Confusing MLE with the true parameter. MLE gives the best guess from the data. With only 10 flips, p_hat=0.7 could easily happen even if the true coin is fair (p=0.5). More data tightens the estimate.
Why this matters for AI: When we train a model, we are often doing MLE. The loss function we minimize corresponds to negative log-likelihood. Understanding MLE connects optimization to probability.
6. Summary¶
- Distributions: Bernoulli (one trial), Binomial (n trials), Normal (bell curve), Poisson (counts), Uniform—each models a different scenario
- CLT: Sample means → Normal as n increases; justifies normal-based inference
- Bayes: Update beliefs with evidence (medical test, spam)—prior matters!
- E[X], Var(X): Mean and spread both matter for model selection
- MLE: Choose parameters that maximize likelihood of observed data
Next: Hypothesis testing, confidence intervals, A/B testing.
Generated by Berta AI | Created by Luigi Pascal Rondanini