Bayesian Statistics for Programmers

—– Part 1: Conditional Probability ——


  • Why did I write this?
    • Information is already out there 1000x, but often behind a wall of jargon and overly complex or generalized explanations
    • Aiming to be a relatively short guide to Bayesian Statistics
    • Think 80/20, 80% of the information in 20% of the words
  • Why learn about bayesian statistics?
    • Understanding probability is important for everyone
    • Bayesian statistics provide a different but possibly more intuitive method for calculating and interpreting probabilities
    • Framework provided by bayesian statistics can be applied in your day to day life (Often called a bayesian mindset)
  • Assumptions of knowledge
    • Solid understanding of basic conditional probability
    • Solid understanding of basic probability distributions
    • Solid understanding of basic calculus (Integration is most important)
    • That’s it. Any notation and jargon will be explained
    • If you’re not sure you know enough, try to follow along anyway. You’ll probably be able to pick it up
  • What to expect
    • Written by a programmer, for programmers
    • Informal explanations with minimal jargon
    • References to specific statistical material
    • Covers mostly theory, not practical application
  • Covered content
  • Side notes
    • More in-depth explanations
    • Not necessary knowledge, but helps build a deeper understanding

Bayes’ Theorem

  • What is Bayes’ Theorem? (Conditional probability)
  • How is it derived? (It’s basically definitional)
  • Why use a bayesian approach over frequentist?

The Bayesian Mindset

  • Conditional probability (Event A given event B)
  • Bake beliefs into the model using priors
  • Models produce “degree of belief” (Bayesian) instead of “frequency of occurence” (Frequentist)
    • Frequentist: In 95/100 trials, x will be value y +- z
    • Bayesian: There is a 95% probability that x is inside the range y +- z
  • Update the model as new information is acquired

Simple Application of Bayes’ Theorem

  • Computing posterior probability based on values
  • Bayesian tables
  • Simple conditional probability
  • Multiple ways of computing the answer
  • These are discrete

Side Note: Bayesian Networks - Chaining Bayes' Theorem

  • Chaining events into one another
  • Directed Acyclic Graph (DAG) of network
  • Conditional independence between nodes
  • Updating a network

—– Part 2: Distributions —–

Using Bayes’ Theorem with Continuous Distributions

  • Replace static prior with distribution
  • Posterior is also now a distribution
  • Show analytical calculation of posterior distribution
  • Explain law of total probability:
  • Alpha and Beta hyperparameters

Side Note: Bayesian Hierachical Modelling - Stacking Priors

  • Hyperparameters
  • Hyperpriors
  • Extended Bayes' Theorem: Modelling with hyperpriors (I.e. multiple layers of priors)

Conjugate Priors

  • Conjugate prior distributions
  • Basic model/prior conjugates
  • Conjugates explained intuitively
    • Certain distributions are underpinned by more generic distributions defined by the Alpha and Beta hyperparameters
  • If you want to learn more about conjugate priors (Be warned, it’s very math heavy and quite complicated):

—- Part 3: Simulations —–

Markov Chain Monte Carlo (MCMC) Simulations

  • Common modern technique
  • What happens without a conjugate prior or high dimensional problems?
    • The probabilty of p(x) can’t be easily analytically simplified
    • Apparently very common situation in realy world
  • Allows solving intractable p(x) integrals numerically instead of analytically
  • Basically random sampling based on a probability distribution

Side Note: MCMC Simulations Explained

  • What exactly are Monte Carlo simluations?
  • Purpose of Monte Carlo simulations: Approximating intractable problems numerically
  • Specifics behind the process they use
First, let's quickly go over what "Markov Chain Monte Carlo" actually means. Markov Chains are basically just a stateless random process (I.e the next state is purely based on the current state). Monte Carlo in this case refers to a family of algorithms known as "Monte Carlo methods", which are generally used for numerical approximation of an intractable or otherwise difficult computation. They approach this by randomly sampling from a probability distribution, then aggregating the sampled values and calculating an approximation. This is a well laid out example: In this case, we use a Monte Carlo algorithm to approximate the integral of p(x) (I.e. the area ).

Side Note: The Metropolis-Hastings and Hamiltonian Monte Carlo Sampling Algorithms

  • These are common MCMC sampling algorithms
  • MH is older, HMC is more modern and common nowadays
  • Brief overview of each
  • Similarities and differences
  • Why is HMC 'better'?


Metropolis-Hastings does not directly sample from a distribution, but uses a clever technique to *effectively* sample the distribution. To generate samples, the algorithm only requires that you have a function f(x) that is *proportional* to the distribution P(x). That is, f(x) * k = P(x). How does this work? We'll step through how the algorithm works to find out: 1) Choose and arbitrary starting point for the initial sample, xt. Also choose an arbitrary probability distribution, g(x|y) (Called the "proposal density" or "jumping distribution"), to serve as a sample generator. This is usually a Gaussian (Normal) distribution centered on y, which means that points close to y are more likely to be selected as the next sample candidate. 2a) Randomly generate a candidate sample for the iteration from g like so: g(x'|xt) 2b) Calculate the acceptance ratio of the candidate. This is the special sauce. Earlier we said that the algorithm only requires a function f(x) which is proportional to P(x), rather than being exact. Calculating the acceptance ratio leverages this fact. a = f(x')/f(xt) = P(x')/P(xt) This works because the normalizing constant is cancelled out a = (f(x') * k)/(f(xt) * k) = P(x')/P(xt) What we are calculating here is the probability of the new sample, relative to the previous sample. If the new sample is more probable, it means it is in a higher density (I.e. higher frequency) region of P(x) distribution. This is the core reason why the algorithm converges on the P(x) distribution over many samples. Values from high density regions of P(x) are sampled more frequently simply by virtue of them being more likely, and vice versa. 2c) Accept or reject the candidate sample. Here is where we use the acceptance ratio to generate randomness in the process. We generate a random number in range [0, 1] from a uniform distribution (I.e. every value has an equal chance). If this number is <= to the acceptance ratio, we accept the new sample, update xt to xt+1 and set xt+1 to equal the newly generated sample. If the number of less than the acceptance ratio, xt is updated to xt+1, but the value remains unchanged for the next iteration. 3) Repeat step 2 until you reach the desired number of iterations The Metropolis-Hastings algorithm works well, but has a couple of drawbacks. Remember how Markov Chains generate new states based on the current state? That means that states (And therefore generated samples) are somewhat correlated. This is bad because it means that locally, samples will not reflect P(x) (Though globally samples will converge on P(x)). Another problem related to the distribution of samples is that even though they eventually converge on P(x), initial samples may not. This may require specifying a "burn-in" period where initial samples are discarded to reduce bias in the generated samples.

Hamiltonian Monte Carlo

—- Part 4: Real-World Applications —–

Written on January 11, 2021