From Linear Regression to Bayesian Inference and Variational Learning

Diving into the world of computational biology without a traditional computer science background has been a bit like learning to swim in the deep end. I have learned that one of the best ways to grasp a concept is to try and explain it. In the past year, I have quickly entered the fun yet sometimes scary world of probabilistic programming and I thought it would be a good time to take a step back and make sure I understand the basics. In this post, we will start off simple and slowly walk through the Bayesian approach to a linear regression before diving into variational inference.

Linear Regression: A Primer

Linear regression is a statistical method enabling us to predict a dependent variable using one or more independent variables. It assumes a linear relationship between these variables, summarized by the formula:

$$ Y = \beta_0 + \beta_1X + \epsilon $$

Where:

$Y \text{ is the dependent variable you are trying to predict}$
$X \text{ is the independent variable(s) you're using to make predictions}$
$B_0 \text{ is the y-intercept of the regression line, representing the predicted value of } Y \text{ when }X=0$
$\begin{aligned}B_1 &\text{ is the slope of the regression line, representing the predicted change in } Y \text{ for a one-unit change in } X,\\\epsilon &\text{ is the error term, accounting for the difference between the observed and predicted values.}\end{aligned}$

For multiple independent variables, the equation extends to:

$$ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \epsilon $$

Fitting the Model: The process of "fitting" a linear regression model involves finding the values of the β coefficients that minimize the error term (ϵ). This is typically done using the method of least squares, which calculates the best-fitting line by minimizing the sum of the squares of the vertical distances (residuals) of the points from the line.

Below is a basic scatter plot of data points that follow a linear trend, and then fit a linear regression line (the best fit line) through these points.

Untitled

By quantifying the strength and direction of the relationship between predictors and a dependent variable, linear regression provides a powerful tool for prediction and for understanding how each predictor influences the outcome. However, the commonly used least squares approach assumes a linear relationship and is particularly sensitive to outliers, which can distort the slope and intercept of the best-fit line. Least absolute deviation (which minimizes absolute errors rather than squared errors) is a variant of linear regression that is less influenced by outliers.

Transitioning to Bayesian Thinking: Prior vs. Posterior

Although classical linear regression can capture uncertainty through confidence intervals—either by assuming normal errors or using resampling methods like the bootstrap —these approaches remain rooted in frequentist concepts. Bayesian methods, including variational inference, instead model a full probability distribution over parameter values, offering a more comprehensive view of uncertainty. Rather than asking, “What is the single best estimate of our parameter?” or “What is the confidence interval around it?” Bayesian approaches pose the question, “What is the probability distribution of possible parameter values, given the data?” Grounded in Bayes’ theorem, this perspective updates our beliefs about parameters when new data is observed. Central to this framework are the prior and posterior distributions, which represent our knowledge about the parameters before and after seeing the data.

Prior Distribution, p(θ)

Definition: The prior distribution represents our knowledge or beliefs about the parameters θ before observing any data. It is a mathematical expression of our assumptions or existing knowledge about the parameter's possible values.
Role: Priors can be informed by previous studies, expert knowledge, or can be set as relatively uninformative so that the data exerts a stronger influence. In practice, priors help to regularize coefficient estimates by pulling implausible values back toward a more reasonable mean—especially if the prior is centered around a “no-effect” value. This regularization becomes particularly important when data are limited, as the prior’s influence on the posterior can dominate in situations of high uncertainty.