Diving into the world of computational biology without a traditional computer science background has been a bit like learning to swim in the deep end. I have learned that one of the best ways to grasp a concept is to try and explain it. In the past year, I have quickly entered the fun yet sometimes scary world of probabilistic programming and I thought it would be a good time to take a step back and make sure I understand the basics. In this post, we will start off simple and slowly walk through the Bayesian approach to a linear regression before diving into variational inference.
Linear regression is a statistical method enabling us to predict a dependent variable using one or more independent variables. It assumes a linear relationship between these variables, summarized by the formula:
$$ Y = \beta_0 + \beta_1X + \epsilon $$
Where:
For multiple independent variables, the equation extends to:
$$ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \Beta_nX_n + \epsilon $$
Fitting the Model: The process of "fitting" a linear regression model involves finding the values of the β coefficients that minimize the error term (ϵ). This is typically done using the method of least squares, which calculates the best-fitting line by minimizing the sum of the squares of the vertical distances (residuals) of the points from the line.
Below is a basic scatter plot of data points that follow a linear trend, and then fit a linear regression line (the best fit line) through these points.
By quantifying the strength and direction of the relationship between the predictors and the dependent variable, linear regression provides a powerful tool for prediction and understanding the linear effect of individual variables. However, it assumes a linear relationship between the dependent and independent variables and is sensitive to outliers, which can significantly affect the slope and intercept of the best-fit line.
Linear regression offers single-point estimates for parameters, lacking an assessment of uncertainty. Bayesian methods, including variational inference, enrich this perspective by considering a spectrum of possible parameter values, encapsulated in a probability distribution. Instead of asking, "What is the single best estimate of our parameter?" Bayesian methods ask, "What is the distribution of possible values for our parameter, given the data?"This approach is grounded in Bayes' theorem, which updates our parameter beliefs upon observing new data. Central to this approach are the concepts of prior and posterior distributions, which embody our knowledge about model parameters at different stages.