Introduction to Model Fitting

Data is everywhere. From cosmology to public health, interested parties measure and tabulate descriptive features of processes and objects that they want to understand. Implicit in this activity is the hope (or in less emotional terms, the assumption) that in this cacophony of description there are relationships, or principles, that explain what is observed. For our purposes here to “explain” a set of data is to functionally relate all or a subset of that data (the inputs) to a another set of of data (the outputs). From this definition there remain a few questions, perhaps most pressing is how do we select this functional relationship? Or even how does one know what data to functionally relate with respect to other data? More esoterically one could even query what it means to “functionally relate” at all, and what constitutes a good versus a bad explanation? I think all of these are interesting, but will address none of them here. Instead what I aim to describe is how to select the best functional relationship once a class of functional relationships (i.e. a model) is selected and given a clear idea of what variables constitute the data subset which intends to explain the output data of interest. In much simpler terms, what does it mean to “fit” a model to data?

All non-trivial models have parameters which dictate how inputs are mapped or transformed to yield outputs. A linear function, y = mx + b, has parameters for slope, m, and y-intercept, b. Given control over these two parameters one could draw every conceivable line (i.e. generate y values) for some arbitrary set of inputs x. So if we now turn back to our model fitting case, we have some input data we would like to relate via a model to some output data. For an arbitrary dataset, what delineates between inputs and outputs is determined by experimental hypotheses or practical considerations. In other words the input and output datasets are generally prescribed by either what the model intends to explain or what it aims to accomplish. Here we are not interested in every conceivable parameterization of our model, but rather the specific parameterization that best relates our input data to our output observations. This is the basic idea of model fitting: find parameter values such that the model’s predictions best resemble the measured data. What this requires is that of all possible parameterizations of the model we establish a rule or strategy for selecting a particular set of model parameters that accomplishes what we want, namely, to be capable of generating data that closely resembles the data collected. One strategy could be to randomly select parameterizations of a model. This approach however is quite inefficient, inaccurate, and lacks functional principles which afford the model explainability and reproducibility. This is not to say that randomness as a governing principle for a model is inappropriate, rather that relying on a random process to select model parameterizations makes the connection between the data and the model ambiguous. 

To address these issues we need a more sophisticated strategy for searching through parameter space. In general this presents as an optimization problem, where some “objective function” is either minimized or maximized. An objective function is a function that relates relevant variables (e.g. data recordings and model predictions) onto a value that represents a desired objective. In the context of model fitting, this objective function is parameterized by characteristics of the model, like the predictions those parameters determine or even the sum total of the parameter values of the model. Indeed an objective function can be constrained as desired in order to shape which model parameterizations are deemed best.

Now we are equipped to move through an example. In my experience, I first ran across “model fitting” in a statistics course via linear regression. I view this as a good starting point for introducing model fitting. So let’s begin there.  Often the type of regression most are familiar with is least-squares linear regression. In least-squares regression for some arbitrary input data x data and outputs y, a line (i.e. a linear model) tis fit o these measurements, in a least-squares sense, which means we aim to find the function for the line (y = mx + b) that minimizes the sum of the squared errors between the actual measured values and the values predicted by the line.

Here we have a great example of an objective function! We have a function that represents a measure of a model’s fit to the data in through a comparison between the predictions the model makes and the actual output data observations. The squaring of the error, or the difference between the actual and predicted y values, is necessary to avoid negative values. One could imagine avoiding negative values by simply taking the absolute value of the errors, but this would present problems during optimization later on [1].

F(\theta, Y, X) = \sum_{i}\left(y_{i} - \hat{y_{i}}\right)^{2}= \sum_{i}\left(y_{i} - \left(\beta_{1}x_{i} + \beta_{0}\right)\right)^{2}

For simplicity in notation, I have packaged the linear model parameters into a set denoted by \theta.

\theta = \left\{\beta_{0},\beta_{1}\right\}

This is a great approach, and intuitively seems to make a lot of sense. For a line to represent your data, you’d want that line to be as close as it can be to all the data points; minimizing the average squared distance from the data points to the line should accomplish this. 

As it turns out, least squares can be seen as a special case of a more general approach to model fitting: maximum likelihood estimation (MLE). In MLE model fitting, parameters (slope and intercept in the case of linear regression) are estimated such that the probability of observing the data at hand (called the “likelihood”) is maximized. In other words, the more “likely” it is for a given choice of parameter values to produce the data that we observed, the more we favor those parameter values.

This all sounds great, but we still do not have a complete picture of what “likely” means. To address this we take the stance that the model we seek to fit represents the true, ideal relationship between input data and output observations, and the observed variability of this input-output relationship is due to noise. Another way to think about this is if we take a slice of our output data, at some x value, the observed outputs can be considered to be drawn from a probability distribution.

Figure 1: At left we have some fake data with a least-squares linear model fit. Example residuals are shown in black. At right we see how conditioning on an x-value we can view our observed output data as a random variable.

Here we now start to see the route towards maximizing likelihood. What we would like to do is to find the parameters for the probability distribution (e.g. the Gaussian distribution) that yields a distribution that would generate the observed data with the highest probability. We will need to do this conditioned on each x-value. The expectation value of this probability distribution (i.e. the average) represents the model prediction for that particular x-value. In our figure this means that the peak of the gaussian curve at right maps onto the point on the red line predicting an output for an x-value of 0.2. This is great; we now have a function that, when conditioned on x and given model parameters, will return a measure of likelihood! Let’s now take some time to formalize all of this with some math.

First let’s define our likelihood function. Let’s call it L, and recall that we intend for it to be a function of our model parameters (again packaging them all into \theta) and our inputs X and outputs Y. This function will tell us how likely, or how probable, the selected model parameters are given the observed data. If we now recall above how we sliced our data into conditional probability distributions we can begin to see how our model parameters would impact the likelihood. In this linear regression the choice of our model parameters dictates the average of the Gaussian probability distribution given a particular input data x-value. The closer the corresponding output data y-value is to this average, the higher the probability is that the data point could have been sampled from this distribution. So we are looking for model parameters that yield the probability distributions most likely to have generated the output data y-value for each input data x-value. Since we have a large set of input-output data pairs we will want our likelihood function to represent these probabilities for ALL of our data points, not just each pair individually. What we know from probability theory is that if we take each of these points to be independent samples then the likelihood for ALL data points is the product of the likelihoods for each point individually. An example of this is if I asked you what the probability is of flipping two heads in a row with a fair coin. The answer is 0.5 \times 0.5 = 0.25. This is an example of the product rule in probability and the same idea is used here, but instead of a coin flip (which is a Bernoulli distribution) we are sampling from a Gaussian distribution at each data point. We can write all of this as follows where the likelihood of our model parameters given the data is equal to the product of all of the individual probabilities that our observed output data points were generated by Gaussian probability distributions parameterized by \theta and conditioned on our input data x-value. 

L(\theta|X,Y) = \prod_{i}P\left(y_{i}|x_{i}, \theta\right)

Now because we aren’t just interested in calculating the likelihood for some arbitrary parameters \theta, but are instead motivated to find the “best” parameters we must undertake an optimization of our likelihood function. In other words, we want to find the parameters \theta that maximize our likelihood function L. Theoretically we already have the function in a form where this optimization can take place, but computationally there are some idiosyncrasies about the above equation that do not translate well to accurate and efficient optimization. First off, the magnitudes of the likelihoods can be very small, and by definition are each guaranteed to be less than 1. Therefore for even a reasonably sized dataset, multiplying the likelihoods will result in an exceedingly small number. Theoretically this is not a problem, but computationally working with exceedingly small or large numbers is unstable. We can address this by instead maximizing the logarithm of the likelihood. Performing this log transform converts small numbers to larger negative values upon which computers can successfully operate. Remembering our log rules, the logarithm of a product is the same as the sum of the logarithms of each product factor. This can be written as follows: 

\log\left(L(\theta|X,Y)\right) = \log\left(\prod_{i}P\left(y_{i}|x_{i}, \theta\right)\right) = \sum_{i}\log\left(P\left(y_{i}|x_{i}, \theta\right)\right)

Let’s cast our attention to the probability term on the right hand side of the log-likelihood equation. We have already discussed how to determine this probability. As discussed earlier, we are assuming that each output data y-value is sampled from a Gaussian distribution conditioned on the input data x-value. The equation describing a generic Gaussian distribution is the following:

f(x) = \frac{1}{\sigma\sqrt{2 \pi}} \exp\left(- \frac{1}{2}\frac{\left(y - \mu\right)^{2}}{\sigma^{2}}\right)

where \mu is the distribution average and \sigma is the variance. This function returns the probability of drawing a particular y value from the distribution parameterized by \mu and \sigma. Therefor we should expect the function to be maximized when y and \mu are equal [2]. Before we are able to plug this in to our likelihood equation, we need to think again about how the parameters of our linear model interact with the parameters of this Gaussian distribution. Recall that we saw earlier that in linear regression we want to take the average of the Gaussian distribution conditioned on an x-value to be the prediction of the linear model for that x-value. The prediction of a linear model is simply the output of function for an x-value of interest and given the slope and bias parameters:

\hat{y}_{i} = \beta_{1}x_{i} + \beta_{0}

So we can plug this in for the \mu term when we use the equation for the Gaussian distribution in our likelihood function.

P\left(y_{i}|x_{i}, \theta\right) = \frac{1}{\sigma\sqrt{2 \pi}} \exp\left(- \frac{1}{2}\frac{\left(y_{i} - \left(\beta_{1}x_{i} + \beta_{0}\right)\right)^{2}}{\sigma^{2}}\right)

Putting all of this together we have worked our way to our full log-likelihood equation!

\log\left(L(\theta|X,Y)\right) =  \sum_{i}\log\left(\frac{1}{\sigma\sqrt{2 \pi}} \exp\left(- \frac{1}{2}\frac{\left(y_{i} - \left(\beta_{1}x_{i} + \beta_{0}\right)\right)^{2}}{\sigma^{2}}\right)\right)

This is the function for which we want to find the parameters \beta_1 and \beta_0 which maximize the output. By convention computational optimization routines typically minimize rather than maximize a function of interest, so to accommodate this we simply recast our likelihood function again as now a NEGATIVE log likelihood function. Minimization of the negative of our log likelihood will yield the same result as maximization of the positive log likelihood.

This brings us close to the end, but in closing I would like to show how this formulation of linear regression reduces to the special case of least-squares linear regression that most are familiar with. The approach we will take to do this is investigate our likelihood function and identify components that do not change if a different model (i.e. different parameter set) is tested. So we are looking for parts of the function that do not depend on our \beta’s. We can quickly see that we can remove the leading term and simplify to: 

\log\left(L(\theta|X,Y)\right) =  \sum_{i}\log\left(\exp\left(- \frac{1}{2}\frac{\left(y_{i} - \left(\beta_{1}x_{i} + \beta_{0}\right)\right)^{2}}{\sigma^{2}}\right)\right)

But the log serves now to simply cancel the exponential. Additionally we can remove the the sigma in the denominator of the exponential and the leading term in the sum as they are both terms that do not depend on the model parameters. All of this leaves us with the following: 

\log\left(L(\theta|X,Y)\right) =  \sum_{i}\left(- \frac{1}{2}\frac{\left(y_{i} - \left(\beta_{1}x_{i} + \beta_{0}\right)\right)^{2}}{\sigma^{2}}\right) = \sum_{i}\left(y_{i} - \left(\beta_{1}x_{i} + \beta_{0}\right)\right)^{2}

Minimizing this new formulation is equivalent to the least-squares approach to linear regression. 

Maximum-likelihood is important to understand because it is more general than the commonly encountered least-squares approach and is thus applicable to more settings where, for example, different assumptions are made about the output data measurement noise distribution. Secondly, I find that posing the problem in terms of the likelihood function is an interesting twist because doing so takes the stance that there is a relationship between input and output data that accounts both for explainable and unexplainable aspects of (measurement noise) the input-output relationship. In our linear regression example, the output data is taken to be a linear function of the input data, plus some noise which remains unexplained (beyond assuming a particular noise probability distribution).  To me this has strong ties to philosophy of science and perhaps even epistemology. For example, if we consider a model, like our linear model, to represent knowledge about relationships between measured entities then the MLE model fitting approach proposes a formal strategy for how knowledge manifests and its explainable and unexplainable facets. This line of thinking spawns new questions related to those posited in the introductory paragraph. Namely, how are the model parameterizations and sets of input and output data measurements selected so as to fit meaningful models and thus establish knowledge? I am admittedly not supremely confident in a 1:1 mapping of knowledge onto statistical models, but at the very least I see model fitting as a mathematical formalism capable of discussing the topic of knowledge. I intend in future posts to continue exploring this kind of relationship between mathematical formalism of model fitting and more philosophical questions of knowledge and the scientific method.

[1] This is because performing optimization on an objective function like we have here requires taking the derivative of the objective function. The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to optimization.

[2] Through inspection we can see this is the case as setting y equal to \mu zeros out the term in the exponential, and the exponential of zero is 1. This leaves us only with the leading term which is controlled by the variance parameter. Because the term in the exponential is always negative (due to the leading negative sign), we can expect any nonzero value in the exponential to yield a number less than 1 and approaching zero. The result is a multiplicative reduction of the leading term in the function to some percentage of its maximum, only achieved when the term in the exponential is zero.

Leave a comment