Regularized Cost Functions

Regularization is a technique that has several different versions as well as a multitude of names (eg. ridge regression, LASSO, L1/L2 penalties…).  In the context of machine learning cost functions, regularization serves to alleviate overfitting  and can be best explained (imo) from a Bayesian perspective.

The crux of Bayesian analysis is to think of the weights as random and having a probability distribution itself (prior distribution).  Combining this prior distribution on the weights with the likelihood function results in another distribution called the posterior distribution, which could then be used to estimate the weights.  A reasonable way of choosing our weights is to select the set of weights that results that is the most probable according to the posterior distribution (maximum a posteriori estimation).  This is related to the method of maximum likelihood, but instead of optimizing over the likelihood function, we do so over the posterior distribution.  In fact one could consider maximum likelihood as a specific case of of this technique (using a uniform prior distribution).

According to Bayesian statistics we have:

f(\mathbf{w}|\mathbf{y},\mathbf{x})\:\:\alpha \:\: L(\mathbf{w})g(\mathbf{w})

Where f(\mathbf{w}|\mathbf{y},\mathbf{x}) is the posterior distribution for the weights, L(\mathbf{w}) is the likelihood function as we’ve talked about before, and g(\mathbf{w}) is the prior distribution for the weights.  To find the most probable set of weights we need to choose the set that maximizes the term L(\mathbf{w})g(\mathbf{w}).  That is:

\hat{\mathbf{w}}_{MAP}=\underset{\mathbf{w}}{\arg \max}\: L(\mathbf{w})g(\mathbf{w})

Since we have the likelihood functions for normally distribution variables and categorical variables, (discussed here and here) we need a prior distribution.  Without any real prior knowledge of the data the choice seems to be somewhat arbitrary.  But what we really want is a distribution that helps the overfitting problem and that is also relatively easy to work with.  A popular choice is to assume the weights are independent, zero-mean, normally distributed, all with equal variances:

g(\mathbf{w}) = \prod_{p=1}^P\frac{1}{\sqrt{2\pi\sigma_w}}e^{-\frac{w_p^2}{2\sigma_w^2}}=C_1e^{-\frac{\sum_{p=1}^P w_p^2}{2\sigma_w^2}}

The reason I write it with the C is that the square root term is a constant and will not be needed to find the MAP.  Also notice that this is a prior on w_1, w_2, ..., w_P.  Usually we have a term w_0 called the bias term and is  usually not included in the regularization.  Or you can think of it as having a constant (uniform) prior distribution.

Using this prior distribution we can calculate our cost functions by writing out the posteriors and rewriting it as a cost function.  In the case of normally distributed response variables we have:

f(\mathbf{w}|\mathbf{y},\mathbf{x}) = Ce^{-\frac{\sum_{i=1}^m (y^{(i)}-h(\mathbf{x}^{(i)};\mathbf{w}))^2}{2\sigma_w^2}}e^{-\frac{\sum_{p=1}^P w_p^2}{2\sigma_w^2}} = Ce^{-(\frac{1}{2\sigma_y^2}\sum_{i=1}^m (y^{(i)}-h(\mathbf{x}^{(i)};\mathbf{w}))^2+\frac{1}{2\sigma_w^2}\sum_{p=1}^P w_p^2)}

In order to maximize this function we need to minimize the term:

\frac{1}{2\sigma_y^2}\sum_{i=1}^m (y^{(i)}-h(\mathbf{x}^{(i)};\mathbf{w}))^2+\frac{1}{2\sigma_w^2}\sum_{p=1}^P w_p^2

Multiplying the equation by the constant \sigma_y^2 yields:

Cost=\frac{1}{2}\sum_{i=1}^m (y^{(i)}-h(\mathbf{x}^{(i)};\mathbf{w}))^2+\frac{\sigma_y^2}{2\sigma_w^2}\sum_{p=1}^P w_p^2

This is basically the least squares cost function with an additional term added to it.  The additional term is also called an L2 penalty, as larger values of \mathbf{w} increases the term and adds to the cost.  It’s also interesting to note how the constant \frac{\sigma_y^2}{\sigma_w^2} affects things.  If \sigma_y^2 is large in comparison to \sigma_w^2 then \frac{\sigma_y^2}{\sigma_w^2} will be a large number.  In this case, the best way of minimizing the cost function is setting all the values of w to be zero.  If on the other hand, \sigma_y^2 is small in comparison to \sigma_w^2, then \frac{\sigma_y^2}{\sigma_w^2} will be close to zero and we simply have the ordinary least squares cost function.  In practice we wouldn’t know what the correct value for \frac{\sigma_y^2}{\sigma_w^2} should be you might have to fiddle around with it to see what works best.

In general, the penalty term results in smaller values for the weights, which results in less complicated hypothesis functions.

We can also calculate derivatives for use in gradient descent:

\frac{\partial Cost}{\partial w_k}=\sum_{i=1}^m(y^{(i)}-h(\mathbf{x}^{(i)};\mathbf{w}))\frac{\partial h(\mathbf{x}^{(i)};\mathbf{w})}{\partial w_k} + \alpha w_k

Where \alpha = \frac{\sigma_y^2}{\sigma_w^2}.

Similarly with the cost function for categorical response variables we have:

f(\mathbf{w}|\mathbf{y},\mathbf{x}) = \prod_{i=1}^m(h_1(\mathbf{x}^{(i)}; \mathbf{w})^{y_1^{(i)}}h_2(\mathbf{x}^{(i)}; \mathbf{w})^{y_2^{(i)}}...h_k(\mathbf{x}^{(i)}; \mathbf{w})^{y_k^{(i)}})Ce^{-\frac{\sum_{p=1}^P w_p^2}{2\sigma_w^2}}

Taking the log of this function we end up with:

\sum_{i=1}^m(y_1^{(i)}log(h_1(\mathbf{x}^{(i)}; \mathbf{w}))+...+y_k^{(i)}log(h_k(\mathbf{x}^{(i)}; \mathbf{w})))+log(C)-\frac{\sum_{p=1}^P w_p^2}{2\sigma_w^2}

We define the cost function as the negative of this function without the constant log(C) (since it doesn’t affect the optimization):

Cost = -\sum_{i=1}^m(y_1^{(i)}log(h_1(\mathbf{x}^{(i)}; \mathbf{w}))+...+y_k^{(i)}log(h_k(\mathbf{x}^{(i)}; \mathbf{w})))+\frac{\alpha}{2}\sum_{p=1}^P w_p^2

With \alpha = \frac{1}{\sigma_w^2}

The derivatives are therefore:

\frac{\partial Cost}{\partial w_t}=-\sum_{y_1^{(i)}=1}\frac{1}{h_1(\mathbf{x}^{(i)};\mathbf{w})}\frac{dh_1(\mathbf{x}^{(i)};\mathbf{w})}{dw_t}-...-\sum_{y_k^{(i)}=1}\frac{1}{h_k(\mathbf{x}^{(i)};\mathbf{w})}\frac{dh_1(\mathbf{x}^{(i)};\mathbf{w})}{dw_t}+\alpha w_t

As you can see, the end results for our gradient descent algorithm is simply an extra term added to the derivatives.  You can find my implementations of regularized linear regression and multinomial logistic regression here and here respectively.

Another popular version of regularization is the LASSO or L1 penalty.  This can be derived similarly using a Laplace distribution as a prior instead of a Normal distribution.

Multinomial Logistic Regression

Multinomial logistic regression (aka softmax regression) is a generalization of binomial logistic regression, as it allows the response variable to have more than two classes.  There also seems to be less information about multinomial regression in comparison to binomial out there, so I’ve decided to write this post.

As described here, the cost function for a multinomial distributed response variable is:

Cost=-\sum_{y_1^{(i)}=1}log(h_1(\mathbf{x}^{(i)};\mathbf{w}))-\sum_{y_2^{(i)}=1}log(h_2(\mathbf{x}^{(i)};\mathbf{w}))-...-\sum_{y_k^{(i)}=1}log(h_k(\mathbf{x}^{(i)};\mathbf{w}))

In logistic regression with K classes, the hypotheses are:

h_1(\mathbf{x}^{(i)};\mathbf{w}) = \frac{e^{\mathbf{w}_1\mathbf{x}^{(i)}}}{1+\sum_{j=1}^{K-1} e^{\mathbf{w}_j\mathbf{x}^{(i)}}}

h_2(\mathbf{x}^{(i)};\mathbf{w}) = \frac{e^{\mathbf{w}_2\mathbf{x}^{(i)}}}{1+\sum_{j=1}^{K-1} e^{\mathbf{w}_j\mathbf{x}^{(i)}}}

h_K(\mathbf{x}^{(i)};\mathbf{w}) = \frac{1}{1+\sum_{j=1}^{K-1} e^{\mathbf{w}_j\mathbf{x}^{(i)}}}

One can think of this as running a set of (K-1) binary logistic regressions using the Kth category as the reference point.  You can read more about it here.  For each of the K-1 categories we’ll have set of weights:

\mathbf{w}_1 = \begin{bmatrix} w_{01}&w_{11}&w_{21}&...&w_{p1}\end{bmatrix}

\mathbf{w}_2 = \begin{bmatrix} w_{02}&w_{12}&w_{22}&...&w_{p2}\end{bmatrix}

\mathbf{w}_{K-1} = \begin{bmatrix} w_{0(K-1)}&w_{1(K-1)}&w_{2(K-1)}&...&w_{p(K-1)}\end{bmatrix}

Note that there are p predictors and (K-1) sets of them, leaving p*(K-1) weights that need to be estimated.

In order to apply an algorithm like gradient descent we need to calculate the derivatives of the cost function with respect to the weights.  eg:

\frac{\partial Cost}{\partial w_{11}} = -\sum_{y_1^{(i)}=1}1/h_1\frac{\partial h_1}{\partial w_{11}} - \sum_{y_2^{(i)}=1}1/h_2\frac{\partial h_2}{\partial w_{11}} - ... - \sum_{y_K^{(i)}=1}1/h_K\frac{\partial h_K}{\partial w_{11}}

We can see that:

\frac{\partial h_1}{\partial w_{11}} = \frac{e^{\mathbf{w}_1\mathbf{x}^{(i)}}x^{(i)}_1(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})-e^{\mathbf{w}_1\mathbf{x}^{(i)}}e^{\mathbf{w}_1\mathbf{x}^{(i)}}x^{(i)}_1}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})^2}=x^{(i)}_1\frac{e^{\mathbf{w}_1\mathbf{x}^{(i)}}}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})}\frac{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})-e^{\mathbf{w}_1\mathbf{x}^{(i)}}}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})}\\=x^{(i)}_1h_1(1-h_1)

\frac{\partial h_2}{\partial w_{11}} = \frac{-e^{\mathbf{w}_2\mathbf{x}^{(i)}}e^{\mathbf{w}_1\mathbf{x}^{(i)}}x^{(i)}_1}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})^2}=-x^{(i)}_1\frac{e^{\mathbf{w}_1\mathbf{x}^{(i)}}}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})}\frac{e^{\mathbf{w}_2\mathbf{x}^{(i)}}}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})}=-x^{(i)}_1h_1h_2

\frac{\partial h_K}{\partial w_{11}} = \frac{-e^{\mathbf{w}_1\mathbf{x}^{(i)}}x^{(i)}_1}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})^2}=-x^{(i)}_1\frac{e^{\mathbf{w}_1\mathbf{x}^{(i)}}}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})}\frac{1}{(1+\sum_{k=1}^K e^{\mathbf{w}_k\mathbf{x}^{(i)}})}=-x^{(i)}_1h_1h_K

Note that I’ve shortened the notation of the hypotheses function for brevity.  ie. h_1 = h_1(\mathbf{x}^{(i)}; \mathbf{w})

Which leads to:

\frac{\partial Cost}{\partial w_{11}} = -\sum_{y_1^{(i)}=1}x^{(i)}_1(1-h_1) + \sum_{y_2^{(i)}=1}x^{(i)}_1h_1+ ... + \sum_{y_K^{(i)}=1}x^{(i)}_1h_1

A similar pattern follows for the other derivatives.  In general we have:

\frac{\partial h_j}{\partial w_{tj}}=x^{(i)}_th_j(1-h_j),\:\:\:\: j=1,2,...,K,\:\: t=1,2,...,p

\frac{\partial h_s}{\partial w_{tj}}=-x^{(i)}_th_jh_s,\:\:\:\: \:\:\:s,j=1,2,...,K,\:\: s\neq j,\:\: t=1,2,...,p

Here are more examples:

\frac{\partial Cost}{\partial w_{21}} = -\sum_{y_1^{(i)}=1}x^{(i)}_2(1-h_1) + \sum_{y_2^{(i)}=1}x^{(i)}_2h_1+ ... + \sum_{y_K^{(i)}=1}x^{(i)}_2h_1

\frac{\partial Cost}{\partial w_{32}} = \sum_{y_1^{(i)}=1}x^{(i)}_3h_2 - \sum_{y_2^{(i)}=1}x^{(i)}_3(1-h_2)+ ... + \sum_{y_K^{(i)}=1}x^{(i)}_3h_2

\frac{\partial Cost}{\partial w_{pK}} = \sum_{y_1^{(i)}=1}x^{(i)}_ph_K + \sum_{y_2^{(i)}=1}x^{(i)}_ph_K+ ... - \sum_{y_K^{(i)}=1}x^{(i)}_p(1-h_K)

Because we are really finding the maximum likelihood estimates of the weights, we can estimate the standard errors using the observed information matrix.

Since the cost function was defined as the negative of the log-likelihood (see here) we have:

l(\mathbf{w})=\sum_{y_1^{(i)}=1}log(h_1(\mathbf{x}^{(i)}; \mathbf{w})) + \sum_{y_2^{(i)}=1}log(h_2(\mathbf{x}^{(i)}; \mathbf{w}))+ ... + \sum_{y_K^{(i)}=1}log(h_K(\mathbf{x}^{(i)}; \mathbf{w}))

In order to get the observed information we need to calculate the Hessian of the log-likelihood, which involves finding the second derivatives.  Using the above results, one can show that:

\frac{\partial^2 l(\mathbf{w})}{\partial w_{jk}^2}=\sum_{i=1}^{m}x_{j}^{(i)2}h_k(1-h_k), \:\:\:j=1,2,...,p,\:\:k=1,2,...,K

\frac{\partial^2 l(\mathbf{w})}{\partial w_{jk}w_{lk}}=\sum_{i=1}^{m}x_{j}^{(i)}x_{l}^{(i)}h_k(1-h_k), \:\:\:j,l=1,2,...,p,\:\:k=1,2,...,K

\frac{\partial^2 l(\mathbf{w})}{\partial w_{jk}w_{ls}}=\sum_{i=1}^{m}x_{j}^{(i)}x_{l}^{(i)}h_kh_s, \:\:\:j,l=1,2,...,p,\:\:k,s=1,2,...,K,\:\:k\neq s

These formulas will be evaluated at the weights that were found by our algorithm (hopefully close to the MLE).

Once we find the observed information we use it’s inverse as the estimate of the variance matrix of the weights.

You can find my implementation of gradient descent and getting the standard errors of the weights here.

As in my earlier implementation of linear regression, I first simulate some data with three categories in the response, and perform feature scaling on the predictors before doing gradient descent.  While there are already implementations of multinomial logistic regression in other R packages, it is still worthwhile to know what is going on behind the scenes.

Maximum Likelihood Estimates

In previous posts about cost functions (ordinary least squares, generalized least squares, and categorical) we found that minimizing the cost functions equated to maximizing the likelihood function for the weights.  Such an approach is aptly named maximum likelihood estimation.  Why would this be a good estimate of the weights?  It turns out that maximum likelihood estimators (MLE) have some nice limiting properties under certain regularity conditions, which are met for many practical distributions.  They are as follows:

1.  Consistency.  ie.  \widehat{\mathbf{w}}_{mle}\xrightarrow{\:\:\:p\:\:\:} \mathbf{w}_0.  Where \widehat{\mathbf{w}}_{mle} is the MLE and \mathbf{w}_0 is the true value of the parameter.  This is basically saying that as the number of observations gets really large, you won’t be able to distinguish between the MLE and the true value.

2.  Asymptotically normal and efficient.  ie.

\sqrt{m}(\widehat{\mathbf{w}}_{mle}-\mathbf{w}_0)\xrightarrow{\:\:\:d\:\:\:} N(\mathbf{0}, I(\mathbf{w})^{-1}).

Where I(\mathbf{w}) is the Fisher information with respect to the parameters \mathbf{w}.

This property states that as the number of observations (m) gets really large then the MLE follows a multivariate normal distribution.  The fact that the variance is the inverse of the Fisher information matrix, means that it achieves that lowest mean squared error of all unbiased estimators, and hence termed efficient (see Cramer-Rao bound).  This is a nice property if you want to calculate standard errors or confidence regions for the weights by seeing that:

\widehat{\mathbf{w}}_{mle}\xrightarrow{\:\:\:d\:\:\:} N(\mathbf{w}_0, (mI(\mathbf{w}))^{-1}).

If the data is large enough, it would approximately follow a multivariate normal distribution and you could use the observed information of the data  (approximating nI(w)) to obtain an estimated confidence region for the parameters using this characteristic of the multivariate normal.  I believe that most software programs use the square rooted diagonals of the inverse of the observed information matrix and simply lists them as the standard errors for each parameter.  While this is simple, it could be misleading if the predictors are correlated.

Hopefully this post justified the use of maximum likelihood methods that are inherent in the cost functions.  For more information about MLE’s you could take a look at the Wikipedia page.  Since likelihood methods are widely used, there is a lot of information you could find about them on the Internet.

Cost function for categorical response variables

The least squares cost function is usually appropriate for response values that are continuous on the real number line, as often the errors are indeed close to that of a normal distribution.  For categorical responses this is not the case as the errors range from zero to one.

In the case of two categories we have that the response value y can take on the zero or one with probabilities conditioned on the predictors x.  In other words:

Pr(Y^{(i)}=0)=1-p_i\\Pr(Y^{(i)}=1)=p_i

Where p_i is in the form of our hypothesis h(\mathbf{x}^{(i)}; \mathbf{w}).  Since it represents a probability it would range from zero to one.

One can see that Y follows a Bernoulli distribution, which we can write down as:

Pr(Y^{(i)}=k)=h(\mathbf{x}^{(i)}; \mathbf{w})^{k}(1-h(\mathbf{x}^{(i)}; \mathbf{w}))^{1-k}, \:\:\:\:\:\: k = 0,1

Assuming that the observations are independent we can write down the likelihood function as the product of these Bernoulli’s:

L(\mathbf{w})=\prod_{i=1}^mh(\mathbf{x}^{(i)}; \mathbf{w})^{y^{(i)}}(1-h(\mathbf{x}^{(i)}; \mathbf{w}))^{1-y^{(i)}}

Taking the log of this equation results in:

log(L(\mathbf{w}))=\sum_{i=1}^m(y^{(i)}log(h(\mathbf{x}^{(i)}; \mathbf{w}))+(1-y^{(i)})log(1-h(\mathbf{x}^{(i)}; \mathbf{w})))

In order to maximize the log-likelihood function we could minimize the negative of it, which would lead to the cost function:

Cost=-\sum_{i=1}^m(y^{(i)}log(h(\mathbf{x}^{(i)}; \mathbf{w}))+(1-y^{(i)})log(1-h(\mathbf{x}^{(i)}; \mathbf{w})))

Another way of writing this is

Cost=-\sum_{y^{(i)}=0}log(1-h(\mathbf{x}^{(i)};\mathbf{w}))-\sum_{y^{(i)}=1}log(h(\mathbf{x}^{(i)};\mathbf{w}))

We can use an algorithm like gradient descent to minimize this cost function by calculating the derivatives:

\frac{\partial Cost}{\partial w_t}=\sum_{y^{(i)}=0}\frac{1}{1-h(\mathbf{x}^{(i)};\mathbf{w})}\frac{dh(\mathbf{x}^{(i)};\mathbf{w})}{dw_t}-\sum_{y^{(i)}=1}\frac{1}{h(\mathbf{x}^{(i)};\mathbf{w})}\frac{dh(\mathbf{x}^{(i)};\mathbf{w})}{dw_t}

What the logistic regression model does is specify a hypothesis and minimizes this function with respect to the weights.

Taking the concept a step further, we consider if there are more than two categories in the response.  That is each observation y can be of category 1,2,…, k.  One can represent each response value as a vector of size k, with each element of the vector being an indicator for which category the observation is part of.

Eg.  If there are three categories and the response for observation i is in category 1 then the vector would be

\mathbf{y}^{(i)}=\begin{bmatrix} 1&0&0\end{bmatrix}

in category 2 it would be:

\mathbf{y}^{(i)}=\begin{bmatrix} 0&1&0\end{bmatrix}

category 3:

\mathbf{y}^{(i)}=\begin{bmatrix} 0&0&1\end{bmatrix}

For every observation there is one element in the vector that has a value of one while the other elements are zero.

The response is now a vector y^{(i)}=[y_1^{(i)}\:y_2^{(i)}\:...\:y_k^{(i)}].

As with the case with two variables we can write a probability distribution:

Pr(y_1^{(i)}=1)=h_1(\mathbf{x}^{(i)}; \mathbf{w}),\:\:...,\:\:Pr(y_k^{(i)}=1)=h_k(\mathbf{x}^{(i)}; \mathbf{w})

With our hypothesis vector \mathbf{h}(\mathbf{x}^{(i)};\mathbf{w}) = [h_1(\mathbf{x}^{(i)};\mathbf{w})\:...\:h_k(\mathbf{x}^{(i)};\mathbf{w})]

In this case, Y follows a multinomial distribution.  Thus the likelihood function would be:

L(\mathbf{w})=\prod_{i=1}^mh_1(\mathbf{x}^{(i)}; \mathbf{w})^{y_1^{(i)}}h_2(\mathbf{x}^{(i)}; \mathbf{w})^{y_2^{(i)}}...h_k(\mathbf{x}^{(i)}; \mathbf{w})^{y_k^{(i)}}

And the log-likelihood:

log(L(\mathbf{w}))=\sum_{i=1}^m(y_1^{(i)}log(h_1(\mathbf{x}^{(i)}; \mathbf{w}))+y_2^{(i)}log(h_2(\mathbf{x}^{(i)}; \mathbf{w}))+...+y_k^{(i)}log(h_k(\mathbf{x}^{(i)}; \mathbf{w})))

Leading the to the cost function:

Cost=-\sum_{i=1}^m(y_1^{(i)}log(h_1(\mathbf{x}^{(i)}; \mathbf{w}))+y_2^{(i)}log(h_2(\mathbf{x}^{(i)}; \mathbf{w}))+...+y_k^{(i)}log(h_k(\mathbf{x}^{(i)}; \mathbf{w})))

Since only one of the elements in the \mathbf{y}^{(i)} vector can be one (while the others are zero), we can write the cost function in the following way:

Cost=-\sum_{y_1^{(i)}=1}log(h_1(\mathbf{x}^{(i)};\mathbf{w}))-\sum_{y_2^{(i)}=1}log(h_2(\mathbf{x}^{(i)};\mathbf{w}))-...-\sum_{y_k^{(i)}=1}log(h_k(\mathbf{x}^{(i)};\mathbf{w}))

With the derivatives:

\frac{\partial Cost}{\partial w_t}=-\sum_{y_1^{(i)}=1}\frac{1}{h_1(\mathbf{x}^{(i)};\mathbf{w})}\frac{dh_1(\mathbf{x}^{(i)};\mathbf{w})}{dw_t}-\sum_{y_2^{(i)}=1}\frac{1}{h_2(\mathbf{x}^{(i)};\mathbf{w})}\frac{dh_2(\mathbf{x}^{(i)};\mathbf{w})}{dw_t}-...-\sum_{y_k^{(i)}=1}\frac{1}{h_k(\mathbf{x}^{(i)};\mathbf{w})}\frac{dh_1(\mathbf{x}^{(i)};\mathbf{w})}{dw_t}

Again, one could use an algorithm like gradient descent to minimize the cost function as long as the derivatives of the hypothesis are able to be calculated.  One of the more common hypotheses of this form is that of multinomial logistic regression (AKA softmax regression), which we will look at in another post.

Generalized least squares cost function

In the last post we talked about the least squares cost function and how minimizing it was equivalent to maximizing the likelihood function of a Normal distribution with independence and equal variance assumptions.  The full name of this is called “ordinary least squares”.  In this post we will explore what happens when the independence and variance assumptions are relaxed, which leads to what is called generalized least squares.

Before going into the details, it is worth noting that ordinary least squares is fairly robust and would give reasonable results if the assumptions are not too severe of a problem.  In the case of independence but unequal variances (heteroscedasticity) one might try doing some transformations (such as log(y)) and then fitting with ordinary least squares to see if that helps.

When deriving the least squares cost function we were able to write down the likelihood function as a product of Normal distributions due to the independence assumption.  If we do not make this assumption we model the errors as a multivariate normal distribution.  That is:

L(\mathbf{\mu})=\frac{1}{\sqrt{(2\pi)^m\left |\mathbf{\Sigma}\right |}}e^{-\frac{1}{2}(\mathbf{y}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{y}-\mathbf{\mu})}

Where,

\mathbf{y}=\begin{bmatrix}y^{(1)}\\y^{(2)}\\...\\y^{(m)}\end{bmatrix},\:\:\mathbf{\mu}=\begin{bmatrix}\mu_1\\\mu_2\\...\\\mu_m\end{bmatrix},\:\:\mathbf{\Sigma}=\begin{bmatrix}\sigma_1^2 & \sigma_{12} & ... & \sigma_{1m}\\\sigma_{21} & \sigma_2^2 & ... & \sigma_{2m}\\... & ... & ... & ...\\\sigma_{m1} & \sigma_{m2} & ... & \sigma_m^2\end{bmatrix}

\sigma_i^2 is the variance of the ith error, \sigma_{ij} is the covariance of the ith and jth errors.

In order to maximize this function we’ll need to minimize:

(\mathbf{y}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{y}-\mathbf{\mu})

Assuming \mathbf{\mu} takes the form of our hypothesis we have:

Cost=(\mathbf{y}-\mathbf{h}(\mathbf{x};\mathbf{w}))^T\Sigma^{-1}(\mathbf{y}-\mathbf{h}(\mathbf{x};\mathbf{w}))

where,

\mathbf{h}(\mathbf{x};\mathbf{w}) =\begin{bmatrix}h(\mathbf{x}^{(1)}; \mathbf{w}\\h(\mathbf{x}^{(2)}; \mathbf{w})\\...\\h(\mathbf{x}^{(m)}; \mathbf{w})\end{bmatrix}

Defining \mathbf{\Sigma}^{-1}=\begin{bmatrix}c_{11} & c_{12} & ... & c_{1m}\\c_{21} & c_{22} & ... & c_{2m}\\... & ... & ... & ...\\c_{m1} & c_{m2} & ... & c_{mm}\end{bmatrix}

And carrying out the matrix calculations:

\begin{bmatrix} y^{(1)}-h(\mathbf{x}^{(1)}; \mathbf{w}) & ... & y^{(m)}-h(\mathbf{x}^{(m)};\mathbf{w})\end{bmatrix}\begin{bmatrix}c_{11} & c_{12} & ... & c_{1m}\\c_{21} & c_{22} & ... & c_{2m}\\... & ... & ... & ...\\c_{m1} & c_{m2} & ... & c_{mm}\end{bmatrix}\begin{bmatrix}y^{(1)}-h(\mathbf{x}^{(1)}; \mathbf{w}) \\ ... \\ y^{(m)}-h(\mathbf{x}^{(m)}; \mathbf{w})\end{bmatrix}

Leads to the cost function:

Cost=\sum_{i=1}^m(y^{(i)}-h(\mathbf{x}^{(i)}; \mathbf{w}))^2c_{ii} + \sum_{j\neq k} (y^{(j)}-h(\mathbf{x}^{(j)}; \mathbf{w}))(y^{(k)}-h(\mathbf{x}^{(k)}; \mathbf{w}))c_{jk}

As you can see, if the observations are independent c_{ij}; i \neq j would equal to zero and the cost function would not have any of the cross terms.  If they are independent and all have equal variances than we get the ordinary least squares cost function that we had derived earlier.

How do we minimize this cost function?  The most difficult part is finding an appropriate \mathbf{\Sigma} matrix.  If you don’t know the correlation structure it is somewhat difficult to estimate and I haven’t found a straight forward method yet.  You’ll have to research this outside of this blog.

After getting the covariance matrix you can also use gradient descent to minimize as the derivatives are relatively straight forward.

\frac{\partial Cost}{\partial w_l}=2\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)};\mathbf{w}))\frac{\partial h(\mathbf{x}^{(i)}; \mathbf{w})}{\partial w_l}\\-\sum_{j \neq k}((y^{(k)}-h(\mathbf{x}^{(k)};\mathbf{w}))\frac{\partial h(\mathbf{x}^{(j)}; \mathbf{w})}{\partial w_l} + (y^{(j)}-h(\mathbf{x}^{(j)};\mathbf{w}))\frac{\partial h(\mathbf{x}^{(k)}; \mathbf{w})}{\partial w_l})

One concern with this is that it’s much more computationally expensive than ordinary least squares as it requires O(m^2) computations to get the derivatives (the second term is actually a double sum).  This would be problematic for large datasets, as it would converge much more slowly than ordinary least squares.

That’s basically the end of this post.  The purpose wasn’t necessarily to show how to implement generalized least squares but rather to illustrate how the cost function changes for data that is heteroscedastic and/or not independent.

The Least Squares Cost Function

In the previous post I posed a question involving the least squares cost function.  This post will attempt to show where it comes from.

The least squares cost function is of the form:

c\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)};\mathbf{w}))^2

Where c is a constant, y the target and h the hypothesis of our model, which is a function of x and parameterized by the weights w.  The goal is to minimize this function when we have the form of our hypothesis.  People generally use this cost function when the response variable (y) is a real number.

I first saw this function when I was first learning about linear regression and one of the first questions that I had was why we use this particular function instead of something else like least absolute deviations.  It turns out that least squares maximizes the likelihood function for observations with Gaussian (normally distributed) errors.

To get the most out of this post, you should have an understanding of likelihood functions and the Normal distribution.  Both of these concepts are good subjects to learn for machine learning and statistics in general.

Now for the derivation:

Assume that each of the observations are independent and the response values (y) are normally distributed about a mean value and a specific variance.  ie.

y^{(i)} \: \mathtt{\sim} \:Normal(\mu_i,\sigma_i^{2})

The likelihood function for \mu_1, \mu_2,...,\mu_p is:

L(\mu_1,\mu_2,...,\mu_m)=\prod_{i=1}^{m}\frac{1}{\sigma_i\sqrt{2\pi}}e^{-\frac{(y^{(i)}-\mu_i)^2}{2\sigma_i^2}}

It is easier to deal with the log-likelihood since we can replace the products with sums, hence:

\log{L(\mu_1,\mu_2,...,\mu_m)}=-\sum_{i=1}^{m}\log{\sqrt{2\pi}}-\sum_{i=1}^{m}\log{\sigma_i}-\frac{1}{2}\sum_{i=1}^{m}\frac{(y^{(i)}-\mu_i)^2}{\sigma_i^2}

The next assumption that we will make is that all the variances are the same.

\sigma_1,\sigma_2,...,\sigma_m\:are\:all\:equal\:to\:\sigma

The log-likelihood function becomes:

\log{L(\mu_1,\mu_2,...,\mu_m)}=-\sum_{i=1}^{m}\log{\sqrt{2\pi}}-\sum_{i=1}^{m}\log{\sigma}-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(y^{(i)}-\mu_i)^2

Since \sigma and 2\pi are constants, in order to maximize the likelihood function we’ll need to minimize the term:

\sum_{i=1}^{m}(y^{(i)}-\mu_i)^2

Since our hypothesis is attempting to predict the average value of the response given a set of predictors, we define:

\mu_i=h(\mathbf{x}^{(i)};\mathbf{w})

Which leads us to the least squares cost function:

\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)};\mathbf{w}))^2

In linear regression h(\mathbf{x}^{(i)};\mathbf{w})=w_0 + w_1x_1^{(i)}+ w_2x_2^{(i)} + ... + w_px_p^{(i)} but these calculations don’t necessarily make this assumption.

In other words, finding a set of weights that minimizes the least squares cost function is equivalent to maximizing the likelihood function assuming the observations are independent, with Normally distributed error terms all with the same variance.  If these assumptions do not hold, then there might be better cost functions to use.  However there may be transformations you can do to the data so that the assumptions are roughly met.

Gradient Descent for Linear Regression

One of the basic learning examples is that of using gradient descent for linear regression.  The model is very popular in statistics and is as follows:

y^{(i)} = \beta_0 + \beta_1x_1^{(i)} + \beta_2x_2^{(i)} + ... + \beta_px_p^{(i)} + \epsilon_i

y^{(i)}\:\textnormal{is the response value of observation i}
x_1^{(i)},x_2^{(i)},...,x_p^{(i)}\:\textnormal{are the predictors for observation i}
\beta_0,\beta_1,...,\beta_p\:\textnormal{are the weights for the predictors}

We form the following hypothesis to make predictions:

h(\mathbf{x}^{(i)}) = w_0 + w_1x_1^{(i)} + w_2x_2^{(i)} + ... + w_px_p^{(i)}

Note here that the bold x represents the vector [1 x1,x2,…,xp].

The goal is to find the weights (w’s) that will minimize the squared error between the prediction (h(x)) and the target value (y).  The cost function is therefore:

Cost = \frac{1}{2m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))^2

Before proceeding further, one might ask why we use least squares over anything else.  One reason is that it is easy to solve both analytically and numerically.  The derivatives of the cost function with respect to the w’s are easy to compute and can be used to find a minimum as we’ll see a little later.  Another reason is that it is equivalent to maximizing the likelihood equation of a Normal distribution.  We’ll get more into that in another post.

The method that we will employ is gradient descent.  It is based on the property that the gradient of a surface points in the direction of steepest ascent.  It follows that the opposite (negative) direction of the gradient is the direction of steepest descent, which we will use to find the minimum of the cost function.

Consider the example with response y and three predictors x_1, x_2, x_3.

The gradient of the cost function is:

\nabla Cost=\left[\begin{matrix}  \frac{\partial Cost}{\partial w_0} & \frac{\partial Cost}{\partial w_1} & \frac{\partial Cost}{\partial w_2} & \frac{\partial Cost}{\partial w_3}  \end{matrix}\right]

Where,

\frac{\partial Cost}{\partial w_0}=\frac{1}{2m}\sum_{i=1}^{m}\frac{d Cost}{d h(\mathbf{x}^{(i)})}\frac{\partial h(\mathbf{x}^{(i)})}{\partial w_0} \vspace{2 mm}\\    =\frac{1}{2m}\sum_{i=1}^{m}2(y^{(i)}-h(\mathbf{x}^{(i)}))(1)\vspace{2mm}\\    =\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))

Going through similar steps you will find:

\frac{\partial Cost}{\partial w_1}=\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))x_1

\frac{\partial Cost}{\partial w_2}=\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))x_2

\frac{\partial Cost}{\partial w_3}=\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))x_3

During each iteration of the algorithm, a step of size \alpha is taken in the direction of steepest descent.  This continues until a stopping criteria is met.

The pseudocode for our example would be:

Choose initial values for w_0, w_1, w_2, w_3

Choose step size \alpha
while (stopping criteria not met)  {
   w_0 <- w_0 - \alpha \frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))
   w_1 <- w_1 - \alpha \frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))x_1
   w_2 <- w_2 - \alpha \frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))x_2
   w_3 <- w_3 - \alpha \frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h(\mathbf{x}^{(i)}))x_3
   Update criteria
}

I’ve taken the liberty of implementing this algorithm R. You can download it here from my Dropbox.

The code does a bit more than just implement gradient descent.  Some other practical considerations must be taken.  In a nutshell here are the things that the code does:

  1. Generate test data.  Uses R’s built-in random generators to create the predictors and the response values.
  2. Perform feature scaling on the predictors.  What I found was that the algorithm did not converge without doing this step.  What feature scaling does is that it makes the error surface more like circular bowl when the inputs are of different magnitudes of scale, which is the surface that gradient descent works best on.  For more information about feature scaling look at the Wikipedia page and the lecture titled “A bag of tricks for mini-batch gradient descent” on Geoffrey Hinton’s Neural Network course.
  3. Calculates the least squares estimate using the built-in glm function from R.  As you may know, the exact least squares solution can be solved analytically (see here) so we use this to check that our algorithm gets to a reasonable answer.  Why use gradient descent then?  Geoffrey Hinton gives two reasons for this in his one of his class lectures.  Firstly, many researchers are interested in how the brain learns, and it is more likely that it learns iteratively like gradient descent instead of solving an equation.  The second reason is that iterative methods are more generalizable to more complex cost functions and hypotheses.
  4. Checks the derivatives calculated in our equations by comparing to the finite difference estimates.  This is something I learned from Andrew Ng’s course (lecture “Gradient Checking”).  Although it’s not really necessary for this simple example it’s something we should get into the habit of doing, as it is almost essential when we’re tackling more complicated algorithms.
  5. Calculates the euclidean norm of the gradient for use in the stopping criteria.  A local (global for this case) minimum of the cost function will occur when all the partial derivatives are zero.  We set a threshold where the algorithm stops when the norm of the gradient is “sufficiently” small.

Gradient descent is a simple algorithm for finding solutions that minimizes cost.  While there are more advanced methods that converge more quickly, this algorithm is more intuitive and provides a good foundation to machine learning algorithms.

A few machine learning resources

Coursera is one of the coolest sites for me on the internet.  It’s a site that offers free training courses to many different subjects.  Lucky for us, the founders of the site (Andrew Ng and Daphne Koller) are computer science professors so it has a lot of content related to machine learning.

Although I have yet to officially complete a course, I’ve looked through the lecture notes of two classes:

  • Machine Learning  (by Andrew Ng) is a very practical course that is designed for people just getting started with machine learning.  I particularly appreciated the content in the “Advice for Applying Machine Learning” and “Machine Learning System Design” lectures.  They contain valuable advice and gets you thinking about how to approach problems in general.
  • Neural Networks for Machine Learning (by Geoffrey Hinton) teaches you about one of the current state of the art techniques in machine learning.  Neural networks have been very successful in recent competitions so it’s a good idea to know about them.  This course is more involved than Andrew Ng’s course but if you take the time to study the material it will worthwhile.

I also find Wikipedia extremely handy with very good content on machine learning and statistics.  I find myself looking up things there before trying Google.

A lot of the notes that will be on this website will be from one of these resources.

First Post!

Welcome to my blog.  This blog is a vehicle for me to write down some of my notes and thoughts about machine learning and statistics.  I hope you enjoy reading and find some things helpful.