Generalized Linear Models

Sun Jul 13 2025


If you've taken a statistics class, you've certainly come across Linear Regression, and possibly other models like Logistic Regression. You may have been taught both seperately, with their respective hypothesis functions being pulled out of thin air, however, they both derive quite beautifully.

This article is my attempt at summarizing Generalized Linear Models, both for the purposes of cementing my own learning, and hopefully helping a rare reader. This article is based entirely off of Andrew Ng's CS229 (Autumn 2018) series. The particular video I learnt from can be found here: Lecture 4. Generalized Linear Models extend naturally from the Exponential Family of statistical distributions, so we'll cover them first.

Exponential Family

A distribution that is part of the exponential family of distributions must fit the following form:

p(y;η)=b(y)exp(ηTT(y)a(η))p(y;\eta) = b(y) \exp(\eta^T T(y) - a(\eta))

Here, η\eta is the natural parameter, T(y)T(y) is the sufficient statistic (often, T(y)=yT(y)=y) and a(η)a(\eta) is the log-partition function. p(y;η)p(y;\eta) is the probability density or mass function, depending on whether your chosen distribution is continous or discrete, respectively. The natural parameter η\eta is the parameter that appears in the PDF or PMF, for example, in a Bernoulli distribution defined as ϕy(1ϕ)1y\phi^y (1 - \phi)^{1-y}, ϕ\phi will become a function purely of ϕ\phi. To prove a statistical distribution is part of the exponential family, you need to show that it's PDF or PMF can be manipulated to fit the above form.

Bernoulli Distribution

To illustrate this, we'll use the Bernoulli distribution as an example. We begin by defining the distribution's PMF and slowly manipulating it to fit the exponential family form.

p(y;ϕ)=ϕy(1ϕ)1yp(y;\phi) = \phi^y (1 - \phi)^{1-y} =exp(ln(ϕy(1ϕ)1y))= \exp(\ln(\phi^y (1 - \phi)^{1-y})) =exp(ln(ϕy)+ln((1ϕ)1y))= \exp(\ln(\phi^y) + \ln((1 - \phi)^{1-y})) =exp(yln(ϕ)+(1ϕ)ln(1ϕ))= \exp(y\ln(\phi) + (1 - \phi)\ln(1 - \phi)) =exp(yln(ϕ)+ln(1ϕ)yln(1ϕ))= \exp(y\ln(\phi) + \ln(1 - \phi) - y\ln(1 - \phi)) =exp(yln(ϕ1ϕ)+ln(1ϕ))= \exp(y\ln(\frac{\phi}{1 - \phi}) + \ln(1 - \phi))

We've successfully manipulated the distribution to fit the exponential family definition, provind that the Bernoulli distribution is a family within the exponential family (I'll get into what a family within a family means soon). Doing some pattern matching, we can extract the following parameters:

b(y)=1b(y) = 1 η=ln(ϕ1ϕ)\eta = \ln(\frac{\phi}{1 - \phi}) T(y)=yT(y) = y a(η)=ln(1ϕ)=ln(1+eη)a(\eta) = -\ln(1 - \phi) = \ln(1 + e^\eta)

The latter definition for a(η)a(\eta) can be found by re-arranging the definition for η\eta in terms of ϕ\phi and simply substituting back into the former definition. This is simple, but is algebraically long, so I won't include it here.

Constructing GLMs

Constructing GLMs involves the same set of recurring steps. Depending on the task at hand, an appropriate distribution must be chosen, manipulated into the exponential family form, and it's respective set of parameters found. To finally construct the GLM, two main assumptions/design-choices need to be made.

  • The first design choice is relating η=θTx\eta = \theta^T x where θ,xRn\theta, x \in \R^n. θ\theta is a set of learnable parameters, and nn is the number of features you have.
  • At test time, the output of the model is the expected value of the distribution, i.e E[yx;θ]E[y|x; \theta].

GLMs have a nice property whereby their expected value is actually given by the derivative of a(η)a(\eta) with respect to η\eta, which is a lot nicer than the traditional integral approach of calculating the expected value of a random variable or distribution. To convince you of this, I'll write a quick derivation for the expected value of the log-partition that we derived for Bernoulli:

a(η)=ln(1+eη)a(\eta) = \ln(1 + e^\eta) η(ln(1+eη))=eη1+eη=11+eη\frac{\partial}{\partial \eta} (\ln(1 + e^\eta)) = \frac{e^\eta}{1 + e^\eta} = \frac{1}{1 + e^{-\eta}}

This is indeed the sigmoid function that we use in Logistic Regression! If you're interested in a similar derivation for Linear Regression using the Gaussian Distribution, I've uploaded full derivations for everything here, since the CS229 lecture notes tend to skip over some algebra (which is fair, just some people may want to see all the steps).