Hieu Nguyen Phi, FRM

How I enjoy my life


4 minutes read

Linear Discriminant Analysis (LDA)

According to Bayes’ Theorem, the posterior distribution that a sample \(x\) belong to class \(l\) is given by

\[\pi (y=C_l|x) = \frac{\pi (y=C_l)\pi(x|y=C_l)}{\sum_{l=1}^C \pi(y=C_l)\pi(x|y=C_l)}\]

where \(\pi(y=C_l)\) is the prior probability of membership in class \(C_l\) and \(\pi(x|y=C_l)\) is the conditional probability of the predictors \(x\) given hat data comes from class \(C_l\).

The rule that minimizes the total probability of misclassification says to classify \(x\) into class \(C_k\) if \(\pi(y=C_k)\pi(x|y=C_k)\) has the largest value across all of the \(C\) classes.

If we assume that \(\pi(x|y=C_l)\) follows a multivariate Gaussian distribution \(\Phi(\mu_l,\Sigma)\) with a class=specific mean vector \(\mu_l\) and a common covariance matrix \(\Sigma\) we have that the log of \(\pi(y=C_k)\pi(x|y=C_k)\) is given by

\[x^T \Sigma^{-1}\mu_l - 0.5\mu_l^T \Sigma^{-1}\mu_l + \log(\pi(y=C_l))\]

which is a linear function in \(x\) that defines separating class boundaries, hence the name LDA.

LDA can be seen from two different angles. The first classify a given sample of predictor \(x\) to the class \(C_l\) with highest posterior probability \(\pi(y=C_l|x)\), where LDA minimizes the total probability of misclassification, as mentioned. The second tries to find a linear combination of the predictors that gives maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

In this post, I will express the first aspect of LDA. The second one will be showed in the next post.

LDA as a classifier

We are going to investigate LDA using the iris dataset. The data contains four continuous variables which correspond to physical measures of flowers and a categorical variable describing the flowers’ species.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.006        3.428        1.462        0.246
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.936        2.770        4.260        1.326
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        6.588        2.974        5.552        2.026

Next, we generate pooled covariance matrix:

p <- 4L
g <- 3L
Sp <- matrix(0, p, p)
nx <- rep(0, g)
lev <- levels(iris$Species)
for(k in 1:g){
  x <- iris[iris$Species==lev[k],1:p]
  nx[k] <- nrow(x)
  Sp <- Sp + cov(x) * (nx[k] - 1)
Sp <- Sp / (sum(nx) - g)
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length        0.265       0.093        0.168       0.038
## Sepal.Width         0.093       0.115        0.055       0.033
## Petal.Length        0.168       0.055        0.185       0.043
## Petal.Width         0.038       0.033        0.043       0.042

Now we call to lda function to fit LDA model.

ldamod <- lda(Species ~ ., data=iris, prior=rep(1/3, 3))
##  [1] "prior"   "counts"  "means"   "scaling" "lev"     "svd"     "N"      
##  [8] "call"    "terms"   "xlevels"

As we can see, there are several useful outcomes provided by R’s lda function. It returns the prior probability of each class, the counts for each class in the data, the class-secific means, the linear combination coefficients contained in scaling for the given linear discriminant and the singular values svd giving the ratio of the between- and within-group standard deviations on the linear discriminant variables.

Next we check the LDA coefficients and create the (centered) discriminant scores.

# check the LDA coefficients/scalings
##                     LD1         LD2
## Sepal.Length  0.8293776  0.02410215
## Sepal.Width   1.5344731  2.16452123
## Petal.Length -2.2012117 -0.93192121
## Petal.Width  -2.8104603  2.83918785
crossprod(ldamod$scaling, Sp) %*% ldamod$scaling
##              LD1          LD2
## LD1  1.00000e+00 -7.21645e-16
## LD2 -7.21645e-16  1.00000e+00
# create the (centered) discriminant scores
mu.k <- ldamod$means
mu <- colMeans(mu.k)
dscores <- scale(iris[,1:p], center=mu, scale=F) %*% ldamod$scaling
sum((dscores - predict(ldamod)$x)^2)
## [1] 1.658958e-28

Finally we will end up with visualization our analysis.

# plot the scores and coefficients
p = ggplot(data.frame(iris$Species,dscores)) + geom_point(aes(LD1, LD2, colour = iris$Species), size = 2.5) + labs(title = "Discriminant Scores") + scale_x_continuous(limits=c(-10, 10)) + 
  scale_y_continuous(limits=c(-3, 3))
g = ggplot(data.frame(ldamod$scaling), aes(LD1,LD2,label = rownames(ldamod$scaling))) + geom_point(size = 6)+ geom_text(vjust = 2)+labs(title = "Discriminant Coefficients") + scale_x_continuous(limits=c(-4, 3)) + scale_y_continuous(limits=c(-1, 3))
gridExtra::grid.arrange(p,g,ncol = 2)

# visualize the LDA paritions
species <- factor(rep(c("s","c","v"), each=50))
partimat(x=dscores[,2:1], grouping=species, method="lda")

# visualize the LDA paritions (for all pairs)
species <- factor(rep(c("s","c","v"), each=50))
partimat(x=iris[,1:4], grouping=species, method="lda")

comments powered by Disqus

Recent posts

See more



I am Hieu Nguyen Phi, FRM, a Quantitative Risk Analyst from Ho Chi Minh City, Vietnam. I have a background in International Finance by Foreign Trade University. I have also charterred the FRM certification by [GARP](https://www.garp.org/#!/home) since 2018, after 2 years of experience in Risk management in banking and investing sectors. I also inspired by Mr Sheldon L. Cooper, a fictional character in the CBS television series The Big Bang Theory. He is who he is and doesn't pretend to be someone he isn't. He is so serious about anything he's doing, from academic tasks to real-life businesses. With my inherent laziness, I always look for the best and most efficient (in terms of time and my energy capacity) approach to resolve my practical issues. Currently, I am employed by [Viet Capital Bank](https://www.vietcapitalbank.com.vn) as Risk Analyst and Modeler. My responsibility is to build risk models and to perform required analyses for the bank's Credit Risk Portfolio. For further details about my job, please contact to my personal email **hieunguyenphi94@gmail.com**.