Suppose we have a dataset \(\mathbf{X}\) with n data points and p
features (n>p): \(\mathbf{X}\in\mathbb{R}^{n\times p}\)
We are looking for a vector of \(\mathbf{b}\) with estimates of
coefficients: \(\mathbf{b}\in\mathbb{R}^{p}\)
so that the response vector \(\mathbf{y}\in\mathbb{R}^{n}\) can be
expressed as \(\mathbf{y}=\mathbf{Xb}+\mathbf{\epsilon}\)
where \(\mathbf{\epsilon}\) is a vector
of random errors.
With PCA we obtain \(\mathbf{V}\in\mathbb{R}^{p\times p}\) with eigenvectors as columns. Then we calculate matrix \(\mathbf{T}_{n\times p}=\mathbf{X}_{n\times p}\mathbf{V}_{p\times p}\) where each column is one of the p principal components. (If we want to reduce dimensions we can cut the number of components in \(\mathbf{V}\) to k. In this way \(\mathbf{T}_{n×k}=\mathbf{X}_{n×p}\mathbf{V}_{p×k}\).)
Now we regress \(\mathbf{y}\) on \(\mathbf{T}\) (instead of \(\mathbf{X}\)) and get coefficients \(\mathbf{b_T}=(\mathbf{T}^T\mathbf{T})^{-1}\mathbf{T}^T\mathbf{y}\) (normal equations) for the PCs. Because \(\mathbf{\hat{y}}=\mathbf{T}\mathbf{b_T}=(\mathbf{XV})\mathbf{b_T}=\mathbf{X}(\mathbf{Vb_T})=\mathbf{Xb}\), we can get back the coefficients for the variables \(\mathbf{b}=\mathbf{Vb_T}\)…(1).
If we have normalized data (center and scale) before doing PCA, this vector of coefficients \(\mathbf{b}\) is still based on the normalized data. That is, \(y = b_0+b_1x_{n1}+b_2x_{n2}+...+b_px_{np}\)…(2) where \(x_{n1},x_{n2},...x_{np}\) are normalized from the original data \(x_{u1},x_{u2},...x_{up}\) by \(x_n=(x_u-\bar{x}_u)/s\)…(3). To get the coefficients \(\beta_0,\beta_1,...\beta_p\) for the original variables \(y=\beta_0+\beta_1x_{u1}+\beta_2x_{u2}+...+\beta_px_{up}\)…(4), rearrange (3) and substitute into (4) we get
\(\begin{align*} y&=\beta_0+\beta_1(x_{n1}\cdot s_1+\bar{x}_{u1})+\beta_2(x_{n2}\cdot s_2+\bar{x}_{u2})+...+\beta_p(x_{np}\cdot s_p+\bar{x}_{up})\\ &=\beta_0+\beta_1x_{n1}s_1+\beta_1\bar{x}_{u1}+\beta_2x_{n2}s_2+\beta_2\bar{x}_{u2}+...+\beta_px_{np}s_p+\beta_p\bar{x}_{up}...(5) \end{align*}\)
Compare (2) and (5) we get:
\(\begin{align*} \beta_1 &= b_1/s_1\\ \beta_2 &= b_2/s_2\\ &...\\ \beta_p &= b_p/s_p...(6)\\ \beta_0 &= b_0-(\beta_1\bar{x}_{u1}+\beta_2\bar{x}_{u2}+...+\beta_p\bar{x}_{up})...(7) \end{align*}\)
library(factoextra) # for extracting and visualizing the output of the PCA model
## Warning: package 'ggplot2' was built under R version 4.1.3
uscrime <- read.csv("./data 9.1/uscrime.txt",sep = "")
X<-uscrime[1:15] # explanatory/independent variables
y<-uscrime[16] # response/dependent variable
pca<-prcomp(X, scale. =TRUE) # PCA on the scaled explanatory variables
# from the output of the PCA model, we can get the following info
# pca$rotation # loadings (the matrix of variable loadings (columns are eigenvectors))
# pca$x # scores (new coordinates of the original data points on the PCs)
# pca$sdev # standard deviations of the principal components
# pca$center # variable means (means that were substracted)
# pca$scale # variable standard deviations (the scaling applied to each variable)
# the following functions report similar output
# summary(pca) # summary gives sd
get_eig(pca) # (from the factoextra package) get_eig gives variance(eigenvalue)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 6.018952657 40.1263510 40.12635
## Dim.2 2.801847026 18.6789802 58.80533
## Dim.3 2.004944334 13.3662956 72.17163
## Dim.4 1.162207801 7.7480520 79.91968
## Dim.5 0.958298972 6.3886598 86.30834
## Dim.6 0.553193900 3.6879593 89.99630
## Dim.7 0.321818687 2.1454579 92.14176
## Dim.8 0.307401270 2.0493418 94.19110
## Dim.9 0.235155292 1.5677019 95.75880
## Dim.10 0.199880931 1.3325395 97.09134
## Dim.11 0.175685403 1.1712360 98.26258
## Dim.12 0.128190107 0.8546007 99.11718
## Dim.13 0.069341691 0.4622779 99.57945
## Dim.14 0.058467765 0.3897851 99.96924
## Dim.15 0.004614165 0.0307611 100.00000
fviz_pca_biplot(pca, repel = T, # Avoid text overlapping
col.var = "red", # Variables color
col.ind = "blue") # Individuals color
A biplot is a very informative plot for PCA. In blue, each individual (observation, in this case state) is plotted on the first two PCs. The shorter the distance between two observations, the higher the similarity between them (e.g., 31 and 42 are very similar states while 29 and 38 are very different states). In red, each variable is plotted based on their rotation (loadings) for the first two PCs. It shows us how strongly each variable influences a PC (by their loadings on each PC) and how they correlate with each other (by the angle between two vectors). For example, Wealth and Ineq strongly influence PC1 (not PC2) but their effects are the opposite (they are negatively correlated). Pop and Time strongly influence PC2 and they are somewhat correlated. Po1 and Po2 have influence on both PC1 and PC2 and are highly correlated with each other.
fviz_eig(pca,addlabels = TRUE) # scree plot to visualize the variance explained by each component
The scree plot above shows that the first PC explains 40.126351% of the variance, the second PC explains 18.6789802% and so on.
From here, we have several options to decide how many PCs to keep. According to Kaiser Guttman Criterion (https://second.wiki/wiki/kaiser-guttman-kriterium), we only keep the PCs for which the eigenvalue is greater than 1. Therefore the number would be 4. Or we could pick the elbow point subjectively from the scree plot. The third way is to use cross validation, for which I will give details in part 3.3. The conclusion is that I will keep the first five PCs based on cross validation results; cumulatively they explain 86.3083386% of the variance.
pc5data<-data.frame(cbind(pca$x[,1:5],y)) # make a data frame with the first five PCs and the response variable
colnames(pc5data)[6]<-'Crime' # re-name the last column
pc5lr<-lm(Crime~.,data = pc5data) # perform linear regression
summary(pc5lr)
##
## Call:
## lm(formula = Crime ~ ., data = pc5data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.79 -185.01 12.21 146.24 447.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 905.09 35.59 25.428 < 2e-16 ***
## PC1 65.22 14.67 4.447 6.51e-05 ***
## PC2 -70.08 21.49 -3.261 0.00224 **
## PC3 25.19 25.41 0.992 0.32725
## PC4 69.45 33.37 2.081 0.04374 *
## PC5 -229.04 36.75 -6.232 2.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 244 on 41 degrees of freedom
## Multiple R-squared: 0.6452, Adjusted R-squared: 0.6019
## F-statistic: 14.91 on 5 and 41 DF, p-value: 2.446e-08
3.1 transform the coefficients for the PCs back to the scale of the original variable (with a dimension equal to the total number of variables)
V <- as.matrix(pca$rotation[,1:5]) # the first five eigenvectors as the transformation matrix
b_T <- as.matrix(pc5lr$coefficients[2:6]) # coefficients for the PCs (not including intercept)
# equation (1), we get the coefficients for the explanatory variables but they are still on the normalized scale
b <- V%*%b_T; b # use %*% for matrix multiplication
## [,1]
## M 60.794349
## So 37.848243
## Ed 19.947757
## Po1 117.344887
## Po2 111.450787
## LF 76.254902
## M.F 108.126558
## Pop 58.880237
## NW 98.071790
## U1 2.866783
## U2 32.345508
## Wealth 35.933362
## Ineq 22.103697
## Prob -34.640264
## Time 27.205022
# equation (6) coefficients for the variables on the original scale
beta<-b/pca$scale; beta
## [,1]
## M 4.837374e+01
## So 7.901922e+01
## Ed 1.783120e+01
## Po1 3.948484e+01
## Po2 3.985892e+01
## LF 1.886946e+03
## M.F 3.669366e+01
## Pop 1.546583e+00
## NW 9.537384e+00
## U1 1.590115e+02
## U2 3.829933e+01
## Wealth 3.724014e-02
## Ineq 5.540321e+00
## Prob -1.523521e+03
## Time 3.838779e+00
# equation (7) intercept on the original scale
beta_0<-pc5lr$coefficients[1]-pca$center%*%beta; beta_0
## [,1]
## [1,] -5933.837
Above, beta and beta_0 are the coefficients (slopes and intercept) in terms of the original variables the question asked. Now we can predict the new city’s Crime by using the linear regression line \(\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_px_p\), shown below.
test_point <- data.frame('M' = 14.0,'So' = 0,'Ed' = 10.0,'Po1' = 12.0,'Po2' = 15.5,'LF' = 0.640,'M.F' = 94.0,'Pop' = 150,'NW' = 1.1,'U1' = 0.120,'U2' = 3.6,'Wealth' = 3200,'Ineq' = 20.1,'Prob' = 0.04,'Time' = 39.0)
# predict Crime for the new city using the linear regression line
as.matrix(test_point)%*%beta+beta_0
## [,1]
## [1,] 1388.926
3.2 (alternative method to 3.1) PCA-transform test point to the selected PC coordinates and use the coefficients directly from the regression output to predict
transformed_test_point <- predict(pca,test_point) # project test point onto the PC coordinates
# alternatively, this step can be done by manually scale the test point and apply the transformation matrix (PCA loadings)
scaled_test_point <- (test_point-pca$center)/pca$scale # scale test point manually
transformed_test_point == as.matrix(scaled_test_point)%*%as.matrix(pca$rotation) # matches the first way
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# predict Crime for the new city, result is the same as in 3.1
predict(pc5lr,data.frame(transformed_test_point))
## 1
## 1388.926
3.3 (another alternative method) use pcr() function in the pls package to perform PCA and linear regression in one step, no more transforming coordinates
library(pls)
## Warning: package 'pls' was built under R version 4.1.3
set.seed(1)
# PCA and linear regression in one step with a 5-fold CV
pcr_cv<-pcr(Crime~., data = uscrime, scale = T, validation="CV", segments=5)
summary(pcr_cv)
## Data: X dimension: 47 15
## Y dimension: 47 1
## Fit method: svdpc
## Number of components considered: 15
##
## VALIDATION: RMSEP
## Cross-validated using 5 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 390.9 362.5 348.7 363.2 354.9 253.2 258.0
## adjCV 390.9 360.8 346.5 359.3 355.2 249.7 253.7
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 255.4 267.2 270.3 280.7 279.6 254.7 257.1
## adjCV 247.1 261.6 265.2 274.1 276.7 247.1 251.8
## 14 comps 15 comps
## CV 255.5 256.7
## adjCV 245.8 246.6
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 40.13 58.81 72.17 79.92 86.31 90.00 92.14 94.19
## Crime 17.11 26.31 27.16 30.91 64.52 65.86 68.82 68.99
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 95.76 97.09 98.26 99.12 99.58 99.97 100.00
## Crime 69.20 69.63 69.74 76.93 77.24 79.11 80.31
# predict new city's Crime with the first 5 PCs, result is the same as above
predict(pcr_cv,test_point, ncomp=5)
## , , 5 comps
##
## Crime
## 1 1388.926
# extract metrics from CV and plot them against the number of PCs
RMSEP(pcr_cv,estimate ='CV')
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## 390.9 362.5 348.7 363.2 354.9 253.2
## 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## 258.0 255.4 267.2 270.3 280.7 279.6
## 12 comps 13 comps 14 comps 15 comps
## 254.7 257.1 255.5 256.7
rmse1 <- RMSEP(pcr_cv,estimate ='CV')$val[6] # RMSE using 5 PCs for linear regression, to be compared with another model later
validationplot(pcr_cv, val.type="RMSEP", cex.axis=0.7)
Using RMSEP (Root Mean Square Error of Prediction) as the metric for CV, we can see 5 PCs gives the least error. That’s why I chose to keep five PCs in part 2.
References:
https://en.wikipedia.org/wiki/Principal_component_regression
the principle and details of the method
https://learnche.org/pid/latent-variable-modelling/principal-components-regression
math and the general procedures
https://jbhender.github.io/Stats506/F17/Projects/G18.html
demonstration of the pls and factoextra packages
https://rpubs.com/esobolewska/pcr-step-by-step math
and cross validation