hw6_github.knit

0. Math behind PCA and PCR (pricinple component regression)

Suppose we have a dataset \(\mathbf{X}\) with n data points and p features (n>p): \(\mathbf{X}\in\mathbb{R}^{n\times p}\)
We are looking for a vector of \(\mathbf{b}\) with estimates of coefficients: \(\mathbf{b}\in\mathbb{R}^{p}\)
so that the response vector \(\mathbf{y}\in\mathbb{R}^{n}\) can be expressed as \(\mathbf{y}=\mathbf{Xb}+\mathbf{\epsilon}\) where \(\mathbf{\epsilon}\) is a vector of random errors.

With PCA we obtain \(\mathbf{V}\in\mathbb{R}^{p\times p}\) with eigenvectors as columns. Then we calculate matrix \(\mathbf{T}_{n\times p}=\mathbf{X}_{n\times p}\mathbf{V}_{p\times p}\) where each column is one of the p principal components. (If we want to reduce dimensions we can cut the number of components in \(\mathbf{V}\) to k. In this way \(\mathbf{T}_{n×k}=\mathbf{X}_{n×p}\mathbf{V}_{p×k}\).)

Now we regress \(\mathbf{y}\) on \(\mathbf{T}\) (instead of \(\mathbf{X}\)) and get coefficients \(\mathbf{b_T}=(\mathbf{T}^T\mathbf{T})^{-1}\mathbf{T}^T\mathbf{y}\) (normal equations) for the PCs. Because \(\mathbf{\hat{y}}=\mathbf{T}\mathbf{b_T}=(\mathbf{XV})\mathbf{b_T}=\mathbf{X}(\mathbf{Vb_T})=\mathbf{Xb}\), we can get back the coefficients for the variables \(\mathbf{b}=\mathbf{Vb_T}\)…(1).

If we have normalized data (center and scale) before doing PCA, this vector of coefficients \(\mathbf{b}\) is still based on the normalized data. That is, \(y = b_0+b_1x_{n1}+b_2x_{n2}+...+b_px_{np}\)…(2) where \(x_{n1},x_{n2},...x_{np}\) are normalized from the original data \(x_{u1},x_{u2},...x_{up}\) by \(x_n=(x_u-\bar{x}_u)/s\)…(3). To get the coefficients \(\beta_0,\beta_1,...\beta_p\) for the original variables \(y=\beta_0+\beta_1x_{u1}+\beta_2x_{u2}+...+\beta_px_{up}\)…(4), rearrange (3) and substitute into (4) we get

\(\begin{align*} y&=\beta_0+\beta_1(x_{n1}\cdot s_1+\bar{x}_{u1})+\beta_2(x_{n2}\cdot s_2+\bar{x}_{u2})+...+\beta_p(x_{np}\cdot s_p+\bar{x}_{up})\\ &=\beta_0+\beta_1x_{n1}s_1+\beta_1\bar{x}_{u1}+\beta_2x_{n2}s_2+\beta_2\bar{x}_{u2}+...+\beta_px_{np}s_p+\beta_p\bar{x}_{up}...(5) \end{align*}\)

Compare (2) and (5) we get:

\(\begin{align*} \beta_1 &= b_1/s_1\\ \beta_2 &= b_2/s_2\\ &...\\ \beta_p &= b_p/s_p...(6)\\ \beta_0 &= b_0-(\beta_1\bar{x}_{u1}+\beta_2\bar{x}_{u2}+...+\beta_p\bar{x}_{up})...(7) \end{align*}\)

1. Perform PCA on the explanatory variables to obtain the principal components (PCs), select a subset of PCs

library(factoextra) # for extracting and visualizing the output of the PCA model

## Warning: package 'ggplot2' was built under R version 4.1.3

uscrime <- read.csv("./data 9.1/uscrime.txt",sep = "")
X<-uscrime[1:15] # explanatory/independent variables
y<-uscrime[16] # response/dependent variable
pca<-prcomp(X, scale. =TRUE) # PCA on the scaled explanatory variables

# from the output of the PCA model, we can get the following info
# pca$rotation  # loadings (the matrix of variable loadings (columns are eigenvectors))
# pca$x         # scores (new coordinates of the original data points on the PCs)
# pca$sdev      # standard deviations of the principal components
# pca$center    # variable means (means that were substracted)
# pca$scale     # variable standard deviations (the scaling applied to each variable)

# the following functions report similar output
# summary(pca) # summary gives sd
get_eig(pca) # (from the factoextra package) get_eig gives variance(eigenvalue)

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  6.018952657       40.1263510                    40.12635
## Dim.2  2.801847026       18.6789802                    58.80533
## Dim.3  2.004944334       13.3662956                    72.17163
## Dim.4  1.162207801        7.7480520                    79.91968
## Dim.5  0.958298972        6.3886598                    86.30834
## Dim.6  0.553193900        3.6879593                    89.99630
## Dim.7  0.321818687        2.1454579                    92.14176
## Dim.8  0.307401270        2.0493418                    94.19110
## Dim.9  0.235155292        1.5677019                    95.75880
## Dim.10 0.199880931        1.3325395                    97.09134
## Dim.11 0.175685403        1.1712360                    98.26258
## Dim.12 0.128190107        0.8546007                    99.11718
## Dim.13 0.069341691        0.4622779                    99.57945
## Dim.14 0.058467765        0.3897851                    99.96924
## Dim.15 0.004614165        0.0307611                   100.00000

fviz_pca_biplot(pca, repel = T, # Avoid text overlapping
                col.var = "red", # Variables color
                col.ind = "blue")  # Individuals color

A biplot is a very informative plot for PCA. In blue, each individual (observation, in this case state) is plotted on the first two PCs. The shorter the distance between two observations, the higher the similarity between them (e.g., 31 and 42 are very similar states while 29 and 38 are very different states). In red, each variable is plotted based on their rotation (loadings) for the first two PCs. It shows us how strongly each variable influences a PC (by their loadings on each PC) and how they correlate with each other (by the angle between two vectors). For example, Wealth and Ineq strongly influence PC1 (not PC2) but their effects are the opposite (they are negatively correlated). Pop and Time strongly influence PC2 and they are somewhat correlated. Po1 and Po2 have influence on both PC1 and PC2 and are highly correlated with each other.

fviz_eig(pca,addlabels = TRUE) # scree plot to visualize the variance explained by each component

The scree plot above shows that the first PC explains 40.126351% of the variance, the second PC explains 18.6789802% and so on.

From here, we have several options to decide how many PCs to keep. According to Kaiser Guttman Criterion (https://second.wiki/wiki/kaiser-guttman-kriterium), we only keep the PCs for which the eigenvalue is greater than 1. Therefore the number would be 4. Or we could pick the elbow point subjectively from the scree plot. The third way is to use cross validation, for which I will give details in part 3.3. The conclusion is that I will keep the first five PCs based on cross validation results; cumulatively they explain 86.3083386% of the variance.

2. Regress the response variable on the selected PCs as covariates, using linear regression to get a vector of estimated coefficients (with a dimension equal to the number of selected PCs)

pc5data<-data.frame(cbind(pca$x[,1:5],y)) # make a data frame with the first five PCs and the response variable
colnames(pc5data)[6]<-'Crime' # re-name the last column
pc5lr<-lm(Crime~.,data = pc5data) # perform linear regression 
summary(pc5lr)

## 
## Call:
## lm(formula = Crime ~ ., data = pc5data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -420.79 -185.01   12.21  146.24  447.86 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      35.59  25.428  < 2e-16 ***
## PC1            65.22      14.67   4.447 6.51e-05 ***
## PC2           -70.08      21.49  -3.261  0.00224 ** 
## PC3            25.19      25.41   0.992  0.32725    
## PC4            69.45      33.37   2.081  0.04374 *  
## PC5          -229.04      36.75  -6.232 2.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 244 on 41 degrees of freedom
## Multiple R-squared:  0.6452, Adjusted R-squared:  0.6019 
## F-statistic: 14.91 on 5 and 41 DF,  p-value: 2.446e-08

3. Predict Crime for the new city

3.1 transform the coefficients for the PCs back to the scale of the original variable (with a dimension equal to the total number of variables)

V <- as.matrix(pca$rotation[,1:5]) # the first five eigenvectors as the transformation matrix
b_T <- as.matrix(pc5lr$coefficients[2:6]) # coefficients for the PCs (not including intercept)

# equation (1), we get the coefficients for the explanatory variables but they are still on the normalized scale
b <- V%*%b_T; b # use %*% for matrix multiplication

##              [,1]
## M       60.794349
## So      37.848243
## Ed      19.947757
## Po1    117.344887
## Po2    111.450787
## LF      76.254902
## M.F    108.126558
## Pop     58.880237
## NW      98.071790
## U1       2.866783
## U2      32.345508
## Wealth  35.933362
## Ineq    22.103697
## Prob   -34.640264
## Time    27.205022

# equation (6) coefficients for the variables on the original scale
beta<-b/pca$scale; beta

##                 [,1]
## M       4.837374e+01
## So      7.901922e+01
## Ed      1.783120e+01
## Po1     3.948484e+01
## Po2     3.985892e+01
## LF      1.886946e+03
## M.F     3.669366e+01
## Pop     1.546583e+00
## NW      9.537384e+00
## U1      1.590115e+02
## U2      3.829933e+01
## Wealth  3.724014e-02
## Ineq    5.540321e+00
## Prob   -1.523521e+03
## Time    3.838779e+00

# equation (7) intercept on the original scale
beta_0<-pc5lr$coefficients[1]-pca$center%*%beta; beta_0

##           [,1]
## [1,] -5933.837

Above, beta and beta_0 are the coefficients (slopes and intercept) in terms of the original variables the question asked. Now we can predict the new city’s Crime by using the linear regression line \(\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_px_p\), shown below.

test_point <- data.frame('M' = 14.0,'So' = 0,'Ed' = 10.0,'Po1' = 12.0,'Po2' = 15.5,'LF' = 0.640,'M.F' = 94.0,'Pop' = 150,'NW' = 1.1,'U1' = 0.120,'U2' = 3.6,'Wealth' = 3200,'Ineq' = 20.1,'Prob' = 0.04,'Time' = 39.0)

# predict Crime for the new city using the linear regression line
as.matrix(test_point)%*%beta+beta_0

##          [,1]
## [1,] 1388.926

3.2 (alternative method to 3.1) PCA-transform test point to the selected PC coordinates and use the coefficients directly from the regression output to predict

transformed_test_point <- predict(pca,test_point) # project test point onto the PC coordinates

# alternatively, this step can be done by manually scale the test point and apply the transformation matrix (PCA loadings)
scaled_test_point <- (test_point-pca$center)/pca$scale # scale test point manually
transformed_test_point == as.matrix(scaled_test_point)%*%as.matrix(pca$rotation) # matches the first way

##       PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9 PC10 PC11 PC12 PC13 PC14 PC15
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# predict Crime for the new city, result is the same as in 3.1
predict(pc5lr,data.frame(transformed_test_point))

##        1 
## 1388.926

3.3 (another alternative method) use pcr() function in the pls package to perform PCA and linear regression in one step, no more transforming coordinates

library(pls)

## Warning: package 'pls' was built under R version 4.1.3

set.seed(1)

# PCA and linear regression in one step with a 5-fold CV
pcr_cv<-pcr(Crime~., data = uscrime, scale = T, validation="CV", segments=5)
summary(pcr_cv)

## Data:    X dimension: 47 15 
##  Y dimension: 47 1
## Fit method: svdpc
## Number of components considered: 15
## 
## VALIDATION: RMSEP
## Cross-validated using 5 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           390.9    362.5    348.7    363.2    354.9    253.2    258.0
## adjCV        390.9    360.8    346.5    359.3    355.2    249.7    253.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       255.4    267.2    270.3     280.7     279.6     254.7     257.1
## adjCV    247.1    261.6    265.2     274.1     276.7     247.1     251.8
##        14 comps  15 comps
## CV        255.5     256.7
## adjCV     245.8     246.6
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X        40.13    58.81    72.17    79.92    86.31    90.00    92.14    94.19
## Crime    17.11    26.31    27.16    30.91    64.52    65.86    68.82    68.99
##        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X        95.76     97.09     98.26     99.12     99.58     99.97    100.00
## Crime    69.20     69.63     69.74     76.93     77.24     79.11     80.31

# predict new city's Crime with the first 5 PCs, result is the same as above
predict(pcr_cv,test_point, ncomp=5)

## , , 5 comps
## 
##      Crime
## 1 1388.926

# extract metrics from CV and plot them against the number of PCs
RMSEP(pcr_cv,estimate ='CV')

## (Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
##       390.9        362.5        348.7        363.2        354.9        253.2  
##     6 comps      7 comps      8 comps      9 comps     10 comps     11 comps  
##       258.0        255.4        267.2        270.3        280.7        279.6  
##    12 comps     13 comps     14 comps     15 comps  
##       254.7        257.1        255.5        256.7

rmse1 <- RMSEP(pcr_cv,estimate ='CV')$val[6] # RMSE using 5 PCs for linear regression, to be compared with another model later
validationplot(pcr_cv, val.type="RMSEP", cex.axis=0.7)

Using RMSEP (Root Mean Square Error of Prediction) as the metric for CV, we can see 5 PCs gives the least error. That’s why I chose to keep five PCs in part 2.

References:
https://en.wikipedia.org/wiki/Principal_component_regression the principle and details of the method
https://learnche.org/pid/latent-variable-modelling/principal-components-regression math and the general procedures
https://jbhender.github.io/Stats506/F17/Projects/G18.html demonstration of the pls and factoextra packages
https://rpubs.com/esobolewska/pcr-step-by-step math and cross validation