資料隨筆-運動同棒球嘅統計同資料分析入門(三) — 因素分析同主成份分析

資料隨筆-運動同棒球嘅統計同資料分析入門(三) — 因素分析同主成份分析Introduction to Baseball and Sabermetrics with Python in Cantonese — Factor Analysis and Principle Component AnalysisLI Wai YinBlockedUnblockFollowFollowingMar 17前言資料隨筆-運動同棒球嘅統計同資料分析入門(二) — WARIntroduction to Baseball and Sabermetrics with Python in Cantonese — Wins above replacementmedium.

com上一篇(已經一年前)講咗一啲比較傳統去評價一個棒球球員嘅指標,係文末仲講過想用呢啲指標去評計一下賽果如何。如果唔係人懶,依家可能會係寫一篇講下預計賽果嘅成果,但既然冇寫,亦都諗唔到點寫,就繼續拖落去先,今次反而想寫另一個 Topic — Factor Analysis 同 Principal Component Analysis。Factor Analysis 同 PCA今次要講嘅 Factor Analysis(FA)同 Principal Component Analysis(PCA)都算係比較古老嘅統計學方法,亦都比較常見於一啲社會科學嘅 Research。至於係 Machine Learning,就一般會將佢地當成 Preprocessing 嘅方法,將資料嘅 Dimension Deduce,去令到個 Model 更加方便快捷。不過今次嘅重點唔係呢個,而係 FA 同 PCA 作為一個統計方法嘅應用,同埋一個比人類去睇嘅方法。另外,文末會提到幾有趣嘅 PCA 應用 — Image Processing。因為冇咩人睇,所以就唔講數學原理,不過簡單嚟講,PCA 就係堆 Variable 之間有 Correlation,咁我地就去用一啲方法去搵出一堆 Linear Combination of Variable 去盡可能重新解釋呢堆 Variable;而 FA 就係去整啲 Common Factor 出嚟重新解釋原本堆 Variable,令到 Common Factor 係由 Correlation 比較大嘅 Variables 組成。Factor Analysis 嘅分類FA 作為一個受歡迎嘅 Research Method,主要原因應該係因為佢幾好寫。FA 有兩個 Schools — Exploratory Factor Analysis (EFA)同埋 Confirmatory Factor Analysis(CFA)。睇個名都知,前者就係用嚟做探索(Exploratory),後者係用嚟做驗證(Confirmatory)。EFA 嘅目標在於透過重組資料,去搵出一啲 Significant 嘅 Latent Factor。至於 CFA,就係想透過歸納出 Latent Factor 去確認啲 Factor 係主要因素。數學上嘅分別係 EFA 嘅 Comman Factors 係 Share 曬比所有 Variables,而 CFA 嘅 Common Factors 就係 Defined 咗一啲式去計啲 Variables。舉個例子,依家做咗一份 Marketing Research,班友答咗堆問題。分析員 A 諗唔到做咩好,佢就會用 EFA 去睇下收集到返嚟數據之間有冇邊啲 Common 嘅關係,而分析員 B 就假設咗身高同體重有關、個樣同收入有關,咁佢就會 Construct 咗兩條式:CF1 = a1 * 身高 + a2 * 體重CF2 = b1 * 樣貌 + b2 * 收入透過 Construct 唔同式,去比較一下邊一啲 Common Factor 係比較合理嘅組合,從而去驗證自己嘅諗法岩唔岩。今次係冇做到 CFA 嘅,如果想自己玩下嘅話,可以參考:The lavaan Projectlavaan latent variable analysislavaan.

ugent.

be帶返頭盔先:我都唔太熟 Factor Analysis,好似冇正式用過。之後就係真係入主題,不過係呢個之前扯遠少少。同 PCA 類似方法仲有好多,其中一個亦都好常用係社會科學嘅研究嘅方法就係叫做 Multidimensional Scaling(MDS)。MDS 所做嘅嘢,就係係降維嘅同時間保留原有資料嘅序列。換句話講,就係透過 Data Point 嘅距離去做 Deduction。另外,MDS 呢個方法亦都衍生出另一個好出名嘅 Non-Linear Dimension Deduction Algorithm — Isomap。Factor Analysis on Baseball今次用咗 R,而唔係用 Python。原因絕對唔係因為大學個陣做個份 PCA 功課係用 R,亦都唔係因為諗住寫呢個題目之後見到有篇 Blog 係用 R 做同樣嘢,更加唔係因為想試下 BBC 個 bbplot,所以先用 R。bbc/bbplotR package that helps create and export ggplot2 charts in the style used by the BBC News data team – bbc/bbplotgithub.

com不過用落就發現唔係幾識用,所以 plot 出嚟仲柒過唔用。得閒先再花啲時間去摸索下。雖然 R 都可以用 Jupyter Notebook 去 Run,但個人覺得 RStudio 比較好用。> library(bbplot)> library(ggplot2)> library(psych)> data <- read.

csv("FanGraphs Leaderboard.

csv", header=T, sep=",")係 R 入面計 PCA,係連 Library 都可以唔洗 Call,因為本身已經包埋。btw 如果想自己試下由頭計一次,可以參考下:Principal component analysis : the basics you should read – R software and data miningEasy GuidesUnderstanding the details of PCA requires knowledge of linear algebra.

In this section, we'll explain the basics with…www.

sthda.

com> pca <- prcomp(formula=~X1B+X2B+X3B+HR+RBI+SB+BB, data=data, scale=TRUE)Print 個 pca 出嚟望會見到 pc 嘅 standard deviations(即係 Correlation 個 eigen vlaues 嘅開方)同埋 rotation(即係 Correlation 個 eigen vectors),pca 入面有 $rotation、$sdev、$x 之類…> plot(pca, type="line", main="pca plot 1", col="red") + abline(h=1, col="blue")variance plot of pc from pca model呢張圖簡單嚟講就係講緊每增加一個 PC,可以解釋到原來嘅 Data 幾多 Variance。所以其實反轉嚟 Plot 會易明啲:> pca.

var <- data.

frame((pca$sdev)^2)> colnames(pca.

var)[1] <- "Variance"> props <- pca.

var / sum(pca.

var)> cum.

props <- data.

frame(cumsum(props))> qplot(seq_along(cum.

props$Variance), cum.

props$Variance, xlab="Principal Compnent", ylab="Variance Explained(%)", main="Pareto Plot of PCA") + geom_hline(yintercept=0.

9, col="blue")cumulative plot of variance explained (%)咁就好易明白,得一粒 PC 個陣,大約解釋到原本個 Set Data 嘅 40%,加多粒 PC 個陣就係 60%,加到去 4 粒個陣就可以解釋到接近 90% 嘅 Variance。講返上面個 plot(pca),佢其實係 plot 緊 Variance of PC,所以我地可以用下面個方法 plot 到張一樣嘅圖出嚟。> plot(pca.

var$Variance, type="b", xlab=NULL, ylab="Variance", main="pca plot 2", col="red") + abline(h=1, col="blue")順便係度試埋 bbplot:> qplot(seq_along(pca.

var$Variance), pca.

var$Variance, xlab="pc", ylab="Variance", main="pca plot 3") + bbc_style()同佢個 Github 入面啲 Examples 差咁多…跟住就係計返 Transformed Data 係幾多,如果要自己計,就係 data 同 eigenvectors 做返個 matrix multiplication:> data.

sub <- scale(data[, c("X1B", "X2B", "X3B", "HR", "RBI", "SB", "BB")])> pca.

data.

self <- data.

sub %*% pca$rotation> print(pca.

data.

self[1:3, 1:3]) PC1 PC2 PC3[1,] 2.

566136 2.

639637 -0.

02509547[2,] 1.

827203 1.

716816 -1.

22368571[3,] -1.

184034 2.

641246 0.

53347784但係 R 個 prcomp 本身係計埋 Input 個 Transformed Values:> pca.

data <- pca$x> print(pca.

data[1:3, 1:3]) PC1 PC2 PC31 2.

566136 2.

639637 -0.

025095472 1.

827203 1.

716816 -1.

223685713 -1.

184034 2.

641246 0.

53347784理論上呢兩個 Results 應該係一樣。咁呢啲 PC1、PC2、PC3 又係講咩乜嘢呢?我地可以透過 Loading Plot 去睇下。> pca.

loading.

1 <- pca$rotation[, 1]> dotchart(pca.

loading.

1[order(pca.

loading.

1, decreasing = FALSE)], main="loading of pc1", xlab="loading", col="blue")loading plot of pc1可以見到 PC1 主要係由 RBI(打點)、HR(全壘打)同 BB(四壞保送)構成,應該係解釋緊打者嘅得分能力。點解 BB 同 HR 咁大關係呢?最大可能係投手見到對面個大粒佬,多數都會敬遠送走佢。不過咁睇嘅話,敬遠都似乎唔係太阻到啲大粒佬得分。loading plot of pc2至於 PC2 就由 SB(盜壘成功)、1B(一壘安打)組成,似乎係短打嘅能力。loading plot of pc3至於 PC3 主要就係由 3B(三壘安打)組成,似乎 3B 同其他能力唔太關事?一嚟 PCA 係一個 Linear 嘅 Transformation,二嚟呢度納入考慮嘅 Data 比較片面,所以同 3B 有關係嘅 Data 就唔太見得到。除咗 1D 嘅 Loading Plot 之外,亦都可以用 biplot 去 Plot 2D 嘅 Loading Plot:> biplot(pca, choices=1:2)唔知點解用biplot(pca) 唔比我落label, 明明 biplot(fa) 係得…呢張圖就可以見到唔同 Data(呢個 Case 係球隊)係唔同 PC 上面嘅嘅分佈,同一時間亦都可以望到唔同 PC 嘅構成大約係點,靠右手邊嘅(e.

g.

13、14、16)就係個堆多全壘打嘅球隊,睇返 Data:> data[order(data$HR, decreasing=TRUE), c("Team", "HR")][1:3, ] Team HR16 Yankees 26714 Dodgers 23513 Athletics 227正正係佢地三隊洋基、道奇同運動家。不過睇多少少:> data <- data[order(data$HR, decreasing=TRUE), ]> data$OrderHR <- seq(1, 30)> data <- data[order(data$BB, decreasing=TRUE), ]> data$OrderBB <- seq(1, 30)> data <- data[order(data$RBI, decreasing=TRUE), ]> data$OrderRBI <- seq(1, 30)> data <- data[order(data$X3B, decreasing=TRUE), ]> data$OrderX3B <- seq(1, 30)> data[order(data$RBI, decreasing=TRUE), ][c("Team", "OrderRBI", "OrderBB", "OrderHR", "OrderX3B")][1:10, ] Team OrderRBI OrderBB OrderHR OrderX3B1 Red Sox 1 6 9 1116 Yankees 2 3 1 222 Indians 3 12 6 2613 Athletics 4 13 3 257 Astros 5 8 10 2714 Dodgers 6 1 2 96 Rockies 7 20 8 39 Nationals 8 2 13 1717 Cardinals 9 17 11 304 Cubs 10 5 22 7大家會見到雖然 RBI、BB 同 HR 高低之間有少少關係,但其實又唔係真係太多關係;不過又咁,如果睇埋 3B(PC1 入面負 Loading 最大個個),就會發現佢同其他 Factors 相反走向。順帶一提,2018 年就係紅襪贏咗道奇。Factor Analysis順手玩下 Factor Analysis。明明 R 本身包咗個 Factor Analysis 嘅 Function — factanal,但唔知點解佢本身包嘅 biplot 唔可以直接食呢個嘅 Result,反而 Call Package — psych 個 Function — fa 就可以直接 Plot,所以就採用咗 psych::fa 呢個 Function。> fa2 <- fa(data.

sub, 2, scores=TRUE)> fa2Factor Analysis using method = minresCall: fa(r = data.

sub, nfactors = 2, scores = TRUE)Standardized loadings (pattern matrix) based upon correlation matrix MR1 MR2 h2 u2 comX1B -0.

22 0.

55 0.

35 0.

646 1.

3X2B 0.

45 0.

21 0.

25 0.

752 1.

4X3B -0.

23 0.

24 0.

11 0.

888 2.

0HR 0.

88 -0.

26 0.

84 0.

162 1.

2RBI 0.

99 0.

27 1.

06 -0.

062 1.

2SB 0.

03 0.

53 0.

28 0.

720 1.

0BB 0.

68 0.

04 0.

46 0.

541 1.

0.

> biplot(fa2, labels=data$Team, main="biplot – fa2")可以見到 MR1 其實同 PC1 嘅結果差唔多,都係由 RBI、HR 同 BB 組成。不過感覺上 Magnitude 個 Representation 好似冇咁好,至少我睇 biplot 覺得差咁啲。biplot of factor analysis with number of factors = 2咁如果拎三個 Factors 呢?> fa3 <- fa(data.

sub, 3, scores=TRUE)> print(fa3$Structure)Loadings: MR1 MR2 MR3 X1B -0.

307 0.

807 -0.

509X2B 0.

437 0.

220 X3B -0.

355 0.

413 0.

841HR 0.

853 -0.

134 RBI 0.

974 0.

341 SB 0.

430 BB 0.

673 0.

118 0.

134> biplot(fa3, main="biplot – fa3")biplot of factor analysis with number of factors = 3呢張圖真係細到睇唔到,所以 Plot 返啲獨立嘅圖出嚟:> biplot(fa3, choose=c(1, 2), labels=data$Team, main="biplot – fa3 (1:2)")> biplot(fa3, choose=c(1, 3), labels=data$Team, main="biplot – fa3 (1:3)")biplot of factor 1 and factor 2 from factor analysis with number of factors = 3biplot of factor 1 and factor 3 from factor analysis with number of factors = 3感覺上都係 PCA 好少少。下次可以睇下點樣去整好啲 Factor Analysis。PCA on Image Processing好啦,因為上面寫得太長所以呢度就兩句講完佢算,以前仲未玩咁多 Deep Learning 個陣,好似因為 Dimension 太大嘅問題,冇辦法扔落去 Train,所以就有 PCA 同佢嘅變種 Kernel PCA 做一啲 Pre-processing 去做 Deduction,跟住再去做 Training。例子如下 — 有少少恐怖嘅 Eigenface:Eigenfaces from AT&T Laboratories Cambridge.

Source: https://en.

wikipedia.

org/wiki/Eigenface之後有機會再講。鐘意可以去 Wiki 睇下特徵臉(Eigenface)。後記下次想講上面提到嘅 MDS 同埋 Structural Equation Modeling(SEM),但亦都想拎返上一篇講個個 Prediction 嘅數,仲有就係 Apply Sabermetrics 落腳波或者玩具到,但唔知會係幾時。多謝收看。Reference:雖然上面都寫齊曬,今次啲 Code 都照扔上 Git:mlb2018 · master · Jasper Li / mediumshare my code of medium articlegitlab.

com[1] R筆記–(7)主成份分析(2012美國職棒MLB): https://rpubs.

com/skydome20/R-Note7-PCA[2] Fangraphs Database: https://www.

fangraphs.

com/leaders.

aspx?pos=all&stats=bat&lg=all&qual=0&type=0&season=2018&month=0&season1=2018&ind=0&team=0%2cts&rost=0&age=0&filter=&players=0[3] Faces dataset decompositions: https://scikit-learn.

org/stable/auto_examples/decomposition/plot_faces_decomposition.

html[4] Eigenface: https://en.

wikipedia.

org/wiki/Eigenface.. More details

Leave a Reply