¾ 技術文 — 用 R 整 Heat Map

¾ 技術文 — 用 R 整 Heat MapEricBlockedUnblockFollowFollowingApr 2這兩天 華田 Watin 、Ellan Ou 分別先後寫了以 Excel、Python 製作 Heatmap 的技術文,而 Medium 華文圈似乎有個以文會友的風氣,趁還未有人談 R,且容不才湊熱鬧 (留名等人講吓 Power BI XD)。很多初學者會問究竟學 Python 還是 R 好,實在視乎用來做甚麼,不過(暫時)依然覺得 R 的數據視覺化比較美觀 (當然 Python 已改善了很多)。1.前期設定及讀取數據跟隊用同樣的原始數據 (https://www.

censtatd.

gov.

hk/fd.

jsp?file=B10500142018AN18B01.

xlsx&product_id=B1050014&lang=1),雖然有個 csv 版本,不過照用 Excel 檔示範吧。日常工作的話,Excel 檔就真的可免就免,實在太慢了,而且行數又有限,數據多點就會打爆個 Excel 檔。而要在 R 讀取 Excel 檔,或許需要安裝額外 Package,今次就試用 “xlsx” 這個 package。最頂那堆 code 會檢查是否已安裝所需 packages,並自動安裝欠奉的,然後就載入那些 packages,以及更改 working directory 至 Excel 檔所在的資料夾。由於 Excel 檔有些中文字,要順道更改 Locale 避免亂碼,最後當然是讀取 Excel 檔特定的工作表。## INSTALL THE PACKAGES (IF NECESSARY)if(“xlsx” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“xlsx”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}if(“ggplot2” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“ggplot2”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}if(“tidyr” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“tidyr”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}if(“RColorBrewer” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“RColorBrewer”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}if(“scales” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“scales”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}## LOAD THE PACKAGESlibrary(“xlsx”)library(“ggplot2”)library(“tidyr”)library(“RColorBrewer”)library(“scales”)## SET THE WORKING DIRECTORYsetwd(“D:/Blog/Medium/R/Heatmap”)## SET THE LOCALE TO CHINESESys.

setlocale(locale = “Chinese”)## ACCESS THE EXCEL FILE AND COPY THE WORKSHEET CALLED “T07”df = read.

xlsx(“B10500142018AN18B01.

xlsx”, sheetName=”T07", header=FALSE, encoding = “UTF-8”, stringsAsFactors=FALSE)2.Data Cleaning載入檔案後檢視 dataframe 內容,毫無懸念,成堆垃圾咁樣,無論你用 Python 抑或是 R 都不會改變原始數據格式並非為了方便程式讀取的這個事實,那又只好做些 Data Cleaning 的工作。誠如前文所言, Data Cleaning 往往佔上專案六成以上時間,但又實在只能頂硬上,鬼叫你窮咩。如果絕對肯定格式固定,你還可以 hard-code 一些 Data Cleaning 工序 (註:Hard-code 此舉動非常極度萬分危險,畢竟若然數據提供方更改格式而你不知曉就有請小鳯姐了,不過在此僅作展示 Data Cleaning 的最終結果何如,俾我偷吓懶啦~) 工序大致與 Ellan 兄那篇類近 (借嚟參考,實情係懶得諗),建議與該文同時服用,用文火三碗水煲埋一碗趁熱飲,則收事半功倍之效,一理通百理明,唔明隨便問。垃圾咁樣## REPLACE EMPTY CELLS WITH NA, REMOVE UNNECESSARY ROWSdf[df==” “] <- NAdf2 = df[!rowSums(is.

na(df[c(3,11,14,17)])) > 2, c(3,11,14,17)]## MERGE THE COLUMN NAMESdf2[4,2:4] = paste(df2[3,2:4],df2[4,2:4])3.抽取需要的數據把數據清理過後,自 dataframe 抽取第一個表格 (全體僱員) 的數字供製圖之用。別問我為何 Python 的 df.

iloc[8:13] 不包括 Row 13, 但 R 的 df[8:13] 又包括 Row 13,對,還有, Python Pandas dataframe 第一行的 index 是 0,但 R dataframe 第一行的 index 是 1,別數錯,當你每日同時用上數種編程語言,對住不同 syntax 不同 convention,真係頭都大晒。仲有變數名稱其實有一定限制,需要依循某些法則,在此不贅,同盡量避免有空格,在此有個反面教材,如果變數名稱有空格,其後使用它時就很麻煩了。## EXTRACT THE FIGURES OF THE FIRST TABLE (Rows 4–9)df3 = df2[4:9,]## RENAME THE COLUMNS USING ROW 1 AND THEN DELETE ROW 1colnames(df3) = df3[1, ]df3 = df3[-1, ]## CREATE FACTOR LEVELS FOR ‘Age group’ AND PRESERVE THE ORDERdf3$`Age group` = factor(df3$`Age group`, rev(as.

character(unique(df3$`Age group`))))抽取所需數據後,那就要用到 “tidyr” 這個 package去將數據由 wide form 變為 long form,那甚麼是 Wide,甚麼是 Long,我唔直接答你,用例子答你:Wide formLong formLong form 是否真的長點呢?Wide form 的話,例如對於特定一個年齡層 (Column 1),25th percentile、 50th percentile、75th percentile 各自有個 column,總共 3 個 columns,而 Long form 則是 Column 1 繼續儲存 age group,Column 2 就告訴大家這記錄是儲存甚麼資料,25th/50th/75th Percentile 其中一個,數值全放在 Column 3。至於為何要這樣做,因為下面製圖的 Function 要求 Long form。其實有其他 package 可以直接用 wide form,不過 ggplot2 較為常用,先苦後甜吧。4.用 ggplot2 製圖ggplot2 內建大量不同製圖功能,Python 與之對應的大抵是 matplotlib,段程式碼水蛇春咁長,因為加了很多自訂設定弄得(自以為的)美觀點。之前那篇關於急症室的文章都用了 ggplot2 的 geom_tile()來製圖。邊間醫院內科病房最人滿為患?簡單分析 2014–2019年香港公立醫院在服務高峰期的急症室服務及內科住院病床使用率的數據medium.

com## PLOTTING USING ggplot2ggplot(data = df_long , aes(x=Percentile, y=`Age group`, fill=Value)) + ## DEFINE THE DATA SOURCE, X, Y ## USED TO CREATE A HEATMAP geom_tile()先來看看只用預設設定會是甚麼模樣:做乜同 Ellan 兄嗰張差十萬八千里咁遠預設顏色只是深藍、淺藍,要弄得色彩繽紛點,同樣借助一些常用的 Color Palettes,這裏用了 ‘YlGnBu’,究竟是甚麼暗語?## CREATE A FUNCTION THAT INTERPOLATES A SET OF GIVEN COLORS TO CREATE A NEW COLOR PALETTEpal_c <- colorRampPalette((brewer.

pal(9, ‘YlGnBu’)), space=’Lab’)## PLOTTING USING ggplot2ggplot(data = df_long , aes(x=Percentile, y=`Age group`, fill=Value)) + ## DEFINE THE DATA SOURCE, X, Y ## USED TO CREATE A HEATMAP geom_tile() + ## DEFINE THE HOW TO MAP THE VALUE TO COLOR scale_fill_gradientn(colours = pal_c(100),labels=dollar)先看暫時成果:都有七成似了,但方格內沒有數值,那就加 (真的是打個"+"號) 句 geom_text(……),最後那數行是為了避免深藍色方格用上黑字以致看不到 (改用白字加強對比)。## CREATE A FUNCTION THAT INTERPOLATES A SET OF GIVEN COLORS TO CREATE A NEW COLOR PALETTEpal_c <- colorRampPalette((brewer.

pal(9, ‘YlGnBu’)), space=’Lab’)## PLOTTING USING ggplot2ggplot(data = df_long , aes(x=Percentile, y=`Age group`, fill=Value)) + ## DEFINE THE DATA SOURCE, X, Y ## USED TO CREATE A HEATMAP geom_tile() + ## DEFINE THE HOW TO MAP THE VALUE TO COLOR scale_fill_gradientn(colours = pal_c(100),labels=dollar)+ ## ADD THE DATA LABELS (BASED ON ‘Value’) AND CUSTOMIZE THEM ## REQUIRE THE FUNCTION scales::dollar TO ADD THE DOLLAR SIGN AND THOUSANDS SEPARATOR geom_text(size = 5, aes(label = dollar(Value), color = Value > min(df_long$Value)+(max(df_long$Value)-min(df_long$Value))/2)) + ## FINE-TUNE THE COLORS OF DATA LABELS TO IMPROVE CONTRAST scale_color_manual(guide = FALSE, values = c(“black”, “white”))最後調整一些細節,例如文字大小、Legend bar 的高度等等,就大功告成了,程式碼的 Comments 應該 self-explanatory (當年上第一堂編程課教授就說Comments 何其重要,可惜惰性實在太重,平常是是旦旦打就算,善哉善哉。)Inspired by:(由於無 Ellan 兄嗰篇咁深入淺出,如果本身唔多識 R,呢篇應該睇到一頭霧水,所以唔敢叫半技術文,暫且當住係 ¾ 技術文)半技術文:如果想用熊貓 (Pandas) 做 Heat-map 又可以點做?逐步逐步黎,但求電腦白痴都學得識medium.

comExcel 閑談 — 善用 Heat Map 成製圖高手有時候無圖更勝有圖呢medium.

com完整原始碼 (Source code):## INSTALL THE PACKAGES (IF NECESSARY)if(“xlsx” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“xlsx”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}if(“ggplot2” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“ggplot2”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}if(“tidyr” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“tidyr”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}if(“RColorBrewer” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“RColorBrewer”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}if(“scales” %in% rownames(installed.

packages()) == FALSE) { install.

packages(“scales”, repos=”http://lib.

stat.

cmu.

edu/R/CRAN")}## LOAD THE PACKAGESlibrary(“xlsx”)library(“ggplot2”)library(“tidyr”)library(“RColorBrewer”)library(“scales”)## SET THE WORKING DIRECTORYsetwd(“D:/Blog/Medium/R/Heatmap”)## SET THE LOCALE TO CHINESESys.

setlocale(locale = “Chinese”)## ACESS THE EXCEL FILE AND COPY THE WORKSHEET CALLED “T07”df = read.

xlsx(“B10500142018AN18B01.

xlsx”, sheetName=”T07", header=FALSE, encoding = “UTF-8”, stringsAsFactors=FALSE)## REPLACE EMPTY CELLS WITH NA, REMOVE UNNECESSARY ROWSdf[df==” “] <- NAdf2 = df[!rowSums(is.

na(df[c(3,11,14,17)])) > 2, c(3,11,14,17)]## MERGE THE COLUMN NAMESdf2[4,2:4] = paste(df2[3,2:4],df2[4,2:4])## EXTRACT THE FIGURES OF THE FIRST TABLE (Rows 4–9)df3 = df2[4:9,]## RENAME THE COLUMNS USING ROW 1 AND THEN DELETE ROW 1colnames(df3) = df3[1, ]df3 = df3[-1, ]## CREATE FACTOR LEVELS FOR ‘Age group’ AND PRESERVE THE ORDERdf3$`Age group` = factor(df3$`Age group`, rev(as.

character(unique(df3$`Age group`))))## CONVERT FROM WIDE FORM TO LONG FORM USING tidyr::gatherdf_long = gather(df3, Percentile, Value, 2:4, factor_key=TRUE)## PRINT THE DATA IN WIDE-FORMprint(df3)## PRINT THE DATA IN LONG-FORMprint(df_long)## CHECK THE DATA TYPE IN THE DATAFRAMEstr(df_long)## CONVERT Value FROM CHARACTER TYPE TO NUMBERdf_long$Value = as.

numeric(df_long$Value)## CREATE A FUNCTION THAT INTERPOLATES A SET OF GIVEN COLORS TO CREATE A NEW COLOR PALETTEpal_c <- colorRampPalette((brewer.

pal(9, ‘YlGnBu’)), space=’Lab’)## PLOTTING USING ggplot2ggplot(data = df_long , aes(x=Percentile, y=`Age group`, fill=Value)) + ## DEFINE THE DATA SOURCE, X, Y ## USED TO CREATE A HEATMAP geom_tile() + ## DEFINE THE HOW TO MAP THE VALUE TO COLOR scale_fill_gradientn(colours = pal_c(100),labels=dollar)+ ## ADD THE DATA LABELS (BASED ON ‘Value’) AND CUSTOMIZE THEM ## REQUIRE THE FUNCTION scales::dollar TO ADD THE DOLLAR SIGN AND THOUSANDS SEPARATOR geom_text(size = 5, aes(label = dollar(Value), color = Value > min(df_long$Value)+(max(df_long$Value)-min(df_long$Value))/2)) + ## FINE-TUNE THE COLORS OF DATA LABELS TO IMPROVE CONTRAST scale_color_manual(guide = FALSE, values = c(“black”, “white”))+ ## FURTHER FINE-TUNING: CHANGE Y-AXIS TITLE, REMOVE X-AXIS TITLE, LEGEND TITLE, CHANGE FONT SIZES, ylab(“Age Group”)+ theme(axis.

title.

x = element_blank(), legend.

title = element_blank(), axis.

text = element_text(size = 13), axis.

title = element_text(size = 15), legend.

text = element_text(size = 12))+ ## RESCALE THE HEATMAP TO FIT THE SIZE OF THE BACKGROUND scale_x_discrete(expand=c(0,0)) + scale_y_discrete(expand=c(0,0)) + ## ADJUST THE HEIGHT OF THE COLOR BAR OF THE LEGEND guides(fill= guide_colorbar(barheight=15))## RESET TO DEFAULT LOCALESys.

setlocale().

. More details

Leave a Reply