Data Retention: Handling Data with Many Missing Values and Less Than 1000 Observations

Data Retention: Handling Data with Many Missing Values and Less Than 1000 ObservationsAsel MendisBlockedUnblockFollowFollowingNov 5The data used in the current project contains a number of diagnostic measures of type 2 diabetes in women of the Pima Indian heritage, and whether or not the individual has type 2 diabetes..The variables in the dataset are:PregnanciesGlucose — The blood plasma glucose concentration after a 2 hour oral glucose tolerance test.BloodPressure — Diastolic blood pressure (mm/HG).SkinThickness — Skinfold thickness of the triceps (mm).Insulin — 2 hour serum insulin (mu U/ml).BMI — Body mass index (kg/m squared)DiabetesPedigreeFunction — A function that determines the risk of type 2 diabetes based on family history, the larger the function, the higher the risk of type 2 diabetes.Age.Outcome — Whether the person is diagnosed with type 2 diabetes (1 = yes, 0 = no).Preprocessinglibrary(readr) library(tidyverse) library(dplyr) library(knitr) Overviewglimpse(Diabetes)Observations: 768 Variables: 9 $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4….It will be ordered to accommodate analysis purposes in the near future.Diabetes$Outcome <- as.factor(unlist(Diabetes$Outcome)) Diabetes$Outcome <- factor(Diabetes$Outcome, levels=c("1", "0"), labels = c("Positive", "Negative")) summary(Diabetes$Outcome)Positive Negative 268 500We can see that there are almost twice as many people wihtout diabetes than there are with diabetes..For example: A woman has a zero record of pregnancies because that woman has not been pregnant.This is one example on how careful you have to be when preprocessing your data for missing values..Insulin is such an important variable concerning Diabetes, but when a variable is just rife with missing values you have to do something about it..To categorize the levels of glucose tolerance, we will use the following criteria:* Hypoglycemia (Low Blood Sugar) – <2.2 mmol/L * Normal/No Diabetes – >=2.2 mmol/L – <=7.8mmol/L * Prediabetes (Hyperglycemia / High Blood Sugar) – >7.8 mmol/L – <=11.1 mmol/L * Diabetes – >11.1 mmol/LAlthough one of the levels says this person has diabetes, it is not a final diganosis..Multiplying the current results by 0.0555 will convert them to be measured in mmol/l.Diabetes$Glucose <- Diabetes$Glucose*0.0555Diabetes$Glucose <- if_else(Diabetes$Glucose<2.2,"Hypoglycemia", if_else(Diabetes$Glucose>=2.2 & Diabetes$Glucose<=7.8,"Normal", if_else(Diabetes$Glucose>7.8 & Diabetes$Glucose<=11.1, "Hyperglycemia","Diabetes"))) %>% factor()list( `Test Result` = summary(Diabetes$Glucose) )$`Test Result` Hyperglycemia Hypoglycemia Normal 192 5 571It appears that 74% of participants have Normal Glucose levels and 25% have Prediabetes/Hyperglycemia/High Blood Sugar.Only 1% have Hypoglycemia/Low Blood Sugar.My main concern with Hypoglycemia is that when I take it through a machine learning process, the level may not be present during k-fold cross validation..I cannot say for sure at this point but it is worth noting and keeping in mind for the future.Blood PressureWhen measuring Blood Pressure, two measures are used:* Systolic – Measures the pressure in blood vessels when the heart beats..* Diastolic – Measures the pressure in blood vessels when the heart rests between beats.In this dataset, only the diastolic blood Pressure is reported..This is also a variable with a lot of noise.Categorical Variables (% of Outcome)Pregnancies(pregnant <- table(Diabetes$Pregnancies, Diabetes$Outcome, dnn = c("Pregnant", "Outcome")) ) Outcome Pregnant Positive Negative No 38 73Yes 230 427pregnant %>% prop.table(2) %>% round(2) %>% kable(format = 'html') Positive Negative No 0.14 0.15 Yes 0.86 0.85It seems that having a pregnancy does not necessarily increase your chances of having diabetes as the same proportion of women who had or didn’t have diabetes had at least one pregnancy.Obesity(bmi <- table(Diabetes$BMI, Diabetes$Outcome, dnn = c("BMI", "Outcome")) ) Outcome BMI Positive Negative Underweight 2 13 Normal 7 101 Overweight 44 136 Obese 215 250bmi %>% prop.table(2)%>% round(2) %>% kable(format = 'html') Positive Negative Underweight 0.01 0.03 Normal 0.03 0.20 Overweight 0.16 0.27 Obese 0.80 0.50Unsurprisingly, 80% of Diabetic women were obese while 16% were overweight..Among the women that do not have diabetes, 50% were obese, 27% overweight and 20% normal.Glucose(glucose <- table(Diabetes$Glucose, Diabetes$Outcome, dnn = c("Glucose Level", "Outcome")) ) Outcome Glucose Level Positive Negative Hyperglycemia 132 60 Hypoglycemia 2 3 Normal 134 437glucose %>% prop.table(2) %>% round(2) %>% kable(format = 'html') Positive Negative Hyperglycemia 0.49 0.12 Hypoglycemia 0.01 0.01 Normal 0.50 0.8749% of women who have diabetes were positive for Hyperglycemia and 50% had normal glucose levels..Obviously, people with Hyperglycemia are more likely to have diabetes but the magnitude is very low according to the above table.Unsurprisingly, 87% of women without diabetes had normal glucose levels.Final Datasummary(Diabetes)Pregnancies No :111 Yes:657Glucose Hyperglycemia:192 Hypoglycemia : 5 Normal :571BMI Underweight: 15 Normal :108 Overweight :180 Obese :465 DiabetesPedigreeFunction Min.. More details

Leave a Reply