Will Haberman’s Survival Data Set make you diagnose Cancer?

Below you can see the 1D scatter plot using data feature Age and Axillary nodesHere you can observe the data of short survival status are mostly overlap on long survival status due to which you will not able to conclude on this data.You can get better clarification if you use PDF or CDF of data for plotting.Let me explain you concept of PDF and CDF in high level.PDF (Probability Density Function):- It shows the density of that data or number of data present on that point..PDF will be a peak like structure represents high peak if more number of data present or else it will be flat/ small peak if number of data present is less.It is smooth graph plot using the edges of histogramCDF (Cumulative Distribution Function):- It is representation of cumulative data of PDF ie..it will plot a graph by considering PDF for every data point cumulatively.Seaborn library will help you to plot PDF and CDF of any data so that you can easily visualise the density of data present on specific point.Below code snippet will plot the PDFLets try to plot PDF of each data feature and see which data give us maximum precision.PDF of AgeObservation: In above plot it is observed that at the age range from 30–75 the status of survival and death is same..So, using this datapoint we cannot predict anythingPDF of Operation AgeObservation: Similar here we cannot predict anything with these histograms as there is equal number of density in each data point..Even the PDF of both classification overlap on each other.PDF of Axillary NodesObservation: It has been observed that people survive long if they have less axillary nodes detected and vice versa but still it is hard to classify but this is the best data you can choose among all..So, I accept the PDF of Axillary nodes and can conclude below resultif(AxillaryNodes≤0)Patient= Long survivalelse if(AxillaryNodes≥0 && Axillary nodes≤3.5(approx))Patient= Long survival chances are highelse if(Axillary nodes ≥3.5)Patient = Short survivalSo from above PDF we can say the patients survival status, but we cannot exactly say what percentage of patient will actually short survive or long survive..To know that we have another distribution that is CDF.CDF will give the cumulative plot of PDF so that you can calculate what are the exact percentage of patient survival statusLet’s plot CDF for our selected feature which is Axillary nodesAbove code will give me the CDF of Long survival status..Here we only use cumsum function from Numpy which will cumulative sum up PDF of that feature.The CDF will of Long survival status is shown on plot in orange colour.From above CDF you can observe that orange line shows there is a 85% chance of long survival if number of axillary nodes detected are < 5..Also you can see as number of axillary nodes increases survival chances also reduces means it is clearly observed that 80% — 85% of people have good chances of survival if they have less no of auxillary nodes detected and as nodes increases the survival status also decreases as a result 100% of people have less chances of survival if nodes increases >40Let’s try to plot CDF for both feature in a single plot..To do so just add below code in existing code written for Long SurvivalBelow image shows the CDF for short survival in Red lineYou can observe in above combine CDF for Long survival observation is same but in Short survival nearly 55% of people who have nodes less than 5 and there are nearly 100% of people in short survival if nodes are > 40We can also predict patients status by applying mathematical formulae like Standard Deviation and Mean.Mean is the average of all data and Standard deviation is the spread of data means how much wide the data is spread along the data set..Python have Numpy library which can perform this operation in a single line.Here we can see in line 3 I have added outlier(data which is very large or small compare to respective data. It may be an error or exception case while collecting data) even though the mean of data is not much affected.You can observe that for Long survive mean is 2.79 and including outlier it is 3 that is almost same, but the mean of Short survive is 7.4 which is comparatively much higher than Long survive..So the probability for short survive is more in data set.If you observe the standard deviation Long survive has standard deviation of only 2.79 and Short survive has 7.45, means the spread of data for short survive is more.Median, Quantiles and PercentileSome more mathematical operation you can do like Median, Quantiles, PercentileAbove code snippet will give you the Median Qantiles and nth PercentilesMedian is the centre value of data and Quantiles are the value of specific feature on nth Percentage n= 25,50,75 and nth Percentile is similar to Quantiles but n could be any number from 1 to 100.So, for our data set we have values of these terms as followsObservation:From above observation it is clear that average axillary nodes in long survival is 0 and for short survival it is 4.. More details

Leave a Reply