Using Machine Learning Models for Breast Cancer Detection

Cancer is currently the deadliest disease in the world, taking the lives of eight thousand people every single year, yet we haven’t been able to find a cure for it yet.By merging the power of artificial intelligence and human intelligence, we may be able to step-by-step optimize the cancer treatment process, from screening to effectively diagnosing and eradicating cancer cells!In this article, I will discuss how we can leverage several machine learning models to obtain higher accuracy in breast cancer detection..Let’s see how it works!Phase 1: Preparing DataFirst, I downloaded UCI Machine Learning Repository for breast cancer dataset.The dataset was created by Dr..For instance, 1 means that the cancer is malignant, and 0 means that the cancer is benign.Sci-kit Learn Library also allows us to split our data set into training set and test set..Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.Now, unlike most other methods of classification, kNN falls under lazy learning (And no, it doesn’t mean that the algorithm does nothing like chubby lazy polar bears — just in case you were like me, and that was your first thought!)In actuality, what this means is that there is no explicit training phase before classification..Essentially, kNN can be broken down to three main steps:Compute a distance value between the item to be classified with every item in the training data setPick the k closest data point/ itemConduct a “majority vote” among the data points..A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive.There are many ways to compute the distance, the two popular of which is Euclidean distance and Cosine similarity.Euclidean distance is essentially the magnitude of the vector obtained by subtracting the training data point from the point to be classified.It can be determined using the equation below, where x and y are the coordinates of a given data point (assuming the data lie nicely on a 2D plane — if the data lies in a higher dimensional space, there would just be more coordinates).Another method is Cosine similarity..Instead of explicitly computing the distance between two points, Cosine similarity uses the difference in directions of two vectors, using the equation:Next, how do we find the value of k?Usually, data scientists choose as an odd number if the number of classes is 2 and another simple approach to select k is set k=sqrt(n)..These transformations are called kernels.You can see where we are going with this: Overall, the objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points.Intuitively, we want to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes..Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.Following this intuition, I imported the algorithm from Sci-kit Learn and achieved an accuracy rate of 96.5%.Naïve BayesNaive Bayes algorithm is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors..As the name suggest, this algorithm creates the forest with a number of trees.Before diving into a random forest, let’s think about what a single decision tree looks like!A decision tree is drawn upside down with its root at the top.The bold text in black represents a condition/internal node, based on which the tree splits into branches/ edges.The end of the branch that doesn’t split anymore is the decision/leaf, in this case, whether the passenger died or survived, represented as red and green text respectively.Now, let’s consider the following two-dimensional data, which has one of four class labels:A simple decision tree built on this data will iteratively split the data along one or the other axis according to some quantitative criteria.. More details

Leave a Reply