The Long Tail of Medical Data

By Thijs Kooi, Merantix  The most commonly used word in the English language is ‘the’, accounting for about 7% of all words in an average text..Ideally, we want this set to comprise of all possible examples that are out there, but unfortunately, the data we have to train the model is often limited..Because there are thousands of words, many occurring only once, it is impossible for a sample smaller than the dictionary to contain all.Now let’s assume we have two such sets: a set of English words and a set of Japanese words and we want to train a model to predict which language it is from: a ‘word in — class label out’ task..Feeding words immediately to the algorithm is not possible because our model works with numbers..The number of syllables in the word (Japanese words tend to have short syllables)We can plot the words as points in a two-dimensional space, train our model on these points and plot the discriminant originating from our model, which is supposed to split the two classes..However, when adding the rare word ‘onomatopoeia’, which has a relatively large amount of syllables, we can see it is up on the wrong side of the discriminant.A solution of course, is to collect more training data until we have examples of all possible words in a language, which is possible in theory and in practice..Although the distribution of possible images may not follow the same power law as for language, the ‘long tail’ of rare cases is often present.In academia these types of rare samples could simply be brushed aside and for some industrial settings like face detection in commodity cameras, this may also not cause major problems..In fields such as medicine, however, where we make predictions such as whether a patient has cancer or not and human lives depend on accurate predictions, we can not simply ignore the problem of rare samples.   Experts in both health care and artificial intelligence are recently starting to realize the largely untapped amounts of medical data and immense potential machine learning can have in medicine..Unfortunately also similar to the language example, we will always have a limited amount of training data and sample from an incredibly large amount of all possible anomalies in images..This means many rare samples are, almost by definition, not included in the training set.We see two types of rare samples:Rare ‘within-class’ samplesIn the language example above, we tried to group words into two classes: Japanese and English..As shown in the plot of our data, rare samples within these classes can deviate so much from the data distribution that they end up in areas of the feature space for which the model has not been optimized.The most common sign of cancer on mammograms are masses: white blobs are typically quite small, something in the order of one to three centimeters..If a model is trained on a random sample from screening data, it is unlikely any of these large masses are found in this limited sample..If we generate a random sample of words from all possible languages, rare languages are likely highly underrepresented.In mammography, masses are a common but not the only sign of breast cancer.. More details

Leave a Reply