Applying Logistic Regression to PubMed

Since MeSH is often used to describe the subject content of journal articles for PubMed, we can use the MeSH terms as a good indicator of whether an article is about Musculoskeletal Diseases.Of course, we could collect the MeSH terms related to Musculoskeletal Diseases manually from the MeSH official web site, however, there is an easier solution to this..You can find an ASCII file called d2019.bin with a complete list of MeSH terms on the NLM site..As you can see, the structure of the file is pretty difficult to work with..Here is how one MeSH record looks like.MeSH recordAll we need to do is collect all the MeSH IDs that belong to the desired section of the MeSH tree, which is C05..To achieve this, we will look at all the records in the vocabulary and fetch MeSH IDs (denoted as “UI”) for all the records whose tree position (denoted as “MN”) starts with C05..Beware that some MeSH terms might have several tree positions.MeSH IDs related to Musculoskeletal Diseases (a fraction of a full list)Okay, now we are ready to label our dataset..Assuming that our articles are already stored in the database (let’s say MongoDB), it’s easy to label them by checking whether an article has MeSH IDs that belong to the list of newly collected C05 MeSH IDs..An efficient approach to this task would be using Celery to process the work in parallel.Now comes the exciting partTo classify a given abstract to one of the two categories, we are going to build a Logistic Regression model..But first, we need to change the representation of the abstracts..We will use the Universal Sentence Encoder to encode each abstract into a high dimensional vector..The Universal Sentence Encoder is pre-trained on a large corpus and can be used in a variety of tasks (sentimental analysis, classification and so on)..The model takes a word, sentence or a paragraph as input and outputs a 512-dimensional vector.Finally, we will use Python’s scikit-learn library to fit the Logistic Regression model and make predictions.Confusion matrixLooks like we’ve done a pretty good job.. More details

Leave a Reply