Use Embeddings to Predict Therapeutic Area of Clinical Studies

For example here are the concepts that are the closest ones to headache or hepatitis:Most similar conceptsConnecting Studies to Therapeutic Areas with Concept EmbeddingsWe will connect a clinical study to therapeutic areas with concept embeddings by using this path:Get the study MeSH terms (Medical Subject Headings) from the browse_conditions.txt file from AACTConvert the MeSH terms to their unique identifier (a.k.a. codes) using this file from the National Library of Medicine (NLM)Load UMLS Concepts identifiers (CUIs) and associated to MeSH codes using UMLS filesManually Associate Therapeutic Areas to UMLS Concepts identifier (CUIs)Use the pre-trained embeddings to find the area whose CUIs (found in 4. above) are the most similar to those of the study (found in 3. above)The set of data sources and transformations is summarized in this diagram:1..Get MeSH Terms Associated to a StudyThe CTTI offers very well organized and documented data from studies as well as associated data can be downloaded from CTTI download site ..Here’s the subset of the data we’re interested in (schema diagram provided by CTTI):Studies are in file studies.txt and are linked to MeSH terms in browse_conditions.txt..Here’s an example of the MeSH terms for study NCT01007578 about atherosclerosis:grep NCT01007578 ./clinical-trials-gov/browse_conditions.txt1508569|NCT01007578|Atherosclerosis|atherosclerosis1508570|NCT01007578|Peripheral Arterial Disease|peripheral arterial disease1508571|NCT01007578|Peripheral Vascular Diseases|peripheral vascular diseases1508572|NCT01007578|Arterial Occlusive Diseases|arterial occlusive diseasesIt’s easy to load those and build a dictionary from study ID to its set of MeSH terms..load_df() is a small wrapper function around pandas’ read_csv()..The dictionary is then built:2..Convert the MeSH Terms to their Unique IdentifiersThe relevant MeSH file can be downloaded from the National Library of Medicine..Parsing details are in the notebook.Build MeSH terms to MeSH codes dictionary3..Load UMLS ConceptsWe’ll now load the Concepts and Sources files (MRCONSO.RRF) that comes with the huge set of UMLS files available here..Its format is described here..Before loading this CSV file as a pandas data frame, we can reduce its size (7148656 rows for 897Mb) by half by retaining only the columns we need.Reduce Concepts File SizeThe short load_conso() function loads the reduced file into a data frame.. More details

Leave a Reply