Using Electronic Health Records to predict future diagnosis codes with Gated Recurrent Units

Unless your EHR system has uniquely identifiable Admission IDs for each patients visit, it would be difficult to associate each patient ID with a unique Admission ID.

To demonstrate this, we deliberately created double digit Admission IDs one of which was repeated ( Admission ID: 34) for both patients.

To avoid this, we took a pre-cautionary step to create a hash key that is a unique combination of the first half of the the unique PatientIDhyphenated with the patient's specific Admission ID.

Final Admission and Diagnosis Tables generated with fake EHR dataAdmission table with artificially generated dataDiagnosis table with artificially generated dWrite tables to csv filesPart 2: Pre-processing artificially generated EHR dataIn this section we will demonstrate how to process the data in preparation for modeling.

The intent of this tutorial is to provide a detailed step through on how EHR data should be pre-processed for use in RNNs using Pytorch.

This paper is one of the few papers that provide a code base to start taking a detailed look into how we can build generic models that leverages temporal models to predict future clinical events.

However, while this highly cited paper is open sourced (written using Theano:https://github.

com/mp2893/doctorai), it assumes quite a bit about its readers.

As such, we have modernized the code for ease of use in python 3+ and provided a detailed explanation of each step to allow anyone, with a computer and access to healthcare data to begin trying to develop innovative solutions to solve healthcare challenges.

Important Disclaimer:This data set was artificial created with two patients in Part 1 of this series to help provide readers with a clear understanding of the basic structure of EHR data.

Please note that each EHR system is specifically designed to meet a specific providers needs and this is just a basic example of data that is typically contained in most systems.

Additionally, it is also key to note that this tutorial begins after all of the desired exclusion and inclusion criteria related to your research question has been performed.

Therefore, at this step your data would have been fully wrangled and cleaned.

Load data : A quick review of the artificial EHR data we created in Part 1:Step 1: Create mappings of patient IDsIn this step we are going to create a dictionary that maps each patient with his or her specific visit or Admission ID.

Step 2: Create Diagnosis Code Mapped to each unique patient and visitThis step as with all subsequent steps is very important as it is important to keep the patient’s diagnosis codes in the correct visit order.

Step 3: Embed diagnosis codes into visit mapping Patient-Admission mappingThis step essentially adds each code assigned to the patient directing into the dictionary with the patient-admission id mapping and the visit date mapping visitMap.

Which allows us to have a list of list of diagnosis codes that each patient received during each visit.

Step 4a: Extract patient IDs, visit dates and diagnosisIn this step, we will create a list of all of the diagnosis codes, this will then be used in step 4b to convert these strings into integers for modeling.

Step 4b: Create a dictionary of the unique diagnosis codes assigned at each visit for each unique patientHere we need to make sure that the codes are not only converted to integers but that they are kept in the unique orders in which they were administered to each unique patient.

Step 6: Dump the data into a pickled list of listFull ScriptPart 3: Doctor AI Pytorch minimal implementationWe will now apply the knowledge gained from the GRUs tutorial and part 1 of this series to a larger publicly available EHR dataset.

This study will utilize the MIMIC III electronic health record (EHR) dataset, which is comprised of over 58,000 hospital admissions for 38,645 adults and 7 ,875 neonates.

This dataset is a collection of de-identified intensive care unit stays at the Beth Israel Deaconess Medical Center from June 2001- October 2012.

Despite being de-identified, this EHR dataset contains information about the patients’ demographics, vital sign measurements made at the bedside (~1/hr), laboratory test results, billing codes, medications, caregiver notes, imaging reports, and mortality (during and after hospitalization).

Using the pre-processing methods demonstrated on artificially generated dataset in (Part 1 & Part 2) we will create a companion cohort for use in this study.

Model ArchitectureDoctor AI model architectureChecking for GPU availabilityThis model was trained on a GPU enabled system…highly recommended.

Load dataThe data pre-processed datasets will be loaded and split into a train, test and validation set at a 75%:15%:10% ratio.

Padding the inputsThe input tensors were padded with zeros, note that the inputs are padded to allow the RNN to handle the variable length inputs.

A mask was then created to provide the algorithm information about the padding.

Note this can be done using Pytorch’s utility pad_pack_sequence function.

However, given the nested nature of this dataset, the encoded inputs were first multi-one hot encoded.

This off-course creates a high-dimenisonal sparse inputs, however the dimensionality was then projected into a lower-dimensional space using an embedding layer.

GRU ClassThis class contains randomly initiated weights needed to begin calculating the hidden states of the algorithms.

Note that in this paper the author used embedding matrix (W_emb) generated using the skip-gram algorithm, which outperformed the randomly initialized approached shown in this step.

Custom Layer for handling two layer GRUThe purpose of this class is to perform the initially embedding followed by calculating the hidden states and performing dropout between the layers.

Train modelThis model is a minimal implementation for the Dr.

AI algorithm created by Edward Choi, while functional it requires significant tuning.

This will be demonstrated in a subsequent tutorial.

Final Notes/ Next Steps:This should serve as starter code to get the model up and running.

As noted before, a significant amount of tuning will be required as this was built using custom classes.

We will walkthrough the process in a future tutorial.

References:Doctor AI: Predicting Clinical Events via Recurrent Neural Networks (https://arxiv.

org/abs/1511.

05942).

. More details

Leave a Reply