The RMSE equation for this work is given as follows, where (n) is the number of hospital admission records, (y-hat) the prediction LOS, and (y) is the actual LOS.The ultimate goal is to develop a prediction model that results in a lower RMSE than the average or median models.There is a multitude of regression models available for predicting LOS..To determine the best regression model for this work (of the subset of models that will be evaluated), the R2 (R-squared) score will be used..The R2 is a measure of the goodness of the fit of a model..In other words, it is the proportion of the variance in the dependent variable that is predictable from the independent variables..R2 is defined as the following equation where (y_i) is an observed data point, (ŷ) is the mean of the observed data, and (f_i) the predicted model value.Best possible R2 score is 1.0 and a negative value means it is worse than a constant model, average or median in this case.Data Exploration and Feature EngineeringAfter several iterations of reviewing the contents of the various tables in the MIMIC database, I ended up selecting the following tables and loading them into DataFrames using Pandas: ADMISSIONS.csv, PATIENTS.csv, DIAGNOSES_ICD.csv, and ICUSTAYS.csv.The ADMISSIONS table gives information such as SUBJECT_ID (unique patient identifier), HADM_ID (hospital admission ID), ADMITTIME (admission date/time), DISCHTIME (discharge time), DEATHTIME, and more..The table had 58,976 admission events and 46,520 unique patients which seemed like a reasonable amount of data to do a prediction model study on..To start with, I created a length-of-stay column by taking the difference between the admission and discharge time for each row..I opted to drop rows that had a negative LOS since those were cases where the patient died prior to admission..Additionally, I found that 9.8% of the admission events resulted in death, so I removed these since they are not included as part of typical LOS metrics..The distribution of the LOS in terms of days is right-skewed with a median of 10.13 days, a median of 6.56 days and max of 295 days.For the admission ethnicity column, there were 30+ categories that could be easily reduced to the five shown below..Interestingly, the Asian category has the lowest median LOS of the dataset.For religion, I reduced the list to the three categories of unobtainable (13% of admissions), religious (66% of admissions), or not specified (20% of admissions)..The unobtainable group has the lowest median LOS.Hospital admissions were reduced down to four categories: urgent, newborn, emergency, elective.. More details