Towards Well-Being, with Data Science (part 1)

This is the description Apple provides:The Health app makes it easy to learn about your health and start reaching your goals.

It consolidates health data from iPhone, Apple Watch, and third-party apps you already use, so you can view all your progress in one convenient place.

And it recommends other helpful apps to round out your collection — making it simpler than ever to move your health forward.

iOS – HealthThe Health app makes it easier to keep track of your health and wellness data, focusing on four areas: activity, sleep…www.


comMotivationI currently have an iPhone X which the Apple Health app comes with.

It also has the Health Records API which you can connect to and extract more interesting data but for all intents and purposes, we will be using a simpler method in this article.

From what I know, Apple Health is included in previous iPhone models too so you are not left out if you have an older version of the phone.

Furthermore, if you own an Apple Watch you have additional metrics such as: heart rate monitoring, ECG, Fall Detection, vitals, blood pressure, etc.

The Apple Health app has interactive visualizations that comes with it to look at your data, so I am mostly doing this for my own entertainment.

I must say that it lacks tools to have an aggregate view so I figured I could use some python libraries to visualize my current form.

Downloading the DataBefore you convert the data, you need to export the data file from the Apple Health file.

While there are several ways to do this, I emailed myself the data export.

To do that, just navigate within your phone: Apple Health > Health Data > Profile Icon > Export Health DataTo convert the data, I used markwk’s ‘Apple Health Extractor and Data Analysis’ tool.

See link for more info:https://github.

com/markwk/qs_ledger/tree/master/apple_healthThe first script of markwk’s code allows you to convert the XML file into CSV, which I load it into Jupyter Notebook:# %run -i ‘apple-health-data-parser’ ‘export.

xml’ %run -i ‘/Users/stephenhyungilkim/qs_ledger-master/apple_health/apple-health-data-parser.

py’ ‘/Users/stephenhyungilkim/apple_health_export/export.

xml’The script splits the code into several features including:Height, BodyMass, StepCount, DistanceWalkingRunning, ActiveEnergyBurned, FlightsClimbed, Workout data.


Exploring the DataThe first step of any data analysis project, we must seek to understand the different dimension of the data and what the files contain!.Let’s start with one of the features, steps:steps = pd.


csv", parse_dates=["startDate"], index_col="startDate")steps.

describe()figure 1This gives us some general statistics about the data.

For example, we can see that there has been 6833 instances of ‘steps’, and that the mean is 374 steps…isn’t the recommended average 10,000 steps per day though? :Osteps.

tail()figure 2In figure 2, we can see my most recent data before I did the export.

Some of the information we can derive is that the data source is from my iPhone ‘device’, and we also have the value.

I am starting to get defensive though, as the data seems to suggest i am very lazy.

Is it even possible to take only 34 steps in a day?Then again, I realize that 34 steps are not the total for the day since the ‘startDate’ has multiple times within the same day.

I went back to my phone to see my total steps for that day, and to my relief, it surpassed the 34 steps.

len(steps)This, once again confirms that we have 6833 records.


columnsfigure 3In figure 3, we can tell that there are 8 features but for our time series use case, we can keep ‘value’ and ‘startDate’ and get rid of all of the rest (which I will do shortly after more exploring).



sum()figure 4Figure 4 shows us the sum of all the steps we have taken from the given range.

10,000 miles roughly equals to 5 miles so you guys can do the math.

This might seem like a lot but remember, the daily number is more important than the summation for good health.

Gotta get those steps in every day, right?steps.

valuefigure 5Figure 5 starts to look something like a time series, where we have a data and a value.

This is promising and seems to be plottable.

I do realize that if I want to have a daily view, I would need to aggregate per day, instead of using the startDate feature.

We still need to get rid of the other features, as we mentioned before.

steps_new = steps.

drop(['sourceName','sourceVersion','device','type','unit','creationDate','endDate'], axis=1)The code above drops the unnecessary features from the dataframe.

type(steps_new)figure 5In order for us to work with time series, we need to modify the data accordingly.

The code above shows us what type of data we are working with.

Note, that when I used ‘read_csv’, I made sure that ‘startDate’ was a Datetime (not a string or any other value), and that the ‘startDate’ was the index column.

Other DimensionsSince there are 6 more dimensions, I iterate through the same code to seek to understand the data.

Some of the features like ‘height’ are not really exciting to look at given that my height has remained the same.

For sake of continuity, I will keep exploring the ‘step’s dimensions.


Analysis and PlottingWhile there are several interesting columns, we should main focus on the ‘value’ and the ‘startDate’.

These will be our axes for our line graph so see the time series representation of my ‘health’.



title('How Many Steps Has Stephen Taken?')plt.


xlabel('Date')figure 6This is an interesting first plot… We can see huge activity from months October til mid November.

I immediately have some initial thoughts.

It was warmer during those days, and when I exported my data, I did notice that it was retrieving data from three/four sources.

My iPhone, my golf GPS app, my nike run app, and my Fitbit until it broke.

My initial thought is that I used all those devices more in those months, hence I see the spike.

My Part 1 Conclusions are:This concludes Part 1 for now.

In part 2, I will delve deeper in the the plot, talk about time series considerations and use a model called ARIMA/Box Jenkins method into our data.

I will also touch on time series factors like Trend, Seasonality, Mean, and Variance.



Shoutout to my friend *WH who inspired me to pursue this!.While I do want to disclose his name to protect his privacy, his previous data explorations including using python to map where he frequents the most within New York, and using the ‘Screen Time’ app from Apple to monitor his digital wellbeing have inspired me to put data into good use for my own personal life.


. More details

Leave a Reply