Physical Activity Monitoring Using Smartphone Sensors and Machine Learning

Physical Activity Monitoring Using Smartphone Sensors and Machine LearningPaul SchimekBlockedUnblockFollowFollowingJun 5Sitting: One of the Most Dangerous ActivitiesSedentary behavior has become a major public health risk around the world.

Experts tell us that a minimum amount of daily physical activity (PA) is necessary to maintain health and reduce the risk of chronic diseases such as diabetes, heart disease, and cancer.

Some researchers have suggested that sitting for long periods of time may in itself contribute to the problem, in addition to the total amount of inactivity.

A study published in 2017 found that sitting in periods of longer than 30 minutes at a time increased mortality risk after control for other factors.

The total amount of sedentary time was separately a risk factor.

A follow-up study by the same team, published in January 2019, found that there was no benefit of reducing the duration of episodes of sitting unless those episodes were replaced with physical activity (of any intensity).

A study published in April 2019 found that sitting time increased an hour between 2007 and 2016 — to more than 6 and a half hours for adults and nearly 8 and a half hours for adolescents.

A Multi-Sensor in Everyone’s PocketWith the proliferation of screens of all sizes for both work and entertainment, many people find themselves sitting for much of the day.

However, the same devices can remind people to be active.

Smartphones contain numerous sensors that can be used to classify movement, including an accelerometer, gyroscope, GPS, and magnetometer (compass).

GPS location can only be used outdoors, but the others are potentially effective anywhere.

Continuously polling smartphone sensors uses a lot of power and would drain phone batteries quickly.

Therefore, a useful algorithm should be able to classify activities based on relatively infrequent polling of sensors.

Several sources indicate that smartphone gyroscopes use considerably more power than accelerometers, so they should only be used if absolutely necessary to solve the problem at hand.

A Problem of Human Activity RecognitionThere is an extensive literature on using sensors from smartphones and wearable devices to detect and classify different types of human activities.

The frequently-used UCI-HAR (Human Activity Recognition) dataset, published on UCI’s machine learning repository, has accelerometer observations from 30 subjects labeled as WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, or LAYING (sic).

Although the UCI dataset and many others ask subjects to engage in many different types of activities, it is not essential to distinguish between each activity type for the proposed use case (monitoring and promotion of physical activity).

We need to be able to distinguish between activity and inactivity, and we would also like to know the intensity of the activity, such as walking compared to running.

(While some research has suggested that sitting may be worse than standing, we can only say with certainty that the risk lies in sedentary behavior.

) The goal of this project is to develop a model using sensor data from ordinary smartphones that can distinguish between a) active and inactive states and b) vigorous and less intense physical activity.

The Real World DatasetThere are many publicly available datasets of human activity data that could be used for model development and testing.

I used the Real World Dataset created by the University of Mannheim — Research Group Data and Web Science.

This dataset includes two more active activities (running and jumping), in addition to the same six activities in the UCI-HAR dataset.

In total the 8 activity classes are: climbing down stairs, climbing up stairs, jumping, lying, standing, sitting, running/jogging, and walking.

The researchers outfitted 15 subjects with smartphones or smartwatches in 7 body positions: chest, forearm, head, shin, thigh, upper arm, and waist.

I decided to use only the data from the “thigh” (pocket) position under the assumption that it is the most commonly used of those tested.

The dataset includes readings from the following sensors: acceleration, GPS, gyroscope, light, magnetic field, and sound level.

The subjects were asked to perform each activity for 10 minutes (except for jumping).

The sensors were recorded at 50 Hz (50 observations per second).

My research was limited to the triaxial (x, y, and z axes) data from the accelerometer and gyroscope.

Exploratory Data AnalysisI created scatterplots of the accelerometer and gyroscope data for each of the 15 subjects performing each of the 8 activities with the x, y, and z dimensions shown in red, blue and green.

For example, this is a plot of the accelerometer data for a subject who was walking:A plot of the same subject sitting looks very different:As can be seen from these plots, the subjects were asked to wait a short time before beginning the assigned activity.

Thus the labels for the beginning and end of each activity period are frequently incorrect.

In addition, it is clear from inspecting the plots that not all subjects performed the assigned activity continuously.

For example, Subject 4 stood for several periods between bouts of running, as can be seen from the following:Correcting the LabelsBecause the subjects did not always perform the assigned activities continuously for the entire period, it was necessary to filter the data to remove samples that would be otherwise mislabeled.

I divided the raw data into two-second windows, each with 100 observations (at 50 Hz).

I labeled each 100 consecutive observations in each subject/activity with a sample number (dropping any remainder).

I then calculated the standard deviation of the sensor data for each of the samples.

I then plotted the standard deviations.

For example, the following is a plot of the standard deviations of the samples of the accelerometer data for Subject 4, running:The times when the subject was not running can clearly be seen.

By inspecting this and similar plots for all the subjects, a threshold of 5 for the standard deviation of the y accelerometer reading was picked as a cutoff.

All samples with standard deviation of the y accelerometer less than 5 were deleted from the data.

Similarly, thresholds were established for all the other activities: maxima for stationary activities (sitting, lying) and minima for moving activities.

No filtering was done for standing, since the subjects did not transition to or from standing during the recording periods.

From 8 Activities to 3 ClassesI initially tested models of all eight activities.

Although machine learning models could be fit to training data, they would not generalize with sufficient accuracy on data from subjects not included in the training set.

The best classification model of the eight activities could only produce 75% accuracy on the validation data (when the latter consisted entirely of new subjects).

This is insufficient accuracy to be useful for the intended purpose, since too many errors will cause users to mistrust, ignore, and remove our potential phone app.

Moreover, for PA promotion it is not necessary to know the exact type of activity.

Thus the activities were grouped into three PA classes as follows:Sedentary (standing, sitting, lying)Light-Moderate PA (walking, going up stairs, going down stairs)Vigorous PA (running, jumping)All of the model results below are based on classifying the data into these three groupings.

Separating Data into Train, Test, and Holdout SetsData from the first 10 subjects was used to develop and train the model.

Data from the remaining 5 subjects was held out for final testing.

The training data was divided into train and test samples.

I tried two methods: random selection of samples using the “validation split” setting in Keras and manually dividing the sample into a training group of subjects 1 to 7 and a test group of subjects 8 to 10.

I found that it was necessary to use a subject-based split of the training and test data.

A random split means that there are observations from all 10 subjects in the training data.

Although the resulting test scores are high, the model performs poorly on data from subjects not seen before, since there are significant differences between subjects.

Because our use case involves the ability to detect PA from any (previously unseen) user, it is essential to insure that the model is generalizable to new subjects.

“K-fold cross validation” is another form of random splitting of the training and test data.

The common practice in Human Activity Recognition of creating overlapping samples compounds the problem by including the same information in the training and test data.

A study of these problems in HAR research found that “k-fold cross validation artificially increases the performance of recognizers by about 10%, and even by 16% when overlapping windows are used.

”Models Using Handcrafted FeaturesI created a summary dataset where samples of activities conducted by a subject were grouped into windows of 100 observations (representing 2 seconds of measurement).

I calculated the mean, standard deviation, and range of the triaxial measurements (x, y, and z dimensions) for both the accelerometer and the gyroscope.

These 18 (3 x 3 x 2) features were used in various machine learning models to predict the class (sedentary, light-moderate PA, vigorous PA).

The models tested were logistic regression, KNN, various decision trees including XGBoost, SVM, and Naive Bayes.

By a small margin over the next-best models, the logistic regression provided the greatest accuracy (0.

973) and F1 score (0.

956) on the validation data.

Tests on Holdout DataThe logistic regression model was tested on the holdout data consisting of the remaining five subjects.

The overall accuracy of the predictions was 94%, and the weighted average F1 score was 0.


This is a plot of the confusion matrix by class:An inspection of the correlation coefficients showed that many of the gyroscope features were highly correlated with the accelerometer features (r > 0.


A logistic regression model including only the accelerometer features had the same accuracy and F1 score as the model including both the accelerometer and the gyroscope features.

There was no benefit to using the gyroscope data in this model.

This result is significant: using just one sensor and simple statistical features can predict physical activity with very high accuracy.

Although the overall accuracy of this simple model was very high, about 600 samples of light-moderate PA (11% of the class) were misclassified as “sedentary.

” Can a more complicated model do better?Neural Network ModelInstead of using “handcrafted features,” a neural network model uses the raw sensor data, formatted in this case as an array of 100 observations (2 seconds) x 6 sensor readings.

Because there is a time dimension to the data, it is appropriate to use a recurrent layer.

I used the gated recurrent unit (GRU) technique because it has fewer parameters than LSTM and may have better performance on smaller datasets.

Convolutional layers can improve the result by helping the model to learn important features, preventing overfitting, and reducing the number of parameters.

Batch normalization was used to further reduce overfitting.

The final model consisted of:two convolution layers, with 50 filters, 3 kernels, and ReLU activation, each followed by a pooling layer and a batch normalization layer (first pool size set to 4, second pool size set to 2)two recurrent (GRU) layers each with 64 neurons and tanh activation (the second layer with a recurrent dropout of 0.

2)a dropout layer with a dropout rate of 0.

1a final dense layer representing the three classes to be predicted.

There are more than 55,000 parameters to train in this model.

While hugely more computationally intense than the logistic regression model, this neural network is relatively sparse — some NNs have millions of parameters to train.

The model was trained using minibatches of 25 samples over just two epochs.

The computational time was modest.

The neural network model performance was even better than the logistic regression model, achieving 99% accuracy.

About 100 samples that were actually vigorous PA were misclassified as light-moderate PA.

This is a very small error rate, and is a better type of error than the previous model in that the mislabeling concerns the degree of physical activity rather than a confusion between movement and sedentary behavior.

Further ImprovementsThis model was tested on only 15 subjects.

I would like to incorporate other public datasets to create a more robust model with more subjects, different hardware, different test situations, and different environments.

In addition, I would like to see if a reduction from a sampling rate of 50 Hz to 10–25 Hz still produces a robust model, since a lower sampling rate represents a significant reduction in power consumption.

The Python code for this research is available on my Github page.

.. More details

Leave a Reply