How to use Jupyter to conduct preliminary data analysis for health sciences: R/tidyverse edition

I’d recommend that after you have installed Jupyter Notebooks (or Jupyter Lab), or have accessed a hosted instance of Jupyter notebook or Jupyter lab, make sure you read the following two tutorials written in “Towards Data Science”:Bringing the best out of Jupyter notebooks for data scienceJupyter lab: evolution of Jupyter NotebookYou will develop a working knowledge of how to work with Jupyter notebooks, and how to use them.

The purpose of this tutorialIn this tutorial, I will show you how you can use Jupyter Notebooks/Jupyter Lab to conduct real world data analysis starting from scratch using R (tidyverse).

I will write about using R (tidyverse and ggplot) to do data analysis.

In a future post, I will show you how you can use Python (pandas, matplotlib, statsmodel, and seaborn) to conduct data analysis and create graphics.

As these are open-source and free (free as in free beer and free as in free speech) tools.

(click on the image below to listen an interview with Richard Stallman what that means, 🙂 ):So learning how to use these tools for your data analysis and writing will make it easy and intuitive for you to do and share your data analysis projects with everyone.

What will you need to complete this tutorial?A working knowledge of RA web browser and a computer that has Jupyter notebook or Jupyter Lab installed.

However, if you do not have to install Jupyter notebook/Jupyter Lab on your computer to follow along.

You can create free accounts on one of these websites to follow along as well.

Practically any browser will work.

If any browser does not work with these, please let me know in the comments.

How to write in markdownWhat do I mean by “A working knowledge of R”?If you have not used R before, now is a good time.

R is a statistical programming environment (this means you can not only conduct statistical data analysis but you can also develop “routines” or procedures that you can share with others using R, which is a good thing).

You can find out more about R from the following website (what is R, how to obtain and install R):R home page: https://www.

r-project.

org/Download R from the following site: https://cran.

r-project.

org/A series of pages to learn how you can adopt R in your work: https://cran.

r-project.

org/web/views/In this tutorial, I will make use of “tidyverse” a suite of packages in R you can use for data science; here is the link: https://www.

tidyverse.

org/I recommend that you make yourself familiar with the following online text written by Hadley Wickham and Garrett Grolmund for doing data science in R.

It is free and easy to follow along: https://r4ds.

had.

co.

nz/These are the basic resources to get you started.

Here, I will introduce some essential key components of R that you can use to follow along.

In this tutorial, we will use R and tidyverse to read a data set, “clean and preprocess” the data set, and then find some meaning in the data set.

We will create simple graphics using a package within R referred to as “ggplot2” (Grammar of Graphics Plotting Version 2, based on the book by Leland and Wilkinson, “Grammar of Graphics”).

A quick cook’s tour of R and markdownLet’s get started.

We will use a jupyter lab instance for free hosted in Cocalc.

com (URL: https://cocalc.

com); if you want to use this and follow along, consider to create a free account there and start a jupyter lab server.

When you visit cocalc.

com, it looks as follows:Figure 1.

Cocalc online jupyter environment.

You can sign in with your other social media sign-ons.

After you sign-in and bring up a Jupyter lab environment, it should look like as follows:Figure 2.

Jupyter lab interfaceI will not get into the details of the Jupyterlab interface here as this has already been well documented in the link I shared above, please read the associated document, it’s really well written.

Instead, let’s focus on the most interesting tasks on hand:The file extension is ‘.

ipynb’: “interactive ipython notebook”We will need to set up a filename instead of “untitled.

ipynb”, let’s call it ‘first_analysis.

ipynb’.

You can right-click on the filename in the second, large white panel and change the filename.

Then we need to grab some data and write codes and text to get to work.

But before we did so, we need to touch two more points:Let’s learn a little about R as such and tidyverse and tidy data, andlet’s cover a little about markdown syntax to write your textThe basic minimum information about R to get goingI have provided some annotated codes in the following “greyed out” box that you can copy and paste in a code block in Jupyterlab and when you are ready to run, press “Shift+Enter”# Comments in R code blocks are written using 'hash' marks, and # you should use detailed comments on all your codesx <- 1 # '<- ' is an assignment operator.

Here x is set to the value of 1# This is same as x = 1x == 1# Here, we evaluate whether x is equal to 1; # Everything in R is an object and you can find out about objects# using the function typeof()typeof(x) # should put 'double'# double tells you that x is a number as in 1.

00# you can and should write functions to accomplish repetitive # tasks in R# The syntax of function is:my_function = function(x){ function statements }# You can call functions anywhere in R using the function name and # by writing the function parameters within parentheses# as in my_function(x)# An example of a function where two numbers x, and y are # multipliedmultiply_fn = function(x, y){ z = x * y return(z)}multiply_fn(2,3) # will produce 6# You can see that a function can call within itself another # function# Combination of data and functions are referred to as packages# You can call packages using library() function# You can install packages using install.

packages() function# You can find help in R using help("topic") function# In our case we will use "tidyverse" package.

# We will need to install this first if not already installedSo at the end of this little exercise, after we have called the library “tidyverse”, this is how it looks like:Figure 3.

Our basic set up with tidyverse library ready to goThe basic minimum about markdown to get goingAs html is the lingua franca of the world wide web, it is also quite complex.

So, John Gruber devised this language called markdown where you can write all basic elements of the text using simple decorations to your texts.

You can write headings (I usually use two levels of headings), links, and insert codes.

You will also be able to use tables and citation marks in the flavour of markdown that is used for jupyter notebooks, so you can pretty much write your entire paper or thesis here.

In the box below I have provided a version of a short text written in markdown and the rendered version thereafter.

This should get you going writing in jupyter notebooks as well, hopefully.

This is what we have written in the text box in Jupyter notebook# PurposeThis above was a first level header where we put one hash mark before the header.

If we wanted to put a second level header, we would add another hash mark.

Here, for example the goal is to demonstrate the principles of writing in markdown and conducting data analyses entirely on a web browser after installing jupyter notebook and optionally jupyter lab on the computer or the server.

## What we will do?We will do the following:- We will learn a little about R and markdown- We will go grab some data sets- After loading the data set in Jupyter, we will clean the data set- We will run some tables and visualisations## So is it possible to add tables?Yes, to add tables, do something like:| Task | Software ||——|———-|| Data analysis | Any statistical programme will do || Data analysis plus writing | An integrated solution will work || Examples of integrations | Rstudio, Jupyter, stata |## How do we add links?If you wanted to add links to say sometihng like Google, you would insert the following code: [Name the URL should point, say Google](http://www.

google.

com), and it would put an underlined URL or name to your text.

## Where can I learn more about markdown and its various "flavours"?Try the following links:- [Markdown](https://daringfireball.

net/projects/markdown/)- [Github flavoured markdown](https://github.

github.

com/gfm/)- [Academic markdown](http://scholarlymarkdown.

com/)And this is how it looks like (partial, you can replicate it in a Jupyter lab or Jupyter Notebook)Screen output of a markdown rendered in a text block in JupyterLet’s grab some dataNow that we have:Learned a little about R and installed and loaded up tidyverseLearned how to write in markdown so that we can describe our resultsIt’s time to grab some data off the web and do some analyses using these tools.

Now, where will you get data?.Data are everywhere, but some sites make it easier than others for you to grab and work with them.

How to obtain data can be another post by itself, so I will not go into the details, but generally:You can search for data (we will use two such sites: Figshare and Google Data Explorer)You can use application programming interfaces (APIs) to get data off websites and play with them.

Several sites set up toy data or real world data for you to play with (e.

g.

, data and stories library)Government sites and health departments provide “truckloads” of data for you to analyse and make senseThe key here is to learn how to make sense of that data.

This is where data science comes into the picture.

Specifically, you can:Download data for freeShape it up in ways that will make sense for you to understand their patterns and make them “tidy data”Run visualisations using available toolsPreprocess data using different software (we will use R and tidyverse in this tutorial, but you can use Python, or other specialised tools such as OpenRefine, and spreadsheets)For the purpose of this exercise, let’s visit Figshare and search data for our work.

Let’s say we are interested to identify data on health issues in workplaces and see what data we get off Figshare to get our work done.

As this is a demo exercise, we will keep our data set small so that we can learn the most essential points about grabbing data off the web and analysing them.

It’s actually quite intuitive to do that using tidyverse.

When I log in to Figshare, this is how my dashboard looks like:Figshare dashboardWe searched the data based with “work” and “health” and identified a raw data set from a study where the researchers studied the relationship between behavioural activation and depression and quality of life.

When we searched, we made sure that we wanted to download only those data that were freely and openly shared (with a CC-0 licence), and we wanted data sets.

The data were taken from the following publication:Assessing the relationship between quality of life and behavioral activation using the Japanese…Quality of life (QOL) is an important health-related concept.

Identifying factors that affect QOL can help develop and…journals.

plos.

orgIn describing the study, the authors of the paper wrote in the abstract:Quality of life (QOL) is an important health-related concept.

Identifying factors that affect QOL can help develop and improve health-promotion interventions.

Previous studies suggest that behavioral activation fosters subjective QOL, including well-being.

However, the mechanism by which behavioral activation improves QOL is not clear.

Considering that QOL improves when depressive symptoms improve post-treatment and that behavioral activation is an effective treatment for depression, it is possible that behavioral activation affects QOL indirectly rather than directly.

To clarify the mechanism of the influence of behavioral activation on QOL, it is necessary to examine the relationships between factors related to behavioral activation, depressive symptoms, and QOL.

Therefore, we attempted to examine the relationship between these factors.

Participants comprised 221 Japanese undergraduate students who completed questionnaires on behavioral activation, QOL, and depressive symptoms: the Japanese versions of the Behavioral Activation for Depression Scale-Short Form (BADS-SF), WHO Quality of Life-BREF (WHOQOL-26), and Center for Epidemiologic Studies Depression Scale (CES-D).

The BADS-SF comprises two subscales, Activation and Avoidance, and the WHOQOL-26 measures overall QOL and four domains, Physical Health, Psychological Health, Social Relationships, and Environment.

Mediation analyses were conducted with BADS-SF activation and avoidance as independent variables, CES-D as a mediator variable, and each WHO-QOL as an outcome variable.

Results indicated that depression completely mediated the relationship between Avoidance and QOL, and partially mediated the relationship between Activation and QOL.

In addition, analyses of each domain of QOL showed that Activation positively affected all aspects of QOL directly and indirectly, but Avoidance had a negative influence on only part of QOL mainly through depression.

The present study provides behavioral activation strategies aimed at QOL enhancement.

The full paper is in open source and you can download and read the paper from the following website:Assessing the relationship between quality of life and behavioral activation using the Japanese…Quality of life (QOL) is an important health-related concept.

Identifying factors that affect QOL can help develop and…journals.

plos.

orgYou can directly download the data from here:https://ndownloader.

figshare.

com/files/9446491Here’s how the data appear:A look at how the data appearsHere, we will use the data set and will run an annotated jupyter notebook to show the different steps of:Obtaining the dataReading the data into RCleaning the data setAsking questionsAnswering the questionsIf you want, you can reproduce their paper on the data set the authors provided but we will not do that in this tutorial.

In this tutorial, we have already so far introduced the fact that using a web based tool, you can use Jupyter notebooks/Jupyter labs to conduct data analyses.

I have introduced Cocalc, but there are other similar tools as well.

Some examples:Microsoft Azure Notebooks a range of languages available all for free and you can use an R or a Python or other languages; for free, and you can sign up and start working right awayGoogle Colaboratory: From Google, and free.

You can work on Python notebooksTryjupyter: Gives you for free notebooks in different languages and in different set ups.

So, we can start and end our analyses in any of these as these notebooks are interoperable.

You can also host notebooks on github and in binder and share your notebooks with the rest of the world.

Step 1: Download the dataYou can see that this is an Excel spreadsheet file.

You can either open it in a spreadsheet programme (such as Excel or in OpenOffice Calc), and export the file as a comma separated value file.

Alternatively, we can read the file directly in Jupyter.

We will do this here.

In order to do this, we will need to use the package “readxl”.

So we do as follows:# first load the packagelibrary(readxl)# If you find that your Jupyter instance does not have "readxl" # package in itself, then you will need to install it.

# Easiest, install.

packages("readxl") and then do # library(readxl)After loading, this is how it looks like:After loading a data set, this is how it looks likeWell, nothing happened!.Why?.Because read_excel function read the contents of the excel file “S1_Dataset.

xlsx” and stored it in an object named as mydata.

We will now use this object mydata to examine what is inside this.

Step 2: Wrangle the data and preprocessNow that we have read the data set, let’s delve into finding out what’s in there.

A first thing is to find out the header information.

We can do that in R using “`head()“` function.

So here is how it looks:Output of head(mydata)We also want to find out the list of variables in the data set, So we do:names(mydata) # produces the list of variables'Sex' 'Age' 'BADS-SF' 'BADS-SF_Activation' 'BADS-SF_Avoidance' 'CES-D' 'WHOQOL-26_Mean total score' 'WHOQOL-26_Phisical health' 'WHOQOL-26_Psychological health' 'WHOQOL-26_Social relationships' 'WHOQOL-26_Environment' 'WHOQOL-26_Overall QOL'As you can see that while this dataset contains 12 variables with “expressive” names.

Some variables have names that contain “spaces” in it.

We will rename these variables so that these are meaningful for us.

Variables such as “Sex”, and “Age” are relatively straightforward to understand, but we may need to rename the other variables.

You will find more information about these variable names from the main paper and the accompanying data description in Figshare.

The following table maps the variable names with the concepts they stand for and a short description:| Variable Name | What it stands for ||—————|——————–|| Sex | 1 = Male, 2 = Female || Age | Age in years || BADS-SF | Behavioral Activation for Depression Scale-Short Form || BADS-SF_Activation | BADS for activation || BADS-SF_Avoidance | BADS for avoidance || CES-D | Center for Epidemiologic Studies Depression Scale || WHOQOL-26_Mean total score | Japanese version of the WHO Quality of Life-BREF mean total score || WHOQOL-26_Phisical health | WHOQol-26 physical health || WHOQOL-26_Psychological health | WHOQoL-26 psychological health || WHOQOL-26_Social relationships | WHOQoL-26 social relationships || WHOQOL-26_Environment | WHOQoL-26 Environment || WHOQOL-26_Overall QOL | WHOQoL-26 Overall |QoL in these variables are short forms of Quality of Life.

You can learn more about WHO QoL from the relevant website here: https://www.

who.

int/substance_abuse/research_tools/whoqolbref/en/We will not cover the details here as we are soldiering on to the next step.

At this stage, you can see that:The variable names are kind of long so you may want to shorten them but keep them understandable as you keep workingMany variables have spaces in their names so they need to be taken care of (we suggest that you use underscore to represent spaces)How do we rename the variables in tidyverse?This is what we do:Renaming variables in the data setSeveral things to note:We will store the renamed variables in a new data set (otherwise our changes will not persist beyond what we have done in the code and immediately executed, so make sure to save it in a different named object)We use a symbol ‘%>%’; this symbol is referred to as ‘pipe’ and it read as “and then”.

Here we chain a number of different operations or commands together using this.

If you want to learn more about this and how to use it, please refer to Hadley Wickham’s excellent book on data science with R, where he writes,Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on.

You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.

We’ll use piping frequently from now on because it considerably improves the readability of codeThen we used the “rename()” function and in there, we did as follows:The “new variable name” was the FIRST element, followed byAn “equal sign”, and the “old variable name” was the SECOND element, indicating thatWe store the old variable values in the new variableNote that at the end of this, once we executed it, we got a new data set with all values intact but with new variable names.

If you do not want to transform the names of some variables, keep them as such (as we did for “Age” and “Sex”)Let’s some a few “grammars” for wrangling the dataThe first thing we do when we get a brand new data set to play with is to “wrangle” the data.

We examine the variables, we plot them, we recode them to suit our purpose, we ask questions and we try to answer them.

Then, once we have discovered some patterns that we see we can delve in, we start modelling the data in various ways.

In this tutorial, I will confine myself to writing about the first part, how do we do simple tables and graphs.

Again, if you are working with R and tidyverse (which I recommend), your main text is Hadley Wickham’s book: R for Data Science.

The book is available for free; you can work with free software on the web to learn and practice, so please give it a try.

As you have seen, you can get data for free as well from different sites.

I will not repeat in details everything in that book, but will point to five grammars that you will find helpful in working with data:Select.

 — Use this to select columns from the data set.

Let’s say we want to work with only BADS and CES D variables for this population, we will have to select the columns we want to work with:Using SelectFilter.

 — If we wanted to work with certain “individuals” in the data set, we would filter the rows based on some criteria we set on the columns.

Say, we want to work with women (“Sex == 2”) and teenagers (“Age < 20”), see what we do:Grammar: filter()Note the simplicity:We use “==” equivalence operator to set the filtering conditionsWe separate different conditions using a “,” (comma)Arrange.

 — If we want to get a view of the data set by ascending or descending order, we use the verb “arrange()” to do so, as in:Grammar: arrange()As you can see, the youngest person in the dataset is aged 2 years!.You need to go back and correct or test if this was indeed the value.

This is where data cleaning and preprocessing becomes important.

What sense do we make of this?Mutate.

 — You want to create new variables and store them in the data set.

You want to create these variables out of existing variables using other functions.

Say, now that we know that a person’s age was inserted as 2 years and possibly a mistake, we do not want to lose that person’s information and we wanted to cut the age variable in 4 groups, what do we do?.We use mutate verb, like so:Grammar: mutateSo you see:Use mutate to recode one variable to anotherUse mutate to create new variablesMutate uses new functionsLet’s count the age_rec variable and see what we get:Count of recoded age based on the AgeAs you see, if you let the computer to decide, it can create categories that we cannot much use.

So, let’s recode and mutate a new age group and say save it to a new data set, here:Binarised AgeDid you see what we did?We created a new data set (mydata4) and in there we created a new variable binarised age (“age_bin”), where we had teenagers and respondents who were 20 years and older (actually it should be ≥ 20)You can also assign labels of your choice (so we could put “Teenagers” and “Twenty plus” if we wanted.

Mutate is a powerful verb that lets you recreate data sets and code variables.

You should master this.

Summarize and group_by.

 — Selecting, filtering, arranging the data set and mutating or changing the variables may sound like a lot of useful things to do, but you will also need to peek deeper into the data and learn how to find meaning from it.

You will need to find aggregated summaries, and average values for those variables that are continuous, and tabulate those that are categorical.

In this way, you can gain insights into the data.

Let me introduce you to “summarise()” and “group_by()” functions.

Let’s say you are interested to find out the average age of the respondents of this survey and then wanted to see if the males were on average older than the females.

How would you do it?.See:# What is the average age of the respondents?average_age = mydata4 %>% summarise(average_age = mean(Age, na.

rm = T))average_age# Will return 19.

36652See that this is a single number that consolidates the entire data set for all individuals.

That is fine, and you can get a whole list of such averages for the continuous variables you post.

However, if we were to split the data set into different discrete groups (such as binarised age groups, or Sex for that matter), and then calculate the averages of the other variables, it would make for an interesting display to identify patterns.

So, basically, after splitting the data set, we would apply some functions, in our case say mean, and then we combine the results back to present, that’d make for interesting display of results that are also meaningful.

For a comprehensive introduction to this topic, read the following article (download it as PDF):So, let’s put this into action and see if:Are Males older than females in our data?Do Teenagers have lower average scores on health related quality of life (Mean WHOQoL)?The code and results for males vs.

 femalesSo, what are your take away lessons?We first passed the data set, thenWe asked tidyverse to group by the categorical variable (first do this, because this is your “split”ting the data), and thenWe asked for summarising the variable we wanted.

Gathering and spreading.

 — Let’s put some of these concepts into action and try to answer the question, how do quality of life scores differ between males and females for different age groups?Let’s find out the average scores first:We summarised all quality of life scoresThis is fine, but this is quite complex to interpret.

We have to first look at the first row and then find the scores for individual quality of life parameters for the male gender (Sex equals 1) and then read the next row and so on.

It’d help us if we were to have one variable that would say be “quality_of_life” that would list qol, physical, psych, soc, env etc and we would read them down.

That is, we would have to gather these various scores and put them under one variable, say “quality_of_life”, and we would put all the scores we see there under another variable next to it, we can call that variable as “all_means” (as these are essentially mean values).

In tidyverse, we now introduce another verb called “gather” that would do exactly that:Use a “key” where you would gather the names of the variables that you want to group together.

Here, for instance, “quality of life”Use a “value” where you would put their respective values.

What names you give key and value is up to you.

Give them meaningful names is all I care.

So, here is the code for you to examine:Code and results of “gather”See what’s happening?.We have now “gathered” for all age groups and all sexes their quality of life scores (various parameters).

But this is still quite complex as there is repetition for Sex and repetition for age groups.

If we’d like to spread it out so that now we would have the values of “all_means” score spread out under the keys of Sex (1 and 2), we could have see them quite intuitively.

Here’s the code again:Code and results of spreadSo, now you can see you can read across each age group (grouped under each age group), each of the scores and can compare between males and females.

This way you can make some interesting comparisons.

One area where “spread” becomes particularly useful is when you crosstabulate variables, as in when you cross tabulate binarised age and Sex for instance in this data set.

First, take a look at what happens when you runcrosstab = mydata4 %>% count(age_bin, Sex) # crosstabs binarised age and SexCounting binarised age and sex, code and resultFine, but we want a proper cross-tabulation where Sex appears on the column and binarised age in the rows.

So, we use “spread” to achieve this, seeCross-tabulation of Age with SexMuch better.

Now you can see how binarised age groups are spread across the genders 1 and 2 (remember that 1 = male, 2 = female).

Also, note that this is a data frame, which means you can use mutate() to add a column to find out the row wise percentages of teenagers in each gender.

How can you do that?Code and results for finding out the male percentage in the binarised ageAs you can see males are higher in the teenage groups, and lower (26%) in the 20 plus age groups.

So, when you assess relationship between some of the scores and gender, keep in mind that age distribution can work as a confounding variable.

Step 3: Visualise the dataIn the first two steps, how to obtain data from a variety of sources (for free) and how to use freely available software such as Jupyter to write and work on the data to make it suitable for analyses.

But preparing tables and summaries and cleaning the data is half the fun.

Being able to create simple graphs to dive deeper into the data is satisfying.

So in this third and final step today, we will learn how to create simple graphs in tidyverse.

I suggest you use the “ggplot()” to create graphs.

Ggplot() is a vast topic and there are books available on the web for you to read.

Read the following resources to get started:Tidyverse ggplot docsR graphics CookbookThe data visualisation section of R for data scienceIn addition to the above sites, I’d recommend you read the following paper to gain an insight into graph construction (principles at least), irrespective whether you use tidyverse or not:In order to plot graphs to visualise data, we will use ggplot() function in this tutorial and we will follow the general scheme:ggplot([data]) + geom_[geometry](mappings = aes(x = [variable], y = [variable], stats = "<choices>") + facet_<wrap | grid>() + coord_<choices>() + labs("title of the plot) + xlab("label for x axis") + ylab("label for y axis")For a complete understanding of how ggplot() works, please follow the links I have listed above.

Here, as we are introducing and getting our feet “wet”, I will only cover the basics.

Remember some rules:The function is ggplot(), and data is mandatory (this is why I have placed it in square brackets)Then there is a “+” sign, and the + sign must ALWAYS be at the end of a line, NEVER at the startThe plus sign indicates a “layer”, from the “layered grammar of graphics”You must always specify the aesthetics (“aes”) values, otherwise the graph will not buildRest of the options are kind of optional (I say “kind of”, because you will learn as you go on using this)Let’s start with mydata4 and see how it appears now:The mydata4 data frameLet’s graphically explore:What does the age distribution look like?What does it look like for males and females?What is the relationship between bads_sf and whoqol_mean?.Is it similar for males and females?We can go on exploring, but answering these questions graphically will provide you with ideas that you can use for exploring your own questions in this data set and for future data sets that you may want to work with.

What does the age distribution look like?First, study the code:mydata4 %>% ggplot() + geom_bar(aes(x = Age)) + ggtitle("A bar plot of age") + xlab("Age in years") + ylab("Counts") + ggsave("age_bar.

png")mydata4 was the data set that contained all the variables we wanted to plotthe pipe symbol sent the information about mydata4 to ggplot() functionThen we started a new layer and added layers with + symbolWe ask for a bar plot as the age was mostly numbers and hence represented with bars.

We only had to specify an X axis and ggplot would figure out the raw counts.

We titled the graph with ggtitle() and inserted the title in quote marksWe labelled x and y axes using xlab(), and ylab() functionsWe saved the figure using ggsave() functionThis is how it looks like:Bar plot of ageDoes the bar plot of age distribution look same for males and females?Now we should use the facet function to plot two different plots, or we can split the graph:First, see the code:mydata4 %>% ggplot() + geom_bar(aes(x = Age)) + facet_wrap(~Sex) + ggtitle("A bar plot of age by gender") + xlab("Age in years") + ylab("Counts") + ggsave("age_bar_sex.

png")Note that we added to wrap the figure into two parts by adding the Sex variable so that it splits the Figure into two parts as in the figure below.

What do you think?.Does it look similar for males and females?Plot of the Age distribution by genderWhat is the relationship between bads_sf and whoqol_mean?.Is it similar for males and females?This time, we are going to explore the association or relationship between two variables.

What graph we will draw will depend on the nature of the variables we want to work with.

The following table will provide you with some ideas:| X variable | Y variable | Type of graph | Geometry ||————|————|—————|——————|| Continuous | Continuous | Scatterplot | geom_point() || Continuous | Categorical| Boxplot | geom_boxplot() | | Categorical| Continuous | Boxplot | geom_boxplot() || Categorical| None | Barplot | geom_bar() || Continuous | None | Histogram | geom_histogram() |Because both bads_sf and whoqol_mean are continuous, we can draw a scatterplot.

But what is going to explain what else?.If we think that there may be a relationship between between whoqol_mean such that if bads_sf scores increase, so will whoqol_mean, we can test this using a scatterplot and draw linear regressions to plot their relationships (linear regression or modelling is beyond the scope for this tutorial, but we will only show the graph and not write more about it here).

Here’s the code:mydata4 %>% ggplot() + geom_point(aes(x = bads_sf, y = whoqol_mean)) + ggtitle("Association between bads_sf and whoqol") + xlab("BADS-SF score") + ylab("WHOQoL Score") + ggsave("bads_who.

png")Here’s the graph of association:Association between BADS-F score and WHOQoL score, you see a positive associationNow we need to qualify it further and see if the association is similar for males and females.

So, we add a regression line, and colour the points and the lines differently for males and females:Here’s the code first:mydata4 %>% mutate(gender = as.

factor(Sex)) %>% ggplot() + geom_point(aes(x = bads_sf, y = whoqol_mean, colour = gender)) + geom_smooth(aes(x = bads_sf, y = whoqol_mean, colour = gender), method = "lm") + ggtitle("Association between bads_sf and whoqol") + xlab("BADS-SF score") + ylab("WHOQoL Score") + ggsave("bads_who_lm.

png")Then the figure:Association between BADS-SF and WHOQoL scoresSo, what’s going on?We had to add a new variable, “gender”, from “Sex” so used the mutate function to convert it to a factor variable rather than a character variable it was.

A factor variable is a categorical variable but explicitly has levels that are arranged according to alphabetical or numerical order.

This was necessary for the subsequent steps where we wanted to test what would be the differences in the relationships.

We wanted to test how the points would differ so, we coloured them differently using the gender variable.

Note that we have placed the colour argument within the aesthetics of the mapping.

Similarly, we wanted to use a smoothing function, and in our case, this was “linear model”, hence “lm” as a method of choice.

But this has to be posted outside of the aesthetics parameters.

Within the aesthetics of the smoothing lines, we would need to indicate that we want different coloured lines pertaining to the levels of the gender variable.

So, what do you think you see in the relationship?. More details

Leave a Reply