Sina_plot recreation

Sina_plot recreationJulian TagellBlockedUnblockFollowFollowingMay 2I recently got asked recently to recreate a ggplot chart.

knitr::include_graphics("images/image002.

jpg")From looking at it, I was able to determine some dummy data, to get myself started…# dummy datasmall_group_size <- 20large_group_size <- 2000set.

seed(123)small_sample_1 <- tibble(value1 = rnorm(small_group_size, 70, 20), value2 = rnorm(small_group_size, 150, 20), group = "Group A") small_sample_2 <- tibble(value1 = rnorm(small_group_size, 30, 20), value2 = rnorm(small_group_size, 180, 20), group = "Group B")control <- tibble(value1 = rnorm(large_group_size, 50, 20), value2 = rnorm(large_group_size, 220, 20), group = "Group C")# let's put it all together and add a gender column, as welldata_collected <- bind_rows( small_sample_1, small_sample_2, control) %>% mutate(person_id_number = row_number()) %>% mutate_at(c("value1", "value2"), round, 2) %>% mutate(gender = sample(c("M", "F"), n(), replace = T)) %>% mutate(group = factor(group, levels = c("Group A", "Group B", "Group C")))# and tidy things up (if you want.

object nuking, everying but)rm(list = setdiff(ls(), "data_collected"))I found out there is something similar, called a sina_plot, in the ggforce package.

Similar but not quite….

Let’s see the sina_plot out of the boxdata_collected %>% ggplot() + aes(x = group, y = value1) + ggforce::geom_sina()There are some interesting features of the chart I was presented with that the straight sina_plot isn’t giving me.

Firstly, the y-axis values were binned and the scale was a log of some sort.

Plus the x-axis jittering is a not irregular.

Let’s see if we can implement that…given_levels <- data_collected %>% mutate(value_measured = value1, level_group = (value1 %/% 5) * 5, level_group = pmax(level_group, 0)) %>% group_by(group, level_group) %>% mutate(count_of_entries = n(), is_group_count_even = count_of_entries %% 2 == 0)Let’s take a look, now, with colour.

given_levels %>% ggplot() + aes(x = group, y = level_group) + ggforce::geom_sina()So it looks like I’m going to have to determine both the x and y-axis locations in the data-prep and using a facet for the different groups… and use a geom_point chart… And worry about the colour later.

given_levels_and_width_values <- given_levels %>% mutate(count_of_entries = n(), width_of_group = count_of_entries %/% 2, width_position = row_number() – width_of_group – 1) %>% mutate(width_position = if_else(is_group_count_even & width_position == 0, width_of_group, width_position)) %>% # this is tricky.

Basically, if there's an even number, we want to relocate that x=0 point somehow.

by putting it at the end.

this is actually going to cause us problems later 🙂 mutate(width_position = case_when(is_group_count_even & width_position > 0 ~ width_position – 0.

5, is_group_count_even & width_position < 0 ~ width_position + 0.

5, TRUE ~ width_position))Let’s plot it nowgiven_levels_and_width_values %>% ggplot() + aes(width_position, level_group) + geom_point() + facet_grid(.

~ group, scales = "free_x")Groovy!.A bit more neatening up…distance <- 5unique_grouping_levels <- unique(given_levels_and_width_values$group) dummy_data <- tibble(column_to_group = rep(unique_grouping_levels, 2), width_position = c(rep(-distance, length(unique_grouping_levels)), rep(distance, length(unique_grouping_levels))), level_group = rep(NA_integer_, length(unique_grouping_levels)*2)) median_lines <- given_levels_and_width_values %>% group_by(group) %>% summarise(level_group = median(value_measured, na.

rm = T), to = max(width_of_group, na.

rm = T) + distance/2, from = -to) %>% gather(key, width_position, -1, -2) %>% select(-key)break_steps <- seq(min(given_levels_and_width_values$level_group, na.

rm = T), max(given_levels_and_width_values$level_group, na.

rm = T), 25) nearly_finished_plot <- given_levels_and_width_values %>% ggplot() + aes(x = width_position, y = level_group) + geom_point(shape = 1) + geom_blank(data = dummy_data) + labs(x = NULL, y = NULL) + scale_y_continuous(breaks = break_steps) + facet_grid(.

~ group, scales = "free_x", switch = "both") + theme(axis.

text.

x = element_blank(), axis.

ticks.

x = element_blank(), panel.

background = element_blank(), axis.

line = element_line(colour = "black"), strip.

background = element_rect(fill = "white"), strip.

placement = "outside", panel.

spacing.

x = unit(0, "lines")) + geom_line(data = median_lines, colour = "red", size = 1) nearly_finished_plotWe may also want to see the dots coloured by another column…only_colour_points <- given_levels_and_width_values %>% filter(!is.

na(gender)) %>% arrange(gender) %>% mutate(count_of_entries = n(), width_of_group = count_of_entries %/% 2, width_position = row_number() – width_of_group – 1) %>% mutate(width_position = if_else(is_group_count_even & width_position == 0, width_of_group, width_position)) %>% # this is tricky.

Basically, if there's an even number, we want to relocate that x=0 point somehow.

by putting it at the end.

this is actually going to cause us problems later 🙂 mutate(width_position = case_when(is_group_count_even & width_position > 0 ~ width_position – 0.

5, is_group_count_even & width_position < 0 ~ width_position + 0.

5, TRUE ~ width_position))nearly_finished_plot + geom_point(data = only_colour_points, aes(colour = gender)) # these colour points are actually being overlayed over the original points, and loaded in from the left.

this is beneficial because it allows you to see proportions (roughly) of each group (and even NA values, if there are any)Okay, so this is all great.

We’ve worked out the data prep steps and also all the little ggplot additions… from here let’s make each of these steps into functions.

(1) Assign width groups (bins) (2) Assign width poisitions (3) If applicable, assign width positions by a colour variable (4) ggplot it up!While we’re at it, let’s make this as dynamic as possible… So that the y-axis and the faceting (grouping) variable can be specified in the function call.

assign_width_groups <- function(data, column_to_group, column_to_bin, min_value_to_plot, bin_size) { column_to_bin <- rlang::enquo(column_to_bin) column_to_group <- rlang::enquo(column_to_group) data %>% mutate(value_measured = !!.column_to_bin, column_to_group = !!.column_to_group, level_group = ((!! column_to_bin) %/% bin_size) * bin_size, level_group = pmax(level_group, min_value_to_plot)) %>% group_by(column_to_group, level_group) %>% mutate(count_of_entries = n(), is_group_count_even = count_of_entries %% 2 == 0)}assign_width_positions <- function(data) { data %>% mutate(count_of_entries = n(), width_of_group = count_of_entries %/% 2, width_position = row_number() – width_of_group – 1) %>% mutate(width_position = if_else(is_group_count_even & width_position == 0, width_of_group, width_position)) %>% mutate(width_position = case_when(is_group_count_even & width_position > 0 ~ width_position – 0.

5, is_group_count_even & width_position < 0 ~ width_position + 0.

5, TRUE ~ width_position))}plotting_function <- function(data, distance = 5, colour_selection = NULL) { colour_selection <- rlang::enquo(colour_selection) unique_grouping_levels <- unique(data$column_to_group) dummy_data <- tibble(column_to_group = rep(unique_grouping_levels, 2), width_position = c(rep(-distance, length(unique_grouping_levels)), rep(distance, length(unique_grouping_levels))), level_group = rep(NA_integer_, length(unique_grouping_levels)*2)) median_lines <- data %>% group_by(column_to_group) %>% summarise(level_group = median(value_measured, na.

rm = T), to = max(width_of_group, na.

rm = T) + distance/2, from = -to) %>% gather(key, width_position, -1, -2) %>% select(-key) break_steps <- seq(min(data$level_group, na.

rm = T), max(data$level_group, na.

rm = T), 25) plot <- data %>% ggplot() + aes(x = width_position, y = level_group) + geom_point(shape = 1) + geom_blank(data = dummy_data) + labs(x = NULL, y = NULL) + scale_y_continuous(breaks = break_steps) + facet_grid(.

~ column_to_group, scales = "free_x", switch = "both") + theme(axis.

text.

x = element_blank(), axis.

ticks.

x = element_blank(), panel.

background = element_blank(), axis.

line = element_line(colour = "black"), strip.

background = element_rect(fill = "white"), strip.

placement = "outside", panel.

spacing.

x = unit(0, "lines")) + geom_line(data = median_lines, colour = "red", size = 1) # we also need to add a final step for when a colour component is selected, as well.

if (!is.

null(colour_selection)) { only_colour_points <- data %>% # this is the data after both steps of the data-prep.

mutate(colour_column = !! colour_selection) %>% filter(!is.

na(colour_column)) %>% arrange(colour_column) %>% assign_width_positions() plot <- plot + geom_point(data = only_colour_points, aes(colour = colour_column)) } plot }And now it can all be put together by only a few linesdata_collected %>% assign_width_groups(gender, value1, 0, 5) %>% assign_width_positions() %>% plotting_function(colour_selection = group)The median line was also tricky.

Firstly, it’s falling behind the coloured dots (that’s easily fixed) and, secondly, it’s calculated from the actual values instead of the binned values (that’s why it’s not exactly on -or in between -a certain displayed point).

There is also one other tiny issue, but I think it will go unnoticed, until I can be bothered to fix it (it’s not the log scale… I thought it looked better without it).

Anyways, with all these functions all set to go, we can now go ahead and put it all into a nifty, little flexdashboard…Check it outConclusionOh boy, this took a long time to work out.

Customised (and customisable) charts can take a long time in ggplot / R when compared to Tableau… but now that I know how.

watch out world 🙂 implementing the next time will be a sinch.

.. More details

Leave a Reply