How to do reproducible research in Ruby with gKnit

How to do reproducible research in Ruby with gKnitRodrigo BotafogoBlockedUnblockFollowFollowingApr 30Rodrigo Botafogo & Daniel MosséIntroductionThe idea of “literate programming” was first introduced by Donald Knuth in the 1980’s (Knuth 1984).

The main intention of this approach was to develop software interspersing macro snippets, traditional source code, and a natural language such as English in a document that could be compiled into executable code and at the same time easily read by a human developer.

According to Knuth “The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style.

”The idea of literate programming evolved into the idea of reproducible research, in which all the data, software code, documentation, graphics etc.

needed to reproduce the research and its reports could be included in a single document or set of documents that when distributed to peers could be rerun generating the same output and reports.

The R community has put a great deal of effort in reproducible research.

In 2002, Sweave was introduced and it allowed mixing R code with Latex generating high quality PDF documents.

A Sweave document could include code, the results of executing the code, graphics and text such that it contained the whole narrative to reproduce the research.

In 2012, Knitr, developed by Yihui Xie from RStudio was released to replace Sweave and to consolidate in one single package the many extensions and add-on packages that were necessary for Sweave.

With Knitr, R markdown was also developed, an extension to the Markdown format.

With R markdown and Knitr it is possible to generate reports in a multitude of formats such as HTML, markdown, Latex, PDF, dvi, etc.

R markdown also allows the use of multiple programming languages such as R, Ruby, Python, etc.

in the same document.

In R markdown, text is interspersed with code chunks that can be executed and both the code and its results can become part of the final report.

Although R markdown allows multiple programming languages in the same document, only R and Python (with the reticulate package) can persist variables between chunks.

For other languages, such as Ruby, every chunk will start a new process and thus all data is lost between chunks, unless it is somehow stored in a data file that is read by the next chunk.

Being able to persist data between chunks is critical for literate programming otherwise the flow of the narrative is lost by all the effort of having to save data and then reload it.

Although this might, at first, seem like a small nuisance, not being able to persist data between chunks is a major issue.

For example, let’s take a look at the following simple example in which we want to show how to create a list and the use it.

Let’s first assume that data cannot be persisted between chunks.

In the next chunk we create a list, then we would need to save it to file, but to save it, we need somehow to marshal the data into a binary format:lst = R.

list(a: 1, b: 2, c: 3)lst.


rds")then, on the next chunk, where variable ‘lst’ is used, we need to read back it’s valuelst = R.


rds")puts lst## $a## [1] 1## ## $b## [1] 2## ## $c## [1] 3Now, any single code has dozens of variables that we might want to use and reuse between chunks.

Clearly, such an approach becomes quickly unmanageable.

Probably, because of this problem, it is very rare to see any R markdown document in the Ruby community.

When variables can be used accross chunks, then no overhead is needed:@lst = R.

list(a: 1, b: 2, c: 3)# any other code can be added hereputs @lst## $a## [1] 1## ## $b## [1] 2## ## $c## [1] 3In the Python community, the same effort to have code and text in an integrated environment started around the first decade of 2000.

In 2006 iPython 0.


2 was released.

In 2014, Fernando Pérez, spun off project Jupyter from iPython creating a web-based interactive computation environment.

Jupyter can now be used with many languages, including Ruby with the iruby gem (https://github.


In order to have multiple languages in a Jupyter notebook the SoS kernel was developed (https://vatlab.



gKnitting a DocumentThis document describes gKnit.

gKnit is based on knitr and R markdown and can knit a document written both in Ruby and/or R and output it in any of the available formats of R markdown.

gKnit allows ruby developers to do literate programming and reproducible research by allowing them to have in a single document, text and code.

gKnit runs atop of GraalVM, and Galaaz (an integration library between Ruby and R — see bellow).

In gKnit, Ruby variables are persisted between chunks, making it an ideal solution for literate programming in this language.

Also, since it is based on Galaaz, Ruby chunks can have access to R variables and Polyglot Programming with Ruby and R is quite natural.

Galaaz has already been describe in the following posts:https://towardsdatascience.




org/how-to-make-beautiful-ruby-plots-with-galaaz-320848058857This is not a blog post on R markdown, and the interested user is directed to the following links for detailed information on its capabilities and use.



com/ orhttps://bookdown.

org/yihui/rmarkdown/In this post, we will describe just the main aspects of R markdown, so the user can start gKnitting Ruby and R documents quickly.

The Yaml headerAn R markdown document should start with a Yaml header and be stored in a file with ‘.

Rmd’ extension.

This document has the following header for gKitting an HTML document.

—title: "How to do reproducible research in Ruby with gKnit"author: – "Rodrigo Botafogo" – "Daniel Mossé – University of Pittsburgh"tags: [Tech, Data Science, Ruby, R, GraalVM]date: "20/02/2019"output: html_document: self_contained: true keep_md: true pdf_document: includes: in_header: [".



sty"] number_sections: yes—For more information on the options in the Yaml header, check https://bookdown.



R Markdown formattingDocument formatting can be done with simple markups such as:Headers# Header 1## Header 2### Header 3ListsUnordered lists:* Item 1* Item 2 + Item 2a + Item 2bOrdered Lists1.

Item 12.

Item 23.

Item 3 + Item 3a + Item 3bFor more R markdown formatting go to https://rmarkdown.




R chunksRunning and executing Ruby and R code is actually what really interests us is this blog.

Inserting a code chunk is done by adding code in a block delimited by three back ticks followed by an open curly brace (‘{’) followed with the engine name (r, ruby, rb, include, …), an any optional chunk_label and options, as shown bellow:“`{engine_name [chunk_label], [chunk_options]}“`for instance, let’s add an R chunk to the document labeled ‘first_r_chunk’.

This is a very simple code just to create a variable and print it out, as follows:“`{r first_r_chunk}vec <- c(1, 2, 3)print(vec)“`If this block is added to an R markdown document and gKnitted the result will be:vec <- c(1, 2, 3)print(vec)## [1] 1 2 3Now let’s say that we want to do some analysis in the code, but just print the result and not the code itself.

For this, we need to add the option ‘echo = FALSE’.

“`{r second_r_chunk, echo = FALSE}vec2 <- c(10, 20, 30)vec3 <- vec * vec2print(vec3) “`Here is how this block will show up in the document.

Observe that the code is not shown and we only see the execution result in a white box## [1] 10 40 90A description of the available chunk options can be found in https://yihui.


Let’s add another R chunk with a function definition.

In this example, a vector ‘r_vec’ is created and a new function ‘reduce_sum’ is defined.

The chunk specification is“`{r data_creation}r_vec <- c(1, 2, 3, 4, 5)reduce_sum <- function(.

) { Reduce(sum, as.


))}“`and this is how it will look like once executed.

From now on, to be concise in the presentation we will not show chunk definitions any longer.

r_vec <- c(1, 2, 3, 4, 5)reduce_sum <- function(.

) { Reduce(sum, as.


))}We can, possibly in another chunk, access the vector and call the function as follows:print(r_vec)## [1] 1 2 3 4 5print(reduce_sum(r_vec))## [1] 15R Graphics with ggplotIn the following chunk, we create a bubble chart in R using ggplot and include it in this document.

Note that there is no directive in the code to include the image, this occurs automatically.

The ‘mpg’ dataframe is natively available to R and to Galaaz as well.

For the reader not knowledgeable of ggplot, ggplot is a graphics library based on “the grammar of graphics” (Wilkinson 2005).

The idea of the grammar of graphics is to build a graphics by adding layers to the plot.

More information can be found in https://towardsdatascience.


In the plot bellow the ‘mpg’ dataset from base R is used.

“The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

” (Quinlan, 1993)First, the ‘mpg’ dataset if filtered to extract only cars from the following manumactures: Audi, Ford, Honda, and Hyundai and stored in the ‘mpg_select’ variable.

Then, the selected dataframe is passed to the ggplot function specifying in the aesthetic method (aes) that ‘displacement’ (disp) should be plotted in the ‘x’ axis and ‘city mileage’ should be on the ‘y’ axis.

In the ‘labs’ layer we pass the ‘title’ and ‘subtitle’ for the plot.

To the basic plot ‘g’, geom_jitter is added, that plots cars from the same manufactures with the same color (col=manufactures) and the size of the car point equal its high way consumption (size = hwy).

Finally, a last layer is plotter containing a linear regression line (method = “lm”) for every manufacturer.

# load package and datalibrary(ggplot2)data(mpg, package="ggplot2")mpg_select <- mpg[mpg$manufacturer %in% c("audi", "ford", "honda", "hyundai"), ]# Scatterplottheme_set(theme_bw()) # pre-set the bw theme.

g <- ggplot(mpg_select, aes(displ, cty)) + labs(subtitle="mpg: Displacement vs City Mileage", title="Bubble chart")g + geom_jitter(aes(col=manufacturer, size=hwy)) + geom_smooth(aes(col=manufacturer), method="lm", se=F)Ruby chunksIncluding a Ruby chunk is just as easy as including an R chunk in the document: just change the name of the engine to ‘ruby’.

It is also possible to pass chunk options to the Ruby engine; however, this version does not accept all the options that are available to R chunks.

Future versions will add those options.

“`{ruby first_ruby_chunk}“`In this example, the ruby chunk is called ‘first_ruby_chunk’.

One important aspect of chunk labels is that they cannot be duplicated.

If a chunk label is duplicated, gKnit will stop with an error.

Another important point with Ruby chunks is that they are evaluated in the scope of a class called RubyChunk.

To make sure that variables are available between chunks, they should be made as instance variables of the RubyChunk class.

In the following chunk, variable ‘@a’, ‘@b’ and ‘@c’ are standard Ruby variables and ‘@vec’ and ‘@vec2’ are two vectors created by calling the ‘c’ method on the R module.

In Galaaz, the R module allows us to access R functions transparently.

The ‘c’ function in R, is a function that concatenates its arguments making a vector.

It should be clear that there is no requirement in gknit to call or use any R functions.

gKnit will knit standard Ruby code, or even general text without any code.

@a = [1, 2, 3]@b = "US$ 250.

000"@c = "The 'outputs' function"@vec = R.

c(1, 2, 3)@vec2 = R.

c(10, 20, 30)In this next block, variables ‘@a’, ‘@vec’ and ‘@vec2’ are used and printed.

puts @aputs @vec * @vec2## [1, 2, 3]## [1] 10 40 90Note that @a is a standard Ruby Array and @vec and @vec2 are vectors that behave accordingly, where multiplication works as expected.

Accessing R from RubyOne of the nice aspects of Galaaz on GraalVM, is that variables and functions defined in R, can be easily accessed from Ruby.

This next chunk, reads data from R and uses the ‘reduce_sum’ function defined previously.

To access an R variable from Ruby the ‘~’ function should be applied to the Ruby symbol representing the R variable.

Since the R variable is called ‘r_vec’, in Ruby, the symbol to acess it is ‘:r_vec’ and thus ‘~:r_vec’ retrieves the value of the variable.

puts ~:r_vec## [1] 1 2 3 4 5In order to call an R function, the ‘R.

’ module is used as followsputs R.

reduce_sum(~:r_vec)## [1] 15Ruby PlottingWe have seen an example of plotting with R.

Plotting with Ruby does not require anything different from plotting with R.

In the following example, we plot a diverging bar graph using the ‘mtcars’ dataframe from R.

This data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

The ten aspects are:mpg: Miles/(US) galloncyl: Number of cylindersdisp: Displacement (cu.


)hp: Gross horsepowerdrat: Rear axle ratiowt: Weight (1000 lbs)qsec: 1/4 mile timevs: Engine (0 = V-shaped, 1 = straight)am: Transmission (0 = automatic, 1 = manual)gear: Number of forward gearscarb: Number of carburetors# copy the R variable :mtcars to the Ruby mtcars variable@mtcars = ~:mtcars# create a new column 'car_name' to store the car names so that it can be# used for plotting.

The 'rownames' of the data frame cannot be used as# data for plotting@mtcars.

car_name = R.

rownames(:mtcars)# compute normalized mpg and add it to a new column called mpg_z# Note that the mean value for mpg can be obtained by calling the 'mean'# function on the vector 'mtcars.


The same with the standard# deviation 'sd'.

The vector is then rounded to two digits with 'round 2'@mtcars.

mpg_z = ((@mtcars.

mpg – @mtcars.





round 2# create a new column 'mpg_type'.

Function 'ifelse' is a vectorized function# that looks at every element of the mpg_z vector and if the value is below# 0, returns 'below', otherwise returns 'above'@mtcars.

mpg_type = (@mtcars.

mpg_z < 0).

ifelse("below", "above")# order the mtcar data set by the mpg_z vector from smaler to larger values@mtcars = @mtcars[@mtcars.


order, :all]# convert the car_name column to a factor to retain sorted order in plot@mtcars.

car_name = @mtcars.


factor levels: @mtcars.

car_name# let's look at the first records of the final data frameputs @mtcars.

headrequire 'ggplot'puts @mtcars.


aes(x: :car_name, y: :mpg_z, label: :mpg_z)) + R.


aes(fill: :mpg_type), stat: 'identity', width: 0.

5) + R.

scale_fill_manual(name: 'Mileage', labels: R.

c('Above Average', 'Below Average'), values: R.

c('above': '#00ba38', 'below': '#f8766d')) + R.

labs(subtitle: "Normalised mileage from 'mtcars'", title: "Diverging Bars") + R.

coord_flipInline Ruby codeWhen using a Ruby chunk, the code and the output are formatted in blocks as seen above.

This formatting is not always desired.

Sometimes, we want to have the results of the Ruby evaluation included in the middle of a phrase.

gKnit allows adding inline Ruby code with the ‘rb’ engine.

The following chunk specification will create and inline Ruby text:This is some text with inline Ruby accessing variable @b which has value:“`{rb puts @b}“`and is followed by some other text!This is some text with inline Ruby accessing variable @b which has value: US$ 250.

000 and is followed by some other text!Note that it is important not to add any new line before of after the code block if we want everything to be in only one line, resulting in the following sentence with inline Ruby code.

The ‘outputs’ functionHe have previously used the standard ‘puts’ method in Ruby chunks in order produce output.

The result of a ‘puts’, as seen in all previous chunks that use it, is formatted inside a white box that follows the code block.

Many times however, we would like to do some processing in the Ruby chunk and have the result of this processing generate and output that is “included” in the document as if we had typed it in R markdown document.

For example, suppose we want to create a new heading in our document, but the heading phrase is the result of some code processing: maybe it’s the first line of a file we are going to read.

Method ‘outputs’ adds its output as if typed in the R markdown document.

Take now a look at variable ‘@c’ (it was defined in a previous block above) as ‘@c = “The ‘outputs’ function”.

“The ‘outputs’ function” is actually the name of this section and it was created using the ’outputs’ function inside a Ruby chunk.

The ruby chunk to generate this heading is:“`{ruby heading}outputs "### #{@c}"“`The three ‘###’ is the way we add a Heading 3 in R markdown.

HTML Output from Ruby ChunksWe’ve just seen the use of method ‘outputs’ to add text to the the R markdown document.

This technique can also be used to add HTML code to the document.

In R markdown, any html code typed directly in the document will be properly rendered.

Here, for instance, is a table definition in HTML and its output in the document:<table style="width:100%"> <tr> <th>Firstname</th> <th>Lastname</th> <th>Age</th> </tr> <tr> <td>Jill</td> <td>Smith</td> <td>50</td> </tr> <tr> <td>Eve</td> <td>Jackson</td> <td>94</td> </tr></table>FirstnameLastnameAgeJillSmith50EveJackson94But manually creating HTML output is not always easy or desirable, specially if we intend the document to be rendered in other formats, for example, as Latex.

Also, The above table looks ugly.

The ‘kableExtra’ library is a great library for creating beautiful tables.

Take a look at https://cran.



htmlIn the next chunk, we output the ‘mtcars’ dataframe from R in a nicely formatted table.

Note that we retrieve the mtcars dataframe by using ‘~:mtcars’.


install_and_loads('kableExtra')outputs (~:mtcars).


kable_stylingIncluding Ruby files in a chunkR is a language that was created to be easy and fast for statisticians to use.

As far as I know, it was not a language to be used for developing large systems.

Of course, there are large systems and libraries in R, but the focus of the language is for developing statistical models and distribute that to peers.

Ruby on the other hand, is a language for large software development.

Systems written in Ruby will have dozens, hundreds or even thousands of files.

To document a large system with literate programming, we cannot expect the developer to add all the files in a single ‘.

Rmd’ file.

gKnit provides the ‘include’ chunk engine to include a Ruby file as if it had being typed in the ‘.

Rmd’ file.

To include a file, the following chunk should be created, where is the name of the file to be included and where the extension, if it is ‘.

rb’, does not need to be added.

If the ‘relative’ option is not included, then it is treated as TRUE.

When ‘relative’ is true, ruby’s ‘require_relative’ semantics is used to load the file, when false, Ruby’s $LOAD_PATH is searched to find the file and it is ’require’d.

“`{include <filename>, relative = <TRUE/FALSE>}“`Bellow we include file ‘model.

rb’, which is in the same directory of this blog.

This code uses R ‘caret’ package to split a dataset in a train and test sets.

The ‘caret’ package is a very important a useful package for doing Data Analysis, it has hundreds of functions for all steps of the Data Analysis workflow.

To use ‘caret’ just to split a dataset is like using the proverbial cannon to kill the fly.

We use it here only to show that integrating Ruby and R and using even a very complex package as ‘caret’ is trivial with Galaaz.

A word of advice: the ‘caret’ package has lots of dependencies and installing it in a Linux system is a time consuming operation.

Method ‘R.

install_and_loads’ will install the package if it is not already installed and can take a while.

“`{include model}“`require 'galaaz'# Loads the R 'caret' package.

If not present, installs it R.

install_and_loads 'caret'class Model attr_reader :data attr_reader :test attr_reader :train #========================================================== # #========================================================== def initialize(data, percent_train:, seed: 123) R.

set__seed(seed) @data = data @percent_train = percent_train @seed = seed end #========================================================== # #========================================================== def partition(field) train_index = R.


send(field), p: @percet_train, list: false, times: 1) @train = @data[train_index, :all] @test = @data[-train_index, :all] end endmtcars = ~:mtcarsmodel = Model.

new(mtcars, percent_train: 0.


partition(:mpg)puts model.


headputs model.


headDocumenting GemsgKnit also allows developers to document and load files that are not in the same directory of the ‘.

Rmd’ file.

Here is an example of loading the ‘find.

rb’ file from TruffleRuby.

In this example, relative is set to FALSE, so Ruby will look for the file in its $LOAD_PATH, and the user does not need to no it’s directory.

“`{include find, relative = FALSE}“`# frozen_string_literal: true## find.

rb: the Find module for processing all files under a given directory.

### The +Find+ module supports the top-down traversal of a set of file paths.

## For example, to total the size of all files under your home directory,# ignoring anything in a "dot" directory (e.



ssh):## require 'find'## total_size = 0## Find.

find(ENV["HOME"]) do |path|# if FileTest.

directory?(path)# if File.

basename(path)[0] == ?.

# Find.

prune # Don't look any further into this directory.

# else# next# end# else# total_size += FileTest.

size(path)# end# end#module Find # # Calls the associated block with the name of every file and directory listed # as arguments, then recursively on their subdirectories, and so on.

# # Returns an enumerator if no block is given.

# # See the +Find+ module documentation for an example.

# def find(*paths, ignore_error: true) # :yield: path block_given?.or return enum_for(__method__, *paths, ignore_error: ignore_error) fs_encoding = Encoding.

find("filesystem") paths.

collect!{|d| raise Errno::ENOENT, d unless File.

exist?(d); d.


each do |path| path = path.

to_path if path.

respond_to?.:to_path enc = path.

encoding == Encoding::US_ASCII?.fs_encoding : path.

encoding ps = [path] while file = ps.

shift catch(:prune) do yield file.


taint begin s = File.

lstat(file) rescue Errno::ENOENT, Errno::EACCES, Errno::ENOTDIR, Errno::ELOOP, Errno::ENAMETOOLONG raise unless ignore_error next end if s.

directory?.then begin fs = Dir.

children(file, encoding: enc) rescue Errno::ENOENT, Errno::EACCES, Errno::ENOTDIR, Errno::ELOOP, Errno::ENAMETOOLONG raise unless ignore_error next end fs.


reverse_each {|f| f = File.

join(file, f) ps.

unshift f.

untaint } end end end end nil end # # Skips the current file or directory, restarting the loop with the next # entry.

If the current file is a directory, that directory will not be # recursively entered.

Meaningful only within the block associated with # Find::find.

# # See the +Find+ module documentation for an example.

# def prune throw :prune end module_function :find, :pruneendConverting to PDFOne of the beauties of knitr is that the same input can be converted to many different outputs.

One very useful format, is, of course, PDF.

In order to converted an R markdown file to PDF it is necessary to have LaTeX installed on the system.

We will not explain here how to install LaTeX as there are plenty of documents on the web showing how to proceed.

gKnit comes with a simple LaTeX style file for gknitting this blog as a PDF document.

Here is the Yaml header to generate this blog in PDF format instead of HTML:—title: "gKnit – Ruby and R Knitting with Galaaz in GraalVM"author: "Rodrigo Botafogo"tags: [Galaaz, Ruby, R, TruffleRuby, FastR, GraalVM, knitr, gknit]date: "29 October 2018"output: pdf_document: includes: in_header: [".



sty"] number_sections: yes—The PDF document can be seen at: https://www.


net/publication/332766270_How_to_do_reproducible_research_in_Ruby_with_gKnitConclusionIn order to do reproducible research, one of the main basic tools needed is a systhem that allows “literate programming” where text, code and possibly a set of files can be compiled onto a report that can be easily distributed to peers.

Peers should be able to use this same set of files to rerun the compilation by their own obtaining the exact same original report.

gKnit is such a system for Ruby and R.

It uses R Markdown to integrate text and code chunks, where code chunks can either be part of the R Markdwon file or be imported from files in the system.

Ideally, in reproducible research, all the files needed to rebuild a report should be easilly packed together (in the same zipped directory) and distributed to peers for reexecution.

One of the promises of Oracle’s GraalVM is that users/developers will be able to use the best tool for their task at hand, independently of the programming language the tool was written on.

We developed and implemented Galaaz atop the GraalVM and Truffle interop messages and the time and effort to wrap Ruby over R — Galaaz — or to wrap Knitr with gKnit was a fraction of a fraction of a fraction (one man effort for a couple of hours a day, for approximately six months) of the time require to implement the original tools.

Trying to reimplement all R packages in Ruby would require the same effort it is taking Python to implement NumPy, Pandas and all supporting libraries and it is unlikely that this effort would ever be done.

GraalVM has allowed Ruby to profit “almost for free” from this huge set of libraries and tools that make R one of the most used languages for data analysis and machine learning.

More interesting than wrapping the R libraries with Ruby, is that Ruby adds value to R, by allowing developers to use powerful and modern constructs for code reuse that are not the strong points of R.

As shown in this blog, R and Ruby can easily communicate and R can be structured in classes and modules in a way that greatly expands its power and readability.

Installing gKnitPrerequisitesGraalVM (>= rc15)TruffleRubyFastRThe following R packages will be automatically installed when necessary, but could be installed prior to using gKnit if desired:ggplot2gridExtraknitrInstallation of R packages requires a development environment and can be time consuming.

In Linux, the gnu compiler and tools should be enough.

I am not sure what is needed on the Mac.

Preparationgem install galaazUsagegknit <filename>ReferencesKnuth, Donald E.


“Literate Programming.

” Comput.


27 (2).

Oxford, UK: Oxford University Press: 97–111.






Wilkinson, Leland.


The Grammar of Graphics (Statistics and Computing).

Berlin, Heidelberg: Springer-Verlag.


. More details

Leave a Reply