Supporting R with Feedzai OpenML Engine

Supporting R with Feedzai OpenML EnginePaulo PereiraBlockedUnblockFollowFollowingApr 26R is a statistical programming language that is very popular among data scientists.

It offers several features to simplify the manipulation of large datasets and their visualization.

One of the greatest strengths of this language is the variety of classical and modern statistical techniques.

The capabilities of the language are increased by the insertion of new packages.

That can be used for a wide range of fields, such as finance, machine learning, and social sciences.

If you want to use a statistical method in R, it is very likely that there is already a package for it.

Photo by rawpixel on UnsplashWith the introduction of OpenML, Feedzai added the possibility to integrate any machine learning (ML) model into it.

In other words, it opened the doors to start supporting R ML models.

To this end, two new providers were created.

Caret OpenML providerOne of the most popular packages to build predictive models in R is Caret (Classification And REgression Training).

We decided to build a provider for this package due its popularity, and due to the large list of classification models that are supported by it.

The Caret OpenML provider is a loading provider, which means that it’s possible to import a Caret model into the Feedzai platform.

However, it’s not possible to train a Caret model inside the platform.

Even if the models share a common interface, each one has its own parameters that allows it to create models with better performances.

When loading a Caret model, you need to provide the file with the Caret model and the schema of the dataset used to train that model.

To store the Caret model into a file you should use the saveRDS function.

Want to try it out?.Here is where you can find the Caret OpenML provider.

Github repository: https://github.

com/feedzai/feedzai-openml-r/tree/master/openml-caretArtifacts of the released versions: https://mvnrepository.

com/artifact/com.

feedzai/openml-caretGeneric R OpenML providerSo we have a specific provider for the Caret models, but what if you want to use a model that is not supported by Caret?.We could create a provider for each of the most popular frameworks but that would be very time consuming.

In order to solve this problem we created the Generic R OpenML provider.

Similar to the Caret provider, the generic provider is also a loading provider that needs to receive the schema of the dataset used to create the model.

However, this is a generic provider and therefore it’s not possible to assume a specific format for the models to load.

Instead of receiving a file with the model, it receives an R script that should implement the following three methods.

# loads a model.

This can be a no-op if there's no need to save state loadModel <- function() {}# score the instance and returns an array with the probability for each of the classesgetClassDistribution <- function(instance) {}# returns the predicted class of the instanceclassify <- function(instance) {}Here is where you can find the Generic R OpenML provider.

Github repository: https://github.

com/feedzai/feedzai-openml-r/tree/master/openml-generic-rArtifacts of the released versions: https://mvnrepository.

com/artifact/com.

feedzai/openml-generic-rHow to integrate R with Java?One of the biggest technical challenges of these providers was how to integrate R code within a Java application.

Our platform is mostly written in Java and somehow it should be able to execute R code.

In the past we had a similar problem with the Feedzai Open Scoring Server (fos-r) project, and at the time we decided to work with Rserve.

That project was implemented to run in Linux and Windows, and because of that it didn’t support concurrent connections due to a limitation of Rserve in Windows.

Due to that, we decided to take our time to explore new alternatives.

From that, we started focusing our efforts on two frameworks.

RserveRserve consists of a TCP/IP server written in C with a Java client.

The server includes the R dynamic libraries, making it easy to use.

You only have to execute three lines in order to install Rserve and launch it.

install.

packages("Rserve",,"http://rforge.

net")library(Rserve)Rserve()The client creates the connection and makes R requests to the server.

Each connection is independent from the others.

This means that the resources created in one connection are only visible on that connection.

The Windows version of Rserve has the downside that it doesn’t support concurrent connections.

In this version, the connections are not independent and they share the same namespace and sessions.

The Feedzai platform doesn’t run on Windows, and therefore this limitation does not affect us.

rJavarJava allows you to run R inside Java applications as a single thread.

With the disadvantage that it does not support multi-threading.

Basically, it loads the R dynamic libraries inside a JVM, meaning that the connections are very fast.

VerdictIn our providers, the parallelization is very important since we want several independent models running at the same time, scoring different events in parallel while each model is isolated from the others.

That’s why we decided to use Rserve.

Another point that was also considered was the ease of setup.

After installing Rserve we only needed to execute two commands to initialize the server.

In rJava, after the installation we would need to set up several environment variables with paths and dynamic libraries.

Interested in the OpenML initiative?.Follow our page and stay tuned for the following posts.

Feel free to raise any questions and to contribute directly on GitHub!.Or if you want to take part in fighting fraud, join our team.

.. More details

Leave a Reply