We’ll assume in each case that the relationship between mpg and each of our features is linear.# copy mtcars into sparkmtcars_tbl <- copy_to(sc, mtcars)# transform our data set, and then partition into 'training', 'test'partitions <- mtcars_tbl %>% filter(hp >= 100) %>% mutate(cyl8 = cyl == 8) %>% sdf_partition(training = 0.5, test = 0.5, seed = 1099)# fit a linear model to the training datasetfit <- partitions$training %>% ml_linear_regression(response = "mpg", features = c("wt", "cyl"))fit## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl")) ## ## Formula: mpg ~ wt + cyl## ## Coefficients:## (Intercept) wt cyl ## 33.499452 -2.818463 -0.923187For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit and the statistical significance of each of our predictors.summary(fit)## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl")) ## ## Deviance Residuals:## Min 1Q Median 3Q Max ## -1.752 -1.134 -0.499 1.296 2.282 ## ## Coefficients:## (Intercept) wt cyl ## 33.499452 -2.818463 -0.923187 ## ## R-Squared: 0.8274## Root Mean Squared Error: 1.422Spark machine learning supports a wide array of algorithms and feature transformations, and as illustrated above, it’s easy to chain these functions together with dplyr pipelines.Check out more about machine learning with sparklyr here:sparklyrAn R interface to Sparkspark.rstudio.comAnd more information in general about the package and examples here:sparklyrAn R interface to Sparkspark.rstudio.com2. Drake — An R-focused pipeline toolkit for reproducibility and high-performance computingDrake programmingNope, just kidding. But the name of the package is drake!https://github.com/ropensci/drakeThis is such an amazing package. I’ll create a separate post with more details about it, so wait for that!Drake is a package created as a general-purpose workflow manager for data-driven tasks. It rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date.Also, not every run-through starts from scratch, and completed workflows have tangible evidence of reproducibility.Reproducibility, good management, and tracking experiments are all necessary for easily testing others’ work and analysis. It’s a huge deal in Data Science, and you can read more about it here:From Zach Scott:Data Science’s Reproducibility CrisisWhat is Reproducibility in Data Science and Why Should We Care?towardsdatascience.comToward Reproducibility: Balancing Privacy and PublicationCan there ever be a Goldilocks option in the conflict between data security and research disclosure?towardsdatascience.comAnd in an article by me :)Manage your Machine Learning Lifecycle with MLflow — Part 1.Reproducibility, good management and tracking experiments is necessary for making easy to test other’s work and…towardsdatascience.comWith drake, you can automaticallyLaunch the parts that changed since last time.Skip the rest.Installation# Install the latest stable release from CRAN.install.packages("drake")# Alternatively, install the development version from GitHub.install.packages("devtools")library(devtools)install_github("ropensci/drake")There are some known errors when installing from CRAN. For more on these errors, visit:The drake R Package User ManualThe drake R Package User Manualropenscilabs.github.ioI encountered a mistake, so I recommend that for now you install the package from GitHub.Ok, so let’s reproduce a simple example with a twist:I added a simple plot to see the linear model within drake’s main example. With this code, you’re creating a plan for executing your whole project.First, we read the data. Then we prepare it for analysis, create a simple hist, calculate the correlation, fit the model, plot the linear model, and finally create a rmarkdown report.The code I used for the final report is here:If we change some of our functions or analysis, when we execute the plan, drake will know what has changed and will only run those changes. It creates a graph so you can see what’s happening:Graph for analysisIn Rstudio, this graph is interactive, and you can save it to HTML for later analysis.There are more awesome things that you can do with drake that I’ll show in a future post :)1..DALEX — Descriptive mAchine Learning EXplanationshttps://github.com/pbiecek/DALEXExplaining machine learning models isn’t always easy..Yet it’s so important for a range of business applications..Luckily, there are some great libraries that help us with this task..For example:thomasp85/limelime — Local Interpretable Model-Agnostic Explanations (R port of original Python package)github.com(By the way, sometimes a simple visualization with ggplot can help you explain a model. For more on this check the awesome article below by Matthew Mayo)Interpreting Machine Learning Models: An OverviewAn article on machine learning interpretation appeared on O’Reilly’s blog back in March, written by Patrick Hall, Wen…www.kdnuggets.comIn many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.DALEX is a set of tools that helps explain how complex models are working.To install from CRAN, just run:install.packages("DALEX")They have amazing documentation on how to use DALEX with different ML packages:How to use DALEX with caretHow to use DALEX with mlrHow to use DALEX with H2OHow to use DALEX with xgboost packageHow to use DALEX for teaching..Part 1How to use DALEX for teaching..Part 2breakDown vs lime vs shapleyRGreat cheat sheets:https://github.com/pbiecek/DALEXhttps://github.com/pbiecek/DALEXHere’s an interactive notebook where you can learn more about the package:Binder (beta)Edit descriptionmybinder.orgAnd finally, some book-style documentation on DALEX, machine learning, and explainability:DALEX: Descriptive mAchine Learning EXplanationsDo not trust a black-box model. Unless it explains itself.pbiecek.github.ioCheck it out in the original repository:pbiecek/DALEXDALEX — Descriptive mAchine Learning EXplanationsgithub.comand remember to star it :)Thanks to the amazing team at Ciencia y Datos for helping with these digests.Thanks also for reading this. I hope you found something interesting here :)..If these articles are helping you please share them with your friends!If you have questions just follow me on Twitter:Favio Vázquez (@FavioVaz) | TwitterThe latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a…twitter.comand LinkedIn:Favio Vázquez — Founder — Ciencia y Datos | LinkedInView Favio Vázquez’s profile on LinkedIn, the world’s largest professional community. Favio has 16 jobs listed on their…www.linkedin.comSee you there :)Editor’s Note: 2018 has been an incredible year for AI..Check out the following Heartbeat recaps that detail all the year’s best and brightest:2018 Year-in-Review: AI & Machine Learning ConferencesBest of Machine Learning in 2018: Reddit Edition2018 Year-in-Review: Machine Learning Open Source Projects & FrameworksGet a jumpstart on 2019 by joining Heartbeat on Slack to chat with the author and a growing community of machine learners, mobile developers, and more. More details