Step-by-Step Guide to Creating R and Python Libraries (in JupyterLab) |

Step-by-Step Guide to Creating R and Python Libraries (in JupyterLab)Sean McClureBlockedUnblockFollowFollowingMar 30R and Python are the bread and butter of today’s machine learning languages.

R provides powerful statistics and quick visualizations, while Python offers an intuitive syntax, abundant support, and is the choice interface to today’s major AI frameworks.

In this article we’ll look at the steps involved in creating libraries in R and Python.

This is a skill every machine learning practitioner should have in their toolbox.

Libraries help us organize our code and share it with others, offering packaged functionality to the data community.

NOTE: In this article I use the terms “library” and “package” interchangeably.

While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice.

We can think of a library (or package) as a directory of scripts containing functions.

Those functions are grouped together to help engineers and scientists solve challenges.

THE IMPORTANCE OF CREATING LIBRARIESBuilding today’s software doesn’t happen without extensive use of libraries.

Libraries dramatically cut down the time and effort required for a team to bring work to production.

By leveraging the open source community engineers and scientists can move their unique contribution towards a larger audience, and effectively improve the quality of their code.

Companies of all sizes use these libraries to sit their work on top of existing functionality, making product development more productive and focused.

But creating libraries isn’t just for production software.

Libraries are critical to rapidly prototyping ideas, helping teams validate hypotheses and craft experimental software quickly.

While popular libraries enjoy massive community support and a set of best practices, smaller projects can be converted into libraries overnight.

By learning to create lighter-weight libraries we develop an ongoing habit of maintaining code and sharing our work.

Our own development is sped up dramatically, and we anchor our coding efforts around a tangible unit of work we can improve over time.

ARTICLE SCOPEIn this article we will focus on creating libraries in R and Python as well as hosting them on, and installing from, GitHub.

This means we won’t look at popular hosting sites like CRAN for R and PyPI for Python.

These are extra steps that are beyond the scope of this article.

Focusing only on GitHub helps encourage practitioners to develop and share libraries more frequently.

CRAN and PyPI have a number of criteria that must be met (and they change frequently), which can slow down the process of releasing our work.

Rest assured, it is just as easy for others to install our libraries from GitHub.

Also, the steps for CRAN and PyPI can always be added later should you feel your library would benefit from a hosting site.

We will build both R and Python libraries using the same environment (JupyterLab), with the same high-level steps for both languages.

This should help you build a working knowledge of the core step required to package your code as a library.

Let’s get started.

SETUPWe will be creating a library called datapeek in both R and Python.

The datapeek library is a simple package offering a few useful functions for handling raw data.

These functions are:encode_and_bindremove_featuresapply_function_to_columnget_closest_stringWe will look at these functions later.

For now we need to setup an R and Python environment to create datapeek, along with a few libraries to support packaging our code.

We will be using JupyterLab inside a Docker container, along with a “docker stack” that comes with the pieces we need.

Install and Run DockerThe Docker Stack we will use is called the jupyter/datascience-notebook.

This image contains both R and Python environments, along with many of the packages typically used in machine learning.

Since these run inside Docker you must have Docker installed on your machine.

So install Docker if you don’t already have it, and once installed, run the following in terminal to pull the datascience-notebook:docker pull jupyter/datascience-notebookThis will pull the most recent image hosted on Docker Hub.

NOTE: Anytime you pull a project from Docker Hub you get the latest build.

If some time passes since your last pull, pull again to update your image.

Immediately after running the above command you should see the following:Once everything has been pulled we can confirm our new image exists by running the following:docker images… showing something similar to the following:Now that we have our Docker stack let’s setup JupyterLab.

JupyterLabWe will create our libraries inside a JupyterLab environment.

JupyterLab is a web-based user interface for programming.

With JupyterLab we have a lightweight IDE in the browser, making it convenient for building quick applications.

JupyterLab provides everything we need to create libraries in R and Python, including:A terminal environment for running shell commands and downloading/installing libraries;An R and Python console for working interactively with these languages;A simple text editor for creating files with various extensions;Jupyter Notebooks for prototyping ML work.

The datascience-notebook we just pulled contains an installation of JupyterLab so we don’t need to install this separately.

Before running our Docker image we need to mount a volume to ensure our work is saved outside the container.

First, create a folder called datapeek on your desktop (or anywhere you wish) and change into that directory.

We need to run our Docker container with JupyterLab, so our full command should look as follows:docker run -it -v `pwd`:/home/jovyan/work -p 8888:8888 jupyter/datascience-notebook start.

sh jupyter labYou can learn more about Docker commands here.

Importantly, the above command exposes our environment on port 8888, meaning we can access our container through the browser.

After running the above command you should see the following output at the end:This tells us to copy and paste the provided URL into our browser.

Open your browser and add the link in the address bar and hit enter (your token will be different):localhost:8888/?token=11e5027e9f7cacebac465d79c9548978b03aaf53131ce5fdThis will automatically open JupyterLab in your browser as a new tab:We are now ready to start building libraries.

We begin this article with R, then look at Python.

CREATING LIBRARIES IN RR is one of the “big 2” languages of machine learning.

At the time of this writing it has well-over 10,000 libraries.

Going to Available CRAN Packages By Date of Publication and running…document.

getElementsByTagName('tr').

length…in the browser console gives me 13858.

Minus the header and final row this gives 13856 packages.

Needless to say R is not in need of variety.

With strong community support and a concise (if not intuitive) language, R sits comfortably at the top of statistical languages worth learning.

The most well-known treatise on creating R packages is Hadley Wickam’s book R Packages.

Its contents are available for free online.

For a deeper dive on topic I recommend looking there.

We will use Hadley’s devtools package to abstract away the tedious tasks involved in creating packages.

devtools is already installed in our Docker Stacks environment.

We also require the roxygen2 package, which helps us document our functions.

Since this doesn’t come pre-installed with our image let’s install that now.

NOTE: From now on we’ll use the terminal in JupyterLab in order to conveniently keep our work within the browser.

Open terminal inside JupyterLab’s Launcher:NOTE: If you’d like to change your JupyterLab to dark theme, click on Settings at the top, JupyterLab Theme, then JupyterLab Dark:Inside the console type R, then….

install.

packages("roxygen2")library("roxygen2")With the necessary packages installed we’re ready to tackle each step.

STEP 1: Create Package FrameworkWe need to create a directory for our package.

We can do this in one line of code, using the devtools create function.

In terminal run:devtools::create("datapeek")This automatically creates the bare bone files and directories needed to define our R package.

In JupyterLab you will see a set of new folders and files created on the left side.

NOTE: You will also see your new directory structure created on your desktop (or wherever you chose to create it) since we mounted a volume to our container during setup.

If we inspect our package in JupyterLab we now see:datapeek├── R├── datapeek.

Rproj├── DESCRIPTION├── NAMESPACEThe R folder will eventually contain our R code.

The my_package.

Rproj file is specific to the RStudio IDE so we can ignore that.

The DESCRIPTION folder holds our package’s metadata (a detailed discussion can be found here).

Finally, NAMSPACE is a file that ensures our library plays nicely with others, and is more of a CRAN requirement.

Naming ConventionsWe must follow these rules when naming an R package:must be unique on CRAN (you can check all current R libraries here);can only consist of letters, numbers and periods;cannot contain an underscore or hyphen;must start with a letter;cannot end in a period;You can read more about naming packages here.

Our package name “datapeek” passes the above criteria.

Let’s head over to CRAN and do a Command+F search for “datapeek” to ensure it’s not already taken:Command + F search on CRAN to check for package name uniqueness.

…look’s like we’re good.

STEP 2: Fill Out Description DetailsThe job of the DESCRIPTION file is to store important metadata about our package.

These data include others packages required to run our library, our license, and our contact information.

Technically, the definition of a package in R is any directory containing a DESCRIPTION file, so always ensure this is present.

Click on the DESCRIPTION file in JupyterLab’s directory listing.

You will see the basic details created automatically when we ran devtools::create(“datapeek”) :Let’s add our specific details so our package contains the necessary metadata.

Simply edit this file inside JupyterLab.

Here are the details I am adding:Package: datapeekTitle: Provides useful functions for working with raw data.

Version: 0.

1Authors@R: person(“Sean”, “McClure”, email=”sean.

mcclure@example.

com”, role=c('aut','cre'))Description: The datapeek package helps users transform raw data for machine learning development.

Depends: R (≥ 3.

1)License: MITEncoding: UTF-8LazyData: trueOf course you should fill out these parts with your own details.

STEP 3: Add Functions3A: Add Functions to R FolderOur library wouldn’t do much without functions.

Let’s add the 4 functions mentioned in the beginning of this article.

The following GIST shows these functions in R:We have to add our functions to the R folder, since this is where R looks for any functions inside a library.

datapeek├── R├── datapeek.

Rproj├── DESCRIPTION├── NAMESPACESince our library only contains 4 functions we will place all of them into a single file called utilities.

R, with this file residing inside the R folder.

Let’s do that now:Go into the directory in JupyterLab and open the R folder.

Click on Text File in the Launcher and paste in our 4 R functions.

Right-click the file and rename it to utilities.

3B: Export our FunctionsIt isn’t enough to have a bunch of R functions in our file.

Each function must be exported to expose them to users of our library.

This is accomplished by adding the @export tag above each function.

The export syntax comes from Roxygen, and ensures our function gets added to the NAMESPACE.

Let’s add the @export tag to our first function:Do this for the remaining function as well.

3C: Document our FunctionsIt is important to document functions… This also comes from Roxygen.

This means that when a user of our package types ?datapeek will get information about our packages.

There are 2 things we will do here:add the document annotationsrun devtools::document()— Add the Document AnnotationsDocumentation is added above our function, directly above our #’ @export line.

Here’s the example with our first function:Add document annotation to our remaining functions:You can read more about documenting functions here.

— Run devtools::document()With documentation added to our functions, we then run the following:…(be sure to change directories into datapeek):devtools::document()NOTE: Make sure you’re one level outside the datapeek directory.

You may get the error:Error: ‘roxygen2’ >= 5.

0 must be installed for this functionality.

In this case open your terminal in JupyterLab and install roxygen2.

You should also install data.

table and mltools, since our first function uses these:install.

packages('roxygen2')install.

packages('data.

table')install.

packages('mltools')Run the devtools::document() again.

You should see the following:This will generate .

Rd files inside a new man folder.

You’ll notice 1 .

Rd file is created for each function in our package.

If you look at your DESCRIPTION file it will now show a new line at the bottom:This will also generate a NAMESPACE file:and we can see our 4 functions have been exposed.

STEP 4: List External DependenciesIt is common for our functions to require functions found in other libraries.

There are 2 things we must do to ensure external functionality is made available to our library’s functions.

Use double colons to specify which library we are relying on;Add imports to our DESCRIPTION file.

You’ll notice above we simply listed our libraries at the top.

While this works well in stand-alone R scripts it isn’t the way to use dependencies in an R package.

When creating R packages we must use the “double-colon approach” to …The only function in our datapeek package requiring additional packages is our first one:Using the double-colon approach to specify dependent packages in R.

Notice each time we call an external function we must preface it with the external library and double colons.

Note the use of double-colons between the name of the library we are using and the function we are using from that library.

Above, we are adding mltools:: and data.

table:: before the appropriate functions.

Now we can take the second step of adding our imports to the DESCRIPTION file:NOTE: Any packages our library will depend on must be listed as external dependencies, added as additional lines under Importsin the DESCRIPTION file.

Be sure to have the imported libraries comma-separated.

Notice we didn’t specify any versions for our external dependencies.

If we need to we can specify it in parentheses after the package name:Imports: data.

table (>= 1.

12.

0)Since our encode_and_bind function isn’t taking advantage of any bleeding-edge updates I will leave it without any version specified.

STEP 5: Add DataSometimes it makes sense to include data inside our library.

This makes it easier for us to demonstrate to users how to use the functions inside our library.

It also helps with testing, since machine learning packages will always contain functions that ingest and transform data.

The 4 options for adding external data to an R package are:binary dataparsed dataraw dataserialized dataYou can learn more about these different approaches here.

For this article we will stick with the most common approach, which is to add external data to an R folder.

Let’s add the Iris dataset to our library in order to provide users a quick way to test our functions.

The data must be in the .

rda format, created using R’s save() function, and have the same name as the file.

We can ensure all these are taken care of by using the devtools use_data function:x <- read.

csv("http://bit.

ly/2HuTS0Z")devtools::use_data(x, iris)Above, I read in the Iris dataset from its URL and pass the data frame to devtools::use_data().

In JupyterLab we see a new data folder has been created, along with our iris.

rda dataset:datapeek├── data├── man├── R├── datapeek.

Rproj├── DESCRIPTION├── NAMESPACEWe will our added dataset to run tests in the following section.

STEP 6: Add TestsTesting is an important part of software development.

Testing helps ensure our code works as expected, and makes debugging our code a much faster and more effective process.

Learn more about testing R packages here.

REPL environments like Jupyter Notebooks where we can run our code at anytime to see the otput it produces.

“Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead.

” — Martin FowlerThe above quote from Martin is a good way to think about when to write tests.

If you prototype applications regularly you’ll find yourself writing to the console frequently to see if a piece of code returns what you expect.

In data science, writing interactive code is even more common, since machine learning work is highly experimental.

On one hand this provides ample opportunity to think about which tests to write.

On the other hand, the non-deterministic nature of machine learning code means testing machine learning code has its own unique challenges.

This conversation is beyond the scope of this article.

Suffice it to say it’s a good idea to add at least a few tests to the libraries we create.

Testing our library means others can quickly make custom changes to our library, or extend it into their own library, and ensure their changes don’t break the code we already wrote.

While the testing we do in data science is manual, what we are looking for in our packages is automated testing.

Automated testing helps us when we revisit our code months later.

Just as important, it makes the entire codebase more maintainable since when we collaborate we can….

While there are many kinds of tests in software, here we are taking about “unit tests.

” Thinking in terms of unit tests forces us to break up our code into more modular components, which is good practice in software design.

NOTE: If you are used to testing in languages like Python, notice that R is more functional in nature (i.

, methods belong to functions not classes), so there will be some differences.

There are 2 parts:6A: Creating the tests/testthat folder;6B: Writing tests.

— 6A: Creating the tests/testthat folderJust as R expects our R scripts and data to be in specific folders it also expects tests to be located in a specific folder.

To create the folder, we run the following in JupyterLab’s R console:devtools::use_testthat()You may get the following error:Error: ‘testthat’ >= 1.

2 must be installed for this functionality.

If so, use the same approach above for installing roxygen2 in Jupyter’s terminal.

install.

packages('testthat')After running you should see:* Adding testthat to Suggests* Creating `tests/testthat`.

* Creating `tests/testthat.

R` from template.

There should now be a new tests folder in our main directory:datapeek├── data├── man├── R├── tests├── datapeek.

Rproj├── DESCRIPTION├── NAMESPACEWe also created a file called testthat.

R inside the tests folder.

This runs all your tests when R CMD check runs (we’ll look at that in a bit):You’ll also notice testthat has been added under Suggests in our DESCRIPTION file:— 6B: Writing Teststestthat is the most popular unit testing package for R, used by at least 2,600 CRAN package, not to mention libraries on Github.

You can check out the latest news regarding testthat on the Tidyverse page here.

Also, here is the documentation.

There are 3 parts to consider when testing with testthat:expectation (assertion): the expected result of a computation;test: groups together multiple expectations from a single function, or related functionality from across multiple functions;file: groups together multiple related tests.

Files are given a human readable name with context().

AssertionsAssertions are the functions included in the testing library we choose.

We use assertions to check whether our own functions return the expected output.

Assertions come on many flavors, depending on what is being checked.

In this section I will cover the main ones used in R programming, showing each one failing its test.

Equality Assertionsexpect_equal()expect_identical()expect_equivalent# test for equalitya <- 10expect_equal(a, 14)> Error: `a` not equal to 14.

# test for identical expect_identical(42, 2)> Error: 42 not identical to 2.

# test for equivalence expect_equivalent(10, 12)> Error: 10 not equivalent to 12.

There are subtle differences between the examples above.

For example, expect_equal is used to check for equality within a numerical tolerance, while expect_identical tests for exact equivalence.

Here are examples:expect_equal(10, 10 + 1e-7) # trueexpect_identical(10, 10 + 1e-7) # falseAs you write more tests you’ll understand when to use which one.

Of course always refer to the documentation referenced above when in doubt.

Testing for String Matchesexpect_match()# test for string matchingexpect_match("Machine Learning is Fun", "But also rewarding.

")> Error: "Machine Learning is Fun" does not match "But also rewarding.

Testing for Lengthexpect_length# test for length vec <- 1:10expect_length(vec, 12)> Error: `vec` has length 10, not length 12.

Testing for Comparisonexpect_ltexpect_gt# test for less thana <- 11expect_lt(a, 10)> Error: `a` is not strictly less than 10.

Difference: 1# test for greater thana <- 11expect_gt(a, 12)> Error: `a` is not strictly more than 12.

Difference: -1Testing for Logicexpect_trueexpect_false# test for truth expect_true(5 == 2)> Error: 5 == 2 isn't true.

# test for false expect_false(2 == 2)> Error: 2 == 2 isn't false.

Testing for Outputsexpect_outputexpect_message# testing for outputs expect_output(str(mtcars), "31 obs")> Error: `str(mtcars)` does not match "31 obs".

# test for warning f <-function(x) { if(x < 0) { message("*x* is already negative") }}expect_message(f(1))> Error: `f(1)` did not produce any messages.

There are many more included in the testthat library.

If you are new to testing, start writing a few simple ones to get used to the process.

With time you’ll build an intuition around what to test and when.

Writing TestsA test is a group of assertions.

We write tests in testthat as follows:test_that("this functionality does what it should", { // group of assertions here})We can see we have both a description (the test name) and the code (containing the assertions).

The description completes the sentence, “test that ….

” Above, we are saying “test that this functionality does what it should.

”The assertions are the things we want to test.

For example:test_that("trigonometric functions match identities", { expect_equal(sin(pi / 4), 1 / sqrt(2)) expect_equal(cos(pi / 4), 1 / sqrt(10)) expect_equal(tan(pi / 4), 1) })> Error: Test failed: 'trigonometric functions match identities'NOTE: The discussion above on cohesion and coupling remains true for our test files.

As stated in Hadley’s book, “the two extremes are clearly bad (all tests in one file, one file per test).

You need to find a happy medium that works for you.

A good starting place is to have one file of tests for each complicated function.

”Creating FilesThe last thing we do in testing is create files.

A “file” in testing is a group of tests covering a related set of functionality.

Our test file must live inside thetests/testthat/ directory.

Here is an example test file for the stringr package on GitHub:Example Test File from the stringr package on GitHub.

The file is called test-case.

R (starts with “test”) and lives inside the tests/testthat/ directory.

The context at the top simply allows us to provide a simple description of the file’s contents.

This appears in the console when we run our tests.

Let’s create our test file, which will contain tests and assertions related to our 4 functions.

As usual, we use JupyterLab’s Text File in its Launcher to create and rename a new file:Creating a Test File in RNow let’s add our tests:For the first function I am going to make sure a data frame with the correct number of features is returned:Notice how we called our encode_and_bind function, then simply checked the equality between the dimensions and the expected output.

We run our automated tests at any point to ensure our test files runs and that we get the expected output.

Running devtools::test() in the console runs our tests:We get a smiley face too!Let’s add the rest of our tests and then run the full “test suite” to see….

Since our second function removes a specified feature I will use the same test as above, checking for the dimensions of the returned frame.

Or third function applies a specified function to a chosen column, so I will write a test that checks the result of given specified function.

Finally, our fourth function returns the closest matching string, so I will simply check the returned string for a known input.

Here is our full test file:NOTE: notice the path to the data in the test file must use a .

/ since the test file is inside its own directory…Testing our PackageAs we did above, we run our tests using the following command:devtools::test()This will run all tests in any test files we placed inside the testthat directory.

Running our tests from above produces the following output:We had 5 assertions across 4 unit tests, placed in one test file.

Looks like we’re good.

If any of our tests failed we see this notified in the above printout.

It’s good practice to write tests continuously, as we add more functions to our library.

STEP 7: Create DocumentationThis has traditionally been done using “Vignettes” in R.

You can learn about creating R vignettes for your R package here.

Personally, I find this a dated approach to documentation.

I prefer to use things like Sphinx or Julep.

Documentation should be easily shared, searchable and hosted.

Click on the question mark at julepcode.

com to learn how to use Julep.

I created and hosted some simple documentation our R datapeek library on Julep, which you can find here.

Of course we will also have the library on GitHub, which I cover below.

STEP 8: Share your R LibraryAs I mentioned in the introduction we should be creating libraries on a regular basis, so others can benefit from and extend our work.

The best way to do this is through GitHub, which is the standard way to distribute and collaborate on open source software projects.

In case you’re new to GitHub here’s a quick tutorial to get you started so we can push our datapeek project to a remote repo.

…which will provide us with the usual screen:With our remote repo setup we can initialize our local repo on our machine, and send our first commit.

Open Terminal in JupyterLab and change into the datapeek directory:Initialize the local repo:git initAdd the remote origin (your link will be different):git remote add origin https://github.

com/sean-mcclure/datapeek.

gitNow run git add .

to add all modified and new (untracked) files in the current directory and all subdirectories to the staging area:git add .

Don’t forget the “dot” in the above command.

Now we can commit our changes, which adds any new code to our local repo.

But, since we are working inside a Docker container the username and email associated with our local repo cannot be autodetected.

We can set these by running the following in termimal:git config –global user.

email {emailaddress}git config –global user.

name {name}Use the email address and username you use to sign into GitHub.

Now we can commit:git commit -m 'initial commit'With our new code committed we can do our push, which transfers the last commit(s) to our remote repo:git push origin masterNOTE: Since we are in Docker you’ll likely get asked again for authentication.

Simply add your GitHub username and password when prompted.

Then run the above command again.

Some readers will notice we didn’t place a .

gitignore file in our directory.

It is usually fine to push all files inside smaller R libraries.

For larger libraries, or libraries containing large datasets, you can use the site gitignore.

io to see what common gitignore files look like.

Here is a common R .

gitignore file for R:Example .

gitignore file for an R packageTo recap, git add adds all modified and new (untracked) files in the current directory to the staging area.

Commit adds any changes to our local repo, and push transfers the last commit(s) to our remote repo.

While git add might seem superfluous, the reason it exists is because sometimes we want to only commit certain files, this we can stage files selectively.

Above, we staged all files by using the “dot” after git add.

Now, anyone can use our library.

????.Let’s see how.

STEP 9: Install your R LibraryAll they have to do is run the following command from their local machine:devtools::install_github("yourusername/mypackage")So, if I wanted to share datapeek (or something more interesting) with my team I would run the following:devtools::install_github("sean-mcclure/datapeek")This will install our package like any other package we get from CRAN:We now load the library as usual and we’re good to go:library(datapeek)I think using GitHub is best for most cases, since I encourage people to build packages regularly, and there are no strict criteria for adding and sharing libraries on GitHub.

However, we often see R library on CRAN.

There are more steps involved to host your libraries on CRAN, which you can read about here.

CREATING LIBRARIES IN PYTHONCreating Python libraries follows the same high-level steps we saw previously for R.

We require a basic directory structure with proper naming conventions, functions with descriptions, imports, specified dependencies, added datasets, documentation, and the ability to share and allow others to install our library.

We will use JupyterLab to build our Python library, just as we did for R.

Library vs Package vs ModuleIn the beginning of this article I discussed the difference between a “library” and a “package”, and how I prefer to use these terms interchangeably.

The same holds for Python libraries.

“Modules” are another term, and in Python simply refer to any file containing Python code.

Python libraries obviously contain modules as scripts.

Before we start:I stated in the introduction that we will host and install our libraries on and from GitHub.

This encourages rapid creation and sharing of libraries without getting bogged down by publishing criteria on popular package hosting sites for R and Python.

The most popular hosting site for Python is the Python Package Index (PyPI).

This is is a place for finding, installing and publishing python libraries.

Whenever you runpip install <package_name> (or easy_intall) you are fetching a package from PyPI.

While we won’t cover hosting our package on PyPI it’s still a good idea to see if our library name is unique.

This will minimize confusion with other popular Python libraries and improve the odds our library name is distinctive, should we decide to someday host it on PyPI.

First, we should follow a few naming conventions for Python libraries.

Python Library Naming ConventionsUse all lowercase;Make the name unique on PyPI (search for name on PyPI)No hyphens (you can use underscore to separate words)Our library name is datapeek, so the first and third criteria are met; let’s check PyPI for uniqueness:All good.

????We’re now ready to move through each step required to create a Python library.

STEP 1: Create Package FrameworkJupyterLab should be up-and-running as per the instructions in the setup section of this article.

Use JupyterLab’s New Folder and Text File options to create the following directory structure and files:datapeek├── datapeek └── __init__.

py └── utilities.

py├── setup.

pyNOTE: Bold names are folders and light names are files.

We will refer to the inner datapeek folder as the “module directory” and the outer datapeek directory as the “root directory.

”The following video shows me creating our directory in JupyterLab:There will be files we do not want to commit to source control.

These are files that are created by the Python build system.

As such, let’s also add the following .

gitignore file to our package framework:NOTE: At the time of this writing, JupyterLab still lacks a front-end setting to toggle hidden files in the browser.

As such, we will simply name our file gitignore (no preceding dot); we will change it to a hidden file later prior to pushing to GitHub.

Add your gitignore file as a simple text file to the root directory:datapeek├── datapeek └── __init__.

py └── utilities.

py├── setup.

py├── gitignoreSTEP 2: Fill Out Description DetailsJust as we did for R, we should add metadata about our new library.

We do this using Setuptools.

Setuptools is a Python library designed to facilitate packaging Python projects.

Open setup.

py and add the following details for our library:Of course you should change the authoring to your own.

We will add more details to this file later.

You can learn more about what can be added to the setup.

py file here.

STEP 3: Add FunctionsOur library obviously requires functions to be useful.

For larger libraries we would organize our modules so as to balance cohesion/coupling as discussed in the introduction.

Since our library is small we will simply keep all functions inside a single file.

We will add the same functions we did above for R, this time written in Python:Add these functions to the utilities.

py module, inside datapeek’s module directory.

STEP 4: List External DependenciesOur library will often require other packages as dependencies.

Our user’s Python environment will need to be aware of these when installing our library (so these other packages can also be installed).

Setuptools provides the install_requires keyword to list any packages our library depends on.

Our datapeek library depends on the fuzzywuzzy package for fuzzy string matching, and the pandas package for high-performance manipulation of data structures.

To specify our dependencies, add the following to your setup.

py file:install_requires=[ 'fuzzywuzzy', 'pandas']Your steup.

py file should now look like this:We can confirm all is in order by running the following in a JupyterLab terminal session:python setup.

py developNOTE: Run this in datapeek’s root directory.

After running the command you should see something like this:…with and ending that reads:Finished processing dependencies for datapeek==0.

1If one or more of our dependencies is not available on PyPI, but is available on GitHub (e.

a bleeding-edge machine learning package is only available on Github…or it’s another one of our team’s libraries hosted only on GitHub), we can use dependency_links inside our setup call:setup( .

dependency_links=['http://github.

com/user/repo/tarball/master#egg=package-1.

0'], .

)If you want to add additional metadata, such as status, licensing, language version, etc.

we can use classifiers like this:setup( .

classifiers=[ 'Development Status :: 3 – Alpha', 'License :: OSI Approved :: MIT License', 'Programming Language :: Python :: 2.

7', 'Topic :: Text Processing :: Linguistic', ], .

)To learn more about the different classifiers that can be added to our setup.

py file see here.

STEP 5: Add DataJust as we did above in R we can add data to our Python library.

In Python these are called Non-Code Files and can include things like images, data, documentation, etc.

We add data to our library’s module directory, so that any code that requires those data can use a relative path from the consuming module’s __file__ variable.

Let’s add the Iris dataset to our library in order to provide users a quick way to test our functions.

First, use the New Folder button in JupyterLab to create a new folder called data inside the module directory:datapeek├── datapeek └── __init__.

py └── utilities.

py └── data├── setup.

py├── gitignore…then make a new Text File inside the data folder called iris.

csv, and paste the data from here into the new file.

If you close and open the new csv file it will render inside JupyterLab as a proper table:CSV file rendered in JupyterLab as formatted table.

We specify Non-Code Files using a MANIFEST.

in file.

Create another Text File called MANIFEST.

in placing it inside your root folder:datapeek├── datapeek └── __init__.

py └── utilities.

py └── data├── MANIFEST.

in├── setup.

py├── gitignore…and add this line to the file:include datapeek/data/iris.

csvWe also need to include the following line in our setup.

py:include_package_data=TrueOur setup.

py file should now look like this:STEP 6: Add TestsAs with our R library we should add tests so others can import and run them, should they wish to extend our work.

Add a test folder to our library’s module directory:datapeek├── datapeek └── __init__.

py └── utilities.

py └── data └── tests├── MANIFEST.

in├── setup.

py├── gitignoreOur test folder should have its own __init__.

py file as well as the test file itself.

Create those now using JupyterLab’s Text File option:datapeek├── datapeek └── __init__.

py └── utilities.

py └── data └── tests └──__init__.

py └──datapeek_tests.

py├── MANIFEST.

in├── setup.

py├── gitignoreOur datapeek directory structure is now set to house test functions, which we will write now.

Writing TestsWriting tests in Python is similar to doing so in R.

Assertions are used to check the expected outputs produced by our library’s functions.

We can use these “unit tests” to check a variety of expected outputs depending on what might be expected to fail.

For example, we might want to ensure a data frame is returned, or perhaps the correct number of columns after some known transformation.

I will add a simple test for each of our 4 functions.

Feel free to add your own tests.

Think about what should be checked, and keep in mind Martin Fowler’s quote shown in the R section of this article.

We will use unittest, a popular unit testing framework in Python.

Add unit tests to the datapeek_tests.

py file, ensuring the unittest and datapeek libraries are imported:To run these tests we can use Nose, which extends unittest to make testing easier.

Install nose using a terminal session in JupyterLab:$ pip install noseWe also need to add the following lines to setup.

py:setup( .

test_suite='nose.

collector', tests_require=['nose'],)Our setup.

py should now look like this:Run the following from the root directory to run our tests:python setup.

py testSetuptools will take care of installing nose if required and running the test suite.

After running the above, you should see the following:All our tests have passed!If any test should fail, the unittest framework will show which functions did not pass.

At this point, check to ensure you are calling the function correctly and that the output is indeed what you expected.

If can also be good practice to purposely write tests to fail first, then write your functions until they pass.

STEP 7: Create DocumentationAs I mentioned in the R section, I use Julep to rapidly create sharable and searchable documentation.

This avoids writing cryptic annotations and provides the ability to immediately host our documentation.

Of course this doesn’t come with the IDE hooks that other documentation does, but for rapidly communicating it works.

You can find the documentation I create for this library here.

STEP 8: Share Your Python LibraryThe standard approach for sharing python libraries is through PyPI.

Just as we didn’t cover CRAN with R, we will not cover hosting our library on PyPI.

While the requirements are fewer than those associated with CRAN there are still a number of steps that must be taken to successfully host on PyPI.

Keep in mind that these steps for creating libraries are standard, and thus hosting on sites other than GitHub can always be added later.

GitHubWe covered the steps for adding a project to GitHub in the R section.

The same steps apply here.

I mentioned above the need to rename our gitignore file to make it a hidden file.

You can do that by running the following in terminal:mv gitignore .

gitignoreYou’ll notice this file is no longer visible in our JupyterLab directory (it eventually disappears).

Since JupyterLab still lacks a front-end setting to toggle hidden files, simply run the following in terminal at anytime to see hidden files:ls -a We can make it visible again at anytime, should we need to view/edit the file in JupyterLab, by running:mv .

gitignore gitignoreHere is a quick recap on pushing our library to GitHub (change git URL to your own):Create a new repo on GitHub called datapeek_pyInitialize your library’s directory using git initConfigure your local repo with your GitGub email and username (if using Docker) using:git config –global user.

email {emailaddress}git config –global user.

name {name}Add your new remote origin using git remote add origin https://github.

com/sean-mcclure/datapeek_py.

gitStage your library using git add .

Commit all files using git commit -m ‘initial commit’Push your library to the remote repo using git push origin master (authenticate when prompted)Now, anyone can use our python library.

????.Let’s see how.

STEP 9: Install your Python LibraryWhile we usually install Python libraries using the following command:pip install <package_name>… this requires hosting our library on PyPI, which as explained above is beyond the scope of this article.

Instead we will learn how to install our Python libraries from GitHub, as we did for R.

This still uses the pip install command, but it’s followed by the GitHub URL instead of the package name.

Installing our Python Library from GitHubWith our library hosted on GitHub, we simply use pip install git+ following by the URL provided on our GitHub repo (available by clicking the Clone or Download button on the GitHub website):pip install git+https://github.

com/sean-mcclure/datapeek_pyNow, we can import our library into our Python environment.

For a single function:from datapeek.

utilities import encode_and_bind…and for all functions:from datapeek.

utilities import *Let’s do a quick check in a new Python environment to ensure our functions are available.

Spinning up a new Docker container, I run the following:Fetch a dataset:iris = pd.

read_csv('https://raw.

githubusercontent.

com/uiuc-cse/data-fa14/gh-pages/data/iris.

csv')Check functions:encode_and_bind(iris, 'species')remove_features(iris, ['petal_length', 'petal_width'])apply_function_to_column(iris, ['sepal_length'], 'times_4', 'x*4')get_closest_string(['hey there','we we are','howdy doody'], 'doody')Success!SUMMARYIn this article we looked at how to create both R and Python libraries.

Creating libraries is a critical skill for any machine learning practitioner, and something I encourage others to do regularly.

Creating packages helps isolate our work inside useful abstractions, improves reproducibility, makes our work shareable, and is the first step towards designing better software.

Using a lightweight approach ensures we can prototype and share quickly, with the option to add more detailed practices and publishing criteria later as needed.

As always, please ask questions in the comments section should you run into any issues.

Happy coding.

If you enjoyed this article you might also enjoy:Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to…In this post I show readers how to expose their machine learning models as RESTful web services, prototype real…towardsdatascience.

comGraduating from Toy Visuals to Real Applications with D3.

jsToo often we learn about technology and methods in isolation, disconnected from the true goal of data science; to…towardsdatascience.

comGUI-fying the Machine Learning Workflow: Towards Rapid Discovery of Viable PipelinesPREFACEtowardsdatascience.

comFURTHER READING AND RESOURCESR Packages by Hadley WickhamTesting by Hadley WickhamPython Packaging by Scott TorborgJupyter Data Science NotebookDocker — Orientation and SetupJupyterLab DocumentationAvailable CRAN Packages By Date of PublicationDocumenting Functions in RJulepgitignore.

ioThe Python Package IndexSetuptools DocumentationIris Dataset on GitHubUnit Tests — Wikipedia Articleunitest — A Unit Testing FrameworkNose — Nicer Testing for PythonTest-Driven Development — Article on WikipediaDocker Run Reference.. More details

Post Views: 71

Leave a Reply Cancel reply

Related