Deep Learning Literature with Kaggle and Google Cloud Platform

Deep Learning Literature with Kaggle and Google Cloud PlatformRob HarrandBlockedUnblockFollowFollowingMar 24Creating an auto-updating dataset and kernelI first discovered Kaggle around 4 years ago when I was first starting out on my journey into the world of data.

Their ‘Titanic’ dataset and associated kernels helped me to get up to speed on a lot of modern data science tools and techniques, and I’ve been hooked ever since.

The platform has several parts, including machine-learning competitions, datasets, kernels, discussion forums and courses.

The dataset section is my personal favourite, as it offers a vast playground to explore open data from any field you can imagine.

You can then share your work via kernels (notebooks), which are a mix of text, code and code outputs.

Recently, Kaggle have created an API that allows both datasets and kernels to be updated from the command line.

By using Google Cloud Platform (GCP), we can schedule this to occur regularly, providing an auto-updating ‘dashboard’.

This blog post outlines the process.

Google Cloud PlatformGCP is a cloud computing service that not only provides free credits upon sign-up, but also offers completely free tiers of computing and storage.

Of course, for a fee, you can also spin-up huge clusters of virtual machines, not to mention dozens of other features, including some amazing big data and machine-learning products.

Other cloud platforms are available, but given that Kaggle is a Google company and there are a set of excellent courses on Coursera teaching various aspects of it, I’ve gone down the GCP route.

Tutorial aimsThis section of the kernel aims to show you how to,Create a GCP accountCreate a virtual machine and storage bucketGrab data from an APICreate a regularly updating dataset on Kaggle via GCPCreate a regularly updating kernel on Kaggle via GCPSpecific kernel aimsSeveral times over the years, I or people I’ve worked with wanted to get a sense of what a particular area of the scientific literature was doing.

Recently, this led me to the The National Center for Biotechnology Information (NCBI) API, which allows the details of scientific papers to be downloaded, including full abstracts, from a specified date-range.

This gave me the idea of using an R script to regularly download such details, to give me and my colleagues an overview of the past and a snapshot of the present.

Once the script was written, I needed a way of automating it, which led me to GCP.

Finally, I realised this may be of interest to other people, and so decided to create this post explaining the methodology.

It would be remiss of me not to tell you about Rachael Tatman’s great series of kernels on dashboarding, that gave me the inspiration for this project.

Below is an overview of this project, giving you an idea of how all the parts slot together,Create a Google Cloud Platform accountHead over to GCP and sign-up.

Now, at some point, you’ll be asked to set up a billing account, which I was initially reluctant to do.

Basically, all I could think was ….

“What if I accidentally spin up the world’s largest cluster, forget to deactivate it for 6 months and end up bankrupt?”.

However, I then found their page on pricing, and paid close attention to the section on all of the free stuff.

Namely, $300 of credit for 12 months plus a completely free tier.

See here for more details.

Also, keep an eye on the billing section of our account over time, to ensure nothing crazy is going on.

Create an InstanceOnce you’re in, you’ll see something like this…This can be a bit intimidating at first.

I’ve just completed the excellent GCP Data Engineering specialisation on Coursera, and have only used a fraction of the features on offer.

For this post, we’ll focus on the essentials, namely instances and buckets.

An instance is basically a virtual machine on GCP.

On the bar along the left, click COMPUTE ENGINE, followed by CREATE INSTANCE.

Now, here is where the money could come in.

You’re presented by a cornucopia of different options for setting up your machine, from geographic region to number of CPUs.

Thankfully, there is a cost estimator on the right that’s updated as you play around with the options.

First, give your instance a name of your choice.

Second, chose a region (I chose us-central1, as it was included in the free tier.

Not all are, so check on the pricing estimate on the right).

Third, from the Machine Type section chose micro (1 shared vCPU).

This is (at the time of writing) covered by the free tier.

I left the rest as defaults.

Click CREATE.

Create a BucketNext you’ll need somewhere to store your data, especially if you spin-down your instance throughout the day.

To do this, go back to the main console screen by clicking Google Cloud Platform in the top-left, and then from the bar along the left click STORAGE, followed by CREATE BUCKET.

Give it a name of your choice, leave everything else as default, then click CREATE.

Note that the default location is United States.

I left this, despite being in the UK, as all the data transfers will be happening with Kaggle, so it makes sense to keep everything in the same country.

Install the Kaggle APINow it’s time to have a play with the instance you’ve created.

Go back to the main console, then Click COMPUTE ENGINE to see your instance.

Then, click SSH to access its shell.

From this point, you can do anything you normally would with Linux.

Install RAt this point, you might want to do your own thing.

For example, I use R to get data from the NCBI API.

However, you could use a different language and/or a different API.

For me, I installed R by typing sudo apt-get update followed by sudo apt-get install r-base.

For getting the data from the API, I used the RCurl R package, and after a bit of hacking away with R code, I realised that I needed to install a few other bits and pieces…sudo apt-get install aptitudesudo apt-get install libcurl4-openssl-devsudo apt-get install libxml2-devOnce done, launch R with sudo R, then type install.

packages(‘RCurl’) and select an appropriate mirror.

Again, a US site makes sense.

Do the same for the packages jsonlite and XML.

Exit R with q().

It will ask you if you want to save the workspace image.

Type n and press enter.

I then created an R script from the command line with nano literature_update.

RThe script below is what I used to get the latest paper details from the API.

It uses two API commands.

The first gets all the papers in a given date range (actually, the number of days in the past from today’s date) that match certain search terms.

In this case, I’m using the terms ‘deep’ and ‘learning’ in the last day.

The result of this is an XML file containing ID numbers for each paper.

The script then goes through this list, requesting the paper details for each ID.

These details include everything you would expect, such as the article title, author details, and even the full abstract.

I save each one to a separate XML file.

Copy and paste the R script (or write your own using the appropriate API commands), then save and exit (CTRL-X, followed by ‘Y’).

Note that for some APIs, such as the Kaggle API, you may need to specify a username and key, either as environment variables, or in a JSON file.

The API of your choice will guide you on that.

The NCBI API has no such requirements.

library('RCurl')library('jsonlite')library('XML')search1 = 'deep'search2 = 'learning'since = '1' #daysxml_folder = paste('xml_files_', search1, '_', search2, '/', sep = "")dir.

create(file.

path(xml_folder), showWarnings = FALSE)#Function to get paper dataget_and_parse = function(id) { data <- getURL(paste("https://eutils.

ncbi.

nlm.

nih.

gov/entrez/eutils/efetch.

fcgi?db=pubmed&retmode=xml&rettype=abstract&id=", id, sep = "")) data <- xmlParse(data) return(data)}#Get the new papersnew_papers <- getURL(paste("https://eutils.

ncbi.

nlm.

nih.

gov/entrez/eutils/esearch.

fcgi?db=pubmed&term=", search1, "+AND+", search2, "&reldate=", since, "&retmode=json&retmax=10000", sep=""))new_papers_parsed <- fromJSON(new_papers)new_paper_ids = new_papers_parsed$esearchresult$idlistl=length(new_paper_ids)if(l==0) {'Nothing published!'} else { paste('Found', l, 'new articles in the last', since, 'days relating to', search1, 'and', search2) }#Get all the papers and save each one to an XML filei=1while (i<=l) { data_temp = get_and_parse(new_paper_ids[i])# saveXML(data_temp, file = paste(xml_folder, new_paper_ids[i], '.

xml', sep = "")) i=i+1}Note: In the above code, I’m searching for new papers in the last ‘1’ day.

When I first ran this script, I ran it to search for the last 365 days to get a year of data, and then changed it to 1 day for the regular updates.

Run an R script with search terms and a date range of your choiceNow that I have the R script, I need to run it.

I do this with sudo Rscript literature_update.

RAuthorise access to the bucketI now have a bunch of XML files corresponding to all the articles I’ve downloaded, stored in a dedicated folder.

Now, I want to back these up to the bucket that I created.

First, I need to given this instance permission to access the bucket.

Do this with the following code,gcloud auth loginThis will take you to a set of instructions.

Follow them to provide access to the bucket (you’ll need to copy and paste a code from your browser).

Copy the XMLs to the bucketIt’s time to copy the XMLs to the Bucket.

However, there is no point doing this one file at a time.

Instead, let’s stick them into a tar.

gz file and upload that.

Create the archive with tar -zcvf literature.

tar.

gz xml_files_deep_learning (you may have a different folder name) and then transfer to the bucket with gsutil cp literature.

tar.

gz gs://kaggle_dataset_data/ (with the name of your bucket).

Note the use of gsutil.

This is a useful tool for accessing GCP buckets from the command line.

Learn more here.

At this point, we’re ready to create a Kaggle dataset, using either the data on the instance, or the data in the bucket.

However, by default, the data in the bucket is not publicly available, which it needs to be if we want to use the Kaggle website to create it.

Change the permission with gsutil acl ch -u AllUsers:R gs://kaggle_dataset_data/literature.

tar.

gz (changing the bucket name and filename accordingly).

Prepare for using the Kaggle APIThe Kaggle API allows your to fully interact with Kaggle, and gives us everything we need for updating our dataset and kernel.

See here for more details.

First, we need to install the Kaggle API, which requires Python 3 installing.

Do with with sudo apt install python python-dev python3 python3-devSecond, install pip with wget https://bootstrap.

pypa.

io/get-pip.

py then sudo python get-pip.

pyThird, install the Kaggle API with sudo pip install kaggle.

Then, go to your Kaggle account and click ‘create new API token’.

Download and open the file, copy the contexts, then in your GCP shell, type nano kaggle.

json.

In the file, paste the contents in, then save and close the file.

Lastly, move this into the required folder with mkdir .

kaggle and then mv kaggle.

json .

kaggle/, and finish with some permission setting using chmod 600 /home//.

kaggle/kaggle.

jsonOne more thing.

I later encountered an error when using the kaggle API, specifically ImportError: No module named ordered_dict.

After a bit of searching, running the following fixed the issue … sudo pip install -U requestsCreate a dataset in Kaggle from the XML tar fileAt this point, you have a choice.

If you want Kaggle to do the updating, do this from the bucket.

This option is best if your data is very large and/or you’re planning on deactivating your instance for most of the time.

If that’s your requirement, from the GCP console, go to STORAGE and then click on your bucket.

You should see your newly uploaded tar.

gz file and the ‘Public access’ column should be set to ‘Public’.

Next to the word ‘Public’, you should see a link symbol.

Right click and select Copy link location (or equivalent in your browser of choice).

Next, in Kaggle, go to Datasets followed by Create new dataset.

Give it a title, and select the ‘link’ symbol (highlighted in red below).

Paste the address of your GCP bucket file, and click add remote file.

Once the dataset is created, go to settings and select the update frequency of your choice.

For me, given the instance is in the free tier and my data is small, I’m going to update from the instance storage using the API (note that the Bucket instructions are still useful for backing data and files up).

To do this, forget about the instructions above that creates a dataset from the Kaggle website.

Instead, from the GCP shell, create a new Kaggle dataset first with kaggle datasets init -p /path/to/dataset.

This will create a meta-date JSON file in the corresponding directory.

Go to this directory, and edit the file.

You’ll see default values for the slug and title.

Change these to the directory name and a title of your choice, respectively.

Then, exit out and change the permissions on the tar.

gz file you’re about to upload with chmod 777 file_name, then go back to your home directory and type kaggle datasets create -p /path/to/dataset.

You should get the message Your private Dataset is being created.

Please check progress at….

For my project I used,mkdir deep_learning_literaturemv literature.

tar.

gz deep_learning_literaturekaggle datasets init -p deep_learning_literature/kaggle datasets create -p deep_learning_literature/Next, go to Kaggle and check that the dataset has been created.

Tweak it with a title, subtitle, background image, etc, as you see fit.

Create a kernel analysing as you wishWe now have a dataset using data from GCP.

At this point, simply create a kernel to analyse the data however you wish.

In the kernel on Kaggle (link below), I have some code that pulls out the key pieces of data form the XML files and plots some graphs.

Automate the processIf you’ve chosen to let Kaggle update from your bucket, all you need to worry about is updating the bucket and your kernel.

If you’re using the API to update the dataset from the instance storage, you’ll also need to handle that.

My approach was to put all of this into a single shell script, which you can see below.

This has 3 parts.

First, the script runs the R script that gets the latest deep learning articles.

It then creates a tar.

gz file and copies it to both the bucket (for backup or Kaggle auto-updates) as-well-as a folder on the instance.

It also sets permission of the file on the bucket.

Next, it waits 15s and then updates the dataset with kaggle datasets version -p ~/deep_learning_literature -m “Updated via API from Google Cloud”.

Finally, it waits 60s (to allow the dataset to get updated) followed by an kernel update using kaggle kernels pull tentotheminus9/deep-learning-literature-and-gcp-tutorial -m and then kaggle kernels push (change the name of the folders, bucket, dataset and kernel to match yours).

Here is the full script.

Create an appropriate file with nano update_kaggle.

sh or similar.

#!/usr/bin/env bashsudo Rscript literature_update.

Rtar -zcvf literature.

tar.

gz xml_files_deep_learninggsutil cp literature.

tar.

gz gs://kaggle_dataset_data/gsutil acl ch -u AllUsers:R gs://kaggle_dataset_data/literature.

tar.

gzcp literature.

tar.

gz deep_learning_literature/ sleep 15skaggle datasets version -p ~/deep_learning_literature -m "Updated via API from Google Cloud"sleep 60skaggle kernels pull tentotheminus9/deep-learning-literature-and-gcp-tutorial -mkaggle kernels pushOK, now for the final step.

I need to tell my instance to trigger the above script regularly.

First, make the script executable with chmod +x literature_bucket.

sh (or whatever your script is called).

Then, we use cron to set the timings (cron is a tool in Linux to schedule jobs).

Create a cron job by typing EDITOR=”nano” crontab -e.

In the file that opens, you set the timings in the format of minutes, hours and days.

It can take a while to figure out, so I recommend this online tool to help.

Below is my cron file.

There are some settings at the start, followed by a command to trigger my shell script at 7am every morning,#!/usr/bin/env bashSHELL=/bin/shPATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/tentotheminus9/0 7 * * * 1-literature_bucket_update.

shOf course, you can (and should) test your script first.

Do this with .

/update_kaggle.

sh (or whatever you called it).

And that’s it.

All being well, the details of new papers are downloaded each morning, the data backed-up, the Kaggle dataset updated, and then the kernel updated accordingly.

Future ImprovementsThe first area of possible improvement could be around error-checking, as per Rachael Tatman’s 5th notebook.

I’ve been looking at XML validation, but haven’t quite got their yet.

So far, the NCBI XML files look perfectly setup and I haven’t had any errors yet.

The second area is around scheduling not only the triggering of the script on GCP, but the spinning-up and spinning-down of the instance, with an instance start-up script reinstalling everything each day.

From what I’ve read so far, this sort of scheduling is a little tricky to implement, but I’ll see what I can do in the future.

This would of course save a lot of money for larger-scale resources.

The KernelSee the full Kaggle kernel here.

The first part is a repeat of this post, and the second part shows some visualisations of the dataset.

For example, below is a breakdown of the most frequent deep learning paper authors…I hope this post has been useful.

Thanks!.

. More details

Leave a Reply