Top 5 Data Science GitHub Repositories and Reddit Discussions (February 2019)

You should also check out our top GitHub and Reddit picks for January here: January 2019 Edition   Top Data Science GitHub Repositories (February 2019) StyleGAN – Generating Life-Like Human Faces The above image seems like a typical collage – nothing to see here.

What if I told you none of the people in this collection are real?.That’s right – these folks do not exist.

All these faces were produced by an algorithm called StyleGAN.

While GANs have been getting steadily better since their invention a few years back, StyleGAN has taken the game up by several notches.

The developers have proposed two new, automated methods to quantify the quality of these images and also open sourced a massive high-quality dataset of faces.

This repository contains the official TensorFlow implementation of the algorithm.

Below are a few key resources to learn more about StyleGAN: Link Description http://stylegan.

xyz/paper Paper PDF.

http://stylegan.

xyz/video Result video.

http://stylegan.

xyz/code Source code.

http://stylegan.

xyz/ffhq Flickr-Faces-HQ dataset.

http://stylegan.

xyz/drive Google Drive folder.

  OpenAI’s Ground-Breaking Language Model – GPT-2 GPT-2 won the unofficial “most talked about” Natural Language Processing (NLP) library award in February.

The way they went about launching GPT-2 raised quite a few eyebrows.

The team claims that the model works so well they cannot fully open source it for fear of malicious use.

You can imagine why that attracted headlines and questions.

They have, however, released a smaller version of the model which is available on this GitHub repository we’ve linked above.

GPT-2 is a large language model with 1.

5 billion parameters.

The model has been trained on a dataset of 8 million web pages.

The aim behind the model is to predict the next word, given all the previous words within some text.

Is it state-of-the-art?.We’ll have to take OpenAI’s word for it (for now).

Here are a couple of additional resources to learn more about GPT-2: Blog Post Official Paper   SC-FEGAN : Face Editing Generative Adversarial Network with User’s Sketch and Color Another GAN library?!.That’s right – GANs are taking the data science world by storm.

SC-FEGAN is as cool in terms of style as the StyleGAN algorithm we covered above.

The above image perfectly illustrates what SC-FEGAN does.

You can edit all sorts of facial images using the deep neural network the developers have trained.

We can all become artists just sitting in front of our computers!.The repository helpfully includes steps to help you build the SC-FEGAN model on your own machine.

Give it a try!.And if computational power is a challenge, hop over to Google Colaboratory and utilize their free GPU offering.

  LazyNLP for Creating Massive Text Datasets The premise behind LazyNLP is simple – it enables you to crawl, clean up and deduplicate websites to create massive monolingual datasets.

What do I mean by massive?.According to the developer, LazyNLP will allow you to create datasets larger than the one used by OpenAI for training the GPT-2 model.

The full scale one.

That certainly had my full attention.

This GitHub repository lists down the 5 steps you’ll need to follow to create your own custom NLP dataset.

If you’re in any way interested in NLP, you should definitely check out this release.

  Subsync – Automating Subtitles Synchronization with the Video How often have you found yourself frustrated at subtitles being out of sync with the video?.This repository happens to be your savior in such situations.

Subsync is about “language-agnostic automatic synchronization of subtitles to video, so that subtitles are aligned to the correct starting point within the video”.

The algorithm was built using the Fast Fourier Transform technique in Python.

Subsync works inside the VLC Media Player as well!.The model takes about 20-30 seconds to train (depending on the video length).

  BONUS: Flickr-Faces-HQ Dataset (FFHQ) I wanted to include this in the article for anyone searching for high-quality images.

The dataset consists of 70,000 super high-quality images (1024 x 1024).

There’s a lot of variety in the faces, such as age, ethnicity, image background, etc.

It’s ideal for learning and experimenting with GANs.

Let me know in the comments section below if you use it!.  Reddit Discussions Are you Expected to Solve Hard Coding Challenges to work in the Machine Learning Industry?.I like this question because of how relevant it is in today’s world.

The thread has close to 200 comments from experienced data scientists and machine learning researchers debating whether these coding challenges are a good or bad thing in an interview round.

There’s a lot of experience here so this is a discussion you really should pay close attention to.

The essential question it comes down to is – should data science/machine learning professionals be judged extremely tightly on their coding skills or should algorithms/concepts take preference?.We also aim to help you crack these data science interviews in our course offering.

Make sure you check it out!. More details

Leave a Reply