How to Scrape Like a (Hype)Beast |

But since I am a student who cannot afford the expensive resell prices and needs a bunch of coding practices, I decided to develop a web scraper that informs me products which I believe requires my attention from branded websites I go for.

I will like to share with you the entire development process in the most straightforward way possible.

Hopefully you can learn the basics about scraping and apply it to something you enjoy!Note: This scraper is for educational purposes only.

Tips before ScrapingScraping is all about finding patterns in websites.

If you can determine which tags and classes certain contents belong to, the rest of the coding will become very straightforward.

Unless the company decides to remake the layout of the website, your scraper should work just fine even when the contents are renewed.

Disclaimer: Scraping data from websites for personal or business uses may breach the Terms of Services issued by the companies.

PlanSince I am kind of (not really) familiar with python, I decide to write this basic scraper with the help of python libraries Requests and BeautifulSoup.

I will be looking into brands such as Nike and Adidas and retail shop such as Juice Store to organize a list of rare products that I might like.

Then, I will create a text file that tells me what is currently on sell or in raffle from what website, with whatever details I can find for me to consider the product.

Now, before we begin any scraping, let’s import the libraries we need:import requestsfrom bs4 import BeautifulSoupNow, let the scraping begin!Investigate and DevelopFirst up, Nike.

I live in Hong Kong, and for Nike’s Hong Kong website, they have a specific URL: https://www.

nike.

com.

hk/draw/list.

htm with a list of all the shoes that require lucky draws to purchase.

If you go to their site and inspect the html (Right Click or F12), you will see a small window on the right with a long daunting code of the site.

Don’t worry, all you have to do is click the top left cursor-looking button and hover over the contents you want to find — product names, colorways and prices in my case.

This will lead to the corresponding lines in the HTML of those contents, and all we have to do now is to remember their tags and classes: <p> and “tn_s”, “tn_n”, “tn_p”.

Fig.

1: The texts I want and the corresponding HTML codesNext, to the code.

(My methods may not be optimal, feel free propose better ones)First, we have to get a response using requests’ GET method:url = 'https://www.

nike.

com.

hk/draw/list.

htm'response = requests.

get(url)We can put the response from that URL into the BeautifulSoup parser:soup = BeautifulSoup(response.

text, "html.

parser")Here comes the important part.

In order to muddle through the perplexing codes scraped down from Nike, we can limit the search in soup to just the tags these contents suppose to be in.

For example, since all the contents I need are in the <p> tag, the code will look something like this:for i in soup.

findAll('p'):From here we can convert i into strings to check for the ones with “tn_s”, “tn_n”, “tn_p” in it.

With some simple manipulations on those strings, we can extract all the data of the products.

Done.

Next, Adidas Yeezys.

Adidas is a bit trickier as their website https://www.

adidas.

com/us/yeezy blocks automated python requests (I think this applies to all Adidas websites).

A way to bypass it will be to pretend to be a web browser by creating a header like this:header = {'User-Agent': 'My User Agent 1.

0','From': 'youremail@domain.

com'}response = requests.

get(url,headers=header)The rest of the code will be similar to the method we used to scrape Nike’s lucky draw list.

One thing to note is that the response from Adidas is mainly JavaScript.

As there are no html tags, you may have to extract the data by converting the text into JSON and then into python dictionaries to organize them.

However, I am a very lazy person with a final exam in 2 days and Adidas just so happens to set their newest yeezy release as their website title, so I scraped the title.

Fig.

2: <title> of HTML for Adidas Yeezy websitefor i in soup.

findAll('title'):Finally, Juice Store.

Juice Store does not have a list of all the raffles.

However, they do have an editorial page that posts relevant news about it, and every raffle is headlined as “Raffle: ….

”.

With that in mind, we can do a search through the URL https://juicestore.

com/blogs/editorial/tagged/raffle.

By limiting soup.

findAll() tags to “h2” and “time”, we can extract the products that are in recent or upcoming raffles with the date the news is released.

Fig.

3: Headline with “Raffle” and the corresponding HTML tagSave the ResultsAfter all data are collected and organized, we can now write it into a text file for reading in the future.

This can be done through the following code:f = open("name.

txt","w")f.

truncate(0)f.

write('what you want to write')f.

close()After some tuning on how the text displays, I created something like this in the end:Fig.

4: Scraped Results as a .

txt FileThis is it! A simple scraper that scrapes down all the information I need from three websites and now I don’t have to spend 3 minutes every day browsing them through.

Yes ok I know it is not THAT useful but the development process was indeed fun.

Hopefully this article is helpful :)Full code on: https://github.

com/ttchengab/ScrapeBeastSide note: This is my first post on medium!.I chose a topic that is perhaps simpler but hopefully still helpful.

I will be writing more advanced articles/tutorials regarding data science and machine learning.

Follow if you want to see more!.. More details

Post Views: 55

Leave a Reply Cancel reply

Related