An introduction to web scraping using R

Once we are there, we need to load the packages required as shown below:#loading the package:> library(xml2)> library(rvest)> library(stringr)Step 2: Reading the HTML content from Amazon#Specifying the url for desired website to be scrappedurl <- ‘’#Reading the html content from Amazonwebpage <- read_html(url)In this code, we read the HTML content from the given URL, and assign that HTML into the webpage variable.Step 3: Scrape product details from AmazonNow, as the next step, we will extract the following information from the website:Title: The title of the product.Price: The price of the product.Description: The description of the product.Rating: The user rating of the product.Size: The size of the product.Color: The color of the product.This screenshot shows how these fields are arranged.Next, we will make use of HTML tags, like the title of the product and price, for extracting data using Inspect Element.In order to find out the class of the HTML tag, use the following steps:=> go to chrome browser => go to this URL => right click => inspect elementNOTE: If you are not using the Chrome browser, check out this article.Based on CSS selectors such as class and id, we will scrape the data from the HTML..To find the CSS class for the product title, we need to right-click on title and select “Inspect” or “Inspect Element”.As you can see below, I extracted the title of the product with the help of html_nodes in which I passed the id of the title — h1#title — and webpage which had stored HTML content.I could also get the title text using html_text and print the text of the title with the help of the head () function.#scrape title of the product> title_html <- html_nodes(webpage, ‘h1#title’)> title <- html_text(title_html)> head(title)The output is shown below:We could get the title of the product using spaces and…The next step would be to remove spaces and new line with the help of the str_replace_all() function in the stringr library.# remove all space and new linesstr_replace_all(title, “[..]” , “”)Output:Now we will need to extract the other related information of the product following the same process.Price of the product:# scrape the price of the product> price_html <- html_nodes(webpage, ‘span#priceblock_ourprice’)> price <- html_text(price_html)# remove spaces and new line> str_replace_all(title, “[..]” , “”)# print price value> head(price)Output:Product description:# scrape product description> desc_html <- html_nodes(webpage, ‘div#productDescription’)> desc <- html_text(desc_html)# replace new lines and spaces> desc <- str_replace_all(desc, “[..]” , “”)> desc <- str_trim(desc)> head(desc)Output:Rating of the product:# scrape product rating > rate_html <- html_nodes(webpage, ‘span#acrPopover’)> rate <- html_text(rate_html)# remove spaces and newlines and tabs > rate <- str_replace_all(rate, “[..]” , “”)> rate <- str_trim(rate)# print rating of the product> head(rate)Output:Size of the product:# Scrape size of the product> size_html <- html_nodes(webpage, ‘div#variation_size_name’)> size_html <- html_nodes(size_html, ‘span.selection’)> size <- html_text(size_html)# remove tab from text> size <- str_trim(size)# Print product size> head(size)Output:Color of the product:# Scrape product color> color_html <- html_nodes(webpage, ‘div#variation_color_name’)> color_html <- html_nodes(color_html, ‘span.selection’)> color <- html_text(color_html)# remove tabs from text> color <- str_trim(color)# print product color> head(color)Output:Step 4: We have successfully extracted data from all the fields which can be used to compare the product information from another site.Let’s compile and combine them to work out a dataframe and inspect its structure.#Combining all the lists to form a data frameproduct_data <- data.frame(Title = title, Price = price,Description = desc, Rating = rate, Size = size, Color = color)#Structure of the data framestr(product_data)Output:In this output we can see all the scraped data in the data frames.Step 5: Store data in JSON format:As the data is collected, we can carry out different tasks on it such as compare, analyze, and arrive at business insights about it..Based on this data, we can think of training machine learning models over this.Data would be stored in JSON format for further process.Follow the given code and get the JSON result.# Include ‘jsonlite’ library to convert in JSON form.> library(jsonlite)# convert dataframe into JSON format> json_data <- toJSON(product_data)# print output> cat(json_data)In the code above, I have included jsonlite library for using the toJSON() function to convert the dataframe object into JSON form.At the end of the process, we have stored data in JSON format and printed it.It is possible to store data in a csv file also or in the database for further processing, if we wish.Output:Following this practical example, you can also extract the relevant data for the same from product from and compare with Amazon to work out the fair value of the product..In the same way, you can use the data to compare it with other websites.4.End noteAs you can see, R can give you great leverage in scraping data from different websites.With this practical illustration of how R can be used, you can now explore it on your own and extract product data from Amazon or any other e-commerce website.A word of caution for you: certain websites have anti-scraping policies.If you overdo it, you will be blocked and you will begin to see captchas instead of product details. More details

Leave a Reply