In 10 minutes: Web Scraping with Beautiful Soup and Selenium for Data Professionals |

Definitive Guide to AnalyticsIn 10 minutes: Web Scraping with Beautiful Soup and Selenium for Data ProfessionalsExtract Critical Information from Wikipedia and eCommerce Quickly with BS4 and SeleniumVincent TatanBlockedUnblockFollowFollowingJun 17WebScraping — Free ImageIntroductionWeb Scraping is a process to extract valuable information from websites and online contents.

It is a free method to extract information and receive datasets for further analysis.

In this era where information is practically highly related to each other, I believe that the need for Web Scraping to extract alternative data is enormous especially for me as a data professional.

The objective for this publication is for you to understand several ways on scraping any publicly available information using quick and dirty Python Code.

Just spend 10 minutes to read this article — or even better, contribute.

Then you could get a quick glimpse to code your first Web Scraping tool.

In this article, we are going to learn how to scrape data from Wikipedia and e-commerce (Lazada).

We will clean up, process, and save the data into .

csv file.

We will use Beautiful Soup and Selenium as our main Web Scraping Libraries.

What are Beautiful Soup and SeleniumBeautiful SoupBeautiful Soup parses HTML into an easy machine readable tree format to extract DOM Elements quickly.

It allows extraction of a certain paragraph and table elements with certain HTML ID/Class/XPATH.

Parsing of DOM elements compared to Tree Dir FolderWhenever I need a quick and dirty way approach to extract information online.

I will always use BS as my first approach.

Usually it would take me in less than 10 minutes within 15 lines of codes to extract.

Beautiful Soup Documentation – Beautiful Soup 4.

0 documentationBeautiful Soup 4 is published through PyPi, so if you can't install it with the system packager, you can install it…www.

crummy.

comSeleniumSelenium is a tool designed to automate Web Browser.

It is commonly used by Quality Assurance (QA) engineers to automate their testings Selenium Browser application.

Additionally, it is very useful to web scrape because of these automation capabilities:Clicking specific form buttonsInputting information in text fieldsExtracting the DOM elements for browser HTML codeSelenium – Web Browser AutomationSelenium has the support of some of the largest browser vendors who have taken (or are taking) steps to make Selenium a…www.

seleniumhq.

orgCoding your first Web Scraping Tool(Github is available at the end of this article)Beautiful SoupProblem StatementImagine you were UN ambassadors, aiming to make visits on cities all around the world to discuss about the Kyoto Protocol status on Climate Changes.

You need to plan your travel, but you do not know the capital city for each of the country.

Therefore, you googled and found this link on Wikipedia.

List of national capitals – WikipediaThis is a list of national capitals, including capitals of territories and dependencies, non-sovereign states including…en.

wikipedia.

orgInside this link, there is a table which maps each country to the capital city.

You find this is good, but you do not stop there.

As a data scientist and UN ambassador, you want to extract the table from Wikipedia and dump it into your data application.

You took up the challenge to write some scripts with Python and BeautifulSoup.

StepsWe will leverage on the following steps:Pip install beautifulsoup4 and pip install requests.

Requests would get the HTML element from URL, this will become the input for BS to parse.

Check which DOM element the table is referring to.

Right click on your mouse and click on inspect element.

Shortcut is CTRL+I (inspect) for Chrome Browser.

Click on the inspect button at the top left corner to highlight the elements you want to extract.

Now you know that the element is a table element in the HTML document.

National Capitals Elements Wikipedia4.

Add header and url into your requests.

This will create a request into the wikipedia link.

The header would be useful to spoof your request so that it looks like it comes from a legitimate browser.

For Wikipedia, it might not matter as all the information is open sourced and publicly available.

But for some other sites such as Financial Trading Site (SGX), it might block the requests which do not have legitimate headers.

headers = {'User-Agent': 'Mozilla/5.

0 (Windows NT 6.

3; Win64; x64) AppleWebKit/537.

36 (KHTML, like Gecko) Chrome/54.

2840.

71 Safari/537.

36'}url = "https://en.

wikipedia.

org/wiki/List_of_national_capitals"r = requests.

get(url, headers=headers)5.

Initiate BS and list element to extract all the rows in the tablesoup = BeautifulSoup(r.

content, "html.

parser")table = soup.

find_all('table')[1]rows = table.

find_all('tr')row_list = list()6.

Iterate through all of the rows in table and get through each of the cell to append it into rows and row_listfor tr in rows: td = tr.

find_all('td') row = [i.

text for i in td] row_list.

append(row)7.

Create Pandas Dataframe and export data into csv.

df_bs = pd.

DataFrame(row_list,columns=['City','Country','Notes'])df_bs.

set_index('Country',inplace=True)df_bs.

to_csv('beautifulsoup.

csv')Result of web scraping in csvCongratulations!.You have become a web scraper professional in only 7 steps and within 15 lines of codeThe Limitations of Beautiful SoupSo far BS has been really successful to web scrape for us.

But I discovered there are some limitations depending on the problems:The requests takes the html response prematurely without waiting for async calls from Javascript to render the browser.

This means it does not get the most recent DOM elements that is generated by Javascript async calls (AJAX, etc).

Online retailers, such as Amazon or Lazada put anti-bot software throughout the websites which might stop your crawler.

Some retailers like Amazon and Lazada will shut down any requests from Beautiful Soup as it knows that it does not come from legitimate browsers.

NoteIf we run Beautiful Soup in e commerce websites such as Lazada and Amazon, we will run to this Connection Error which is caused by their anti scraping software to deter bots from making http requests.

HTTPSConnectionPool(host=’www.

amazon.

com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, ‘[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.

c:833)’),))One way to fix it is to use client browsers and automate our browsing behavior.

We can achieve this by using Selenium.

All hail Selenium!!SeleniumProblem StatementImagine you were creating price fluctuation model to analyze e-Commerce providers such as Lazada and Amazon.

Without Web Scraping tool, you would need to hire somebody to manually browse through numerous product pages and copy paste the pricing one by one into Excelsheet.

This process would be very repetitive, especially if you’d like to collect the data point every day/every hour.

This would also be a very time consuming process as it involves many many manual clicks and browses to duplicate the information.

What if I tell you, you can automate this process:By having Selenium doing the exploration of products and clicking for you.

By having Selenium opening your Google Chrome Browser to mimic legitimate user browsing behaviors.

By having Selenium pump all of the information into lists and csv files for you.

Well you’re in luck, because all you need to do is write a simple Selenium script and you can now run the web scraping program while having a good night sleep.

Extracting Lazada Information and Products are time consuming and repetitiveSetting UpPip install selenium.

Install the Selenium Browser.

Please refer to this link to identify your favorite browser (Chrome, Firefox, IE, etc).

Put that in the same directory as your project.

Feel free to download it from my Github link below if you are not sure which one to use.

Include these importfrom selenium import webdriverfrom selenium.

webdriver.

common.

by import Byfrom selenium.

webdriver.

support.

ui import WebDriverWaitfrom selenium.

webdriver.

support import expected_conditions as ECfrom selenium.

common.

exceptions import TimeoutException4.

Drive Selenium Chrome Browser by inserting the executable path and url.

In my case, I used the relative path to find the chromedriver.

exe located in the same directory as my script.

driver = webdriver.

Chrome(executable_path='chromedriver')driver.

get('https://www.

lazada.

sg/#')Selenium Running Chrome and Extract Lazada and Redmart Data5.

Wait page to load and find the element.

This is how Selenium could be different from Requests and BS.

You could instruct the page to wait until a certain DOM element is renderred.

After that, it would continue running its web scraping logic.

You can stop the wait until Expected Conditions (EC) is met to find by ID “Level_1_Category_No1”.

If 30 seconds already passed without finding such element, then pass TimeoutException to shut the browser.

timeout = 30try: WebDriverWait(driver, timeout).

until(EC.

visibility_of_element_located((By.

ID, "Level_1_Category_No1")))except TimeoutException: driver.

quit()Congrats.

We have setup Selenium to use our Chrome Browser.

Now we are ready to automate the Information Extraction.

Information ExtractionLet us identify several attributes from our Lazada Websites and extract their DOM Elements.

Extracting the DOM Elements via ID, Class, and XPATH Attributesfind_element by ID to return the relevant category listing.

category_element = driver.

find_element(By.

ID,'Level_1_Category_No1').

text;#result — Electronic Devices as the first category listing2.

Get the unordered list xpath (ul) and extract the values for each list item (li).

You could inspect the element, right click, and select copy>XPATH to easily generate the relevant XPATH.

Feel free to open the following link for further detail.

How to Locate Elements in Chrome and IE Browsers for Building Selenium Scripts – Selenium Tutorial…This is tutorial #7 in our Selenium Online Training Series.

If you want to check all Selenium tutorials in this series…www.

softwaretestinghelp.

comlist_category_elements = driver.

find_element(By.

XPATH,'//*[@id="J_icms-5000498-1511516689962"]/div/ul')links = list_category_elements.

find_elements(By.

CLASS_NAME,"lzd-site-menu-root-item")for i in range(len(links)): print("element in list ",links[i].

text)#result {Electronic Devices, Electronic Accessories, etc}Clicks and ActionsAutomate Actions.

Supposedly you want to browse to Redmart from Lazada Homepage, you can mimic the click in the ActionChains Object.

element = driver.

find_elements_by_class_name('J_ChannelsLink')[1]webdriver.

ActionChains(driver).

move_to_element(element).

click(element).

perform()Extracting all product listings from RedmartCreate lists of product title.

We can extract and print them as followingproduct_titles = driver.

find_elements_by_class_name('title')for title in product_titles: print(title.

text)Redmart Best Seller Title Extractions2.

Extract the product title, pack size, price, and rating.

We will open several lists to contain every item and dump them into a Dataframe.

product_containers = driver.

find_elements_by_class_name('product_container')for container in product_containers: product_titles.

append(container.

find_element_by_class_name('title').

text)pack_sizes.

append(container.

find_element_by_class_name('pack_size').

text) product_prices.

append(container.

find_element_by_class_name('product_price').

text)rating_counts.

append(container.

find_element_by_class_name('ratings_count').

text)data = {'product_title': product_titles, 'pack_size': pack_sizes,'product_price': product_prices, 'rating_count': rating_counts}3.

Dump the information into a Pandas Dataframe and csvdf_product = pd.

DataFrame.

from_dict(data)df_product.

to_csv('product_info.

csv')CSV Dump for each of the product in Best Seller RedmartCongrats!.You have effectively expanded your skills to extract any information found online!Purpose, Github Code and Your ContributionsThe purpose for this Proof Of Concepts (POC) was created as a part of my own side project.

The goal of this application is to use web scraping tool to extract any publicly available information without much cost and manpower.

In this POC, I used Python as the scripting language, Beautiful Soup and Selenium library to extract the necessary information.

The Github Python Code is located below.

VincentTatan/Web-ScrapingWeb Scraping with Beautiful Soup and Selenium.

Contribute to VincentTatan/Web-Scraping development by creating an…github.

comFeel free to clone the repository and contribute whenever you have time.

Beautiful Soup and Stocks InvestingIn lieu with today’s topics about python and web scraping.

You could also visit another of my publication regarding web scraping for aspiring investors.

You should try this walk through to guide you to code quick and dirty Python to scrape, analyze, and visualize stocks.

Value Investing Dashboard with Python Beautiful Soup and Dash PythonAn Overview of Web Scraping with a Quick Dash Visualization for Value Investingtowardsdatascience.

comHopefully from this relevant publication, you could learn how to scrape critical information and develop an useful application.

Please read and reach out to me if you like it.

Finally…Whew… That’s it, about my idea which I formulated into writings.

I really hope this has been a great read for you guys.

With that, I hope my idea could be a source of inspiration for you to develop and innovate.

Please Comment out below to suggest and feedback.

Happy coding :)About the AuthorVincent Tatan is a Data and Technology enthusiast with relevant working experiences from Visa Inc.

and Lazada to implement microservice architectures, data engineering, and analytics pipeline projects.

Vincent is a native Indonesian with a record of accomplishments in problem solving with strengths in Full Stack Development, Data Analytics, and Strategic Planning.

He has been actively consulting SMU BI & Analytics Club, guiding aspiring data scientists and engineers from various backgrounds, and opening up his expertise for businesses to develop their products .

Please reach out to Vincent via LinkedIn , Medium or Youtube ChannelDisclaimerThis disclaimer informs readers that the views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author’s employer, organization, committee or other group or individual.

References are picked up from the list and any similarities with other works are purely coincidentalThis article was made purely as the author’s side project and in no way driven by any other hidden agenda.

. More details

Post Views: 69

Leave a Reply Cancel reply

Related