Data Science Skills: Web scraping javascript using python

The techniques used will be the following:Using selenium with Firefox web driverUsing a headless browser with phantomJSMaking an API call using a REST client or python requests libraryTL;DR For examples of scraping javascript web pages in python you can find the complete code as covered in this tutorial over on GitHub.First stepsTo start the tutorial, I first needed to find a website to scrape..I will be using Insomnia but feel free to use whichever client you prefer!Scraping the web page using BeautifulSoupFollowing the standard steps outlined in my introductory tutorial into web scraping, I have inspected the webpage and want to extract the repeated HTML element:<div data-cid="XXXX" class="listing category_templates clearfix productListing ">…</div>As a first step, you might try using BeautifulSoup to extract this information using the following script.# import librariesimport urllib.requestfrom bs4 import BeautifulSoup# specify the urlurlpage = 'https://groceries.asda.com/search/yogurt' print(urlpage)# query the website and return the html to the variable 'page'page = urllib.request.urlopen(urlpage)# parse the html using beautiful soup and store in variable 'soup'soup = BeautifulSoup(page, 'html.parser')# find product itemsresults = soup.find_all('div', attrs={'class': 'listing category_templates clearfix productListing'})print('Number of results', len(results))Unexpectedly, when running the python script, the number of results returned is 0 even though I see many results on the web page!https://groceries.asda.com/search/yoghurtBeautifulSoup – Number of results 0When further inspecting the page, there are many dynamic features on the web page which suggests that javascript is used to present these results. By right-clicking and selecting View Page Source there are many <script> elements in use and searching for the element above containing the data we are interested in returns no matches.The first approach to scrape this webpage is to use Selenium web driver to call the browser, search for the elements of interest and return the results.Scraping the web page using Selenium1..Selenium with geckodriverSince we are unable to access the content of the web page using Beautiful Soup, we first need to set up a web driver in our python script.# import librariesimport urllib.requestfrom bs4 import BeautifulSoupfrom selenium import webdriverimport timeimport pandas as pd# specify the urlurlpage = 'https://groceries.asda.com/search/yogurt' print(urlpage)# run firefox webdriver from executable path of your choicedriver = webdriver.Firefox(executable_path = 'your/directory/of/choice')As mentioned when installing geckodriver, if the executable file is not in an executable path, we are able to define the path in our python script..Below is a simple example to get the page to scroll, there will be more efficient ways to do this, why not test your own javascript here and let me know in the comments what works best for you!We also add a sleep time as another method to wait for the page to fully load.# get web pagedriver.get(urlpage)# execute script to scroll down the pagedriver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")# sleep for 30stime.sleep(30)# driver.quit()If we run the script now (you can also uncommentdriver.quit() at the end to ensure the browser closes), as your python script runs Firefox will open the url specified and scroll down the page..This means that we can follow the method above but change the line that initialises the web driver which becomes:# run phantomJS webdriver from executable path of your choicedriver = webdriver.PhantomJS(executable_path = 'your/directory/of/choice')Note here that Selenium support for PhantomJS has been depreciated and provides a warning.It is also possible to use headless mode with geckodriver by using the headless option:from selenium import webdriverfrom selenium.webdriver.firefox.options import Optionsoptions = Options()options.headless = Truedriver = webdriver.Firefox(firefox_options=options, executable_path = 'your/directory/of/choice')By using the headless browser, we should see an improvement in time for the script to run since we aren’t opening a browser but not all results are scraped in a similar way to using firefox webdriver in normal mode.Making an API requestThe final approach we will discuss in this tutorial is making a request to an API..For other cases, the REST client allows you to enter any additional response parameters that you can get from the inspect tool when gathering the request details.Python requestWe can also make the same request from python using the urllib.request library in the same way that we connect to a web page before scraping.The JSON response can be made more readable by adding a few parameters for indenting and sorting the keys so that we can now open the file and see the response data provided to the webpage when a search is made.# import json libraryimport json# request urlurlreq = 'https://groceries.asda.com/api/items/search?keyword=yogurt'# get responseresponse = urllib.request.urlopen(urlreq)# load as jsonjresponse = json.load(response)# write to file as pretty printwith open('asdaresp.json', 'w') as outfile: json.dump(jresponse, outfile, sort_keys=True, indent=4)For now, we will keep all the data..These methods include:Using a web driver to scrape contentUsing selenium web driver to connect to a web page either with Firefox web driver, PhantomJS, headless browserUse the web driver to find the elements of interestLoop over the results and saving variables of interestSaving data to a dataframeWriting to a csv fileMake a HTTP requestInspect the web page to find HTTP request detailsMake the GET request using either a browser, REST client, pythonWhilst the HTTP request method is quicker to implement in this tutorial and provides all the data we need from one request, this is not always the case.. More details

Leave a Reply