Web Scraping using Selenium and BeautifulSoup

And then we need to ensure that we go back to the main page after we are finished with a page.num_links = len(driver.find_elements_by_link_text('Watch'))code_blocks = []for i in range(num_links): # navigate to link button = driver.find_elements_by_class_name("btn-primary")[i] button.click() # get soup element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id('iframe_container')) tutorial_soup = BeautifulSoup(driver.page_source, 'html.parser') tutorial_code_soup = tutorial_soup.find_all('div', attrs={'class': 'code-toolbar'}) tutorial_code = [i.getText() for i in tutorial_code_soup] code_blocks.append(tutorial_code) # go back to initial page driver.execute_script("window.history.go(-1)")print(code_blocks)Figure 4: Scraping dataThis outputs an array of arrays containing all the code of my Keras tutorials.[['import numpy as np..import pandas as pd.import matplotlib.pyplot as plt.from keras.datasets import mnist.from keras.utils import to_categorical ', 'def getData(): Copy', 'def getData():.(X_train, y_train), (X_test, y_test) = mnist.load_data().img_rows, img_cols = 28, 28 ', ' y_train = to_categorical(y_train, num_classes=10).y_test = to_categorical(y_test, num_classes=10) Copy', ' X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1).X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1) ',…Lastly you should always close the browser instance.driver.quit()Save dataNow that we have the data stored in an array we can save it to disk.We will save the code from each tutorial in a separate .txt filefor i, tutorial_code in enumerate(code_blocks): with open('code_blocks{}.txt'.format(i), 'w') as f: for code_block in tutorial_code: f.write(code_block+".")ConclusionSelenium is a browser automation tool, which can be used for many purposes including testing and webscraping.It can be used on its own, or in combination with another scraping library like BeautifulSoup.If you liked this article consider subscribing on my Youtube Channel and following me on social media.The code covered in this article is available as a Github Repository.If you have any questions, recommendations or critiques, I can be reached via Twitter or the comment section.. More details

Leave a Reply