Mapping Physical Activity by Webscraping with Selenium in R

Here, I use a relative xpath starting from the header of the HTML document, so my xpath starts with //header .login <- remDr$findElement(using = “xpath”, “//header/div[2]/nav[2]/ul/li[1]/a”)The HTML for MapMyRun’s homepage, with the relative xpath to the Log in Link highlighted.Now we tell ChromeDriver to click the login element.login$clickElement()This takes us to the Login page, where we need to input our email and password..But the input forms for these are both input elements, so how will we differentiate one from the other?.Lucky for us, web developers have already solved this problem by giving each element its own attribute..Many types of attributes exist, but the three that we will mainly use are name, ID, and class..IDs are used only once per webpage, while classes can be used multiple times..Names are usually used in forms, as is the case for our Email/Password input form.The HTML for the input form, with the name and ID of our input elements highlighted.# Either name or id will work to select the <- remDr$findElement(using = ‘name’, value = “email”)password <- remDr$findElement(using = ‘id’, value = “password”)Use the sendKeysToElement function to input email and$sendKeysToElement(list(“”))password$sendKeysToElement(list(“Password”))To click the Log In button, we can select it by its class, of which it has multiple..To deal with compound classes in RSelenium, we need to use css as our selector.login <- remDr$findElement(using = 'css', "[class='success-2jB0o button-2M08K medium-3PyzS button-2UrgI button-3ptrG']")login$clickElement()Some personal information blocked outWe’ve logged in and are at our homepage..To access the posted routes, we need to navigate to the Find Routes page, which is inside of the Routes dropdown on the top left.I select it with another relative xpath.findRoutes<- remDr$findElement(using = “xpath”,“//header/div[1]/nav[1]/ul/li[2]/ul/li/a”)The HTML for our homepage, with the xpath to the Find Routes link highlighted.findRoutes$clickElement()We’ve made it the routes page, but now we need to input our city in the Near section..MapMyRun will have a default location for you (called a placeholder), but this can be wrong- in my case, it combines Baltimore, MD with the zip code for Columbus, Ohio!.We’ll empty that by clicking the x next to it, which is a <span> with a class of “Select-clear”.clear <- remDr$findElement(using = “class”, value=”Select-clear”)clear$clickElement()Since the Near <input> element does not have a unique ID, name, or class, we instead find all of the <input> elements on the page using findElements, choose the second one, and send our city’s information with the tab key to autofill.# Find all input elements with findElements.inputElements <- remDr$findElements(using = “css”, value=”input”) # Choose the second <- inputElements[[2]]# Send city info and the code for the tab key,$sendKeysToElement(list(“Baltimore, MD, 21231”, “uE004”))One more thing: We want to get all uploaded activities, not just those above three miles.distance <- remDr$findElement(using = “name”, value = “distanceMinimum”)distance$sendKeysToElement(list(key = “backspace”, “0”))We can now click the search button, which has a compound class.searchButton <- remDr$findElement(using = "css", "[class = 'primary-xvWQU button-2M08K medium-3PyzS']")searchButton$clickElement()#Scroll down the page to see the results.webElem <- remDr$findElement(“css”, “body”)webElem$sendKeysToElement(list(key = “end”))Each of the links in the middle of the HTML table is a link to the webpage where the data for the run is kept.We need a function that will download the 20 links (which are <a> HTML elements), press the Next button, and repeat..I do this with a while loop that continues as long as there is both a Previous and a Next button at the bottom of the page, which will cause it to end when it downloads all of the workouts..Since these buttons are both <span> elements with class pageLink-3961h , I tell ChromeDriver to keep downloading as long as there are two of such elements.# Initialize a starter matrixruns <- as.tibble(matrix(nrow = 20))# Start the while loopwhile (length(remDr$findElements(using = ‘xpath’, “//a[@class= ‘pageLink-3961h’]/span”)) == 2){# Find the 20 links on the page that contain “routes/view”.links <- remDr$findElements(‘xpath’, “//a[contains(@href, ‘routes/view’)]”)# Save these in the runs starter matrixfor (i in 1:length(links)){ runs<- links[[i]] runs<- runs$getElementAttribute(“href”) runs[i,1] <- paste(runs, sep=””)}# Add 20 more empty rows to the matrix to put the next set of links in.runs <- rbind(as.tibble(matrix(nrow = 20)), runs)# Click the next buttonnext<- remDr$findElement(using = ‘xpath’,“//a[@class= ‘pageLink-3961h’][2]/span”)next$clickElement()# Wait to make sure the webpage fully loads.Sys.sleep(10)}We now have a list of 20,000 URLS, each of which are link to a run’s webpage, in the runs dataframe.. More details

Leave a Reply