Creating an Easy Website Scraper for Data Science | Sports Prediction PT.1

Well the truth is, maybe you don’t.

The first step to finding web data is coming up with a plan of attack.

Maybe all of your desired information is stored on the same page, in which case you could skip this step and just take all of the data you need in one sweep.

If you aren’t able to use this URL trick, selenium offers functions for “manual” clicking that you can use to navigate through the website and you can find those in the documentation for Selenium.

Next, we I will go over how to read through HTML yourself so that you can tell your program what to look for.

Reading HTMLIn order to do this next part, you will need to know how to do some HTML reading.

The first step is to navigate to a page that you need data from, I went to a random daily NHL stats page.

Take a look at where your data is on the page.

You can see that the game statistics are all embedded into some kind of table, so it is likely that this table is its own class.

Your data might not be inside of a unique JavaScript element like mine is, but the process is still the same.

Next, right click on the text information that you want to use, and press “inspect element”.

This will bring up the inspection pane in your web browser where you can find all of the classes and sub-classes of the page.

Look through these and find the “route” to the class your desired object, chart, or information is located in.

You need this in order to tell your scraper where to look.

When selecting the “rt-table” class, the table on the web page is highlighted.

The find() function is what does the looking for you in the code.

We don’t need to make the program finger through every fold in the HTML to get to the table, you just need to specify which class to look for and BeautifulSoup will find it.

# finding table datatable_body = nhl_soup.

find(attrs="rt-table").

find(attrs='rt-tbody')This one-liner is what locates the class in python that holds the table data and turns it into the object “table_body” which we can then parse through to record the data we need.

The table is effectively in the format of a 2-dimensional array; each game being two arrays.

Each array is the statistics for one team’s perspective of that game.

Collecting informationThe large chunk of code below may look like the hard part, but really we are on the home stretch.

# loading the game datagame_data = []# finding the chart and looping through rowsfor rows in table_body.

find_all(attrs="rt-tr-group"): row_data = []Using what we know from reading the HTML, we can simply use the find() function and put a for loop in front of it and it will automatically pull everything within the class, one row at a time.

But we can’t stop there because included in the table element is more than just the data we want but also some data we don’t want such as headers and divisions.

If we look at the specific HTML for the data portion of the table, you can see that it has a different tag: td (table data).

So what this means is that we need to do a nested for loop in order to narrow the search to only this table data.

# looping through row data for td in rows.

find_all(): row_data.

append(td.

text) game_date = f"{year}-{month}-{day}" enemy_team = row_data[4][14:].

lstrip() win = row_data[8] loss = row_data[9] over_time = row_data[11] points = row_data[12] goals_for = row_data[13] goals_against = row_data[14] shots_for = row_data[17] shots_against = row_data[18] power_play_goals_for = row_data[19] power_play_opportunities = row_data[20] power_play_percent = row_data[21] power_play_goals_against = row_data[23] penalty_kill_percent = row_data[24] faceoff_wins = row_data[25] faceoff_losses = row_data[26] faceoff_win_pct = row_data[27]row_data = [game_date, team_name, enemy_team, home_away, win, loss, over_time, points, goals_for, goals_against, shots_for, shots_against, power_play_goals_for, power_play_opportunities, power_play_percent, power_play_goals_against, penalty_kill_percent, faceoff_wins, faceoff_losses, faceoff_win_pct] game_data.

append(row_data) return game_dataThis finds a list of the row data for us, one row at a time.

We can then select what information we want, sort it, and save it.

The only thing left to do… save itTo wrap this up in a nice bow we can tie, I defined a batch collection method.

This method calls the nhl_daily_data() method one day at a time for a range of dates.

This creates a chart of everything we have collected, and then writes it to your computer as a CSV file.

def batch_collection(start_date, end_date): season_dates = pd.

date_range(start=start_date, end=end_date) for date in season_dates: print(date) year = date.

year month = date.

month day = date.

day chart = array(nhl_daily_data(year, month, day)) # .

txt to CSV with open('2017-2018season.

txt', "a") as output: writer = csv.

writer(output, lineterminator='.') writer.

writerows(chart)From here this data will need a lot more processing, but everything we need is now lumped into one file in a format that is easy to read.

.. More details

Leave a Reply