Web Scraping with Beautiful Soup — A Use Case

")The next step is query and retrieve the data for each of the foundation’s URLs.

We have to keep in mind two things.

One, we need to query the server only once, since the data will be then stored locally.

And two, we need to be polite, we do not want to overload the server with requests that can break it or that can time out.

And this is where the time.

sleep( ) function comes up.

In this case, I added 10 seconds between requests.

subresponse = []for lines in container: if lines.

name == 'h3': url_fou = lines.

find_all("a", href = re.

compile("cfc_locations"))[0].

get('href') subresponse.

append(session.

get(url_fou, headers = my_headers)) time.

sleep(10)We can now parse the data with BS4 and proceed to extract the rest of the information, such as the address, which in the case of the CFC format, one can use regular expressions to split it by the vertical lines included in the text.

html_subsoup = []for counter in range(1, len(subresponse)): html_subsoup.

append(BeautifulSoup(subresponse[counter].

text, 'html.

parser')) c_location = html_subsoup[counter].

find_all('p', class_ = 'meta-line location') address_array = re.

split(r' | ', c_location[0].

text) print(address_array)Similarly, we proceed with the person’s name, title, etc.

GenderizeThe other Python library used here is Genderize, as the title prefixing the contact’s name is also required (Mr.

or Ms.

) This library is a client for the Genderize.

io web service, its API is free, but limited at 1000 names/day.

So one should not debug the code with it!Genderize will return “male” or “female” given the name, so I create a dictionary to return the prefix.

>>> genderDict = {"male": 'Mr.

', "female": 'Ms.

'}>>> gen = Genderize().

get(['John'])[0]['gender']>>> print(genderDict.

get(gen, "None"))Mr.

PandasAfter working with all the data (the full code can be found here), the last step is to write the information into a pandas dataframe and write it to a CSV file.

df = pd.

DataFrame({'Organization': organization, 'Title': gender_title, 'Addressee': person, 'Addressee Job Title': person_title, 'Civic Address 1 (Street Address)': street, 'Civic Address 2 (PO Box)': pobox, 'Municipality': municipality, 'Province or Territory': provinces, 'Postal Code': postalCode, 'Phone': phone, 'Website': org_url })cols = ['Organization', 'Title', {add in here the others}]df.

to_csv('data/cfcMailingAddresses.

csv', encoding='utf-8', index=False, columns = cols)Final ProductHere is a snapshot of the CSV file:While there is room for improvement, such as names that were not found in the genderize database, or addressing Quebecers by M.

or Mme, the script served its general purpose.

One can further refine the code by adding assertions and throwing exceptions.

NLTAs part of this learning experience, I decided to try two Natural Language Processing (NLP) libraries, NLTK and spaCy, to parse the address.

Here are the results.

NLTK did not give the proper tags for an address.

Most of the tokens were identified as nouns, including a place such as Banff.

def preprocess_without_stopwords(sent): sent = nltk.

word_tokenize(sent) sent = [word for word in sent if word not in en_stop] sent = nltk.

pos_tag(sent) return sentpreprocessed_address = preprocess_without_stopwords(address_test)spaCy did not give the proper tags for an address either.

While it did better than NLTK by identifying Banff Avenue as a place, Banff was identified as person.

addr = nlp(address_test)sentences = [x for x in addr.

sents]displacy.

render(nlp(str(sentences[0])), jupyter=True, style='ent')Training a model on geographical data could be another very interesting project on its own!Full jupyter notebook on GitHub: https://github.

com/brodriguezmilla/WebScrapingCFCBS4.

. More details

Leave a Reply