How to extract online data using Python

←We’ll bring our initial example of the website with URL https://www.

mainwebsite.

com.

Let’s review the facts:We have a main website with three links to three different sections.

In each section, we have a list of links to documents.

Each section has a specific URL, e.

g.

https://www.

mainwebsite.

com/topic1.

Every link takes us to the document content that we are interested in.

We can find every link in the HTML structure of each section.

First, we’ll design our file architecture.

Let’s explore our folders.

We’ve created a master folder called scraper where we are going to store all the files related to our scraper.

Then, we’ll collect all the scraped data in JSON files.

Each of those files will be saved in the JSON folder.

The common folder has another folder called spiders.

There, we’ll save one file for each spider.

And we’ll create one spider for each topic.

So, in total three spiders.

Now, it’s time to understand the files we’ve created.

Let’s start with the settings.

py.

The Scrapy settings allow us to customize the behavior of all Scrapy components, including the core, extensions, pipelines, and spiders themselves.

There, we can specify The name of the bot implemented by the Scrapy project, a list of modules where Scrapy will look for spiders and whether the HTTP cache will be enabled, among others.

Now, we arrive at the main two files.

AWe’ll start by the topic1.

py spider.

We’ll examine only one example as they are all very similar.

The first thing that we need to do is import all the needed libraries.

Obviously, we need to import scrapy.

The re module will allow us to extract information using regular expressions.

The json module will help us when saving information.

The os module is useful to handle directories.

We stated before that a spider has to inherit from the scrapy.

Spider.

So we'll create a class called FirstSpider that subclass it.

We’ll assign the name topic1.

Then, we’ll define the allowed_domains list.

We also need to create the start_request() method to initialize the requests.

Inside the method, we define a list of URL for the requests.

In our case, this list only contains the URL www.

mainwebsite.

com/topic1.

Then, we are going to make a request with scrapy.

Request.

We’ll use yield instead of return.

We’ll tell scrapy to handle the downloaded content using the parse() method inside the callback argument.

Until now, you might think that the explanation about HTML and XPath was quite useless.

Well, now it’s the moment we’ll need it.

After we define our method to start the initial request, we need to define the method that will handle our downloaded information.

So in other words, we need to decide what we want to do with all the data.

What information is worth it to save.

For this, let’s suppose this is the HTML structure of our website.

As you can see in the picture, the highlighted element is the element we need to get to extract our links.

Let’s construct our path to get there.

From all (//) thediv elements that have the class col-md-12 (div[@class='col-md-12']), we need the attribute href from the a children (a/@href).

So, we have then our XPath: //div[@class='col-md-12']/a/@href.

In our parse method, we'll use response.

xpath() to indicate the path and extract() to extract the content of every element.

We are expecting to get a list of links.

We want to extract what is shown in those links.

The spider will need to follow each of them and parse their content using a second parse method that we’ll call parse_first.

Notice that this time we are sending the links using follow in the response variable instead of creating a Request.

Next, the parse_first method has to be defined to tell the spider how to follow the links.

We are going to extract the title and the body of the document.

After exploring the HTML structure of one document, we’ll get any element which id is titleDocument, and all paragraphs that are a child of any element which id is BodyDocument.

Because we don’t care about which tag they have we’ll use the *.

After getting each paragraph, we are going to append them to a list.

After that, we’ll join all the paragraphs in the text list together.

We’ll extract the date.

Finally, we’ll define a dictionary with the date, title and text.

Lastly, we’ll save the data into a JSON file.

Here it’s the definition of the function extractdate where we’ll use regular expressions to extract the date.

Now, our spider is complete.

BIt’s time to investigate the scraper.

py file.

Not only we need to create spiders, but also we need to launch them.

First, we’ll import the required modules from Scrapy.

CrawlerProcess will initiate the crawling process and settings will allow us to arrange the settings.

We’ll also import the three spider class created for each topic.

After that, we initiate a crawling processWe tell the process which spiders to use and finally, we’ll start the crawling.

Perfect!.We now have our scraper built!!!But wait how do we actually start scraping our website?In the terminal, we navigate with command line to our scraper folder (using cd).

Once inside, we just launch the spiders with the python3 command you can see in the picture.

And voilà!.The spiders are crawling the website!Here, I listed a couple of very nice resources and courses to learn more about web scraping:DataCamp Course.

Web Scraping tutorialScrapy documentationHTML long and short explanation.

. More details

Leave a Reply