Crawler data from a website with Html Agility Pack (.NET / C#)

Crawler data from a website with Html Agility Pack (.

NET / C#)BeribeyBlockedUnblockFollowFollowingFeb 24Crawler data from a website with Html Agility Pack (.

NET / C#)This is my first tutorial on the medium.

Currently, the demand for data collection is increasing.

For some big sites like Facebook, Google, steam we can use the API provided by them to get data.

In many other cases, we often extract data manually (Open up the website, copy data to word, excel v … v files), this is both extreme and takes a lot of time and effort.

This week, I received a project from my teacher, I need you to make a newspaper reader application, get information from a website.

Suppose it is a pretty big forum page, and of course, there is no API to get data.

Here, we cannot get the data manually.

The only solution for this is to write software that extracts data from the site itself.

I will guide you to extract by using HTMLAgilityPack and Fizzler libraries.

HTMLAgilityPack is a powerful HTML parse library, the reason it is popular is that it using most HTML, both valid and unvalid (In fact, the number of websites with unvalid HTML is endless, libraries Other will be error-prone, HTMLAgilityPack is not).

The knowledge in this article will be quite useful if you later need to extract information from another website.

You can google more with the keyword: web crawler.

Step 1: Create a new project.

Here I’m creating a new Console App.

Step 2: Installing Fizzler and Html Agility Pack.

 Go to Tools -> Library Package Manager -> Package Manager Console.

Type the following command to install the library: Install-Package Fizzler.

Systems.

HtmlAgilityPackOr you come to NuGet Package, come to “browse” and find with keyword: Fizzler.

Systems.

HtmlAgilityPack and install it for your solution.

After installation, if you see there are 3 references as shown below, ok.

I will explain some objects, methods of HTML Agility Pack.

–HTMLDocument: This is a class of information about an html file (encoding, innerhtml).

We can load data into HTMLDocument from a URL or from a file.

–HTMLNode: An HTMLNode is equivalent to a tag (li, ul, div, etc.

) in HTML.

The largest node containing all will be DocumentNode.

Some properties of HTMLNode that we often use:Name: The name of the node (div, ul, li).

Attributes: The list of notes (Attribute is the information of the node such as: src, href, id, class …)InnerHTML, OuterHTML: EasySelectNodes (string XPath): Find the child nodes of the current node, based on the XPath inserted.

SelectSingleNode (string XPath): Find the first child node of the current node, based on the input xPath.

Descendants (string XPath): Returns the list of child HTMLNode of the current node.

And we can use Fizzler for easy.

Fizzler supports CSS selector, allowing us to use CSS selector.

Fizzler is expanded based on HTMLAgilityPath, adding the following two functions to HTMLNode:QuerySelectorAll: Find the child nodes of the current node, based on the input css selector.

+ QuerySelector: Find the first child node of the current node, based on the input css selector.

Step 3: Writing code, my experience is that you should choose an id-based tag, located near the data you need to get the most.

var html = new HtmlDocument();var document = html.

Load(“your page”); var items = new List<object>();var threadItems = document.

DocumentNode.

QuerySelectorAll(“div.

title-wrap”).

ToList();foreach(var item in threadItems){var title = item.

QuerySelector(“a”).

InnerText;var link = item.

QuerySelector(“a”).

Attributes[“href”].

Value; items.

Add(new { title, link});}To understand more about CSS selector, you can read the link below.

https://www.

w3schools.

com/cssref/css_selectors.

aspStep 4: Export the results somewhere.

At this point everything is done, you can save the results to a database file or export a text file depending on the intended use.

Hope this article will be useful for your programming career.

.. More details

Leave a Reply