SoftCircuits.HtmlMonkey 2.1.2

.NET 5.0 .NET Standard 2.0
NuGet\Install-Package SoftCircuits.HtmlMonkey -Version 2.1.2
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
dotnet add package SoftCircuits.HtmlMonkey --version 2.1.2
<PackageReference Include="SoftCircuits.HtmlMonkey" Version="2.1.2" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add SoftCircuits.HtmlMonkey --version 2.1.2
#r "nuget: SoftCircuits.HtmlMonkey, 2.1.2"
#r directive can be used in F# Interactive, C# scripting and .NET Interactive. Copy this into the interactive tool or source code of the script to reference the package.
// Install SoftCircuits.HtmlMonkey as a Cake Addin
#addin nuget:?package=SoftCircuits.HtmlMonkey&version=2.1.2

// Install SoftCircuits.HtmlMonkey as a Cake Tool
#tool nuget:?package=SoftCircuits.HtmlMonkey&version=2.1.2

HtmlMonkey

NuGet version (SoftCircuits.HtmlMonkey)

Install-Package SoftCircuits.HtmlMonkey

Overview

HtmlMonkey is a lightweight HTML/XML parser written in C#. It allows you to parse HTML or XML into a hierarchy of document node objects, which can then be traversed, or queried using jQuery-like selectors. The node objects can be modified or even built from scratch using code. Finally, you can use the classes to generate HTML or XML strings from the data.

The code also include a WinForms application to display the parsed data nodes. This was mostly done for testing the parser, but offers some functionality that may be useful for inspecting the original markup.

Getting Started

You can use either of the static methods HtmlDocument.FromHtml() or HtmlDocument.FromFile() to parse HTML and create an HtmlDocument object. (Note: If you're using WinForms, watch out for conflict with System.Windows.Forms.HtmlDocument.)

Parse an HTML Document
string html = "...";   // HTML markup
HtmlDocument document = HtmlDocument.FromHtml(html);

This code parses the HTML document into a hierarchy of nodes, which are then stored in the HtmlDocument object.

The node types include HtmlElementNode, which represents an HTML tag with attributes and any number of child nodes. HtmlTextNode nodes contain only text. And HtmlCDataNode nodes contain text from the document that was parsed but is otherwise ignored. Examples of content placed in HtmlCDataNode nodes include CDATA content, comments and the content of <script> tags.

The code also supports the specialized HtmlHeaderNode and XmlHeaderNode nodes.

HtmlMonkey provides a number of ways to navigate parsed nodes. The HtmlDocument.RootNodes property contains the root nodes in the document. Each HtmlElementNode node includes a Children property, which can be used to access all the other nodes in the document. In addition, all nodes have NextNode, PrevNode, and ParentNode properties, which you can use to navigate the nodes in every direction.

The HtmlDocument class also includes a Find() method, which accepts a predicate argument. This method will recursively find all the nodes in the document for which the predicate returns true, and return those nodes in a flat list.

// Returns all nodes that are the first node of its parent
IEnumerable<HtmlNode> nodes = document.Find(n => n.PrevNode == null);

You can also use the FindOfType() method. This method traverses the entire document tree to find all the nodes of the specified type.

// Returns all text nodes
IEnumerable<HtmlTextNode> nodes = document.FindOfType<HtmlTextNode>();

The FindOfType() method is also overloaded to accept an optional predicate argument.

// Returns all HtmlElementNodes that have children
IEnumerable<HtmlElementNode> nodes = document.FindOfType<HtmlElementNode>(n => n.Children.Any());

Using Selectors

The HtmlDocument.Find() method also has an overload that supports using jQuery-like selectors to find nodes. Selectors provide a powerful and flexible way to locate nodes.

Specifying Tag Names

You can specify a tag name to return all the nodes with that tag.

// Get all <p> tags in the document
// Search is not case-sensitive
IEnumerable<HtmlElementNode> nodes = document.Find("p");

// Get all HtmlElementNode nodes (tags) in the document
// Same result as not specifying the tag name
// Also the same result as document.FindOfType<HtmlElementNode>();
nodes = document.Find("*");
Specifying Attributes

There are several ways to search for nodes with specific attributes. You can use the pound (#), period (.) or colon (:) to specify a value for the id, class or type attribute, respectively.

// Get any nodes with the attribute id="center-ad"
IEnumerable<HtmlElementNode> nodes = document.Find("#center-ad");

// Get any <div> tags with the attribute class="align-right"
nodes = document.Find("div.align-right");

// Returns all <input> tags with the attribute type="button"
nodes = document.Find("input:button");

For greater control over attributes, you can use square brackets ([]). This is similar to specifying attributes in jQuery, but there are some differences. The first difference is that all the variations for finding a match at the start, middle or end are not supported by HtmlMonkey. However, to make up for this limitation, you can use the := operator to specify that the value is a regular expression and the code will match if the attribute value matches that regular expression.

// Get any <p> tags with the attribute id="center-ad"
IEnumerable<HtmlElementNode> nodes = document.Find("p[id=\"center-ad\"]");

// Get any <p> tags that have both attributes id="center-ad" and class="align-right"
// Quotes within the square brackets are optional if the value contains no whitespace or most punctuation.
nodes = document.Find("p[id=center-ad][class=align-right]");

// Returns all <a> tags that have an href attribute
// The value of that attribute does not matter
nodes = document.Find("a[href]");

// Get any <p> tags with the attribute data-id with a value that matches the regular
// expression "abc-\d+"
// Not case-sensitive
nodes = document.Find("p[data-id:=\"abc-\\d+\"]");

// Finds all <a> links that link to blackbeltcoder.com
// Uses a regular expression to allow optional http:// or https://, and www. prefix
// This example is also not case-sensitive
nodes = document.Find("a[href:=\"^(http:\\/\\/|https:\\/\\/)?(www\\.)?blackbeltcoder.com\"]");

Note that there is one key difference when using square brackets. When using a pound (#), period (.) or colon (:) to specify an attribute value, it is considered a match if it matches any value within that attribute. For example, the selector div.right-align would match the attribute class="main-content right-align". When using square brackets, it must match the entire value (although there are exceptions to this when using regular expressions).

Multiple Selectors

There are several cases where you can specify multiple selectors.

// Returns all <a>, <div> and <p> tags
IEnumerable<HtmlElementNode> nodes = document.Find("a, div, p");

// Returns all <span> tags that are descendants of a <div> tag
nodes = document.Find("div span");

// Returns all <span> tags that are a direct descendant of a <div> tag
nodes = document.Find("div > span");
Selector Performance

Obviously, there is some overhead parsing selectors. If you want to use the same selectors more than once, you can optimize your code by parsing the selectors into data structures and then passing those data structures to the find methods. The following code is further optimized by first finding a set of container nodes, and then potentially performing multiple searches against those container nodes.

// Parse selectors into SelectorCollections
SelectorCollection containerSelectors = Selector.ParseSelector("div.container");
SelectorCollection itemSelectors = Selector.ParseSelector("p.item");

// Search document for container nodes
IEnumerable<HtmlElementNode> containerNodes = containerSelectors.Find(document.RootNodes);

// Finally, search container nodes for item nodes
IEnumerable<HtmlElementNode> itemNodes = itemSelectors.Find(containerNodes);

Enhancing the Library

This is my initial attempt at this library and I would appreciate and be responsive to any feedback from people working with it. I want to keep the library small but would like to see more testing done on a wide variety of input markup. What sort of scenarios does the library not handle correctly? This is the type of information I'd be curious about.

Product Versions
.NET net5.0 net5.0-windows net6.0 net6.0-android net6.0-ios net6.0-maccatalyst net6.0-macos net6.0-tvos net6.0-windows
.NET Core netcoreapp2.0 netcoreapp2.1 netcoreapp2.2 netcoreapp3.0 netcoreapp3.1
.NET Standard netstandard2.0 netstandard2.1
.NET Framework net461 net462 net463 net47 net471 net472 net48
MonoAndroid monoandroid
MonoMac monomac
MonoTouch monotouch
Tizen tizen40 tizen60
Xamarin.iOS xamarinios
Xamarin.Mac xamarinmac
Xamarin.TVOS xamarintvos
Xamarin.WatchOS xamarinwatchos
Compatible target framework(s)
Additional computed target framework(s)
Learn more about Target Frameworks and .NET Standard.
  • .NETStandard 2.0

    • No dependencies.
  • net5.0

    • No dependencies.
  • net6.0

    • No dependencies.

NuGet packages (2)

Showing the top 2 NuGet packages that depend on SoftCircuits.HtmlMonkey:

Package Downloads
SoftCircuits.WebScraper The ID prefix of this package has been reserved for one of the owners of this package by NuGet.org.

.NET library to scrape content from the Internet. Use it to extract information from Web pages in your own application. Extracted data is written to a CSV file. Supports paging and can cycle through all combinations of any number of replacement tags. Now targets .NET Standard 2.0 or .NET 5.0, and supports nullable reference types.

TwemojiSharp

Unofficial C# wrapper of twemoji

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
2.1.2 942 5/19/2022
2.1.1 195 5/7/2022
2.1.0 2,284 12/12/2021
2.0.7 344 8/29/2021
2.0.6 1,626 4/26/2021
2.0.5 251 3/26/2021
2.0.4 262 3/13/2021
2.0.3 374 2/22/2021
2.0.2 170 2/22/2021
2.0.1 160 2/21/2021
2.0.0 169 2/20/2021
1.3.1 484 11/1/2020
1.3.0 224 10/29/2020
1.2.1 315 10/18/2020
1.2.0 250 10/16/2020
1.1.5 280 10/4/2020
1.1.4 310 9/9/2020
1.1.3 550 8/13/2020
1.1.2 323 5/19/2020
1.1.1 539 5/10/2020
1.1.0 304 5/10/2020
1.0.4 311 2/26/2020
1.0.3 362 2/21/2020
1.0.2 2,366 7/16/2019
1.0.1 340 7/7/2019
1.0.0 355 7/2/2019

Additional tweaks and enhancements.