RobotsTxtParser 2025.9.4

.NET 9.0

dotnet add package RobotsTxtParser --version 2025.9.4

NuGet\Install-Package RobotsTxtParser -Version 2025.9.4

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="RobotsTxtParser" Version="2025.9.4" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="RobotsTxtParser" Version="2025.9.4" />
                    

                            Directory.Packages.props

<PackageReference Include="RobotsTxtParser" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add RobotsTxtParser --version 2025.9.4

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: RobotsTxtParser, 2025.9.4"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package RobotsTxtParser@2025.9.4

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=RobotsTxtParser&version=2025.9.4
                    

                            Install as a Cake Addin

#tool nuget:?package=RobotsTxtParser&version=2025.9.4
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

RobotsTxtParser

A simple .NET Core 9 library for parsing robots.txt files.
Supports:

Reading and parsing User-agent, Allow, Disallow, Crawl-delay, and Sitemap directives
Checking whether a given URL path is allowed for a particular user-agent, with three different “Allow vs Disallow” strategies
Retrieving crawl-delays (as TimeSpan) for specified user-agents, with optional fallback overloads
Collecting all Sitemap URLs declared in a robots.txt

The parser is fully immutable after construction and exposes a clean, intuitive API.

Installation
Quick Start
API Reference
Usage Examples
Notes & Caveats
License
Commercial Licensing

Installation

Via .NET CLI

dotnet add package RobotsTxtParser --version 2025.9.4

Via Package Manager (Visual Studio)

Install-Package RobotsTxtParser -Version 2025.9.4

Direct Reference

If you prefer to reference the local project, copy the RobotsTxtParser folder into your solution and add the project as a reference. The library targets net9.0 with C# 11 and has <Nullable>enable</Nullable> by default.

Quick Start

using RobotsTxtParser;

// Suppose you have robots.txt content as a string:
string robotsTxtContent = @"
User-agent: Googlebot
Disallow: /private
Allow: /public

User-agent: *
Disallow: /tmp
Crawl-delay: 1.5
Sitemap: https://example.com/sitemap.xml
";

// Parse it:
var robots = Robots.Load(robotsTxtContent);

// Check if "/public/page.html" is allowed for "Googlebot":
bool canGooglebotAccess = robots.IsPathAllowed("Googlebot", "/public/page.html");

// Check crawl-delay for a generic crawler:
TimeSpan defaultDelay = robots.CrawlDelay("SomeOtherBot");

// Retrieve all sitemap URLs:
foreach (var site in robots.Sitemaps)
{
    if (site.Url != null)
        Console.WriteLine($"Valid sitemap URL: {site.Url}");
    else
        Console.WriteLine($"Malformed sitemap entry: {site.Value}");
}

For more use-cases, review the test units inside the RobotsTxtParser.Tests project.

API Reference

`Robots` class

namespace RobotsTxtParser
{
    public class Robots : IRobotsParser
    {
        // Properties
        public string Raw { get; }                      // Original robots.txt content
        public List<Sitemap> Sitemaps { get; private set; }
        public bool Malformed { get; private set; }     // True if any line was malformed
        public bool HasRules { get; private set; }      // True if ≥1 Access or Crawl-delay rule parsed
        public bool IsAnyPathDisallowed { get; private set; }
        public AllowRuleImplementation AllowRuleImplementation { get; set; }

        // Static Factory
        public static Robots Load(string content);

        // IRobotsParser Implementation
        public bool IsPathAllowed(string userAgent, string path);
        public TimeSpan CrawlDelay(string userAgent);
        public TimeSpan CrawlDelay(string userAgent, int fallbackAmount);
        public TimeSpan CrawlDelay(string userAgent, TimeSpan fallbackAmount);
    }
}

Load(string content) Parses the entire robots.txt content and returns a Robots instance. If content is null or whitespace, no rules are parsed and HasRules == false.
IsPathAllowed(string userAgent, string path) : bool Returns true if the given path is allowed for the specified userAgent, after normalizing path. Throws ArgumentException if userAgent is null/empty/whitespace. If there are no rules, or if no Disallow rules exist, always returns true. The logic respects the chosen AllowRuleImplementation.
CrawlDelay(string userAgent) : TimeSpan Returns the crawl-delay (in milliseconds, as a TimeSpan) for the userAgent. Throws ArgumentException if userAgent is null/empty/whitespace. If no crawl-delay rule matches, returns TimeSpan.Zero. Specific rules are checked first; if none match, the global (*) rule is used.
CrawlDelay(string userAgent, int fallbackAmount) : TimeSpan Same as CrawlDelay(string), but if no matching rule (specific or global) is found, returns TimeSpan.FromMilliseconds(fallbackAmount) instead of zero.
CrawlDelay(string userAgent, TimeSpan fallbackAmount) : TimeSpan Same as above, but the fallback is a TimeSpan directly. If no rule is found, returns fallbackAmount.
Raw The unmodified string passed into Load(...).
Sitemaps A list of Sitemap objects representing each Sitemap: directive.
Malformed true if at least one line was out of expected context (e.g. Disallow before any User-agent, or an unrecognized directive). Parsed valid rules still apply.
HasRules true if at least one Allow/Disallow or Crawl-delay directive was successfully recorded under some User-agent.
IsAnyPathDisallowed true if there is at least one Disallow with a non-empty path (meaning not “Disallow: ”).
AllowRuleImplementation Determines how to resolve conflicts when multiple Allow/Disallow rules match a path. Default is MoreSpecific.

`IRobotsParser` interface

namespace RobotsTxtParser
{
    public interface IRobotsParser
    {
        bool IsPathAllowed(string userAgent, string path);
        TimeSpan CrawlDelay(string userAgent);
        TimeSpan CrawlDelay(string userAgent, int fallbackAmount);
        TimeSpan CrawlDelay(string userAgent, TimeSpan fallbackAmount);
    }
}

`AllowRuleImplementation` enum

namespace RobotsTxtParser
{
    public enum AllowRuleImplementation
    {
        Standard,       // Pick the matched rule with lowest “order” (first-seen)
        AllowOverrides, // If any matching rule is Allow, path is allowed
        MoreSpecific    // Pick the rule with longest Path, then by order
    }
}

`Sitemap` class

namespace RobotsTxtParser
{
    public class Sitemap
    {
        public string Value { get; }   // Raw text after “Sitemap:” (never null)
        public Uri? Url { get; }       // Parsed absolute Uri, or null if invalid

        internal static Sitemap FromLine(Line line);
    }
}

Use robots.Sitemaps after calling Robots.Load(...); each item has:
- Value – the exact substring from robots.txt after “Sitemap:”
- Url – a Uri if Value is a well-formed absolute URL; otherwise null.

Usage Examples

Basic “Allow/Disallow” check

string robotsTxt = @"
User-agent: *
Disallow: /private
Allow: /public
";

var robots = Robots.Load(robotsTxt);

// Default is MoreSpecific
Console.WriteLine(robots.IsPathAllowed("anybot", "/public/index.html"));  // True
Console.WriteLine(robots.IsPathAllowed("anybot", "/private/data.txt"));   // False

Switching “Allow” rule strategy

string robotsTxt = @"
User-agent: *
Disallow: /foo
Allow: /foo
";

var r = Robots.Load(robotsTxt);

// Standard: pick first-seen → Disallow
r.AllowRuleImplementation = AllowRuleImplementation.Standard;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // False

// AllowOverrides: any Allow wins → allowed
r.AllowRuleImplementation = AllowRuleImplementation.AllowOverrides;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // True

// MoreSpecific: tie-break by order (since both are "/foo")
r.AllowRuleImplementation = AllowRuleImplementation.MoreSpecific;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // False  (Disallow first)

Crawl-delay retrieval

string robotsTxt = @"
User-agent: MyBot
Crawl-delay: 4.25

User-agent: *
Crawl-delay: 2
";

var robots = Robots.Load(robotsTxt);

// “MyBot” → 4250 ms
TimeSpan myDelay = robots.CrawlDelay("MyBot");
Console.WriteLine(myDelay.TotalMilliseconds); // 4250

// Other bots → 2000 ms
TimeSpan otherDelay = robots.CrawlDelay("OtherBot");
Console.WriteLine(otherDelay.TotalMilliseconds); // 2000

// If no matching rule (and no global "*"), returns TimeSpan.Zero
var empty = Robots.Load(@"User-agent: BotOnly");
Console.WriteLine(empty.CrawlDelay("BotOnly") == TimeSpan.Zero); // True

Crawl-delay with fallback overloads

string robotsTxt = @"
User-agent: BotA
Crawl-delay: 3

User-agent: *
Crawl-delay: 1
";

var robots = Robots.Load(robotsTxt);

// Specific rule exists (3s):
TimeSpan result1 = robots.CrawlDelay("BotA", 10000);
Console.WriteLine(result1.TotalMilliseconds); // 3000

// No specific for "OtherBot" → global (1s):
TimeSpan result2 = robots.CrawlDelay("OtherBot", 5000);
Console.WriteLine(result2.TotalMilliseconds); // 1000

// If no global either, returns fallback:
var limited = Robots.Load(@"User-agent: BotX
Crawl-delay: 2.5");
TimeSpan fallbackTs = TimeSpan.FromMilliseconds(750);
TimeSpan result3 = limited.CrawlDelay("NoMatch", fallbackTs);
Console.WriteLine(result3.TotalMilliseconds); // 750

Extracting all Sitemap URLs

string robotsTxt = @"
User-agent: *
Sitemap: https://example.com/sitemap1.xml
Sitemap: not_a_real_url
Sitemap: https://cdn.example.com/other-sitemap.xml
";

var robots = Robots.Load(robotsTxt);

foreach (var site in robots.Sitemaps)
{
    Console.WriteLine($"Raw value: '{site.Value}'");
    if (site.Url != null)
        Console.WriteLine($" Parsed URI: {site.Url}");
    else
        Console.WriteLine(" (Invalid URI)");
}

// Output:
// Raw value: 'https://example.com/sitemap1.xml'
// Parsed URI: https://example.com/sitemap1.xml
// Raw value: 'not_a_real_url'
// (Invalid URI)
// Raw value: 'https://cdn.example.com/other-sitemap.xml'
// Parsed URI: https://cdn.example.com/other-sitemap.xml

Handling malformed lines

string content = @"
Disallow: /private          # no preceding User-agent → malformed
User-agent: *
Allow: /public
FooBar: /ignored            # unknown field → malformed
";

var robots = Robots.Load(content);

Console.WriteLine($"Malformed? {robots.Malformed}");  // True
Console.WriteLine($"HasRules? {robots.HasRules}");    // True (because “Allow” under valid UA)
Console.WriteLine(robots.IsPathAllowed("any", "/private")); // True (early Disallow ignored)
Console.WriteLine(robots.IsPathAllowed("any", "/public"));  // True

Notes & Caveats

Normalization of path
- IsPathAllowed(...) calls NormalizePath(path):
  - Converts null or whitespace to "/".
  - Ensures a leading /.
  - Collapses repeated // into a single /.
- Matching logic strips the leading / before comparing to rule paths.
Wildcard & $ support
- * in a rule path matches any sequence of characters.
- A trailing $ means “end-of-string” match.
- Internally, IsPathMatch(pathWithoutSlash, rulePathWithoutSlash) implements these recursively.
Case-insensitive matching
- Directive names (User-agent, Allow, Disallow, etc.) are matched case-insensitively.
- User-agent value matching (in AccessRule and CrawlDelayRule) is also case-insensitive substring match.
Malformed lines
- A line is marked malformed if it appears out of context (e.g. Disallow before any User-agent) or if the field name is unrecognized.
- Malformed lines set robots.Malformed = true, but valid rules under valid User-agent still apply.
Global (*) rules
- A rule’s UserAgent of "*" is stored in global lists.
- If no specific rule matches a given user-agent, the parser falls back to the global rule.
- If multiple global rules exist, the first one (lowest Order) is used unless MoreSpecific is in effect.

License

RobotsTxtParser is licensed under the GNU Affero General Public License, version 3 (AGPL-3.0-or-later). See LICENSE for full text.

Commercial Licensing

While RobotsTxtParser is available under the AGPL-3.0 for all free/open-source usage, a separate commercial license is required to incorporate this code into proprietary or closed-source products without adhering to AGPL’s copyleft obligations.

To purchase a commercial license, please contact:

Hossein Esmati  
Email: desmati@gmail.com

The commercial license will be provided under mutually agreed terms, which supersede AGPL-3.0 for your proprietary usage.

Product	Compatible and additional computed target framework versions.
.NET	net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net9.0
- No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
2025.9.4	249	6/2/2025
2025.9.3	187	6/2/2025
2025.9.2	178	6/2/2025
2025.9.1	188	6/2/2025

Initial release of RobotsTxtParser v2025.9.4 under AGPL‐3.0. Commercial licensing available via email.