RobotsTxtParser 2025.9.4
dotnet add package RobotsTxtParser --version 2025.9.4
NuGet\Install-Package RobotsTxtParser -Version 2025.9.4
<PackageReference Include="RobotsTxtParser" Version="2025.9.4" />
<PackageVersion Include="RobotsTxtParser" Version="2025.9.4" />
<PackageReference Include="RobotsTxtParser" />
paket add RobotsTxtParser --version 2025.9.4
#r "nuget: RobotsTxtParser, 2025.9.4"
#:package RobotsTxtParser@2025.9.4
#addin nuget:?package=RobotsTxtParser&version=2025.9.4
#tool nuget:?package=RobotsTxtParser&version=2025.9.4
RobotsTxtParser
A simple .NET Core 9 library for parsing robots.txt files.
Supports:
- Reading and parsing
User-agent,Allow,Disallow,Crawl-delay, andSitemapdirectives - Checking whether a given URL path is allowed for a particular user-agent, with three different “Allow vs Disallow” strategies
- Retrieving crawl-delays (as
TimeSpan) for specified user-agents, with optional fallback overloads - Collecting all
SitemapURLs declared in arobots.txt
The parser is fully immutable after construction and exposes a clean, intuitive API.
Table of Contents
Installation
Via .NET CLI
dotnet add package RobotsTxtParser --version 2025.9.4
Via Package Manager (Visual Studio)
Install-Package RobotsTxtParser -Version 2025.9.4
Direct Reference
If you prefer to reference the local project, copy the RobotsTxtParser folder into your solution and add the project as a reference. The library targets net9.0 with C# 11 and has <Nullable>enable</Nullable> by default.
Quick Start
using RobotsTxtParser;
// Suppose you have robots.txt content as a string:
string robotsTxtContent = @"
User-agent: Googlebot
Disallow: /private
Allow: /public
User-agent: *
Disallow: /tmp
Crawl-delay: 1.5
Sitemap: https://example.com/sitemap.xml
";
// Parse it:
var robots = Robots.Load(robotsTxtContent);
// Check if "/public/page.html" is allowed for "Googlebot":
bool canGooglebotAccess = robots.IsPathAllowed("Googlebot", "/public/page.html");
// Check crawl-delay for a generic crawler:
TimeSpan defaultDelay = robots.CrawlDelay("SomeOtherBot");
// Retrieve all sitemap URLs:
foreach (var site in robots.Sitemaps)
{
if (site.Url != null)
Console.WriteLine($"Valid sitemap URL: {site.Url}");
else
Console.WriteLine($"Malformed sitemap entry: {site.Value}");
}
For more use-cases, review the test units inside the RobotsTxtParser.Tests project.
API Reference
Robots class
namespace RobotsTxtParser
{
public class Robots : IRobotsParser
{
// Properties
public string Raw { get; } // Original robots.txt content
public List<Sitemap> Sitemaps { get; private set; }
public bool Malformed { get; private set; } // True if any line was malformed
public bool HasRules { get; private set; } // True if ≥1 Access or Crawl-delay rule parsed
public bool IsAnyPathDisallowed { get; private set; }
public AllowRuleImplementation AllowRuleImplementation { get; set; }
// Static Factory
public static Robots Load(string content);
// IRobotsParser Implementation
public bool IsPathAllowed(string userAgent, string path);
public TimeSpan CrawlDelay(string userAgent);
public TimeSpan CrawlDelay(string userAgent, int fallbackAmount);
public TimeSpan CrawlDelay(string userAgent, TimeSpan fallbackAmount);
}
}
Load(string content)Parses the entirerobots.txtcontent and returns aRobotsinstance. Ifcontentisnullor whitespace, no rules are parsed andHasRules == false.IsPathAllowed(string userAgent, string path) : boolReturnstrueif the givenpathis allowed for the specifieduserAgent, after normalizingpath. ThrowsArgumentExceptionifuserAgentisnull/empty/whitespace. If there are no rules, or if no Disallow rules exist, always returnstrue. The logic respects the chosenAllowRuleImplementation.CrawlDelay(string userAgent) : TimeSpanReturns the crawl-delay (in milliseconds, as aTimeSpan) for theuserAgent. ThrowsArgumentExceptionifuserAgentisnull/empty/whitespace. If no crawl-delay rule matches, returnsTimeSpan.Zero. Specific rules are checked first; if none match, the global (*) rule is used.CrawlDelay(string userAgent, int fallbackAmount) : TimeSpanSame asCrawlDelay(string), but if no matching rule (specific or global) is found, returnsTimeSpan.FromMilliseconds(fallbackAmount)instead of zero.CrawlDelay(string userAgent, TimeSpan fallbackAmount) : TimeSpanSame as above, but the fallback is aTimeSpandirectly. If no rule is found, returnsfallbackAmount.RawThe unmodified string passed intoLoad(...).SitemapsA list ofSitemapobjects representing eachSitemap:directive.Malformedtrueif at least one line was out of expected context (e.g.Disallowbefore anyUser-agent, or an unrecognized directive). Parsed valid rules still apply.HasRulestrueif at least oneAllow/DisalloworCrawl-delaydirective was successfully recorded under someUser-agent.IsAnyPathDisallowedtrueif there is at least oneDisallowwith a non-empty path (meaning not “Disallow: ”).AllowRuleImplementationDetermines how to resolve conflicts when multipleAllow/Disallowrules match a path. Default isMoreSpecific.
IRobotsParser interface
namespace RobotsTxtParser
{
public interface IRobotsParser
{
bool IsPathAllowed(string userAgent, string path);
TimeSpan CrawlDelay(string userAgent);
TimeSpan CrawlDelay(string userAgent, int fallbackAmount);
TimeSpan CrawlDelay(string userAgent, TimeSpan fallbackAmount);
}
}
AllowRuleImplementation enum
namespace RobotsTxtParser
{
public enum AllowRuleImplementation
{
Standard, // Pick the matched rule with lowest “order” (first-seen)
AllowOverrides, // If any matching rule is Allow, path is allowed
MoreSpecific // Pick the rule with longest Path, then by order
}
}
Sitemap class
namespace RobotsTxtParser
{
public class Sitemap
{
public string Value { get; } // Raw text after “Sitemap:” (never null)
public Uri? Url { get; } // Parsed absolute Uri, or null if invalid
internal static Sitemap FromLine(Line line);
}
}
Use
robots.Sitemapsafter callingRobots.Load(...); each item has:Value– the exact substring fromrobots.txtafter “Sitemap:”Url– aUriifValueis a well-formed absolute URL; otherwisenull.
Usage Examples
Basic “Allow/Disallow” check
string robotsTxt = @"
User-agent: *
Disallow: /private
Allow: /public
";
var robots = Robots.Load(robotsTxt);
// Default is MoreSpecific
Console.WriteLine(robots.IsPathAllowed("anybot", "/public/index.html")); // True
Console.WriteLine(robots.IsPathAllowed("anybot", "/private/data.txt")); // False
Switching “Allow” rule strategy
string robotsTxt = @"
User-agent: *
Disallow: /foo
Allow: /foo
";
var r = Robots.Load(robotsTxt);
// Standard: pick first-seen → Disallow
r.AllowRuleImplementation = AllowRuleImplementation.Standard;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // False
// AllowOverrides: any Allow wins → allowed
r.AllowRuleImplementation = AllowRuleImplementation.AllowOverrides;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // True
// MoreSpecific: tie-break by order (since both are "/foo")
r.AllowRuleImplementation = AllowRuleImplementation.MoreSpecific;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // False (Disallow first)
Crawl-delay retrieval
string robotsTxt = @"
User-agent: MyBot
Crawl-delay: 4.25
User-agent: *
Crawl-delay: 2
";
var robots = Robots.Load(robotsTxt);
// “MyBot” → 4250 ms
TimeSpan myDelay = robots.CrawlDelay("MyBot");
Console.WriteLine(myDelay.TotalMilliseconds); // 4250
// Other bots → 2000 ms
TimeSpan otherDelay = robots.CrawlDelay("OtherBot");
Console.WriteLine(otherDelay.TotalMilliseconds); // 2000
// If no matching rule (and no global "*"), returns TimeSpan.Zero
var empty = Robots.Load(@"User-agent: BotOnly");
Console.WriteLine(empty.CrawlDelay("BotOnly") == TimeSpan.Zero); // True
Crawl-delay with fallback overloads
string robotsTxt = @"
User-agent: BotA
Crawl-delay: 3
User-agent: *
Crawl-delay: 1
";
var robots = Robots.Load(robotsTxt);
// Specific rule exists (3s):
TimeSpan result1 = robots.CrawlDelay("BotA", 10000);
Console.WriteLine(result1.TotalMilliseconds); // 3000
// No specific for "OtherBot" → global (1s):
TimeSpan result2 = robots.CrawlDelay("OtherBot", 5000);
Console.WriteLine(result2.TotalMilliseconds); // 1000
// If no global either, returns fallback:
var limited = Robots.Load(@"User-agent: BotX
Crawl-delay: 2.5");
TimeSpan fallbackTs = TimeSpan.FromMilliseconds(750);
TimeSpan result3 = limited.CrawlDelay("NoMatch", fallbackTs);
Console.WriteLine(result3.TotalMilliseconds); // 750
Extracting all Sitemap URLs
string robotsTxt = @"
User-agent: *
Sitemap: https://example.com/sitemap1.xml
Sitemap: not_a_real_url
Sitemap: https://cdn.example.com/other-sitemap.xml
";
var robots = Robots.Load(robotsTxt);
foreach (var site in robots.Sitemaps)
{
Console.WriteLine($"Raw value: '{site.Value}'");
if (site.Url != null)
Console.WriteLine($" Parsed URI: {site.Url}");
else
Console.WriteLine(" (Invalid URI)");
}
// Output:
// Raw value: 'https://example.com/sitemap1.xml'
// Parsed URI: https://example.com/sitemap1.xml
// Raw value: 'not_a_real_url'
// (Invalid URI)
// Raw value: 'https://cdn.example.com/other-sitemap.xml'
// Parsed URI: https://cdn.example.com/other-sitemap.xml
Handling malformed lines
string content = @"
Disallow: /private # no preceding User-agent → malformed
User-agent: *
Allow: /public
FooBar: /ignored # unknown field → malformed
";
var robots = Robots.Load(content);
Console.WriteLine($"Malformed? {robots.Malformed}"); // True
Console.WriteLine($"HasRules? {robots.HasRules}"); // True (because “Allow” under valid UA)
Console.WriteLine(robots.IsPathAllowed("any", "/private")); // True (early Disallow ignored)
Console.WriteLine(robots.IsPathAllowed("any", "/public")); // True
Notes & Caveats
Normalization of
pathIsPathAllowed(...)callsNormalizePath(path):- Converts
nullor whitespace to"/". - Ensures a leading
/. - Collapses repeated
//into a single/.
- Converts
Matching logic strips the leading
/before comparing to rule paths.
Wildcard &
$support*in a rule path matches any sequence of characters.- A trailing
$means “end-of-string” match. - Internally,
IsPathMatch(pathWithoutSlash, rulePathWithoutSlash)implements these recursively.
Case-insensitive matching
- Directive names (
User-agent,Allow,Disallow, etc.) are matched case-insensitively. - User-agent value matching (in
AccessRuleandCrawlDelayRule) is also case-insensitive substring match.
- Directive names (
Malformed lines
- A line is marked malformed if it appears out of context (e.g.
Disallowbefore anyUser-agent) or if the field name is unrecognized. - Malformed lines set
robots.Malformed = true, but valid rules under validUser-agentstill apply.
- A line is marked malformed if it appears out of context (e.g.
Global (
*) rules- A rule’s
UserAgentof"*"is stored in global lists. - If no specific rule matches a given user-agent, the parser falls back to the global rule.
- If multiple global rules exist, the first one (lowest
Order) is used unlessMoreSpecificis in effect.
- A rule’s
License
RobotsTxtParser is licensed under the GNU Affero General Public License, version 3 (AGPL-3.0-or-later). See LICENSE for full text.
Commercial Licensing
While RobotsTxtParser is available under the AGPL-3.0 for all free/open-source usage, a separate commercial license is required to incorporate this code into proprietary or closed-source products without adhering to AGPL’s copyleft obligations.
To purchase a commercial license, please contact:
Hossein Esmati
Email: desmati@gmail.com
The commercial license will be provided under mutually agreed terms, which supersede AGPL-3.0 for your proprietary usage.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Initial release of RobotsTxtParser v2025.9.4 under AGPL‐3.0. Commercial licensing available via email.