FieldCure.DocumentParsers
2.0.1
dotnet add package FieldCure.DocumentParsers --version 2.0.1
NuGet\Install-Package FieldCure.DocumentParsers -Version 2.0.1
<PackageReference Include="FieldCure.DocumentParsers" Version="2.0.1" />
<PackageVersion Include="FieldCure.DocumentParsers" Version="2.0.1" />
<PackageReference Include="FieldCure.DocumentParsers" />
paket add FieldCure.DocumentParsers --version 2.0.1
#r "nuget: FieldCure.DocumentParsers, 2.0.1"
#:package FieldCure.DocumentParsers@2.0.1
#addin nuget:?package=FieldCure.DocumentParsers&version=2.0.1
#tool nuget:?package=FieldCure.DocumentParsers&version=2.0.1
FieldCure.DocumentParsers
Lightweight document text extraction for .NET — DOCX, HWPX, XLSX, PPTX, HTML, and PDF. Structured Markdown output for LLM / RAG consumption. Pure managed, no native binaries.
Features
- DOCX — Headings, paragraphs, tables (nested), math → LaTeX, metadata → YAML, footnotes, endnotes, comments, headers/footers
- HWPX — Korean standard (KS X 6101 / OWPML). Headings, paragraphs, tables, math → LaTeX, metadata → YAML, footnotes, endnotes, memos, headers/footers
- XLSX — Sheets as markdown tables, metadata → YAML
- PPTX — Slide text, tables, speaker notes, metadata → YAML
- HTML — Readable content extraction via SmartReader → GitHub-flavored Markdown via ReverseMarkdown
- PDF — Page-by-page text extraction via PdfPig with
## Page {n}headers (auto-registered) - Math equations — DOCX (
m:oMath) and HWPX (hp:equation) converted to[math: LaTeX] - Metadata — YAML front matter (
title,author,created,modified,subject,keywords,description) - Footnotes / Endnotes —
[^N]/[^enN]inline references with definition sections - Comments — Inline blockquote
> **[Comment — author]:**format - Factory pattern —
DocumentParserFactory.GetParser(".docx")returns the right parser - Zero platform dependency — Targets
net8.0andnet10.0, no native binaries, no Windows-specific APIs - Extensible — Implement
IDocumentParserand callDocumentParserFactory.Register()
Install
dotnet add package FieldCure.DocumentParsers
Quick Start
using FieldCure.DocumentParsers;
// Auto-detect parser by extension — PDF is registered out of the box.
var parser = DocumentParserFactory.GetParser(".pdf");
if (parser is not null)
{
var bytes = File.ReadAllBytes("report.pdf");
var text = parser.ExtractText(bytes);
Console.WriteLine(text);
}
// Check all supported extensions
foreach (var ext in DocumentParserFactory.SupportedExtensions)
Console.WriteLine(ext); // .docx, .hwpx, .xlsx, .pptx, .html, .htm, .pdf
// Opt-out control for metadata, footnotes, etc.
var docxParser = new DocxParser();
var options = new ExtractionOptions
{
IncludeMetadata = false,
IncludeFootnotes = false
};
var text = docxParser.ExtractText(File.ReadAllBytes("report.docx"), options);
Output Format
Headings are prefixed with # markers. Tables are rendered as markdown.
Documents with metadata include YAML front matter; footnotes/endnotes are rendered as reference-style links:
---
title: 2026 Business Plan
author: Alice
created: 2026-04-01
---
> **[Header]:** Company Confidential
# 2026 Business Plan
Please refer to the table below[^1] for details.
| Category | Q1 | Q2 |
| --- | --- | --- |
| Revenue | 100 | 150 |
| Cost | 80 | 90 |
> **[Footer]:** Page 1
## Footnotes
[^1]: Source: internal finance report.
Pipe characters inside cells are escaped as \| to preserve table structure.
Use ExtractionOptions to selectively disable metadata, footnotes, comments, or headers/footers.
Supported Formats
| Format | Extension | Parser | Description |
|---|---|---|---|
| Word | .docx |
DocxParser |
OpenXML (Office 2007+) |
| Hangul | .hwpx |
HwpxParser |
OWPML (Hancom Office) |
| Excel | .xlsx |
XlsxParser |
OpenXML spreadsheets |
| PowerPoint | .pptx |
PptxParser |
OpenXML presentations |
| HTML | .html, .htm |
HtmlParser |
SmartReader + ReverseMarkdown |
.pdf |
PdfParser |
PdfPig (pure managed, text only) |
Optional PDF Extensions
PDF text extraction is built in. Two sibling packages add extra PDF capabilities that require native binaries:
- Page image rendering — FieldCure.DocumentParsers.Imaging adds
PdfImageRenderer : IMediaDocumentParser(PDFium via PDFtoImage). - OCR fallback for scanned PDFs — FieldCure.DocumentParsers.Ocr adds
OcrPdfParser+TesseractOcrEngine(English + Korean tessdata).
// Imaging (page → PNG)
using FieldCure.DocumentParsers.Imaging;
DocumentParserFactoryImagingExtensions.AddImagingSupport();
// OCR (scanned PDFs)
using FieldCure.DocumentParsers.Ocr;
using var ocr = DocumentParserFactoryOcrExtensions.AddOcrSupport();
Related Packages
- FieldCure.DocumentParsers.Imaging — PDF page image rendering (PDFium)
- FieldCure.DocumentParsers.Ocr — Tesseract OCR fallback for scanned PDFs
- FieldCure.AssistStudio.Core — AI provider client library that uses this package for document attachments
License
MIT — Copyright (c) 2026 FieldCure Co., Ltd.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- DocumentFormat.OpenXml (>= 3.5.1)
- PdfPig (>= 0.1.14)
- ReverseMarkdown (>= 5.3.0)
- SmartReader (>= 0.11.0)
-
net8.0
- DocumentFormat.OpenXml (>= 3.5.1)
- PdfPig (>= 0.1.14)
- ReverseMarkdown (>= 5.3.0)
- SmartReader (>= 0.11.0)
NuGet packages (4)
Showing the top 4 NuGet packages that depend on FieldCure.DocumentParsers:
| Package | Downloads |
|---|---|
|
FieldCure.Ai.Providers
AI provider clients for Claude, OpenAI, Gemini, Ollama, and Groq. Shared models and streaming support. |
|
|
FieldCure.DocumentParsers.Audio
Audio transcription parser for FieldCure.DocumentParsers. Converts MP3, WAV, M4A, OGG, FLAC, and WebM audio into timestamped Markdown transcripts via Whisper.net. |
|
|
FieldCure.DocumentParsers.Pdf
PDF text extraction and page image rendering for FieldCure.DocumentParsers |
|
|
FieldCure.DocumentParsers.Imaging
PDF page image rendering for FieldCure.DocumentParsers via PDFtoImage (PDFium). Adds IMediaDocumentParser capability to the core PDF parser. |
GitHub repositories
This package is not used by any popular GitHub repositories.
v2.0.1 — `ExtractionOptions` is no longer sealed so downstream parser packages (e.g. FieldCure.DocumentParsers.Audio) can subclass it. Adds `ExtractionOptions.SourceExtension` so callers can hint the source format and let parsers skip probing.