ManagedCode.MarkItDown
0.0.1
Prefix Reserved
See the version list below for details.
dotnet add package ManagedCode.MarkItDown --version 0.0.1
NuGet\Install-Package ManagedCode.MarkItDown -Version 0.0.1
<PackageReference Include="ManagedCode.MarkItDown" Version="0.0.1" />
<PackageVersion Include="ManagedCode.MarkItDown" Version="0.0.1" />
<PackageReference Include="ManagedCode.MarkItDown" />
paket add ManagedCode.MarkItDown --version 0.0.1
#r "nuget: ManagedCode.MarkItDown, 0.0.1"
#:package ManagedCode.MarkItDown@0.0.1
#addin nuget:?package=ManagedCode.MarkItDown&version=0.0.1
#tool nuget:?package=ManagedCode.MarkItDown&version=0.0.1
MarkItDown
A modern C#/.NET library for converting a wide range of document formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, etc.) into high-quality Markdown suitable for Large Language Models (LLMs), search indexing, and text analytics. The project mirrors the original Microsoft Python implementation while embracing .NET idioms, async APIs, and new integrations.
Table of Contents
- Features
- Format Support
- Quick Start
- Usage
- Architecture
- Development & Contributing
- Roadmap
- Performance
- Configuration
- License
- Acknowledgments
- Support
Features
β¨ Modern .NET - Targets .NET 9.0 with up-to-date language features
π¦ NuGet Package - Drop-in dependency for libraries and automation pipelines
π Async/Await - Fully asynchronous pipeline for responsive apps
π§ LLM-Optimized - Markdown tailored for AI ingestion and summarisation
π§ Extensible - Register custom converters or plug additional caption/transcription services
π§ Smart Detection - Automatic MIME, charset, and file-type guessing (including data/file URIs)
β‘ High Performance - Stream-friendly, minimal allocations, zero temp files
π Format Support
| Format | Extension | Status | Description |
|---|---|---|---|
| HTML | .html, .htm |
β Supported | Full HTML to Markdown conversion |
| Plain Text | .txt, .md |
β Supported | Direct text processing |
.pdf |
β Supported | Adobe PDF documents with text extraction | |
| Word | .docx |
β Supported | Microsoft Word documents with formatting |
| Excel | .xlsx |
β Supported | Microsoft Excel spreadsheets as tables |
| PowerPoint | .pptx |
β Supported | Microsoft PowerPoint presentations |
| Images | .jpg, .png, .gif, .bmp, .tiff, .webp |
β Supported | Exif metadata extraction + optional captions |
| Audio | .wav, .mp3, .m4a, .mp4 |
β Supported | Metadata extraction + optional transcription |
| CSV | .csv |
β Supported | Comma-separated values as Markdown tables |
| JSON | .json, .jsonl, .ndjson |
β Supported | Structured JSON data with formatting |
| XML | .xml, .xsd, .xsl, .rss, .atom |
β Supported | XML documents with structure preservation |
| EPUB | .epub |
β Supported | E-book files with metadata and content |
| ZIP | .zip |
β Supported | Archive processing with recursive file conversion |
| Jupyter Notebook | .ipynb |
β Supported | Python notebooks with code and markdown cells |
| RSS/Atom Feeds | .rss, .atom, .xml |
β Supported | Web feeds with structured content and metadata |
| YouTube URLs | YouTube links | β Supported | Video metadata extraction and link formatting |
| Wikipedia Pages | wikipedia.org | β Supported | Article-only extraction with clean Markdown |
| Bing SERPs | bing.com/search | β Supported | Organic result summarisation |
HTML Conversion Features (AngleSharp powered)
- Headers (H1-H6) β Markdown headers
- Bold/Strong text β bold
- Italic/Emphasis text β italic
- Links β text
- Images β
- Lists (ordered/unordered)
- Tables with header detection and Markdown table output
- Code blocks and inline code
- Blockquotes, sections, semantic containers
PDF Conversion Features
- Text extraction with page separation
- Header detection based on formatting
- List item recognition
- Title extraction from document content
Office Documents (DOCX/XLSX/PPTX)
- Word (.docx): Headers, paragraphs, tables, bold/italic formatting
- Excel (.xlsx): Spreadsheet data as Markdown tables with sheet organization
- PowerPoint (.pptx): Slide-by-slide content with title recognition
CSV Conversion Features
- Automatic table formatting with headers
- Proper escaping of special characters
- Support for various CSV dialects
- Handles quoted fields and embedded commas
JSON Conversion Features
- Structured Format: Converts JSON objects to readable Markdown with proper hierarchy
- JSON Lines Support: Processes
.jsonland.ndjsonfiles line by line - Data Type Preservation: Maintains JSON data types (strings, numbers, booleans, null)
- Nested Objects: Handles complex nested structures with proper indentation
XML Conversion Features
- Structure Preservation: Maintains XML hierarchy as Markdown headings
- Attributes Handling: Converts XML attributes to Markdown lists
- Multiple Formats: Supports XML, XSD, XSL, RSS, and Atom feeds
- CDATA Support: Properly handles CDATA sections as code blocks
EPUB Conversion Features
- Metadata Extraction: Extracts title, author, publisher, and other Dublin Core metadata
- Content Order: Processes content files in proper reading order using spine information
- HTML Processing: Converts XHTML content using the HTML converter
- Table of Contents: Maintains document structure from the original EPUB
ZIP Archive Features
- Recursive Processing: Extracts and converts all supported files within archives
- Structure Preservation: Maintains original file paths and organization
- Multi-Format Support: Processes different file types within the same archive
- Error Handling: Continues processing even if individual files fail
- Size Limits: Protects against memory issues with large files
Jupyter Notebook Conversion Features
- Cell Type Support: Processes markdown, code, and raw cells appropriately
- Metadata Extraction: Extracts notebook title, kernel information, and language details
- Code Output Handling: Captures and formats execution results, streams, and errors
- Syntax Highlighting: Preserves language information for proper code block formatting
RSS/Atom Feed Conversion Features
- Multi-Format Support: Handles RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds
- Feed Metadata: Extracts title, description, last update date, and author information
- Article Processing: Converts feed items with proper title linking and content formatting
- Date Formatting: Normalizes publication dates across different feed formats
YouTube URL Conversion Features
- URL Recognition: Supports standard and shortened YouTube URLs (youtube.com, youtu.be)
- Metadata Extraction: Extracts video ID and URL parameters with descriptions
- Embed Integration: Provides thumbnail images and multiple access methods
- Parameter Parsing: Decodes common YouTube URL parameters (playlist, timestamps, etc.)
Image Conversion Features
- Support for JPEG, PNG, GIF, BMP, TIFF, WebP
- Exif metadata extraction via
exiftool(optional) - Optional multimodal image captioning hook (LLM integration ready)
- Graceful fallback when metadata/captioning unavailable
Audio Conversion Features
- Handles WAV/MP3/M4A/MP4 containers
- Extracts key metadata (artist, album, duration, channels, etc.)
- Optional transcription delegate for speech-to-text results
- Markdown summary highlighting metadata and transcript
π Quick Start
Installation
Install via NuGet Package Manager:
# Package Manager Console
Install-Package ManagedCode.MarkItDown
# .NET CLI
dotnet add package ManagedCode.MarkItDown
# PackageReference (add to your .csproj)
<PackageReference Include="ManagedCode.MarkItDown" Version="1.0.0" />
Prerequisites
- .NET 9.0 SDK or later
- Compatible with .NET 9 apps and libraries
Optional Dependencies for Advanced Features
- PDF Support: Provided via PdfPig (bundled)
- Office Documents: Provided via DocumentFormat.OpenXml (bundled)
- Image metadata: Install ExifTool for richer output (
brew install exiftool,choco install exiftool) - Image captions: Supply an
ImageCaptionerdelegate (e.g., calls to an LLM or vision service) - Audio transcription: Supply an
AudioTranscriberdelegate (e.g., Azure Cognitive Services, OpenAI Whisper)
Note: External tools are optionalβMarkItDown degrades gracefully when they are absent.
π» Usage
Basic API Usage
using MarkItDown.Core;
// Simple conversion
var markItDown = new MarkItDown();
var result = await markItDown.ConvertAsync("document.html");
Console.WriteLine(result.Markdown);
Advanced Usage with Logging
using MarkItDown.Core;
using Microsoft.Extensions.Logging;
// With logging and HTTP client for web content
using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
var logger = loggerFactory.CreateLogger<Program>();
using var httpClient = new HttpClient();
var markItDown = new MarkItDown(logger, httpClient);
// Convert from file
var fileResult = await markItDown.ConvertAsync("document.html");
// Convert from URL
var urlResult = await markItDown.ConvertFromUrlAsync("https://example.com");
// Convert from URI (file:, data:, http:, https:)
var dataResult = await markItDown.ConvertUriAsync("data:text/html;base64,PGgxPkhlbGxvPC9oMT4=");
// Convert from stream with optional overrides
using var stream = File.OpenRead("document.html");
var streamInfo = new StreamInfo(mimeType: "text/html", extension: ".html");
var streamResult = await markItDown.ConvertAsync(stream, streamInfo);
Custom Converters
Create your own format converters by implementing IDocumentConverter:
using MarkItDown.Core;
public class MyCustomConverter : IDocumentConverter
{
public bool Accepts(Stream stream, StreamInfo streamInfo, CancellationToken cancellationToken = default)
{
return streamInfo.Extension == ".mycustomformat";
}
public async Task<DocumentConverterResult> ConvertAsync(
Stream stream,
StreamInfo streamInfo,
CancellationToken cancellationToken = default)
{
// Your conversion logic here
var markdown = "# Converted from custom format\n\nContent here...";
return new DocumentConverterResult(markdown, "Document Title");
}
}
// Register the custom converter
var markItDown = new MarkItDown();
markItDown.RegisterConverter(new MyCustomConverter(), ConverterPriority.SpecificFileFormat);
ποΈ Architecture
Core Components
MarkItDown- Main entry point for conversionsIDocumentConverter- Interface for format-specific convertersDocumentConverterResult- Contains the converted Markdown and optional metadataStreamInfo- Metadata about the input stream (MIME type, extension, charset, etc.)ConverterRegistration- Associates converters with priority for selection
Built-in Converters
PlainTextConverter- Handles text, JSON, NDJSON, Markdown, etc.HtmlConverter- Converts HTML to Markdown using AngleSharpPdfConverter- PdfPig-based extraction with Markdown heuristicsDocx/Xlsx/PptxConverters - Office Open XML processingImageConverter- Exif metadata + optional captionsAudioConverter- Metadata + optional transcriptionWikipediaConverter- Article-only extraction from WikipediaBingSerpConverter- Summaries for Bing search result pagesYouTubeUrlConverter- Video metadata markdownZipConverter- Recursive archive handlingRssFeedConverter,JsonConverter,CsvConverter,XmlConverter,JupyterNotebookConverter,EpubConverter
Converter Priority & Detection
- Priority-based dispatch (lower values processed first)
- Automatic stream sniffing via
StreamInfoGuesser - Manual overrides via
MarkItDownOptionsorStreamInfo
π Development & Contributing
Building from Source
# Clone the repository
git clone https://github.com/managedcode/markitdown.git
cd markitdown
# Build the solution
dotnet build
# Run tests
dotnet test
# Create NuGet package
dotnet pack --configuration Release
Project Structure
βββ src/
β βββ MarkItDown.Core/ # Core library
β β βββ Converters/ # Format-specific converters (HTML, PDF, audio, etc.)
β β βββ MarkItDown.cs # Main conversion engine
β β βββ StreamInfoGuesser.cs # MIME/charset/extension detection helpers
β β βββ MarkItDownOptions.cs # Runtime configuration flags
β β βββ ... # Shared utilities (UriUtilities, MimeMapping, etc.)
βββ tests/
β βββ MarkItDown.Tests/ # xUnit + Shouldly tests, Python parity vectors (WIP)
βββ Directory.Build.props # Shared build + packaging settings
βββ README.md # This document
Contributing Guidelines
- Fork the repository.
- Create a feature branch (
git checkout -b feature/my-feature). - Add tests with xUnit/Shouldly mirroring relevant Python vectors.
- Run
dotnet test(CI enforces green builds + coverage upload). - Update docs or samples if behaviour changes.
- Submit a pull request for review.
πΊοΈ Roadmap
π― Near-Term
- Azure Document Intelligence converter (options already scaffolded)
- Outlook
.msgingestion via MIT-friendly dependencies - Expanded CLI commands (batch mode, globbing, JSON output)
- Richer regression suite mirroring Python test vectors
π― Future Ideas
- Plugin discovery & sandboxing
- Built-in LLM caption/transcription providers
- Incremental/streaming conversion APIs
- Cloud-native samples (Functions, Containers, Logic Apps)
π Performance
MarkItDown is designed for high performance with:
- Stream-based processing β Avoids writing temporary files by default
- Async/await everywhere β Non-blocking I/O with cancellation support
- Minimal allocations β Smart buffer reuse and pay-for-play converters
- Fast detection β Lightweight sniffing before converter dispatch
- Extensible hooks β Offload captions/transcripts to background workers
π§ Configuration
var options = new MarkItDownOptions
{
EnableBuiltins = true,
EnablePlugins = false,
ExifToolPath = "/usr/local/bin/exiftool",
ImageCaptioner = async (bytes, info, token) =>
{
// Call your preferred vision or LLM service here
return await Task.FromResult("A scenic mountain landscape at sunset.");
},
AudioTranscriber = async (bytes, info, token) =>
{
// Route to speech-to-text provider
return await Task.FromResult("Welcome to the MarkItDown demo.");
}
};
var markItDown = new MarkItDown(options);
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
This project is a C# conversion of the original Microsoft MarkItDown Python library. The original project was created by the Microsoft AutoGen team.
π Support
- π Documentation: GitHub Wiki
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Email: Create an issue for support
<div align="center">
β Star this repository if you find it useful!
Made with β€οΈ by ManagedCode
</div>
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- AngleSharp (>= 1.0.0)
- DocumentFormat.OpenXml (>= 3.1.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 8.0.1)
- Microsoft.Extensions.Logging.Abstractions (>= 8.0.1)
- PdfPig (>= 0.1.9)
- SkiaSharp (>= 2.88.8)
- System.Text.Encoding.CodePages (>= 8.0.0)
- System.Text.Json (>= 8.0.5)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.