ManagedCode.MarkItDown
0.0.5
Prefix Reserved
dotnet add package ManagedCode.MarkItDown --version 0.0.5
NuGet\Install-Package ManagedCode.MarkItDown -Version 0.0.5
<PackageReference Include="ManagedCode.MarkItDown" Version="0.0.5" />
<PackageVersion Include="ManagedCode.MarkItDown" Version="0.0.5" />
<PackageReference Include="ManagedCode.MarkItDown" />
paket add ManagedCode.MarkItDown --version 0.0.5
#r "nuget: ManagedCode.MarkItDown, 0.0.5"
#:package ManagedCode.MarkItDown@0.0.5
#addin nuget:?package=ManagedCode.MarkItDown&version=0.0.5
#tool nuget:?package=ManagedCode.MarkItDown&version=0.0.5
MarkItDown
π Transform any document into LLM-ready Markdown with this powerful C#/.NET library!
MarkItDown is a comprehensive document conversion library that transforms diverse file formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, and more) into clean, high-quality Markdown. Perfect for AI workflows, RAG (Retrieval-Augmented Generation) systems, content processing pipelines, and text analytics applications.
Why MarkItDown for .NET?
- π― Built for modern C# developers - Native .NET 9 library with async/await throughout
- π§ LLM-optimized output - Clean Markdown that AI models love to consume
- π¦ Zero-friction NuGet package - Just
dotnet add package ManagedCode.MarkItDownand go - π Disk-first stream processing - Handle large documents efficiently using managed workspaces instead of
MemoryStream - π οΈ Highly extensible - Add custom converters or integrate with AI services for captions/transcription
This is a high-fidelity C# port of Microsoft's original MarkItDown Python library, reimagined for the .NET ecosystem with modern async patterns, improved performance, and enterprise-ready features.
π Why Choose MarkItDown?
For AI & LLM Applications
- Perfect for RAG systems - Convert documents to searchable, contextual Markdown chunks
- Token-efficient - Clean output maximizes your LLM token budget
- Structured data preservation - Tables, headers, and lists maintain semantic meaning
- Metadata extraction - Rich document properties for enhanced context
For .NET Developers
- Native performance - Built from the ground up for .NET, not a wrapper
- Modern async/await - Non-blocking I/O with full cancellation support
- Memory efficient - Stream-based processing avoids loading entire files into memory
- Enterprise ready - Proper error handling, logging, and configuration options
For Content Processing
- 22+ file formats supported - From Office documents to web pages to archives
- Batch processing ready - Handle hundreds of documents efficiently
- Extensible architecture - Add custom converters for proprietary formats
- Smart format detection - Automatic MIME type and encoding detection
Table of Contents
- Features
- Format Support
- Extended Format Support
- Quick Start
- Usage
- Architecture
- Development & Contributing
- Roadmap
- Performance
- Configuration
- License
- Acknowledgments
- Support
Features
β¨ Modern .NET - Targets .NET 9.0 with up-to-date language features
π¦ NuGet Package - Drop-in dependency for libraries and automation pipelines
π Async/Await - Fully asynchronous pipeline for responsive apps
π§ LLM-Optimized - Markdown tailored for AI ingestion and summarisation
π§ Extensible - Register custom converters or plug additional caption/transcription services
π§© Conversion middleware - Compose post-processing steps with IConversionMiddleware (AI enrichment ready)
π Raw artifacts API - Inspect text blocks, tables, and images via DocumentConverterResult.Artifacts
π§ Smart Detection - Automatic MIME, charset, and file-type guessing (including data/file URIs)
β‘ High Performance - Stream-friendly, disk-backed buffers prevent large sources from exhausting RAM
π Format Support
| Format | Extension | Status | Description |
|---|---|---|---|
| HTML | .html, .htm |
β Supported | Full HTML to Markdown conversion |
| Plain Text | .txt, .md |
β Supported | Direct text processing |
.pdf |
β Supported | Adobe PDF documents with text extraction | |
| Word | .docx |
β Supported | Microsoft Word documents with formatting |
| Excel | .xlsx |
β Supported | Microsoft Excel spreadsheets as tables |
| PowerPoint | .pptx |
β Supported | Microsoft PowerPoint presentations |
| Images | .jpg, .png, .gif, .bmp, .tiff, .webp |
β Supported | Exif metadata extraction + optional captions |
| Audio | .wav, .mp3, .m4a, .mp4 |
β Supported | Metadata extraction + optional transcription |
| CSV | .csv |
β Supported | Comma-separated values as Markdown tables |
| JSON | .json, .jsonl, .ndjson |
β Supported | Structured JSON data with formatting |
| XML | .xml, .xsd, .xsl, .rss, .atom |
β Supported | XML documents with structure preservation |
| EPUB | .epub |
β Supported | E-book files with metadata and content |
.eml |
β Supported | Email files with headers, content, and attachment info | |
| ZIP | .zip |
β Supported | Archive processing with recursive file conversion |
| Jupyter Notebook | .ipynb |
β Supported | Python notebooks with code and markdown cells |
| RSS/Atom Feeds | .rss, .atom, .xml |
β Supported | Web feeds with structured content and metadata |
| YouTube URLs | YouTube links | β Supported | Video metadata extraction and link formatting |
| Wikipedia Pages | wikipedia.org | β Supported | Article-only extraction with clean Markdown |
| Bing SERPs | bing.com/search | β Supported | Organic result summarisation |
π Extended Format Support
| Format | Extension | Status | Description |
|---|---|---|---|
| DocBook | .xml, .docbook |
β Supported | Technical documentation with section hierarchy |
| JATS / NISO | .xml |
β Supported | Journal Article Tag Suite articles with enriched metadata |
| OPML | .opml |
β Supported | Outline Processor markup trees converted to Markdown lists |
| FictionBook (FB2) | .fb2 |
β Supported | Narrative e-books with cover art and metadata |
| EndNote XML | .xml |
β Supported | Bibliographic exports with citation data |
| BibTeX | .bib, .bibtex |
β Supported | Reference entries rendered as Markdown tables |
| RIS | .ris |
β Supported | Research citations emitted with field/value mapping |
| CSL-JSON | .csl.json, .json |
β Supported | Citation Style Language exports ready for RAG indexing |
| LaTeX | .tex |
β Supported | Text and math blocks preserved as Markdown or fenced code |
| reStructuredText | .rst |
β Supported | Converts directives, lists, and code blocks |
| AsciiDoc | .adoc, .asciidoc |
β Supported | Handles attributes, admonitions, and tables |
| Org Mode | .org |
β Supported | Emacs Org headlines and property drawers |
| Djot | .djot |
β Supported | Djot lightweight markup translation |
| Typst | .typ |
β Supported | Emerging typesetting language support |
| Textile | .textile |
β Supported | Textile markup to Markdown |
| Wiki Markup | .mediawiki, .wiki |
β Supported | MediaWiki-style formatting |
| Mermaid | .mmd, .mermaid |
β Supported | Diagram source preserved in fenced code blocks |
| Graphviz DOT | .dot |
β Supported | Graph definitions retained for rendering |
| PlantUML | .puml, .plantuml |
β Supported | UML diagrams emitted as fenced code |
| TikZ | .tikz |
β Supported | LaTeX TikZ drawings preserved for reuse |
| MetaMD | .metamd, .markdown |
β Supported | Round-trips existing MetaMD documents defined in docs/MetaMD.md |
HTML Conversion Features (AngleSharp powered)
- Headers (H1-H6) β Markdown headers
- Bold/Strong text β bold
- Italic/Emphasis text β italic
- Links β text
- Images β
- Lists (ordered/unordered)
- Tables with header detection and Markdown table output
- Code blocks and inline code
- Blockquotes, sections, semantic containers
PDF Conversion Features
- Text extraction with page separation
- Header detection based on formatting
- List item recognition
- Title extraction from document content
- Page snapshot artifacts ensure every page can be sent through AI enrichment (OCR, diagram-to-Mermaid, chart narration) even when the PDF exposes selectable text
Office Documents (DOCX/XLSX/PPTX)
- Word (.docx): Headers, paragraphs, tables, bold/italic formatting, and embedded images captured for AI enrichment (OCR, Mermaid-ready diagrams)
- Excel (.xlsx): Spreadsheet data as Markdown tables with sheet organization
- PowerPoint (.pptx): Slide-by-slide content with title recognition plus image artifacts primed for detailed AI captions and diagrams
CSV Conversion Features
- Automatic table formatting with headers
- Proper escaping of special characters
- Support for various CSV dialects
- Handles quoted fields and embedded commas
JSON Conversion Features
- Structured Format: Converts JSON objects to readable Markdown with proper hierarchy
- JSON Lines Support: Processes
.jsonland.ndjsonfiles line by line - Data Type Preservation: Maintains JSON data types (strings, numbers, booleans, null)
- Nested Objects: Handles complex nested structures with proper indentation
XML Conversion Features
- Structure Preservation: Maintains XML hierarchy as Markdown headings
- Attributes Handling: Converts XML attributes to Markdown lists
- Multiple Formats: Supports XML, XSD, XSL, RSS, and Atom feeds
- CDATA Support: Properly handles CDATA sections as code blocks
EPUB Conversion Features
- Metadata Extraction: Extracts title, author, publisher, and other Dublin Core metadata
- Content Order: Processes content files in proper reading order using spine information
- HTML Processing: Converts XHTML content using the HTML converter
- Table of Contents: Maintains document structure from the original EPUB
ZIP Archive Features
- Recursive Processing: Extracts and converts all supported files within archives
- Structure Preservation: Maintains original file paths and organization
- Multi-Format Support: Processes different file types within the same archive
- Error Handling: Continues processing even if individual files fail
- Size Limits: Protects against memory issues with large files
Jupyter Notebook Conversion Features
- Cell Type Support: Processes markdown, code, and raw cells appropriately
- Metadata Extraction: Extracts notebook title, kernel information, and language details
- Code Output Handling: Captures and formats execution results, streams, and errors
- Syntax Highlighting: Preserves language information for proper code block formatting
RSS/Atom Feed Conversion Features
- Multi-Format Support: Handles RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds
- Feed Metadata: Extracts title, description, last update date, and author information
- Article Processing: Converts feed items with proper title linking and content formatting
- Date Formatting: Normalizes publication dates across different feed formats
YouTube URL Conversion Features
- URL Recognition: Supports standard and shortened YouTube URLs (youtube.com, youtu.be)
- Metadata Extraction: Extracts video ID and URL parameters with descriptions
- Embed Integration: Provides thumbnail images and multiple access methods
- Parameter Parsing: Decodes common YouTube URL parameters (playlist, timestamps, etc.)
Image Conversion Features
- Support for JPEG, PNG, GIF, BMP, TIFF, WebP
- Exif metadata extraction via
exiftool(optional) - Optional multimodal image captioning hook (LLM integration ready)
- Graceful fallback when metadata/captioning unavailable
Audio Conversion Features
- Handles WAV/MP3/M4A/MP4 containers
- Extracts key metadata (artist, album, duration, channels, etc.)
- Optional transcription delegate for speech-to-text results
- Markdown summary highlighting metadata and transcript
π Quick Start
Installation
Install via NuGet Package Manager:
# Package Manager Console
Install-Package ManagedCode.MarkItDown
# .NET CLI
dotnet add package ManagedCode.MarkItDown
# PackageReference (add to your .csproj)
<PackageReference Include="ManagedCode.MarkItDown" Version="0.0.3" />
Prerequisites
- .NET 9.0 SDK or later
- Compatible with .NET 9 apps and libraries
Minimal usage
using MarkItDown;
var client = new MarkItDownClient();
await using var result = await client.ConvertAsync("document.pdf");
Console.WriteLine(result.Title);
Console.WriteLine(result.Markdown);
Convert a stream
await using var stream = File.OpenRead("invoice.html");
var info = new StreamInfo(extension: ".html", mimeType: "text/html");
await using var result = await client.ConvertAsync(stream, info);
Convert a URL
await using var result = await client.ConvertFromUrlAsync("https://contoso.example/blog");
Console.WriteLine(result.Markdown.Length);
Optional Dependencies for Advanced Features
- PDF Support: Provided via PdfPig (bundled)
- Office Documents: Provided via DocumentFormat.OpenXml (bundled)
- Image metadata: Install ExifTool for richer output (
brew install exiftool,choco install exiftool) - Image captions: Supply an
ImageCaptionerdelegate (e.g., calls to an LLM or vision service) - Audio transcription: Supply an
AudioTranscriberdelegate (e.g., Azure Cognitive Services, OpenAI Whisper)
Note: External tools are optionalβMarkItDown degrades gracefully when they are absent.
π» Usage
ConvertAsync(string path)converts any supported file on disk and returns aDocumentConverterResult.ConvertAsync(Stream stream, StreamInfo info)handles non-seekable or remote streams once you supply basic metadata (extension/MIME type).ConvertFromUrlAsync(string url)downloads HTTP(S) content using the optionalHttpClientyou pass into the constructor.- Always dispose the result (
await using var result = β¦) so temporary workspaces and artifacts are cleaned up. DocumentConverterResultexposesMarkdown,Title,Segments,Artifacts, andMetadatafor downstream processing.- Apply custom behaviour through
MarkItDownOptions(segment settings, AI providers, middleware) when constructing the client.
CLI
Prefer a guided experience? Run the bundled CLI to batch files or URLs:
dotnet run --project src/MarkItDown.Cli -- path/to/input
Use dotnet publish with your preferred runtime identifier if you need a self-contained binary.
ποΈ Architecture
Core Components
MarkItDown- Main entry point for conversionsIDocumentConverter- Interface for format-specific convertersDocumentConverterResult- Contains the aggregate Markdown plus structuredDocumentSegmententriesStreamInfo- Metadata about the input stream (MIME type, extension, charset, etc.)ConverterRegistration- Associates converters with priority for selection
βΉοΈ MIME detection and normalization rely on ManagedCode.MimeTypes.
Built-in Converters
MarkItDown includes these converters in priority order:
YouTubeUrlConverter- Video metadata from YouTube URLsHtmlConverter- HTML to Markdown using AngleSharpWikipediaConverter- Clean article extraction from Wikipedia pagesBingSerpConverter- Search result summaries from BingRssFeedConverter- RSS/Atom feeds with article processingJsonConverter- Structured JSON data with formattingJupyterNotebookConverter- Python notebooks with code and markdown cellsCsvConverter- CSV files as Markdown tablesEpubConverter- E-book content and metadataEmlConverter- Email files with headers and attachmentsXmlConverter- XML documents with structure preservationZipConverter- Archive processing with recursive conversionPdfConverter- PDF text extraction using PdfPigDocxConverter- Microsoft Word documentsXlsxConverter- Microsoft Excel spreadsheetsPptxConverter- Microsoft PowerPoint presentationsAudioConverter- Audio metadata and optional transcriptionImageConverter- Image metadata via ExifTool and optional captionsPlainTextConverter- Plain text, Markdown, and other text formats (fallback)
Structured Segments & Metadata
Every conversion populates DocumentConverterResult.Segments with strongly typed DocumentSegment instances. Segments preserve natural breakpoints (pages, slides, sheets, archive entries, audio ranges) alongside rich metadata:
TypeandNumberexpose what the segment represents (for example page/slide numbers)Labelcarries human-readable descriptors when availableStartTime/EndTimecapture media timelines for audio/video contentAdditionalMetadataholds contextual properties such as archive entry paths or sheet names
var result = await markItDown.ConvertAsync("report.pdf");
foreach (var segment in result.Segments)
{
Console.WriteLine($"[{segment.Type}] #{segment.Number}: {segment.Label}");
}
Runtime behaviour is controlled through SegmentOptions on MarkItDownOptions. Enabling IncludeSegmentMetadataInMarkdown emits inline annotations like [page:1], [sheet:Sales], or [timecode:00:01:00-00:02:00] directly in the Markdown stream. Audio transcripts honour Segments.Audio.SegmentDuration, while still collapsing short transcripts into a single, time-aware slice.
Cloud Intelligence Providers
MarkItDown exposes optional abstractions for running documents through cloud services:
IDocumentIntelligenceProviderβ structured page, table, and layout extraction.IImageUnderstandingProviderβ OCR, captioning, and object detection for embedded images.IMediaTranscriptionProviderβ timed transcripts for audio and video inputs.
The AzureIntelligenceOptions, GoogleIntelligenceOptions, and AwsIntelligenceOptions helpers wire the respective cloud Document AI/Vision/Speech stacks without forcing the dependency on consumers. You can still bring your own implementation by assigning the provider interfaces directly on MarkItDownOptions.
MarkItDownClient emits structured ILogger events and OpenTelemetry spans by default. Toggle instrumentation with MarkItDownOptions.EnableTelemetry, supply a custom ActivitySource/Meter, or provide a LoggerFactory to integrate with your application's logging pipeline.
Azure AI setup (keys and managed identity)
Docs: Document Intelligence, Computer Vision Image Analysis, Video Indexer authentication.
API keys / connection strings: store your Cognitive Services key in configuration (for example
appsettings.jsonor an Azure App Configuration connection string) and hydrate the options:var configuration = host.Services.GetRequiredService<IConfiguration>(); var azureOptions = new AzureIntelligenceOptions { DocumentIntelligence = new AzureDocumentIntelligenceOptions { Endpoint = configuration["Azure:DocumentIntelligence:Endpoint"], ApiKey = configuration.GetConnectionString("AzureDocumentIntelligenceKey"), ModelId = "prebuilt-layout" }, Vision = new AzureVisionOptions { Endpoint = configuration["Azure:Vision:Endpoint"], ApiKey = configuration.GetConnectionString("AzureVisionKey") }, Media = new AzureMediaIntelligenceOptions { AccountId = configuration["Azure:VideoIndexer:AccountId"], AccountName = configuration["Azure:VideoIndexer:AccountName"], Location = configuration["Azure:VideoIndexer:Location"], SubscriptionId = configuration["Azure:VideoIndexer:SubscriptionId"], ResourceGroup = configuration["Azure:VideoIndexer:ResourceGroup"], ResourceId = configuration["Azure:VideoIndexer:ResourceId"], ArmAccessToken = configuration.GetConnectionString("AzureVideoIndexerArmToken") } };Managed identity: omit the
ApiKey/ArmAccessTokenproperties and the providers automatically fall back toDefaultAzureCredential. Assign the managed identity the Cognitive Services User role for Document Intelligence and Vision, and follow the Video Indexer managed identity instructions to authorize uploads.Video Indexer tips: Video uploads require both the Video Indexer account (ID + region) and either the full resource ID or the trio of subscription id/resource group/account name, plus an ARM token or Azure AD identity with
Contributoraccess on the Video Indexer resource. The interactive CLI exposes dedicated prompts for these values under βConfigure cloud providersβ.var azureOptions = new AzureIntelligenceOptions { DocumentIntelligence = new AzureDocumentIntelligenceOptions { Endpoint = "https://contoso.cognitiveservices.azure.com/" }, Vision = new AzureVisionOptions { Endpoint = "https://contoso.cognitiveservices.azure.com/" }, Media = new AzureMediaIntelligenceOptions { AccountId = "<video-indexer-account-id>", AccountName = "<video-indexer-account-name>", Location = "trial" } };
Google Cloud setup
Docs: Document AI, Vision API, Speech-to-Text.
Service account JSON / ADC: place your service account JSON on disk or load it from Secret Manager, then point the options at it (or provide a
GoogleCredentialinstance). IfCredentialsPath/JsonCredentials/Credentialare omitted the providers use Application Default Credentials:var googleOptions = new GoogleIntelligenceOptions { DocumentIntelligence = new GoogleDocumentIntelligenceOptions { ProjectId = "my-project", Location = "us", ProcessorId = "processor-id", CredentialsPath = Environment.GetEnvironmentVariable("GOOGLE_APPLICATION_CREDENTIALS") }, Vision = new GoogleVisionOptions { JsonCredentials = Environment.GetEnvironmentVariable("GOOGLE_VISION_JSON") }, Media = new GoogleMediaIntelligenceOptions { Credential = GoogleCredential.GetApplicationDefault(), LanguageCode = "en-US" } };Workload identity / managed identities: host the app on GKE, Cloud Run, or Cloud Functions with Workload Identity Federation. The Google SDK automatic credential chain will pick up the ambient identity and the providers will work without JSON keys.
AWS setup
Docs: Textract, Rekognition, Transcribe, .NET credential management.
Access keys / connection strings: populate the options directly from configuration when you must supply static credentials (for example from AWS Secrets Manager or an encrypted connection string):
var awsOptions = new AwsIntelligenceOptions { DocumentIntelligence = new AwsDocumentIntelligenceOptions { AccessKeyId = configuration["AWS:AccessKeyId"], SecretAccessKey = configuration["AWS:SecretAccessKey"], Region = configuration.GetValue<string>("AWS:Region") }, Vision = new AwsVisionOptions { AccessKeyId = configuration["AWS:AccessKeyId"], SecretAccessKey = configuration["AWS:SecretAccessKey"], Region = configuration.GetValue<string>("AWS:Region"), MinConfidence = 80f }, Media = new AwsMediaIntelligenceOptions { AccessKeyId = configuration["AWS:AccessKeyId"], SecretAccessKey = configuration["AWS:SecretAccessKey"], Region = configuration.GetValue<string>("AWS:Region"), InputBucketName = configuration["AWS:Transcribe:InputBucket"], OutputBucketName = configuration["AWS:Transcribe:OutputBucket"] } };IAM roles / AWS managed identity: leave the credential fields null to use the default AWS credential chain (environment variables, shared credentials file, EC2/ECS/EKS IAM roles, or AWS SSO). Ensure the execution role has permissions for
textract:AnalyzeDocument,rekognition:DetectLabels,rekognition:DetectText,transcribe:StartTranscriptionJob, and S3 access for the specified buckets.
YouTube metadata & captions
Docs: YoutubeExplode (used under the hood).
Out of the box:
YouTubeUrlConverternow enriches Markdown with title, channel, stats, thumbnails, and (when available) auto-generated captions laid out as timecoded segments.Custom provider: supply
MarkItDownOptions.YouTubeMetadataProviderto disable network access, inject caching, or swap to an alternative implementation.var options = new MarkItDownOptions { YouTubeMetadataProvider = new YoutubeExplodeMetadataProvider(), // default // You can plug in a stub or caching decorator instead: // YouTubeMetadataProvider = new MyCachedYouTubeProvider(inner: new YoutubeExplodeMetadataProvider()) };When a provider returns
nullthe converter falls back to URL-derived metadata, so YouTube support remains fully optional.
For LLM-style post-processing, assign MarkItDownOptions.AiModels with an IAiModelProvider. The built-in StaticAiModelProvider accepts Microsoft.Extensions.AI clients (chat models, speech-to-text, etc.), enabling you to share application-wide model builders.
Converter Priority & Detection
- Priority-based dispatch (lower values processed first)
- Automatic stream sniffing via
StreamInfoGuesser - Manual overrides via
MarkItDownOptionsorStreamInfo
π¨ Error Handling & Troubleshooting
Common Exceptions
using MarkItDown;
var markItDown = new MarkItDownClient();
try
{
var result = await markItDown.ConvertAsync("document.pdf");
Console.WriteLine(result.Markdown);
}
catch (UnsupportedFormatException ex)
{
// File format not supported by any converter
Console.WriteLine($"Cannot process this file type: {ex.Message}");
}
catch (FileNotFoundException ex)
{
// File path doesn't exist
Console.WriteLine($"File not found: {ex.Message}");
}
catch (UnauthorizedAccessException ex)
{
// Permission issues
Console.WriteLine($"Access denied: {ex.Message}");
}
catch (MarkItDownException ex)
{
// General conversion errors (corrupt files, parsing issues, etc.)
Console.WriteLine($"Conversion failed: {ex.Message}");
if (ex.InnerException != null)
Console.WriteLine($"Details: {ex.InnerException.Message}");
}
Troubleshooting Tips
File Format Detection Issues:
// Force specific format detection
var streamInfo = new StreamInfo(
mimeType: "application/pdf", // Explicit MIME type
extension: ".pdf", // Explicit extension
fileName: "document.pdf" // Original filename
);
var result = await markItDown.ConvertAsync(stream, streamInfo);
Memory Issues with Large Files:
// Use cancellation tokens to prevent runaway processing
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(10));
try
{
var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);
}
catch (OperationCanceledException)
{
Console.WriteLine("Conversion timed out - file may be too large or complex");
}
Network Issues (URLs):
// Configure HttpClient for better reliability
using var httpClient = new HttpClient();
httpClient.Timeout = TimeSpan.FromSeconds(30);
httpClient.DefaultRequestHeaders.Add("User-Agent", "MarkItDown/1.0");
var markItDown = new MarkItDownClient(httpClient: httpClient);
Logging for Diagnostics:
using Microsoft.Extensions.Logging;
using var loggerFactory = LoggerFactory.Create(builder =>
builder.AddConsole().SetMinimumLevel(LogLevel.Debug));
var logger = loggerFactory.CreateLogger<MarkItDown>();
var markItDown = new MarkItDownClient(logger: logger);
// Now you'll see detailed conversion progress in console output
π Development & Contributing
Migration from Python MarkItDown
If you're familiar with the original Python library, here are the key differences:
| Python | C#/.NET | Notes |
|---|---|---|
MarkItDownClient() |
new MarkItDownClient() |
Similar constructor |
markitdown.convert("file.pdf") |
await markItDown.ConvertAsync("file.pdf") |
Async pattern |
markitdown.convert(stream, file_extension=".pdf") |
await markItDown.ConvertAsync(stream, streamInfo) |
StreamInfo object |
markitdown.convert_url("https://...") |
await markItDown.ConvertFromUrlAsync("https://...") |
Async URL conversion |
llm_client=... parameter |
ImageCaptioner, AudioTranscriber delegates |
More flexible callback system |
| Plugin system | Not yet implemented | Planned for future release |
Example Migration:
# Python version
import markitdown
md = markitdown.MarkItDownClient()
result = md.convert("document.pdf")
print(result.text_content)
// C# version
using MarkItDown;
var markItDown = new MarkItDownClient();
var result = await markItDown.ConvertAsync("document.pdf");
Console.WriteLine(result.Markdown);
.NET SDK Setup
MarkItDown targets .NET 9.0. If your environment does not have the required SDK, run the helper script once:
./eng/install-dotnet.sh
The script installs the SDK into ~/.dotnet using the official dotnet-install bootstrapper and prints the environment
variables to add to your shell profile so the dotnet CLI is available on subsequent sessions.
Building from Source
# Clone the repository
git clone https://github.com/managedcode/markitdown.git
cd markitdown
# Build the solution
dotnet build
# Run tests
dotnet test
# Create NuGet package
dotnet pack --configuration Release
Tests & Coverage
dotnet test --collect:"XPlat Code Coverage"
The command emits standard test results plus a Cobertura coverage report at
tests/MarkItDown.Tests/TestResults/<guid>/coverage.cobertura.xml. Tools such as
ReportGenerator can turn this into
HTML or Markdown dashboards.
β The regression suite now exercises DOCX and PPTX conversions with embedded imagery, ensuring conversion middleware runs and enriched descriptions remain attached to the composed Markdown.
β Additional image-placement regressions verify that AI-generated captions are injected immediately after each source placeholder for DOCX, PPTX, and PDF outputs.
Project Structure
βββ src/
β βββ MarkItDown/ # Core library
β βββ Converters/ # Format-specific converters (HTML, PDF, audio, etc.)
β βββ MarkItDown.cs # Main conversion engine
β βββ StreamInfoGuesser.cs # MIME/charset/extension detection helpers
β βββ MarkItDownOptions.cs # Runtime configuration flags
β βββ ... # Shared utilities (UriUtilities, MimeMapping, etc.)
βββ tests/
β βββ MarkItDown.Tests/ # xUnit + Shouldly tests, Python parity vectors
βββ Directory.Build.props # Shared build + packaging settings
βββ README.md # This document
Contributing Guidelines
- Fork the repository.
- Create a feature branch (
git checkout -b feature/my-feature). - Add tests with xUnit/Shouldly mirroring relevant Python vectors.
- Run
dotnet test(CI enforces green builds + coverage upload). - Update docs or samples if behaviour changes.
- Submit a pull request for review.
πΊοΈ Roadmap
π― Near-Term
- Azure Document Intelligence converter (options already scaffolded)
- Outlook
.msgingestion via MIT-friendly dependencies - Performance optimizations and memory usage improvements
- Enhanced test coverage mirroring Python test vectors
π― Future Ideas
- Plugin discovery & sandboxing for custom converters
- Built-in LLM caption/transcription providers (OpenAI, Azure AI)
- Incremental/streaming conversion APIs for large documents
- Cloud-native integration samples (Azure Functions, AWS Lambda)
- Command-line interface (CLI) for batch processing
π Performance
MarkItDown is designed for high-performance document processing in production environments:
π Performance Characteristics
| Feature | Benefit | Impact |
|---|---|---|
| Stream-based processing | No temporary files created | Faster I/O, lower disk usage |
| Async/await throughout | Non-blocking operations | Better scalability, responsive UIs |
| Memory efficient | Smart buffer reuse | Lower memory footprint for large documents |
| Fast format detection | Lightweight MIME/extension sniffing | Quick routing to appropriate converter |
| Parallel processing ready | Thread-safe converter instances | Handle multiple documents concurrently |
π Performance Considerations
MarkItDown's performance depends on:
- Document size and complexity - Larger files with more formatting take longer to process
- File format - Some formats (like PDF) require more processing than others (like plain text)
- Available system resources - Memory, CPU, and I/O capabilities
- Optional services - Image captioning and audio transcription add processing time
Performance will vary based on your specific documents and environment. For production workloads, we recommend benchmarking with your actual document types and sizes.
β‘ Optimization Tips
// 1. Reuse MarkItDown instances (they're thread-safe)
var markItDown = new MarkItDownClient();
await Task.WhenAll(
markItDown.ConvertAsync("file1.pdf"),
markItDown.ConvertAsync("file2.docx"),
markItDown.ConvertAsync("file3.html")
);
// 2. Use cancellation tokens for timeouts
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(5));
var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);
// 3. Configure HttpClient for web content (reuse connections)
using var httpClient = new HttpClient();
var markItDown = new MarkItDownClient(httpClient: httpClient);
// 4. Pre-specify StreamInfo to skip format detection
var streamInfo = new StreamInfo(mimeType: "application/pdf", extension: ".pdf");
var result = await markItDown.ConvertAsync(stream, streamInfo);
π§ Configuration
Basic Configuration
var options = new MarkItDownOptions
{
EnableBuiltins = true, // Use built-in converters (default: true)
EnablePlugins = false, // Plugin system (reserved for future use)
ExifToolPath = "/usr/local/bin/exiftool" // Path to exiftool binary (optional)
};
var markItDown = new MarkItDownClient(options);
Advanced AI Integration
using Azure;
using OpenAI;
var openAIChatClient = new MyChatClient(); // IChatClient from Microsoft.Extensions.AI
var whisperSpeechClient = new MySpeechToTextClient(); // ISpeechToTextClient from Microsoft.Extensions.AI
var options = new MarkItDownOptions
{
AiModels = new StaticAiModelProvider(openAIChatClient, whisperSpeechClient),
AzureIntelligence = new AzureIntelligenceOptions
{
DocumentIntelligence = new AzureDocumentIntelligenceOptions
{
Endpoint = "https://your-document-intelligence.cognitiveservices.azure.com/",
ApiKey = "<document-intelligence-key>"
},
Vision = new AzureVisionOptions
{
Endpoint = "https://your-computervision.cognitiveservices.azure.com/",
ApiKey = "<vision-key>"
}
}
};
var markItDown = new MarkItDownClient(options);
Conversion Middleware & Raw Artifacts
Every conversion now exposes the raw extraction artifacts that feed the Markdown composer. Use DocumentConverterResult.Artifacts to inspect page text, tables, or embedded images before they are flattened into Markdown. You can plug additional processing by registering IConversionMiddleware instances through MarkItDownOptions.ConversionMiddleware. Middleware executes after extraction and can mutate segments, enrich metadata, or call external AI services. When an IChatClient is supplied and EnableAiImageEnrichment remains true (default), MarkItDown automatically adds the built-in AiImageEnrichmentMiddleware to describe charts, diagrams, and other visuals. The middleware keeps enriched prose anchored to the exact Markdown placeholder emitted during extraction, ensuring captions, Mermaid diagrams, and OCR text land beside the original image instead of drifting to the end of the section.
var options = new MarkItDownOptions
{
AiModels = new StaticAiModelProvider(chatClient: myChatClient, speechToTextClient: null),
ConversionMiddleware = new IConversionMiddleware[]
{
new MyDomainSpecificMiddleware()
}
};
var markItDown = new MarkItDownClient(options);
var result = await markItDown.ConvertAsync("docs/diagram.docx");
foreach (var image in result.Artifacts.Images)
{
Console.WriteLine($"Image {image.Label}: {image.DetailedDescription}");
}
Set EnableAiImageEnrichment to false when you need a completely custom pipeline with no default AI step.
Production Configuration with Error Handling
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.DependencyInjection;
// Set up dependency injection
var services = new ServiceCollection();
services.AddLogging(builder => builder.AddConsole().SetMinimumLevel(LogLevel.Information));
services.AddHttpClient();
var serviceProvider = services.BuildServiceProvider();
var logger = serviceProvider.GetRequiredService<ILogger<MarkItDown>>();
var httpClientFactory = serviceProvider.GetRequiredService<IHttpClientFactory>();
var options = new MarkItDownOptions
{
// Graceful degradation for image processing
ImageCaptioner = async (bytes, info, token) =>
{
try
{
// Your AI service call here
return await CallVisionServiceAsync(bytes, token);
}
catch (Exception ex)
{
logger.LogWarning("Image captioning failed: {Error}", ex.Message);
return $"[Image: {info.FileName ?? "unknown"}]"; // Fallback
}
}
};
var markItDown = new MarkItDownClient(options, logger, httpClientFactory.CreateClient());
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
This project is a comprehensive C# port of the original Microsoft MarkItDown Python library, created by the Microsoft AutoGen team. We've reimagined it specifically for the .NET ecosystem while maintaining compatibility with the original's design philosophy and capabilities.
Key differences in this .NET version:
- π― Native .NET performance - Built from scratch in C#, not a Python wrapper
- π Modern async patterns - Full async/await support with cancellation tokens
- π¦ NuGet ecosystem integration - Easy installation and dependency management
- π οΈ Enterprise features - Comprehensive logging, error handling, and configuration
- π Enhanced performance - Stream-based processing and memory optimizations
Maintained by: ManagedCode team
Original inspiration: Microsoft AutoGen team
License: MIT (same as the original Python version)
We're committed to maintaining feature parity with the upstream Python project while delivering the performance and developer experience that .NET developers expect.
π Support
- π Documentation: GitHub Wiki
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Email: Create an issue for support
<div align="center">
β Star this repository if you find it useful!
Made with β€οΈ by ManagedCode
</div>
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- AngleSharp (>= 1.3.0)
- AWSSDK.Rekognition (>= 4.0.2.8)
- AWSSDK.S3 (>= 4.0.7.10)
- AWSSDK.Textract (>= 4.0.2.8)
- AWSSDK.TranscribeService (>= 4.0.4)
- Azure.AI.FormRecognizer (>= 4.1.0)
- Azure.AI.Vision.ImageAnalysis (>= 1.0.0)
- Azure.Identity (>= 1.17.0)
- DocumentFormat.OpenXml (>= 3.3.0)
- Google.Cloud.DocumentAI.V1 (>= 3.22.0)
- Google.Cloud.Speech.V1 (>= 3.8.0)
- Google.Cloud.Vision.V1 (>= 3.7.0)
- ManagedCode.MimeTypes (>= 1.0.5)
- ManagedCode.Storage.Aws (>= 9.2.1)
- ManagedCode.Storage.Azure (>= 9.2.1)
- ManagedCode.Storage.Core (>= 9.2.1)
- ManagedCode.Storage.FileSystem (>= 9.2.1)
- ManagedCode.Storage.Gcp (>= 9.2.1)
- Microsoft.Extensions.AI (>= 9.10.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 9.0.10)
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.10)
- Microsoft.Extensions.Options (>= 9.0.10)
- MimeKit (>= 4.14.0)
- PdfPig (>= 0.1.11)
- PDFtoImage (>= 5.1.1)
- Sep (>= 0.11.2)
- SkiaSharp (>= 3.119.1)
- System.Text.Encoding.CodePages (>= 9.0.10)
- System.Text.Json (>= 9.0.10)
- YoutubeExplode (>= 6.5.5)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.