ManagedCode.MarkItDown 0.0.5

Prefix Reserved
dotnet add package ManagedCode.MarkItDown --version 0.0.5
                    
NuGet\Install-Package ManagedCode.MarkItDown -Version 0.0.5
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ManagedCode.MarkItDown" Version="0.0.5" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="ManagedCode.MarkItDown" Version="0.0.5" />
                    
Directory.Packages.props
<PackageReference Include="ManagedCode.MarkItDown" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add ManagedCode.MarkItDown --version 0.0.5
                    
#r "nuget: ManagedCode.MarkItDown, 0.0.5"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package ManagedCode.MarkItDown@0.0.5
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=ManagedCode.MarkItDown&version=0.0.5
                    
Install as a Cake Addin
#tool nuget:?package=ManagedCode.MarkItDown&version=0.0.5
                    
Install as a Cake Tool

MarkItDown

.NET NuGet License: MIT

πŸš€ Transform any document into LLM-ready Markdown with this powerful C#/.NET library!

MarkItDown is a comprehensive document conversion library that transforms diverse file formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, and more) into clean, high-quality Markdown. Perfect for AI workflows, RAG (Retrieval-Augmented Generation) systems, content processing pipelines, and text analytics applications.

Why MarkItDown for .NET?

  • 🎯 Built for modern C# developers - Native .NET 9 library with async/await throughout
  • 🧠 LLM-optimized output - Clean Markdown that AI models love to consume
  • πŸ“¦ Zero-friction NuGet package - Just dotnet add package ManagedCode.MarkItDown and go
  • πŸ”„ Disk-first stream processing - Handle large documents efficiently using managed workspaces instead of MemoryStream
  • πŸ› οΈ Highly extensible - Add custom converters or integrate with AI services for captions/transcription

This is a high-fidelity C# port of Microsoft's original MarkItDown Python library, reimagined for the .NET ecosystem with modern async patterns, improved performance, and enterprise-ready features.

🌟 Why Choose MarkItDown?

For AI & LLM Applications

  • Perfect for RAG systems - Convert documents to searchable, contextual Markdown chunks
  • Token-efficient - Clean output maximizes your LLM token budget
  • Structured data preservation - Tables, headers, and lists maintain semantic meaning
  • Metadata extraction - Rich document properties for enhanced context

For .NET Developers

  • Native performance - Built from the ground up for .NET, not a wrapper
  • Modern async/await - Non-blocking I/O with full cancellation support
  • Memory efficient - Stream-based processing avoids loading entire files into memory
  • Enterprise ready - Proper error handling, logging, and configuration options

For Content Processing

  • 22+ file formats supported - From Office documents to web pages to archives
  • Batch processing ready - Handle hundreds of documents efficiently
  • Extensible architecture - Add custom converters for proprietary formats
  • Smart format detection - Automatic MIME type and encoding detection

Table of Contents

Features

✨ Modern .NET - Targets .NET 9.0 with up-to-date language features
πŸ“¦ NuGet Package - Drop-in dependency for libraries and automation pipelines
πŸ”„ Async/Await - Fully asynchronous pipeline for responsive apps 🧠 LLM-Optimized - Markdown tailored for AI ingestion and summarisation πŸ”§ Extensible - Register custom converters or plug additional caption/transcription services 🧩 Conversion middleware - Compose post-processing steps with IConversionMiddleware (AI enrichment ready) πŸ“‚ Raw artifacts API - Inspect text blocks, tables, and images via DocumentConverterResult.Artifacts 🧭 Smart Detection - Automatic MIME, charset, and file-type guessing (including data/file URIs) ⚑ High Performance - Stream-friendly, disk-backed buffers prevent large sources from exhausting RAM

πŸ“‹ Format Support

Format Extension Status Description
HTML .html, .htm βœ… Supported Full HTML to Markdown conversion
Plain Text .txt, .md βœ… Supported Direct text processing
PDF .pdf βœ… Supported Adobe PDF documents with text extraction
Word .docx βœ… Supported Microsoft Word documents with formatting
Excel .xlsx βœ… Supported Microsoft Excel spreadsheets as tables
PowerPoint .pptx βœ… Supported Microsoft PowerPoint presentations
Images .jpg, .png, .gif, .bmp, .tiff, .webp βœ… Supported Exif metadata extraction + optional captions
Audio .wav, .mp3, .m4a, .mp4 βœ… Supported Metadata extraction + optional transcription
CSV .csv βœ… Supported Comma-separated values as Markdown tables
JSON .json, .jsonl, .ndjson βœ… Supported Structured JSON data with formatting
XML .xml, .xsd, .xsl, .rss, .atom βœ… Supported XML documents with structure preservation
EPUB .epub βœ… Supported E-book files with metadata and content
Email .eml βœ… Supported Email files with headers, content, and attachment info
ZIP .zip βœ… Supported Archive processing with recursive file conversion
Jupyter Notebook .ipynb βœ… Supported Python notebooks with code and markdown cells
RSS/Atom Feeds .rss, .atom, .xml βœ… Supported Web feeds with structured content and metadata
YouTube URLs YouTube links βœ… Supported Video metadata extraction and link formatting
Wikipedia Pages wikipedia.org βœ… Supported Article-only extraction with clean Markdown
Bing SERPs bing.com/search βœ… Supported Organic result summarisation

πŸ“š Extended Format Support

Format Extension Status Description
DocBook .xml, .docbook βœ… Supported Technical documentation with section hierarchy
JATS / NISO .xml βœ… Supported Journal Article Tag Suite articles with enriched metadata
OPML .opml βœ… Supported Outline Processor markup trees converted to Markdown lists
FictionBook (FB2) .fb2 βœ… Supported Narrative e-books with cover art and metadata
EndNote XML .xml βœ… Supported Bibliographic exports with citation data
BibTeX .bib, .bibtex βœ… Supported Reference entries rendered as Markdown tables
RIS .ris βœ… Supported Research citations emitted with field/value mapping
CSL-JSON .csl.json, .json βœ… Supported Citation Style Language exports ready for RAG indexing
LaTeX .tex βœ… Supported Text and math blocks preserved as Markdown or fenced code
reStructuredText .rst βœ… Supported Converts directives, lists, and code blocks
AsciiDoc .adoc, .asciidoc βœ… Supported Handles attributes, admonitions, and tables
Org Mode .org βœ… Supported Emacs Org headlines and property drawers
Djot .djot βœ… Supported Djot lightweight markup translation
Typst .typ βœ… Supported Emerging typesetting language support
Textile .textile βœ… Supported Textile markup to Markdown
Wiki Markup .mediawiki, .wiki βœ… Supported MediaWiki-style formatting
Mermaid .mmd, .mermaid βœ… Supported Diagram source preserved in fenced code blocks
Graphviz DOT .dot βœ… Supported Graph definitions retained for rendering
PlantUML .puml, .plantuml βœ… Supported UML diagrams emitted as fenced code
TikZ .tikz βœ… Supported LaTeX TikZ drawings preserved for reuse
MetaMD .metamd, .markdown βœ… Supported Round-trips existing MetaMD documents defined in docs/MetaMD.md

HTML Conversion Features (AngleSharp powered)

  • Headers (H1-H6) β†’ Markdown headers
  • Bold/Strong text β†’ bold
  • Italic/Emphasis text β†’ italic
  • Links β†’ text
  • Images β†’ alt
  • Lists (ordered/unordered)
  • Tables with header detection and Markdown table output
  • Code blocks and inline code
  • Blockquotes, sections, semantic containers

PDF Conversion Features

  • Text extraction with page separation
  • Header detection based on formatting
  • List item recognition
  • Title extraction from document content
  • Page snapshot artifacts ensure every page can be sent through AI enrichment (OCR, diagram-to-Mermaid, chart narration) even when the PDF exposes selectable text

Office Documents (DOCX/XLSX/PPTX)

  • Word (.docx): Headers, paragraphs, tables, bold/italic formatting, and embedded images captured for AI enrichment (OCR, Mermaid-ready diagrams)
  • Excel (.xlsx): Spreadsheet data as Markdown tables with sheet organization
  • PowerPoint (.pptx): Slide-by-slide content with title recognition plus image artifacts primed for detailed AI captions and diagrams

CSV Conversion Features

  • Automatic table formatting with headers
  • Proper escaping of special characters
  • Support for various CSV dialects
  • Handles quoted fields and embedded commas

JSON Conversion Features

  • Structured Format: Converts JSON objects to readable Markdown with proper hierarchy
  • JSON Lines Support: Processes .jsonl and .ndjson files line by line
  • Data Type Preservation: Maintains JSON data types (strings, numbers, booleans, null)
  • Nested Objects: Handles complex nested structures with proper indentation

XML Conversion Features

  • Structure Preservation: Maintains XML hierarchy as Markdown headings
  • Attributes Handling: Converts XML attributes to Markdown lists
  • Multiple Formats: Supports XML, XSD, XSL, RSS, and Atom feeds
  • CDATA Support: Properly handles CDATA sections as code blocks

EPUB Conversion Features

  • Metadata Extraction: Extracts title, author, publisher, and other Dublin Core metadata
  • Content Order: Processes content files in proper reading order using spine information
  • HTML Processing: Converts XHTML content using the HTML converter
  • Table of Contents: Maintains document structure from the original EPUB

ZIP Archive Features

  • Recursive Processing: Extracts and converts all supported files within archives
  • Structure Preservation: Maintains original file paths and organization
  • Multi-Format Support: Processes different file types within the same archive
  • Error Handling: Continues processing even if individual files fail
  • Size Limits: Protects against memory issues with large files

Jupyter Notebook Conversion Features

  • Cell Type Support: Processes markdown, code, and raw cells appropriately
  • Metadata Extraction: Extracts notebook title, kernel information, and language details
  • Code Output Handling: Captures and formats execution results, streams, and errors
  • Syntax Highlighting: Preserves language information for proper code block formatting

RSS/Atom Feed Conversion Features

  • Multi-Format Support: Handles RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds
  • Feed Metadata: Extracts title, description, last update date, and author information
  • Article Processing: Converts feed items with proper title linking and content formatting
  • Date Formatting: Normalizes publication dates across different feed formats

YouTube URL Conversion Features

  • URL Recognition: Supports standard and shortened YouTube URLs (youtube.com, youtu.be)
  • Metadata Extraction: Extracts video ID and URL parameters with descriptions
  • Embed Integration: Provides thumbnail images and multiple access methods
  • Parameter Parsing: Decodes common YouTube URL parameters (playlist, timestamps, etc.)

Image Conversion Features

  • Support for JPEG, PNG, GIF, BMP, TIFF, WebP
  • Exif metadata extraction via exiftool (optional)
  • Optional multimodal image captioning hook (LLM integration ready)
  • Graceful fallback when metadata/captioning unavailable

Audio Conversion Features

  • Handles WAV/MP3/M4A/MP4 containers
  • Extracts key metadata (artist, album, duration, channels, etc.)
  • Optional transcription delegate for speech-to-text results
  • Markdown summary highlighting metadata and transcript

πŸš€ Quick Start

Installation

Install via NuGet Package Manager:

# Package Manager Console
Install-Package ManagedCode.MarkItDown

# .NET CLI
dotnet add package ManagedCode.MarkItDown

# PackageReference (add to your .csproj)
<PackageReference Include="ManagedCode.MarkItDown" Version="0.0.3" />

Prerequisites

  • .NET 9.0 SDK or later
  • Compatible with .NET 9 apps and libraries

Minimal usage

using MarkItDown;

var client = new MarkItDownClient();
await using var result = await client.ConvertAsync("document.pdf");

Console.WriteLine(result.Title);
Console.WriteLine(result.Markdown);

Convert a stream

await using var stream = File.OpenRead("invoice.html");
var info = new StreamInfo(extension: ".html", mimeType: "text/html");

await using var result = await client.ConvertAsync(stream, info);

Convert a URL

await using var result = await client.ConvertFromUrlAsync("https://contoso.example/blog");
Console.WriteLine(result.Markdown.Length);

Optional Dependencies for Advanced Features

  • PDF Support: Provided via PdfPig (bundled)
  • Office Documents: Provided via DocumentFormat.OpenXml (bundled)
  • Image metadata: Install ExifTool for richer output (brew install exiftool, choco install exiftool)
  • Image captions: Supply an ImageCaptioner delegate (e.g., calls to an LLM or vision service)
  • Audio transcription: Supply an AudioTranscriber delegate (e.g., Azure Cognitive Services, OpenAI Whisper)

Note: External tools are optionalβ€”MarkItDown degrades gracefully when they are absent.

πŸ’» Usage

  • ConvertAsync(string path) converts any supported file on disk and returns a DocumentConverterResult.
  • ConvertAsync(Stream stream, StreamInfo info) handles non-seekable or remote streams once you supply basic metadata (extension/MIME type).
  • ConvertFromUrlAsync(string url) downloads HTTP(S) content using the optional HttpClient you pass into the constructor.
  • Always dispose the result (await using var result = …) so temporary workspaces and artifacts are cleaned up.
  • DocumentConverterResult exposes Markdown, Title, Segments, Artifacts, and Metadata for downstream processing.
  • Apply custom behaviour through MarkItDownOptions (segment settings, AI providers, middleware) when constructing the client.

CLI

Prefer a guided experience? Run the bundled CLI to batch files or URLs:

dotnet run --project src/MarkItDown.Cli -- path/to/input

Use dotnet publish with your preferred runtime identifier if you need a self-contained binary.

πŸ—οΈ Architecture

Core Components

  • MarkItDown - Main entry point for conversions
  • IDocumentConverter - Interface for format-specific converters
  • DocumentConverterResult - Contains the aggregate Markdown plus structured DocumentSegment entries
  • StreamInfo - Metadata about the input stream (MIME type, extension, charset, etc.)
  • ConverterRegistration - Associates converters with priority for selection

ℹ️ MIME detection and normalization rely on ManagedCode.MimeTypes.

Built-in Converters

MarkItDown includes these converters in priority order:

  • YouTubeUrlConverter - Video metadata from YouTube URLs
  • HtmlConverter - HTML to Markdown using AngleSharp
  • WikipediaConverter - Clean article extraction from Wikipedia pages
  • BingSerpConverter - Search result summaries from Bing
  • RssFeedConverter - RSS/Atom feeds with article processing
  • JsonConverter - Structured JSON data with formatting
  • JupyterNotebookConverter - Python notebooks with code and markdown cells
  • CsvConverter - CSV files as Markdown tables
  • EpubConverter - E-book content and metadata
  • EmlConverter - Email files with headers and attachments
  • XmlConverter - XML documents with structure preservation
  • ZipConverter - Archive processing with recursive conversion
  • PdfConverter - PDF text extraction using PdfPig
  • DocxConverter - Microsoft Word documents
  • XlsxConverter - Microsoft Excel spreadsheets
  • PptxConverter - Microsoft PowerPoint presentations
  • AudioConverter - Audio metadata and optional transcription
  • ImageConverter - Image metadata via ExifTool and optional captions
  • PlainTextConverter - Plain text, Markdown, and other text formats (fallback)

Structured Segments & Metadata

Every conversion populates DocumentConverterResult.Segments with strongly typed DocumentSegment instances. Segments preserve natural breakpoints (pages, slides, sheets, archive entries, audio ranges) alongside rich metadata:

  • Type and Number expose what the segment represents (for example page/slide numbers)
  • Label carries human-readable descriptors when available
  • StartTime/EndTime capture media timelines for audio/video content
  • AdditionalMetadata holds contextual properties such as archive entry paths or sheet names
var result = await markItDown.ConvertAsync("report.pdf");

foreach (var segment in result.Segments)
{
    Console.WriteLine($"[{segment.Type}] #{segment.Number}: {segment.Label}");
}

Runtime behaviour is controlled through SegmentOptions on MarkItDownOptions. Enabling IncludeSegmentMetadataInMarkdown emits inline annotations like [page:1], [sheet:Sales], or [timecode:00:01:00-00:02:00] directly in the Markdown stream. Audio transcripts honour Segments.Audio.SegmentDuration, while still collapsing short transcripts into a single, time-aware slice.

Cloud Intelligence Providers

MarkItDown exposes optional abstractions for running documents through cloud services:

  • IDocumentIntelligenceProvider – structured page, table, and layout extraction.
  • IImageUnderstandingProvider – OCR, captioning, and object detection for embedded images.
  • IMediaTranscriptionProvider – timed transcripts for audio and video inputs.

The AzureIntelligenceOptions, GoogleIntelligenceOptions, and AwsIntelligenceOptions helpers wire the respective cloud Document AI/Vision/Speech stacks without forcing the dependency on consumers. You can still bring your own implementation by assigning the provider interfaces directly on MarkItDownOptions.

MarkItDownClient emits structured ILogger events and OpenTelemetry spans by default. Toggle instrumentation with MarkItDownOptions.EnableTelemetry, supply a custom ActivitySource/Meter, or provide a LoggerFactory to integrate with your application's logging pipeline.

Azure AI setup (keys and managed identity)
  • Docs: Document Intelligence, Computer Vision Image Analysis, Video Indexer authentication.

  • API keys / connection strings: store your Cognitive Services key in configuration (for example appsettings.json or an Azure App Configuration connection string) and hydrate the options:

    var configuration = host.Services.GetRequiredService<IConfiguration>();
    
    var azureOptions = new AzureIntelligenceOptions
    {
        DocumentIntelligence = new AzureDocumentIntelligenceOptions
        {
            Endpoint = configuration["Azure:DocumentIntelligence:Endpoint"],
            ApiKey = configuration.GetConnectionString("AzureDocumentIntelligenceKey"),
            ModelId = "prebuilt-layout"
        },
        Vision = new AzureVisionOptions
        {
            Endpoint = configuration["Azure:Vision:Endpoint"],
            ApiKey = configuration.GetConnectionString("AzureVisionKey")
        },
        Media = new AzureMediaIntelligenceOptions
        {
            AccountId = configuration["Azure:VideoIndexer:AccountId"],
            AccountName = configuration["Azure:VideoIndexer:AccountName"],
            Location = configuration["Azure:VideoIndexer:Location"],
            SubscriptionId = configuration["Azure:VideoIndexer:SubscriptionId"],
            ResourceGroup = configuration["Azure:VideoIndexer:ResourceGroup"],
            ResourceId = configuration["Azure:VideoIndexer:ResourceId"],
            ArmAccessToken = configuration.GetConnectionString("AzureVideoIndexerArmToken")
        }
    };
    
  • Managed identity: omit the ApiKey/ArmAccessToken properties and the providers automatically fall back to DefaultAzureCredential. Assign the managed identity the Cognitive Services User role for Document Intelligence and Vision, and follow the Video Indexer managed identity instructions to authorize uploads.

  • Video Indexer tips: Video uploads require both the Video Indexer account (ID + region) and either the full resource ID or the trio of subscription id/resource group/account name, plus an ARM token or Azure AD identity with Contributor access on the Video Indexer resource. The interactive CLI exposes dedicated prompts for these values under β€œConfigure cloud providers”.

    var azureOptions = new AzureIntelligenceOptions
    {
        DocumentIntelligence = new AzureDocumentIntelligenceOptions
        {
            Endpoint = "https://contoso.cognitiveservices.azure.com/"
        },
        Vision = new AzureVisionOptions
        {
            Endpoint = "https://contoso.cognitiveservices.azure.com/"
        },
        Media = new AzureMediaIntelligenceOptions
        {
            AccountId = "<video-indexer-account-id>",
            AccountName = "<video-indexer-account-name>",
            Location = "trial"
        }
    };
    
Google Cloud setup
  • Docs: Document AI, Vision API, Speech-to-Text.

  • Service account JSON / ADC: place your service account JSON on disk or load it from Secret Manager, then point the options at it (or provide a GoogleCredential instance). If CredentialsPath/JsonCredentials/Credential are omitted the providers use Application Default Credentials:

    var googleOptions = new GoogleIntelligenceOptions
    {
        DocumentIntelligence = new GoogleDocumentIntelligenceOptions
        {
            ProjectId = "my-project",
            Location = "us",
            ProcessorId = "processor-id",
            CredentialsPath = Environment.GetEnvironmentVariable("GOOGLE_APPLICATION_CREDENTIALS")
        },
        Vision = new GoogleVisionOptions
        {
            JsonCredentials = Environment.GetEnvironmentVariable("GOOGLE_VISION_JSON")
        },
        Media = new GoogleMediaIntelligenceOptions
        {
            Credential = GoogleCredential.GetApplicationDefault(),
            LanguageCode = "en-US"
        }
    };
    
  • Workload identity / managed identities: host the app on GKE, Cloud Run, or Cloud Functions with Workload Identity Federation. The Google SDK automatic credential chain will pick up the ambient identity and the providers will work without JSON keys.

AWS setup
  • Docs: Textract, Rekognition, Transcribe, .NET credential management.

  • Access keys / connection strings: populate the options directly from configuration when you must supply static credentials (for example from AWS Secrets Manager or an encrypted connection string):

    var awsOptions = new AwsIntelligenceOptions
    {
        DocumentIntelligence = new AwsDocumentIntelligenceOptions
        {
            AccessKeyId = configuration["AWS:AccessKeyId"],
            SecretAccessKey = configuration["AWS:SecretAccessKey"],
            Region = configuration.GetValue<string>("AWS:Region")
        },
        Vision = new AwsVisionOptions
        {
            AccessKeyId = configuration["AWS:AccessKeyId"],
            SecretAccessKey = configuration["AWS:SecretAccessKey"],
            Region = configuration.GetValue<string>("AWS:Region"),
            MinConfidence = 80f
        },
        Media = new AwsMediaIntelligenceOptions
        {
            AccessKeyId = configuration["AWS:AccessKeyId"],
            SecretAccessKey = configuration["AWS:SecretAccessKey"],
            Region = configuration.GetValue<string>("AWS:Region"),
            InputBucketName = configuration["AWS:Transcribe:InputBucket"],
            OutputBucketName = configuration["AWS:Transcribe:OutputBucket"]
        }
    };
    
  • IAM roles / AWS managed identity: leave the credential fields null to use the default AWS credential chain (environment variables, shared credentials file, EC2/ECS/EKS IAM roles, or AWS SSO). Ensure the execution role has permissions for textract:AnalyzeDocument, rekognition:DetectLabels, rekognition:DetectText, transcribe:StartTranscriptionJob, and S3 access for the specified buckets.

YouTube metadata & captions
  • Docs: YoutubeExplode (used under the hood).

  • Out of the box: YouTubeUrlConverter now enriches Markdown with title, channel, stats, thumbnails, and (when available) auto-generated captions laid out as timecoded segments.

  • Custom provider: supply MarkItDownOptions.YouTubeMetadataProvider to disable network access, inject caching, or swap to an alternative implementation.

    var options = new MarkItDownOptions
    {
        YouTubeMetadataProvider = new YoutubeExplodeMetadataProvider(), // default
        // You can plug in a stub or caching decorator instead:
        // YouTubeMetadataProvider = new MyCachedYouTubeProvider(inner: new YoutubeExplodeMetadataProvider())
    };
    

    When a provider returns null the converter falls back to URL-derived metadata, so YouTube support remains fully optional.

For LLM-style post-processing, assign MarkItDownOptions.AiModels with an IAiModelProvider. The built-in StaticAiModelProvider accepts Microsoft.Extensions.AI clients (chat models, speech-to-text, etc.), enabling you to share application-wide model builders.

Converter Priority & Detection

  • Priority-based dispatch (lower values processed first)
  • Automatic stream sniffing via StreamInfoGuesser
  • Manual overrides via MarkItDownOptions or StreamInfo

🚨 Error Handling & Troubleshooting

Common Exceptions

using MarkItDown;

var markItDown = new MarkItDownClient();

try
{
    var result = await markItDown.ConvertAsync("document.pdf");
    Console.WriteLine(result.Markdown);
}
catch (UnsupportedFormatException ex)
{
    // File format not supported by any converter
    Console.WriteLine($"Cannot process this file type: {ex.Message}");
}
catch (FileNotFoundException ex)
{
    // File path doesn't exist
    Console.WriteLine($"File not found: {ex.Message}");
}
catch (UnauthorizedAccessException ex)
{
    // Permission issues
    Console.WriteLine($"Access denied: {ex.Message}");
}
catch (MarkItDownException ex)
{
    // General conversion errors (corrupt files, parsing issues, etc.)
    Console.WriteLine($"Conversion failed: {ex.Message}");
    if (ex.InnerException != null)
        Console.WriteLine($"Details: {ex.InnerException.Message}");
}

Troubleshooting Tips

File Format Detection Issues:

// Force specific format detection
var streamInfo = new StreamInfo(
    mimeType: "application/pdf",  // Explicit MIME type
    extension: ".pdf",            // Explicit extension
    fileName: "document.pdf"      // Original filename
);

var result = await markItDown.ConvertAsync(stream, streamInfo);

Memory Issues with Large Files:

// Use cancellation tokens to prevent runaway processing
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(10));

try 
{
    var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);
}
catch (OperationCanceledException)
{
    Console.WriteLine("Conversion timed out - file may be too large or complex");
}

Network Issues (URLs):

// Configure HttpClient for better reliability
using var httpClient = new HttpClient();
httpClient.Timeout = TimeSpan.FromSeconds(30);
httpClient.DefaultRequestHeaders.Add("User-Agent", "MarkItDown/1.0");

var markItDown = new MarkItDownClient(httpClient: httpClient);

Logging for Diagnostics:

using Microsoft.Extensions.Logging;

using var loggerFactory = LoggerFactory.Create(builder => 
    builder.AddConsole().SetMinimumLevel(LogLevel.Debug));

var logger = loggerFactory.CreateLogger<MarkItDown>();
var markItDown = new MarkItDownClient(logger: logger);

// Now you'll see detailed conversion progress in console output

πŸ”„ Development & Contributing

Migration from Python MarkItDown

If you're familiar with the original Python library, here are the key differences:

Python C#/.NET Notes
MarkItDownClient() new MarkItDownClient() Similar constructor
markitdown.convert("file.pdf") await markItDown.ConvertAsync("file.pdf") Async pattern
markitdown.convert(stream, file_extension=".pdf") await markItDown.ConvertAsync(stream, streamInfo) StreamInfo object
markitdown.convert_url("https://...") await markItDown.ConvertFromUrlAsync("https://...") Async URL conversion
llm_client=... parameter ImageCaptioner, AudioTranscriber delegates More flexible callback system
Plugin system Not yet implemented Planned for future release

Example Migration:

# Python version
import markitdown
md = markitdown.MarkItDownClient()
result = md.convert("document.pdf")
print(result.text_content)
// C# version  
using MarkItDown;
var markItDown = new MarkItDownClient();
var result = await markItDown.ConvertAsync("document.pdf");
Console.WriteLine(result.Markdown);

.NET SDK Setup

MarkItDown targets .NET 9.0. If your environment does not have the required SDK, run the helper script once:

./eng/install-dotnet.sh

The script installs the SDK into ~/.dotnet using the official dotnet-install bootstrapper and prints the environment variables to add to your shell profile so the dotnet CLI is available on subsequent sessions.

Building from Source

# Clone the repository
git clone https://github.com/managedcode/markitdown.git
cd markitdown

# Build the solution
dotnet build

# Run tests
dotnet test

# Create NuGet package
dotnet pack --configuration Release

Tests & Coverage

dotnet test --collect:"XPlat Code Coverage"

The command emits standard test results plus a Cobertura coverage report at tests/MarkItDown.Tests/TestResults/<guid>/coverage.cobertura.xml. Tools such as ReportGenerator can turn this into HTML or Markdown dashboards.

βœ… The regression suite now exercises DOCX and PPTX conversions with embedded imagery, ensuring conversion middleware runs and enriched descriptions remain attached to the composed Markdown.

βœ… Additional image-placement regressions verify that AI-generated captions are injected immediately after each source placeholder for DOCX, PPTX, and PDF outputs.

Project Structure

β”œβ”€β”€ src/
β”‚   └── MarkItDown/                 # Core library
β”‚       β”œβ”€β”€ Converters/             # Format-specific converters (HTML, PDF, audio, etc.)
β”‚       β”œβ”€β”€ MarkItDown.cs          # Main conversion engine
β”‚       β”œβ”€β”€ StreamInfoGuesser.cs   # MIME/charset/extension detection helpers
β”‚       β”œβ”€β”€ MarkItDownOptions.cs   # Runtime configuration flags
β”‚       └── ...                    # Shared utilities (UriUtilities, MimeMapping, etc.)
β”œβ”€β”€ tests/
β”‚   └── MarkItDown.Tests/          # xUnit + Shouldly tests, Python parity vectors
β”œβ”€β”€ Directory.Build.props          # Shared build + packaging settings
└── README.md                      # This document

Contributing Guidelines

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/my-feature).
  3. Add tests with xUnit/Shouldly mirroring relevant Python vectors.
  4. Run dotnet test (CI enforces green builds + coverage upload).
  5. Update docs or samples if behaviour changes.
  6. Submit a pull request for review.

πŸ—ΊοΈ Roadmap

🎯 Near-Term

  • Azure Document Intelligence converter (options already scaffolded)
  • Outlook .msg ingestion via MIT-friendly dependencies
  • Performance optimizations and memory usage improvements
  • Enhanced test coverage mirroring Python test vectors

🎯 Future Ideas

  • Plugin discovery & sandboxing for custom converters
  • Built-in LLM caption/transcription providers (OpenAI, Azure AI)
  • Incremental/streaming conversion APIs for large documents
  • Cloud-native integration samples (Azure Functions, AWS Lambda)
  • Command-line interface (CLI) for batch processing

πŸ“ˆ Performance

MarkItDown is designed for high-performance document processing in production environments:

πŸš€ Performance Characteristics

Feature Benefit Impact
Stream-based processing No temporary files created Faster I/O, lower disk usage
Async/await throughout Non-blocking operations Better scalability, responsive UIs
Memory efficient Smart buffer reuse Lower memory footprint for large documents
Fast format detection Lightweight MIME/extension sniffing Quick routing to appropriate converter
Parallel processing ready Thread-safe converter instances Handle multiple documents concurrently

πŸ“Š Performance Considerations

MarkItDown's performance depends on:

  • Document size and complexity - Larger files with more formatting take longer to process
  • File format - Some formats (like PDF) require more processing than others (like plain text)
  • Available system resources - Memory, CPU, and I/O capabilities
  • Optional services - Image captioning and audio transcription add processing time

Performance will vary based on your specific documents and environment. For production workloads, we recommend benchmarking with your actual document types and sizes.

⚑ Optimization Tips

// 1. Reuse MarkItDown instances (they're thread-safe)
var markItDown = new MarkItDownClient();
await Task.WhenAll(
    markItDown.ConvertAsync("file1.pdf"),
    markItDown.ConvertAsync("file2.docx"),
    markItDown.ConvertAsync("file3.html")
);

// 2. Use cancellation tokens for timeouts
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(5));
var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);

// 3. Configure HttpClient for web content (reuse connections)
using var httpClient = new HttpClient();
var markItDown = new MarkItDownClient(httpClient: httpClient);

// 4. Pre-specify StreamInfo to skip format detection
var streamInfo = new StreamInfo(mimeType: "application/pdf", extension: ".pdf");
var result = await markItDown.ConvertAsync(stream, streamInfo);

πŸ”§ Configuration

Basic Configuration

var options = new MarkItDownOptions
{
    EnableBuiltins = true,      // Use built-in converters (default: true)
    EnablePlugins = false,      // Plugin system (reserved for future use)
    ExifToolPath = "/usr/local/bin/exiftool"  // Path to exiftool binary (optional)
};

var markItDown = new MarkItDownClient(options);

Advanced AI Integration

using Azure;
using OpenAI;

var openAIChatClient = new MyChatClient(); // IChatClient from Microsoft.Extensions.AI
var whisperSpeechClient = new MySpeechToTextClient(); // ISpeechToTextClient from Microsoft.Extensions.AI

var options = new MarkItDownOptions
{
    AiModels = new StaticAiModelProvider(openAIChatClient, whisperSpeechClient),

    AzureIntelligence = new AzureIntelligenceOptions
    {
        DocumentIntelligence = new AzureDocumentIntelligenceOptions
        {
            Endpoint = "https://your-document-intelligence.cognitiveservices.azure.com/",
            ApiKey = "<document-intelligence-key>"
        },
        Vision = new AzureVisionOptions
        {
            Endpoint = "https://your-computervision.cognitiveservices.azure.com/",
            ApiKey = "<vision-key>"
        }
    }
};

var markItDown = new MarkItDownClient(options);

Conversion Middleware & Raw Artifacts

Every conversion now exposes the raw extraction artifacts that feed the Markdown composer. Use DocumentConverterResult.Artifacts to inspect page text, tables, or embedded images before they are flattened into Markdown. You can plug additional processing by registering IConversionMiddleware instances through MarkItDownOptions.ConversionMiddleware. Middleware executes after extraction and can mutate segments, enrich metadata, or call external AI services. When an IChatClient is supplied and EnableAiImageEnrichment remains true (default), MarkItDown automatically adds the built-in AiImageEnrichmentMiddleware to describe charts, diagrams, and other visuals. The middleware keeps enriched prose anchored to the exact Markdown placeholder emitted during extraction, ensuring captions, Mermaid diagrams, and OCR text land beside the original image instead of drifting to the end of the section.

var options = new MarkItDownOptions
{
    AiModels = new StaticAiModelProvider(chatClient: myChatClient, speechToTextClient: null),
    ConversionMiddleware = new IConversionMiddleware[]
    {
        new MyDomainSpecificMiddleware()
    }
};

var markItDown = new MarkItDownClient(options);
var result = await markItDown.ConvertAsync("docs/diagram.docx");

foreach (var image in result.Artifacts.Images)
{
    Console.WriteLine($"Image {image.Label}: {image.DetailedDescription}");
}

Set EnableAiImageEnrichment to false when you need a completely custom pipeline with no default AI step.

Production Configuration with Error Handling

using Microsoft.Extensions.Logging;
using Microsoft.Extensions.DependencyInjection;

// Set up dependency injection
var services = new ServiceCollection();
services.AddLogging(builder => builder.AddConsole().SetMinimumLevel(LogLevel.Information));
services.AddHttpClient();

var serviceProvider = services.BuildServiceProvider();
var logger = serviceProvider.GetRequiredService<ILogger<MarkItDown>>();
var httpClientFactory = serviceProvider.GetRequiredService<IHttpClientFactory>();

var options = new MarkItDownOptions
{
    // Graceful degradation for image processing
    ImageCaptioner = async (bytes, info, token) =>
    {
        try
        {
            // Your AI service call here
            return await CallVisionServiceAsync(bytes, token);
        }
        catch (Exception ex)
        {
            logger.LogWarning("Image captioning failed: {Error}", ex.Message);
            return $"[Image: {info.FileName ?? "unknown"}]";  // Fallback
        }
    }
};

var markItDown = new MarkItDownClient(options, logger, httpClientFactory.CreateClient());

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

This project is a comprehensive C# port of the original Microsoft MarkItDown Python library, created by the Microsoft AutoGen team. We've reimagined it specifically for the .NET ecosystem while maintaining compatibility with the original's design philosophy and capabilities.

Key differences in this .NET version:

  • 🎯 Native .NET performance - Built from scratch in C#, not a Python wrapper
  • πŸ”„ Modern async patterns - Full async/await support with cancellation tokens
  • πŸ“¦ NuGet ecosystem integration - Easy installation and dependency management
  • πŸ› οΈ Enterprise features - Comprehensive logging, error handling, and configuration
  • πŸš€ Enhanced performance - Stream-based processing and memory optimizations

Maintained by: ManagedCode team
Original inspiration: Microsoft AutoGen team
License: MIT (same as the original Python version)

We're committed to maintaining feature parity with the upstream Python project while delivering the performance and developer experience that .NET developers expect.

πŸ“ž Support


<div align="center">

⭐ Star this repository if you find it useful!

Made with ❀️ by ManagedCode

</div>

Product Compatible and additional computed target framework versions.
.NET net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.0.5 561 10/17/2025
0.0.4 674 9/28/2025
0.0.3 421 9/24/2025
0.0.2 161 9/24/2025
0.0.1 162 9/20/2025