ManagedCode.MarkItDown 0.0.1

Prefix Reserved

There is a newer version of this package available.
See the version list below for details.

dotnet add package ManagedCode.MarkItDown --version 0.0.1

NuGet\Install-Package ManagedCode.MarkItDown -Version 0.0.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="ManagedCode.MarkItDown" Version="0.0.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="ManagedCode.MarkItDown" Version="0.0.1" />
                    

                            Directory.Packages.props

<PackageReference Include="ManagedCode.MarkItDown" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add ManagedCode.MarkItDown --version 0.0.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: ManagedCode.MarkItDown, 0.0.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package ManagedCode.MarkItDown@0.0.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=ManagedCode.MarkItDown&version=0.0.1
                    

                            Install as a Cake Addin

#tool nuget:?package=ManagedCode.MarkItDown&version=0.0.1
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

MarkItDown

A modern C#/.NET library for converting a wide range of document formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, etc.) into high-quality Markdown suitable for Large Language Models (LLMs), search indexing, and text analytics. The project mirrors the original Microsoft Python implementation while embracing .NET idioms, async APIs, and new integrations.

Features
Format Support
Quick Start
Usage
Architecture
Development & Contributing
Roadmap
Performance
Configuration
License
Acknowledgments
Support

Features

✨ Modern .NET - Targets .NET 9.0 with up-to-date language features
📦 NuGet Package - Drop-in dependency for libraries and automation pipelines
🔄 Async/Await - Fully asynchronous pipeline for responsive apps
🧠 LLM-Optimized - Markdown tailored for AI ingestion and summarisation
🔧 Extensible - Register custom converters or plug additional caption/transcription services
🧭 Smart Detection - Automatic MIME, charset, and file-type guessing (including data/file URIs)
⚡ High Performance - Stream-friendly, minimal allocations, zero temp files

📋 Format Support

Format	Extension	Status	Description
HTML	`.html`, `.htm`	✅ Supported	Full HTML to Markdown conversion
Plain Text	`.txt`, `.md`	✅ Supported	Direct text processing
PDF	`.pdf`	✅ Supported	Adobe PDF documents with text extraction
Word	`.docx`	✅ Supported	Microsoft Word documents with formatting
Excel	`.xlsx`	✅ Supported	Microsoft Excel spreadsheets as tables
PowerPoint	`.pptx`	✅ Supported	Microsoft PowerPoint presentations
Images	`.jpg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.webp`	✅ Supported	Exif metadata extraction + optional captions
Audio	`.wav`, `.mp3`, `.m4a`, `.mp4`	✅ Supported	Metadata extraction + optional transcription
CSV	`.csv`	✅ Supported	Comma-separated values as Markdown tables
JSON	`.json`, `.jsonl`, `.ndjson`	✅ Supported	Structured JSON data with formatting
XML	`.xml`, `.xsd`, `.xsl`, `.rss`, `.atom`	✅ Supported	XML documents with structure preservation
EPUB	`.epub`	✅ Supported	E-book files with metadata and content
ZIP	`.zip`	✅ Supported	Archive processing with recursive file conversion
Jupyter Notebook	`.ipynb`	✅ Supported	Python notebooks with code and markdown cells
RSS/Atom Feeds	`.rss`, `.atom`, `.xml`	✅ Supported	Web feeds with structured content and metadata
YouTube URLs	YouTube links	✅ Supported	Video metadata extraction and link formatting
Wikipedia Pages	wikipedia.org	✅ Supported	Article-only extraction with clean Markdown
Bing SERPs	bing.com/search	✅ Supported	Organic result summarisation

HTML Conversion Features (AngleSharp powered)

Headers (H1-H6) → Markdown headers
Bold/Strong text → bold
Italic/Emphasis text → italic
Links → text
Images →
Lists (ordered/unordered)
Tables with header detection and Markdown table output
Code blocks and inline code
Blockquotes, sections, semantic containers

PDF Conversion Features

Text extraction with page separation
Header detection based on formatting
List item recognition
Title extraction from document content

Office Documents (DOCX/XLSX/PPTX)

Word (.docx): Headers, paragraphs, tables, bold/italic formatting
Excel (.xlsx): Spreadsheet data as Markdown tables with sheet organization
PowerPoint (.pptx): Slide-by-slide content with title recognition

CSV Conversion Features

Automatic table formatting with headers
Proper escaping of special characters
Support for various CSV dialects
Handles quoted fields and embedded commas

JSON Conversion Features

Structured Format: Converts JSON objects to readable Markdown with proper hierarchy
JSON Lines Support: Processes .jsonl and .ndjson files line by line
Data Type Preservation: Maintains JSON data types (strings, numbers, booleans, null)
Nested Objects: Handles complex nested structures with proper indentation

XML Conversion Features

Structure Preservation: Maintains XML hierarchy as Markdown headings
Attributes Handling: Converts XML attributes to Markdown lists
Multiple Formats: Supports XML, XSD, XSL, RSS, and Atom feeds
CDATA Support: Properly handles CDATA sections as code blocks

EPUB Conversion Features

Metadata Extraction: Extracts title, author, publisher, and other Dublin Core metadata
Content Order: Processes content files in proper reading order using spine information
HTML Processing: Converts XHTML content using the HTML converter
Table of Contents: Maintains document structure from the original EPUB

ZIP Archive Features

Recursive Processing: Extracts and converts all supported files within archives
Structure Preservation: Maintains original file paths and organization
Multi-Format Support: Processes different file types within the same archive
Error Handling: Continues processing even if individual files fail
Size Limits: Protects against memory issues with large files

Jupyter Notebook Conversion Features

Cell Type Support: Processes markdown, code, and raw cells appropriately
Metadata Extraction: Extracts notebook title, kernel information, and language details
Code Output Handling: Captures and formats execution results, streams, and errors
Syntax Highlighting: Preserves language information for proper code block formatting

RSS/Atom Feed Conversion Features

Multi-Format Support: Handles RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds
Feed Metadata: Extracts title, description, last update date, and author information
Article Processing: Converts feed items with proper title linking and content formatting
Date Formatting: Normalizes publication dates across different feed formats

YouTube URL Conversion Features

URL Recognition: Supports standard and shortened YouTube URLs (youtube.com, youtu.be)
Metadata Extraction: Extracts video ID and URL parameters with descriptions
Embed Integration: Provides thumbnail images and multiple access methods
Parameter Parsing: Decodes common YouTube URL parameters (playlist, timestamps, etc.)

Image Conversion Features

Support for JPEG, PNG, GIF, BMP, TIFF, WebP
Exif metadata extraction via exiftool (optional)
Optional multimodal image captioning hook (LLM integration ready)
Graceful fallback when metadata/captioning unavailable

Audio Conversion Features

Handles WAV/MP3/M4A/MP4 containers
Extracts key metadata (artist, album, duration, channels, etc.)
Optional transcription delegate for speech-to-text results
Markdown summary highlighting metadata and transcript

🚀 Quick Start

Installation

Install via NuGet Package Manager:

# Package Manager Console
Install-Package ManagedCode.MarkItDown

# .NET CLI
dotnet add package ManagedCode.MarkItDown

# PackageReference (add to your .csproj)
<PackageReference Include="ManagedCode.MarkItDown" Version="1.0.0" />

Prerequisites

.NET 9.0 SDK or later
Compatible with .NET 9 apps and libraries

Optional Dependencies for Advanced Features

PDF Support: Provided via PdfPig (bundled)
Office Documents: Provided via DocumentFormat.OpenXml (bundled)
Image metadata: Install ExifTool for richer output (brew install exiftool, choco install exiftool)
Image captions: Supply an ImageCaptioner delegate (e.g., calls to an LLM or vision service)
Audio transcription: Supply an AudioTranscriber delegate (e.g., Azure Cognitive Services, OpenAI Whisper)

Note: External tools are optional—MarkItDown degrades gracefully when they are absent.

💻 Usage

Basic API Usage

using MarkItDown.Core;

// Simple conversion
var markItDown = new MarkItDown();
var result = await markItDown.ConvertAsync("document.html");
Console.WriteLine(result.Markdown);

Advanced Usage with Logging

using MarkItDown.Core;
using Microsoft.Extensions.Logging;

// With logging and HTTP client for web content
using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
var logger = loggerFactory.CreateLogger<Program>();

using var httpClient = new HttpClient();
var markItDown = new MarkItDown(logger, httpClient);

// Convert from file
var fileResult = await markItDown.ConvertAsync("document.html");

// Convert from URL
var urlResult = await markItDown.ConvertFromUrlAsync("https://example.com");

// Convert from URI (file:, data:, http:, https:)
var dataResult = await markItDown.ConvertUriAsync("data:text/html;base64,PGgxPkhlbGxvPC9oMT4=");

// Convert from stream with optional overrides
using var stream = File.OpenRead("document.html");
var streamInfo = new StreamInfo(mimeType: "text/html", extension: ".html");
var streamResult = await markItDown.ConvertAsync(stream, streamInfo);

Custom Converters

Create your own format converters by implementing IDocumentConverter:

using MarkItDown.Core;

public class MyCustomConverter : IDocumentConverter
{
    public bool Accepts(Stream stream, StreamInfo streamInfo, CancellationToken cancellationToken = default)
    {
        return streamInfo.Extension == ".mycustomformat";
    }

    public async Task<DocumentConverterResult> ConvertAsync(
        Stream stream, 
        StreamInfo streamInfo, 
        CancellationToken cancellationToken = default)
    {
        // Your conversion logic here
        var markdown = "# Converted from custom format\n\nContent here...";
        return new DocumentConverterResult(markdown, "Document Title");
    }
}

// Register the custom converter
var markItDown = new MarkItDown();
markItDown.RegisterConverter(new MyCustomConverter(), ConverterPriority.SpecificFileFormat);

🏗️ Architecture

Core Components

MarkItDown - Main entry point for conversions
IDocumentConverter - Interface for format-specific converters
DocumentConverterResult - Contains the converted Markdown and optional metadata
StreamInfo - Metadata about the input stream (MIME type, extension, charset, etc.)
ConverterRegistration - Associates converters with priority for selection

Built-in Converters

PlainTextConverter - Handles text, JSON, NDJSON, Markdown, etc.
HtmlConverter - Converts HTML to Markdown using AngleSharp
PdfConverter - PdfPig-based extraction with Markdown heuristics
Docx/Xlsx/Pptx Converters - Office Open XML processing
ImageConverter - Exif metadata + optional captions
AudioConverter - Metadata + optional transcription
WikipediaConverter - Article-only extraction from Wikipedia
BingSerpConverter - Summaries for Bing search result pages
YouTubeUrlConverter - Video metadata markdown
ZipConverter - Recursive archive handling
RssFeedConverter, JsonConverter, CsvConverter, XmlConverter, JupyterNotebookConverter, EpubConverter

Converter Priority & Detection

Priority-based dispatch (lower values processed first)
Automatic stream sniffing via StreamInfoGuesser
Manual overrides via MarkItDownOptions or StreamInfo

🔄 Development & Contributing

Building from Source

# Clone the repository
git clone https://github.com/managedcode/markitdown.git
cd markitdown

# Build the solution
dotnet build

# Run tests
dotnet test

# Create NuGet package
dotnet pack --configuration Release

Project Structure

├── src/
│   ├── MarkItDown.Core/           # Core library
│   │   ├── Converters/            # Format-specific converters (HTML, PDF, audio, etc.)
│   │   ├── MarkItDown.cs          # Main conversion engine
│   │   ├── StreamInfoGuesser.cs   # MIME/charset/extension detection helpers
│   │   ├── MarkItDownOptions.cs   # Runtime configuration flags
│   │   └── ...                    # Shared utilities (UriUtilities, MimeMapping, etc.)
├── tests/
│   └── MarkItDown.Tests/          # xUnit + Shouldly tests, Python parity vectors (WIP)
├── Directory.Build.props          # Shared build + packaging settings
└── README.md                      # This document

Contributing Guidelines

Fork the repository.
Create a feature branch (git checkout -b feature/my-feature).
Add tests with xUnit/Shouldly mirroring relevant Python vectors.
Run dotnet test (CI enforces green builds + coverage upload).
Update docs or samples if behaviour changes.
Submit a pull request for review.

🗺️ Roadmap

🎯 Near-Term

Azure Document Intelligence converter (options already scaffolded)
Outlook .msg ingestion via MIT-friendly dependencies
Expanded CLI commands (batch mode, globbing, JSON output)
Richer regression suite mirroring Python test vectors

🎯 Future Ideas

Plugin discovery & sandboxing
Built-in LLM caption/transcription providers
Incremental/streaming conversion APIs
Cloud-native samples (Functions, Containers, Logic Apps)

📈 Performance

MarkItDown is designed for high performance with:

Stream-based processing – Avoids writing temporary files by default
Async/await everywhere – Non-blocking I/O with cancellation support
Minimal allocations – Smart buffer reuse and pay-for-play converters
Fast detection – Lightweight sniffing before converter dispatch
Extensible hooks – Offload captions/transcripts to background workers

🔧 Configuration

var options = new MarkItDownOptions
{
    EnableBuiltins = true,
    EnablePlugins = false,
    ExifToolPath = "/usr/local/bin/exiftool",
    ImageCaptioner = async (bytes, info, token) =>
    {
        // Call your preferred vision or LLM service here
        return await Task.FromResult("A scenic mountain landscape at sunset.");
    },
    AudioTranscriber = async (bytes, info, token) =>
    {
        // Route to speech-to-text provider
        return await Task.FromResult("Welcome to the MarkItDown demo.");
    }
};

var markItDown = new MarkItDown(options);

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This project is a C# conversion of the original Microsoft MarkItDown Python library. The original project was created by the Microsoft AutoGen team.

📞 Support

📚 Documentation: GitHub Wiki
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions
📧 Email: Create an issue for support

⭐ Star this repository if you find it useful!

Made with ❤️ by ManagedCode

</div>

Product	Compatible and additional computed target framework versions.
.NET	net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net9.0
- AngleSharp (>= 1.0.0)
- DocumentFormat.OpenXml (>= 3.1.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 8.0.1)
- Microsoft.Extensions.Logging.Abstractions (>= 8.0.1)
- PdfPig (>= 0.1.9)
- SkiaSharp (>= 2.88.8)
- System.Text.Encoding.CodePages (>= 8.0.0)
- System.Text.Json (>= 8.0.5)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.0.5	561	10/17/2025
0.0.4	674	9/28/2025
0.0.3	421	9/24/2025
0.0.2	161	9/24/2025
0.0.1	162	9/20/2025