ZentrixLabs.OcrCorrection 1.0.1

.NET 8.0

dotnet add package ZentrixLabs.OcrCorrection --version 1.0.1

NuGet\Install-Package ZentrixLabs.OcrCorrection -Version 1.0.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="ZentrixLabs.OcrCorrection" Version="1.0.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="ZentrixLabs.OcrCorrection" Version="1.0.1" />
                    

                            Directory.Packages.props

<PackageReference Include="ZentrixLabs.OcrCorrection" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add ZentrixLabs.OcrCorrection --version 1.0.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: ZentrixLabs.OcrCorrection, 1.0.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package ZentrixLabs.OcrCorrection@1.0.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=ZentrixLabs.OcrCorrection&version=1.0.1
                    

                            Install as a Cake Addin

#tool nuget:?package=ZentrixLabs.OcrCorrection&version=1.0.1
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

ZentrixLabs.OcrCorrection

A comprehensive .NET library for correcting common OCR errors in English text, specifically designed for subtitle extraction and document digitization workflows.

Features

~837 Comprehensive Patterns - Extensively tested correction patterns for English OCR
Multi-Pass Processing - Iterative correction until convergence or max passes reached
High Performance - Pre-compiled regex patterns process feature-length films in ~1 second
Zero False Positives - Carefully curated patterns avoid breaking valid text
Modular Architecture - Organized pattern categories (Character Substitution, Spacing, Apostrophes, Numbers)
Detailed Logging - Track corrections by category with performance metrics
Dependency Injection Ready - Designed for modern .NET applications

Common OCR Errors Fixed

Capital I ↔ Lowercase l Confusion

The most common OCR error in subtitles. Our library handles ~660 specific patterns:

HeIIo → Hello
I'm gIad → I'm glad
TeII me → Tell me
stiII → still

Spacing Errors (~281 patterns)

Extra spaces: th e → the, wh at → what
Missing spaces: thejob → the job, ofthose → of those
-tion/-ation words: confus i on → confusion, inform at i on → information

Apostrophe Issues (~37 patterns)

Missing: dont → don't, youre → you're
Malformed: you)re → you're, I)m → I'm

Number Confusion (~20 patterns)

Letter to number: I 00 → 100, $I O → $10
Number to letter: Standalone 0 → O, 1 → I

Sentence Start Issues

lt's → It's, ltm → I'm, ltll → I'll

Installation

dotnet add package ZentrixLabs.OcrCorrection

Or via NuGet Package Manager:

Install-Package ZentrixLabs.OcrCorrection

Quick Start

Basic Usage

using ZentrixLabs.OcrCorrection.Core;
using ZentrixLabs.OcrCorrection.Patterns;

// Create the correction engine
var patternProvider = new EnglishPatternProvider();
var engine = new OcrCorrectionEngine(patternProvider);

// Correct OCR errors
var text = "HeIIo! I dont think th is looks right.";
var result = engine.CorrectText(text);

Console.WriteLine(result.CorrectedText);
// Output: "Hello! I don't think this looks right."

Console.WriteLine($"Corrections made: {result.CorrectionsMade}");
// Output: "Corrections made: 4"

With Detailed Logging

var options = new CorrectionOptions
{
    IncludeDetailedLog = true,
    IncludePerformanceMetrics = true,
    IncludeCorrectionDetails = true
};

var result = engine.CorrectText(text, options);

foreach (var detail in result.CorrectionDetails)
{
    Console.WriteLine($"{detail.Pattern}: '{detail.Original}' → '{detail.Corrected}'");
}

Multi-Pass Processing

using ZentrixLabs.OcrCorrection.Passes;

var multiPass = new MultiPassProcessor(engine);

var result = await multiPass.ProcessAsync(
    text, 
    maxPasses: 5,
    options: new CorrectionOptions { IncludeDetailedLog = true }
);

Console.WriteLine($"Converged after {result.PassesCompleted} passes");
Console.WriteLine($"Total corrections: {result.TotalCorrections}");

Dependency Injection

using Microsoft.Extensions.DependencyInjection;
using ZentrixLabs.OcrCorrection.Extensions;

var services = new ServiceCollection();
services.AddOcrCorrection();

var serviceProvider = services.BuildServiceProvider();
var engine = serviceProvider.GetRequiredService<IOcrCorrectionEngine>();

Batch Processing Example (SRT Files)

// Read SRT file
var srtContent = File.ReadAllText("movie.srt");

// Correct OCR errors
var result = engine.CorrectText(srtContent);

// Save corrected SRT
File.WriteAllText("movie_corrected.srt", result.CorrectedText);

Console.WriteLine($"✅ Corrected {result.CorrectionsMade} errors");

Performance

Tested on feature-length films (900-1,500 subtitle entries):

Average processing time: ~900ms per film
Typical corrections: 2-200 errors per film
Success rate: 100% on tested corpus

Testing Results

Library tested on 6 Tesseract PGS extractions with perfect results:

Film	Subtitles	Corrections	Result
28 Weeks Later (2007)	1,237	2	✅ Perfect
28 Years Later (2025)	1,231	4	✅ Perfect
28 Days Later (2002)	1,232	42	✅ Perfect
Alien (1979)	984	109	✅ Perfect
Alien: Covenant (2017)	1,515	158	✅ Perfect
A View to a Kill (1985)	965	190	✅ Perfect

Total: 7,164 subtitles processed, 505 errors corrected, 0 remaining OCR errors, 0 false positives.

Pattern Categories

Character Substitution: ~660 patterns (primarily capital I ↔ lowercase l)
Spacing Patterns: ~281 patterns (extra/missing spaces, compound words)
Apostrophe Patterns: ~37 patterns (contractions, missing apostrophes)
Number Patterns: ~20 patterns (letter/number confusion in numeric contexts)

Configuration Options

var options = new CorrectionOptions
{
    // Include detailed correction log
    IncludeDetailedLog = true,
    
    // Include performance metrics (processing time, etc.)
    IncludePerformanceMetrics = true,
    
    // Include details about each correction made
    IncludeCorrectionDetails = true,
    
    // Exclude specific pattern categories
    ExcludedCategories = new[] { "Numbers" },
    
    // Context-aware capitalization (experimental, disabled by default)
    UseContextAwareCapitalization = false
};

Advanced Usage

Filtering by Category

var patternProvider = new EnglishPatternProvider();

// Get only spacing-related patterns
var spacingPatterns = patternProvider.GetPatternsByCategory("Spacing");

// Get all available categories
var categories = patternProvider.GetCategories();
// Returns: ["Apostrophes", "Capitalization", "Character Substitution", "Numbers", "Spacing"]

Custom Pattern Providers

public class MyCustomPatternProvider : IPatternProvider
{
    public string Name => "Custom Patterns";
    public string LanguageCode => "en";
    
    public IEnumerable<CorrectionPattern> GetPatterns()
    {
        return new[]
        {
            new CorrectionPattern(@"\bcustomerror\b", "custom error", "Custom")
            {
                Description = "Fix custom error",
                Priority = 50
            }
        };
    }
    
    // ... implement other interface members
}

// Use custom provider
var engine = new OcrCorrectionEngine(new MyCustomPatternProvider());

Use Cases

Subtitle Extraction - Clean up OCR errors from PGS/VobSub/ASS subtitle extraction
Document Digitization - Fix OCR errors in scanned documents
Historical Text - Correct OCR errors in digitized historical documents
Post-Processing Pipeline - Integrate into automated OCR workflows

Requirements

.NET 8.0 or higher
No external dependencies (uses built-in .NET libraries)

Contributing

Contributions are welcome! If you find OCR errors that aren't being corrected:

Check the OCR Pattern Analysis Workflow
Add specific patterns (avoid overly generic patterns)
Test thoroughly to avoid false positives
Submit a pull request

Important Notes

Dangerous Patterns Removed

Early versions included generic "missing space" patterns like:

// ❌ DANGEROUS - breaks valid words
(\w)(are)(\s) → "$1 $2$3"  // Breaks: "fanfare" → "fanf are"
(\w)(he)(\s) → "$1 $2$3"   // Breaks: "she" → "s he"

These were removed in favor of specific patterns only. See Pattern Analysis Workflow for details.

Context-Aware Capitalization

The library includes experimental context-aware capitalization (disabled by default) that attempts to only capitalize words at sentence starts. Due to edge cases with contractions and proper nouns, it's recommended to keep this disabled for production use.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Tested extensively on Tesseract OCR output from Blu-ray PGS subtitle extraction
Pattern library built from analysis of real-world OCR errors in feature films
Special thanks to the .NET community for regex optimization techniques

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.1	223	10/13/2025
1.0.0	146	10/11/2025

v1.0.0: Production-ready release with ~837 comprehensive patterns for English OCR correction. Tested on 7,164 subtitles across 6 feature films with 100% success rate and zero false positives. Includes modular pattern organization (NumberPatterns, ApostrophePatterns), multi-pass processing, detailed logging, and performance optimizations. Perfect for Tesseract PGS subtitle extraction workflows.