ZentrixLabs.OcrCorrection
1.0.1
dotnet add package ZentrixLabs.OcrCorrection --version 1.0.1
NuGet\Install-Package ZentrixLabs.OcrCorrection -Version 1.0.1
<PackageReference Include="ZentrixLabs.OcrCorrection" Version="1.0.1" />
<PackageVersion Include="ZentrixLabs.OcrCorrection" Version="1.0.1" />
<PackageReference Include="ZentrixLabs.OcrCorrection" />
paket add ZentrixLabs.OcrCorrection --version 1.0.1
#r "nuget: ZentrixLabs.OcrCorrection, 1.0.1"
#:package ZentrixLabs.OcrCorrection@1.0.1
#addin nuget:?package=ZentrixLabs.OcrCorrection&version=1.0.1
#tool nuget:?package=ZentrixLabs.OcrCorrection&version=1.0.1
ZentrixLabs.OcrCorrection
A comprehensive .NET library for correcting common OCR errors in English text, specifically designed for subtitle extraction and document digitization workflows.
Features
- ~837 Comprehensive Patterns - Extensively tested correction patterns for English OCR
- Multi-Pass Processing - Iterative correction until convergence or max passes reached
- High Performance - Pre-compiled regex patterns process feature-length films in ~1 second
- Zero False Positives - Carefully curated patterns avoid breaking valid text
- Modular Architecture - Organized pattern categories (Character Substitution, Spacing, Apostrophes, Numbers)
- Detailed Logging - Track corrections by category with performance metrics
- Dependency Injection Ready - Designed for modern .NET applications
Common OCR Errors Fixed
Capital I ↔ Lowercase l Confusion
The most common OCR error in subtitles. Our library handles ~660 specific patterns:
HeIIo→HelloI'm gIad→I'm gladTeII me→Tell mestiII→still
Spacing Errors (~281 patterns)
- Extra spaces:
th e→the,wh at→what - Missing spaces:
thejob→the job,ofthose→of those - -tion/-ation words:
confus i on→confusion,inform at i on→information
Apostrophe Issues (~37 patterns)
- Missing:
dont→don't,youre→you're - Malformed:
you)re→you're,I)m→I'm
Number Confusion (~20 patterns)
- Letter to number:
I 00→100,$I O→$10 - Number to letter: Standalone
0→O,1→I
Sentence Start Issues
lt's→It's,ltm→I'm,ltll→I'll
Installation
dotnet add package ZentrixLabs.OcrCorrection
Or via NuGet Package Manager:
Install-Package ZentrixLabs.OcrCorrection
Quick Start
Basic Usage
using ZentrixLabs.OcrCorrection.Core;
using ZentrixLabs.OcrCorrection.Patterns;
// Create the correction engine
var patternProvider = new EnglishPatternProvider();
var engine = new OcrCorrectionEngine(patternProvider);
// Correct OCR errors
var text = "HeIIo! I dont think th is looks right.";
var result = engine.CorrectText(text);
Console.WriteLine(result.CorrectedText);
// Output: "Hello! I don't think this looks right."
Console.WriteLine($"Corrections made: {result.CorrectionsMade}");
// Output: "Corrections made: 4"
With Detailed Logging
var options = new CorrectionOptions
{
IncludeDetailedLog = true,
IncludePerformanceMetrics = true,
IncludeCorrectionDetails = true
};
var result = engine.CorrectText(text, options);
foreach (var detail in result.CorrectionDetails)
{
Console.WriteLine($"{detail.Pattern}: '{detail.Original}' → '{detail.Corrected}'");
}
Multi-Pass Processing
using ZentrixLabs.OcrCorrection.Passes;
var multiPass = new MultiPassProcessor(engine);
var result = await multiPass.ProcessAsync(
text,
maxPasses: 5,
options: new CorrectionOptions { IncludeDetailedLog = true }
);
Console.WriteLine($"Converged after {result.PassesCompleted} passes");
Console.WriteLine($"Total corrections: {result.TotalCorrections}");
Dependency Injection
using Microsoft.Extensions.DependencyInjection;
using ZentrixLabs.OcrCorrection.Extensions;
var services = new ServiceCollection();
services.AddOcrCorrection();
var serviceProvider = services.BuildServiceProvider();
var engine = serviceProvider.GetRequiredService<IOcrCorrectionEngine>();
Batch Processing Example (SRT Files)
// Read SRT file
var srtContent = File.ReadAllText("movie.srt");
// Correct OCR errors
var result = engine.CorrectText(srtContent);
// Save corrected SRT
File.WriteAllText("movie_corrected.srt", result.CorrectedText);
Console.WriteLine($"✅ Corrected {result.CorrectionsMade} errors");
Performance
Tested on feature-length films (900-1,500 subtitle entries):
- Average processing time: ~900ms per film
- Typical corrections: 2-200 errors per film
- Success rate: 100% on tested corpus
Testing Results
Library tested on 6 Tesseract PGS extractions with perfect results:
| Film | Subtitles | Corrections | Result |
|---|---|---|---|
| 28 Weeks Later (2007) | 1,237 | 2 | ✅ Perfect |
| 28 Years Later (2025) | 1,231 | 4 | ✅ Perfect |
| 28 Days Later (2002) | 1,232 | 42 | ✅ Perfect |
| Alien (1979) | 984 | 109 | ✅ Perfect |
| Alien: Covenant (2017) | 1,515 | 158 | ✅ Perfect |
| A View to a Kill (1985) | 965 | 190 | ✅ Perfect |
Total: 7,164 subtitles processed, 505 errors corrected, 0 remaining OCR errors, 0 false positives.
Pattern Categories
- Character Substitution: ~660 patterns (primarily capital I ↔ lowercase l)
- Spacing Patterns: ~281 patterns (extra/missing spaces, compound words)
- Apostrophe Patterns: ~37 patterns (contractions, missing apostrophes)
- Number Patterns: ~20 patterns (letter/number confusion in numeric contexts)
Configuration Options
var options = new CorrectionOptions
{
// Include detailed correction log
IncludeDetailedLog = true,
// Include performance metrics (processing time, etc.)
IncludePerformanceMetrics = true,
// Include details about each correction made
IncludeCorrectionDetails = true,
// Exclude specific pattern categories
ExcludedCategories = new[] { "Numbers" },
// Context-aware capitalization (experimental, disabled by default)
UseContextAwareCapitalization = false
};
Advanced Usage
Filtering by Category
var patternProvider = new EnglishPatternProvider();
// Get only spacing-related patterns
var spacingPatterns = patternProvider.GetPatternsByCategory("Spacing");
// Get all available categories
var categories = patternProvider.GetCategories();
// Returns: ["Apostrophes", "Capitalization", "Character Substitution", "Numbers", "Spacing"]
Custom Pattern Providers
public class MyCustomPatternProvider : IPatternProvider
{
public string Name => "Custom Patterns";
public string LanguageCode => "en";
public IEnumerable<CorrectionPattern> GetPatterns()
{
return new[]
{
new CorrectionPattern(@"\bcustomerror\b", "custom error", "Custom")
{
Description = "Fix custom error",
Priority = 50
}
};
}
// ... implement other interface members
}
// Use custom provider
var engine = new OcrCorrectionEngine(new MyCustomPatternProvider());
Use Cases
- Subtitle Extraction - Clean up OCR errors from PGS/VobSub/ASS subtitle extraction
- Document Digitization - Fix OCR errors in scanned documents
- Historical Text - Correct OCR errors in digitized historical documents
- Post-Processing Pipeline - Integrate into automated OCR workflows
Requirements
- .NET 8.0 or higher
- No external dependencies (uses built-in .NET libraries)
Contributing
Contributions are welcome! If you find OCR errors that aren't being corrected:
- Check the OCR Pattern Analysis Workflow
- Add specific patterns (avoid overly generic patterns)
- Test thoroughly to avoid false positives
- Submit a pull request
Important Notes
Dangerous Patterns Removed
Early versions included generic "missing space" patterns like:
// ❌ DANGEROUS - breaks valid words
(\w)(are)(\s) → "$1 $2$3" // Breaks: "fanfare" → "fanf are"
(\w)(he)(\s) → "$1 $2$3" // Breaks: "she" → "s he"
These were removed in favor of specific patterns only. See Pattern Analysis Workflow for details.
Context-Aware Capitalization
The library includes experimental context-aware capitalization (disabled by default) that attempts to only capitalize words at sentence starts. Due to edge cases with contractions and proper nouns, it's recommended to keep this disabled for production use.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Tested extensively on Tesseract OCR output from Blu-ray PGS subtitle extraction
- Pattern library built from analysis of real-world OCR errors in feature films
- Special thanks to the .NET community for regex optimization techniques
Links
Built with ❤️ by ZentrixLabs
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- Microsoft.Extensions.Logging.Abstractions (>= 8.0.2)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
v1.0.0: Production-ready release with ~837 comprehensive patterns for English OCR correction. Tested on 7,164 subtitles across 6 feature films with 100% success rate and zero false positives. Includes modular pattern organization (NumberPatterns, ApostrophePatterns), multi-pass processing, detailed logging, and performance optimizations. Perfect for Tesseract PGS subtitle extraction workflows.