DocumentAtom.Pdf
1.0.0
dotnet add package DocumentAtom.Pdf --version 1.0.0
NuGet\Install-Package DocumentAtom.Pdf -Version 1.0.0
<PackageReference Include="DocumentAtom.Pdf" Version="1.0.0" />
paket add DocumentAtom.Pdf --version 1.0.0
#r "nuget: DocumentAtom.Pdf, 1.0.0"
// Install DocumentAtom.Pdf as a Cake Addin #addin nuget:?package=DocumentAtom.Pdf&version=1.0.0 // Install DocumentAtom.Pdf as a Cake Tool #tool nuget:?package=DocumentAtom.Pdf&version=1.0.0
<img src="https://github.com/jchristn/DocumentAtom/blob/main/assets/icon.png" width="256" height="256">
DocumentAtom
DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
New in v1.0.x
- Initial release
Motivation
Parsing documents and extracting constituent parts is one part science and one part black magic. I make no claims about the accuracy of extraction, but rather, aims for perfection and hopefully fails to excellence. If you find ways to improve processing and extraction in any way, we would love your feedback so we can make this library more accurate, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.
Bugs, Quality, Feedback, or Enhancement Requests
Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.
Types Supported
DocumentAtom supports the following input file types:
- Text
- Markdown
- Microsoft Word (.docx)
- Microsoft Excel (.xlsx)
- Microsoft PowerPoint (.pptx)
- PNG images
Simple Example
Refer to the various Test
projects for working examples.
The following example shows processing a markdown (.md
) file.
using DocumentAtom.Core.Atoms;
using DocumentAtom.Markdown;
MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(_Settings);
foreach (MarkdownAtom atom in processor.Extract(filename))
Console.WriteLine(atom.ToString());
Atom Types
DocumentAtom parses input data assets into a variety of Atom
objects. Each Atom
includes top-level metadata including:
GUID
Type
- includingText
,Image
,Binary
,Table
, andList
PageNumber
- where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when renderedPosition
- the ordinal position of theAtom
, relative to othersLength
- the length of theAtom
's contentMD5Hash
- the MD5 hash of theAtom
contentSHA1Hash
- the SHA1 hash of theAtom
contentSHA256Hash
- the SHA256 hash of theAtom
contentQuarks
- sub-atomic particles created from theAtom
content, for instance, when chunking text
The AtomBase
class provides the aforementioned metadata, and several type-specific Atom
s are returned from the various processors, including:
BinaryAtom
- includes aBytes
propertyDocxAtom
- includesText
,HeaderLevel
,UnorderedList
,OrderedList
,Table
, andBinary
propertiesImageAtom
- includesBoundingBox
,Text
,UnorderedList
,OrderedList
,Table
, andBinary
propertiesMarkdownAtom
- includesFormatting
,Text
,UnorderedList
,OrderedList
, andTable
propertiesPdfAtom
- includesBoundingBox
,Text
,UnorderedList
,OrderedList
,Table
, andBinary
propertiesPptxAtom
- includesTitle
,Subtitle
,Text
,UnorderedList
,OrderedList
,Table
, andBinary
propertiesTableAtom
- includesRows
,Columns
,Irregular
, andTable
propertiesTextAtom
- includesText
XlsxAtom
- includesSheetName
,CellIdentifier
,Text
,Table
, andBinary
properties
Table
objects inside of Atom
objects are always presented as SerializableDataTable
objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable
objects.
Underlying Libraries
DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.
Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.
My libraries used within DocumentAtom:
Version History
Please refer to CHANGELOG.md
for version history.
Thanks
Special thanks to iconduck.com and the content authors for producing this icon.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. |
-
net8.0
- DocumentAtom (>= 1.0.0)
- PdfPig (>= 0.1.9)
- Tabula (>= 0.1.3)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
1.0.0 | 49 | 1/19/2025 |
Initial release