GroupDocs.Parser 24.8.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package GroupDocs.Parser --version 24.8.0                
NuGet\Install-Package GroupDocs.Parser -Version 24.8.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="GroupDocs.Parser" Version="24.8.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add GroupDocs.Parser --version 24.8.0                
#r "nuget: GroupDocs.Parser, 24.8.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install GroupDocs.Parser as a Cake Addin
#addin nuget:?package=GroupDocs.Parser&version=24.8.0

// Install GroupDocs.Parser as a Cake Tool
#tool nuget:?package=GroupDocs.Parser&version=24.8.0                

Advanced Document Parsing API for .NET

Version 24.8.0 NuGet .NET

banner

Product Page Docs Demos API Ref Examples Blog Releases Support Temp License

Important Note: Starting from 24.2.0, the GroupDocs.Parser package has been split into two distinct platform packages: .NET Standard and .NET Framework. The GroupDocs.Parser package is specifically designed to support the .NET Standard platform, making it compatible with .NET Core, .NET 5, .NET 6, etc. It includes backward compatibility improvements, allowing it to function with .NET Framework versions starting from 4.6.2. In addition, we have introduced the GroupDocs.Parser.NETFramework package, which is optimized to run seamlessly in the .NET Framework runtime because it includes all the GroupDocs product libraries in their respective .NET Framework versions. It is tailored specifically for .NET Framework users and offers better dependency resolution for those utilizing the .NET Framework. We hope these changes will enhance your experience and provide a more streamlined approach to using the GroupDocs.Parser package. If you have any further questions or concerns, please don't hesitate to reach out to our free support forum.

GroupDocs.Parser for .NET is a powerful API designed for advanced document parsing, offering extensive features like text extraction, metadata retrieval, and image extraction across various document formats, including PDFs, Word, Excel, and PowerPoint. This robust API supports .NET Standard and .NET Framework, making it compatible with .NET Core, .NET 5, and .NET 6, while also providing backward compatibility with older .NET Framework versions. With specialized parsing capabilities for PDF documents, email parsing, and template-based data extraction, GroupDocs.Parser ensures high-performance, secure parsing and scalability, suitable for cross-platform environments including Windows, Linux, and macOS. It's the ideal solution for developers needing to integrate efficient document processing into their .NET applications.

Text Extraction

  • Extract text from various document formats (PDF, Microsoft Word, Excel, etc.).
  • Extract text with its formatting retained, including font styles, sizes, and colors.
  • Search for specific text within a document and extract it.
  • Optical Character Recognition (OCR) support to extract text from images within documents.

Metadata Extraction

  • Extract metadata from documents, including properties like author, title, subject, etc.
  • Extract document properties such as creation date, modification date, and more.
  • Extract field-specific data such as invoice numbers, dates, and other custom fields.

Image and Attachment Extraction

  • Extract images embedded within documents.
  • Extract file attachments from documents such as PDF and email files.
  • Extract and recognize barcodes from documents.

Document Structure Analysis

  • Analyze and extract information from structured documents like tables, lists, and paragraphs.
  • Extract tables and their content from documents.
  • Extract hyperlinks from documents.
  • Extract bookmarks from documents like PDF files.

PDF-Specific Parsing

  • Specialized parsing capabilities for PDF documents, including text extraction, image extraction, and metadata retrieval.
  • Extract page count and other PDF-specific properties.
  • Extract and manage PDF bookmarks.

Email Parsing

  • Extract text, attachments, and metadata from email formats like EML, MSG, etc.
  • Extract email-specific properties such as sender, receiver, subject, and email body.

Spreadsheet Parsing

  • Extract text, metadata, and other data from Excel spreadsheets.
  • Extract specific ranges, cells, or entire sheets from Excel documents.

Presentation Parsing

  • Extract text, images, and metadata from PowerPoint presentations.
  • Extract slide-specific content, including notes, shapes, and text.

Template-Based Data Extraction

  • Define and use templates to extract data based on specific document structures.
  • Template editor for creating and editing templates for structured data extraction.
  • Custom parsing logic to implement specific content extraction rules based on custom templates.

Advanced Features

  • Support for multiple file formats such as PDF, DOCX, XLSX, PPTX, RTF, TXT, and more.
  • Cross-Platform Compatibility: Works across Windows, Linux, and macOS.
  • Integration with .NET applications for seamless functionality.
  • High performance optimized for handling large documents efficiently.
  • Secure parsing that does not compromise the document's security and integrity.
  • Scalability to handle large volumes of documents in batch processing.

Additional Features

  • Retrieve the number of pages in a document.
  • Extract data from forms and other interactive elements within documents.
  • Support for content-aware parsing to detect and extract specific types of data.

Supported Document Formats

Word Processing

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
DOC - Microsoft Word Document
DOT - Microsoft Word Document Template
DOCX - Office Open XML Document
DOCM - Office Open XML Macro-Enabled Document
DOTX - Office Open XML Document Template
DOTM - Office Open XML Document Macro-Enabled Template
TXT - Plain text
ODT - Open Document Text
OTT - Open Document Text Template
RTF - Rich Text Format

PDF

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
PDF - Portable Document Format

Markup

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
XHTML - Extensible Hypertext Markup Language File
MHTML - MIME HTML File
MD - Markdown ✔ (Formatted Text is Not supported)
XML - XML File

Ebook

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
CHM - Compiled HTML Help File
EPUB - Digital E-Book File Format
FB2 - FictionBook 2.0 File
MOBI - Mobipocket
AZW3 - Kindle Format 8

Spreadsheet

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
XLS - Microsoft Excel Spreadsheet
XLT - Microsoft Excel Template
XLSX - Office Open XML Spreadsheet
XLSM - Office Open XML Macro-Enabled Spreadsheet
XLSB - Office Open XML Binary Spreadsheet
XLTX - Office Open XML Spreadsheet Template
XLTM - Office Open XML Macro-Enabled Spreadsheet Template
ODS - Open Document Spreadsheet
OTS - Open Document Spreadsheet Template
CSV - Comma Separated Values
XLA - Excel Add-In File
XLAM - Excel Open XML Macro-Enabled Add-In
NUMBERS - Apple iWork Numbers

Presentation

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
PPT - PowerPoint Presentation
PPS - PowerPoint Slideshow
POT - PowerPoint Template
PPTX - Office Open XML Presentation
PPTM - Office Open XML Macro-Enabled Presentation
POTX - Office Open XML Presentation Template
POTM - Office Open XML Macro-Enabled Presentation Template
PPSX - Office Open XML Presentation Slideshow
PPSM - Office Open XML Macro-Enabled Presentation Slideshow
ODP - Open Document Presentation
OTP - Open Document Presentation Template

Email

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
PST - Outlook Personal Information Store File
OST - Outlook Offline Data File
EML - E-Mail Message
EMLX - Apple Mail Message
MSG - Outlook Mail Message

Note

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
ONE - OneNote Document

Archive

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
7Z* - 7Z File
ZIP - Zipped File
RAR - Rar File
TAR - Tar File
GZ - GZip file
BZ2 - BZip2 File

Note: Encrypted 7-zip archives are not supported.

Image*

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
BMP - Bitmap Image file
GIF - Graphical Interchange Format
JP2 - JPEG 2000
JPG, JPEG - JPEG Image file
PNG - Portable Network Graphics
TIF, TIFF - Tagged Image File Format
DICOM - DICOM (Digital Imaging and Communications in Medicine)
DJVU - DjVu File Format
EMF - Enhanced metafile
J2K - JPEG 2000
PS - PostScript File Format
PSD - Photoshop Document
SVG - Scalar Vector Graphics file
SVGZ - Scalar Vector Graphics file (with gzip compression)
WEBP - WebP Image File Format
WMF - Microsoft Windows Metafile

Database

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
ADO.NET

Platform Independence

GroupDocs.Parser for .NET does not require any external software or third-party tool to be installed. GroupDocs.Parser for .NET supports any 32-bit or 64-bit operating system where .NET or Mono framework is installed. The other details are as follows:

Microsoft Windows: Microsoft Windows Desktop (x86, x64) (XP & up), Microsoft Windows Server (x86, x64) (2000 & up), Windows Azure
Mac OS: Mac OS X
Linux: Linux (Ubuntu, OpenSUSE, CentOS and others)
Development Environments: Microsoft Visual Studio (2010 & up), Xamarin.Android, Xamarin.IOS, Xamarin.Mac, MonoDevelop 2.4 and later.
Supported Frameworks: GroupDocs.Conversion for .NET supports .NET and Mono frameworks.

Get Started

Are you ready to give GroupDocs.Parser for .NET a try? Simply execute Install-Package GroupDocs.Parser from Package Manager Console in Visual Studio to fetch & reference GroupDocs.Parser assembly in your project. If you already have GroupDocs.Parser for .Net and want to upgrade it, please execute Update-Package GroupDocs.Parser to get the latest version.

Please check the GitHub Repository for other common usage scenarios.

How to Install GroupDocs.Parser for .NET

1. Install from NuGet
Option 1: Using Package Manager GUI
  1. Open Visual Studio:

    • Load your solution/project.
  2. Access NuGet Package Manager:

    • Go to Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution.
    • Alternatively, right-click the solution or project in Solution Explorer and select Manage NuGet Packages.
  3. Search for GroupDocs.Parser:

    • Navigate to the Browse tab.
    • Type “GroupDocs.Parser” in the search box.
  4. Install the Package:

    • Click the Install button to add the latest version of GroupDocs.Parser to your project.
Option 2: Using Package Manager Console
  1. Open Visual Studio:

    • Load your solution/project.
  2. Open Package Manager Console:

    • Go to Tools -> NuGet Package Manager -> Package Manager Console.
  3. Install GroupDocs.Parser:

    • Type the command Install-Package GroupDocs.Parser and press Enter.
  4. Verify Installation:

    • GroupDocs.Parser should now be referenced in your application.
2. Handling .NET Framework and .NET Standard
  • Starting with version 24.2, GroupDocs.Parser is split into two packages: one for .NET Framework and one for .NET Standard.
  • For .NET Framework projects:
    • Ensure AutoGenerateBindingRedirects is enabled.
    • Add the following to your project file for unit tests:
<PropertyGroup>
    <AutoGenerateBindingRedirects>true</AutoGenerateBindingRedirects>
    <GenerateBindingRedirectsOutputType>true</GenerateBindingRedirectsOutputType>
</PropertyGroup>
3. Install from the Official GroupDocs Website
  1. Download GroupDocs.Parser:

    • Visit the official GroupDocs website and download the package.
  2. Unpack or Install:

    • Unzip the archive or run the MSI installer.
  3. Add a Reference in Visual Studio:

    • In Solution Explorer, right-click the References node of your project and select Add Reference.
    • If you used the MSI installer, select GroupDocs.Parser from the .NET tab. Otherwise, browse to the location of the GroupDocs.Parser.dll file.
  4. Confirm Reference:

    • Ensure GroupDocs.Parser appears under the References node in your project.
4. Additional Considerations
  • .NET Standard 2.0 Version:

    • This version has external references to several packages like System.Drawing.Common, System.Text.Encoding.CodePages, SkiaSharp, etc.
  • Linux Environment:

    • Install the following packages for proper functionality:
      • libgdiplus
      • libc6-dev
      • ttf-mscorefonts-installer (e.g., sudo apt-get install ttf-mscorefonts-installer)
    • Also, ensure SkiaSharp.NativeAssets.Linux.NoDependencies is installed.

GroupDocs.Parser for .NET Coding Samples

Code Sample 1: Extracting Text from a PDF Document

This code loads a PDF file (sample.pdf) and extracts its text content using the GetText() method. The extracted text is then displayed in the console.

using GroupDocs.Parser;
using GroupDocs.Parser.Options;

public class ExtractTextFromPdf
{
    public static void Run()
    {
        // Load the PDF document
        using (Parser parser = new Parser("sample.pdf"))
        {
            // Extract text from the document
            string text = parser.GetText();
            
            // Output the extracted text
            Console.WriteLine(text);
        }
    }
}

Code Sample 2: Extracting Images from a Word Document

This code loads a Word document (sample.docx) and extracts all images found within the document. Each image is saved as a separate PNG file.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;

public class ExtractImagesFromWord
{
    public static void Run()
    {
        // Load the Word document
        using (Parser parser = new Parser("sample.docx"))
        {
            // Get images from the document
            IEnumerable<PageImageArea> images = parser.GetImages();
            
            // Save each image to a file
            int imageNumber = 1;
            foreach (PageImageArea image in images)
            {
                image.Save($"image{imageNumber++}.png");
            }
        }
    }
}

Code Sample 3: Parsing Metadata from an Excel Spreadsheet

This code loads an Excel spreadsheet (sample.xlsx) and extracts its metadata, such as author, title, and creation date. The metadata is then displayed in the console.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;

public class ExtractMetadataFromExcel
{
    public static void Run()
    {
        // Load the Excel spreadsheet
        using (Parser parser = new Parser("sample.xlsx"))
        {
            // Get document's metadata
            IEnumerable<MetadataItem> metadata = parser.GetMetadata();
            
            // Output the metadata
            foreach (var item in metadata)
            {
                Console.WriteLine($"{item.Name}: {item.Value}");
            }
        }
    }
}

Product Page Docs Demos API Ref Examples Blog Releases Support Temp License


Tags

.NET | Text Parsing | Document Parsing | NuGet | Data Extraction | Metadata Extraction | Document Automation | OCR | PDF Parsing | Email Parsing | Spreadsheet Parsing | Presentation Parsing | Template-based Parsing | Cross Platform | High Performance | API | Batch Processing | Secure Parsing | Document Security | Scalable API | Microsoft Word | Excel | PowerPoint | PDF | Email | Barcode Recognition | Linux | macOS | Windows | Software Development | C# | Programming | Application Development | Content Extraction | Structured Data Parsing | Document Structure Analysis | Hyperlink Extraction | Bookmark Extraction | Table Extraction | Form Parsing | Image Extraction | File Attachment Extraction

Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp2.0 was computed.  netcoreapp2.1 was computed.  netcoreapp2.2 was computed.  netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.0 is compatible.  netstandard2.1 was computed. 
.NET Framework net461 was computed.  net462 was computed.  net463 was computed.  net47 was computed.  net471 was computed.  net472 was computed.  net48 was computed.  net481 was computed. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen40 was computed.  tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
24.10.0 1,075 11/1/2024
24.9.0 2,230 9/30/2024
24.8.0 29,459 8/30/2024
24.7.0 1,534 7/24/2024
24.6.0 2,710 6/29/2024
24.5.0 5,471 5/31/2024
24.4.0 5,862 4/23/2024
24.2.1 7,201 3/13/2024
24.2.0 1,309 2/29/2024
23.12.0 134,040 12/23/2023
23.11.0 36,739 11/24/2023
23.10.0 13,560 10/21/2023
23.8.0 65,528 8/18/2023
23.5.0 84,970 5/31/2023
23.3.0 16,093 3/31/2023
23.2.0 22,868 3/1/2023
22.11.1 25,281 1/17/2023
22.11.0 38,896 11/29/2022
22.8.0 74,431 8/12/2022
22.6.0 31,446 6/7/2022
22.2.0 37,310 2/25/2022
21.5.0 63,323 5/31/2021
21.2.0 50,949 2/22/2021
20.12.0 24,429 12/30/2020
20.10.0 169,301 10/27/2020
20.8.0 49,008 8/19/2020
20.6.1 47,470 6/30/2020
20.6.0 20,077 6/19/2020
20.5.0 35,188 5/8/2020
20.3.0 48,429 3/19/2020
20.1.0 35,725 1/31/2020
19.12.0 33,537 12/27/2019
19.11.0 28,458 11/22/2019
19.9.0 2,809 9/27/2019
19.5.0 3,039 5/29/2019
18.12.0 3,214 12/11/2018
18.11.0 2,701 11/8/2018
18.10.0 2,785 10/10/2018
18.9.0 2,772 9/5/2018
18.8.0 2,841 8/7/2018
18.7.0 2,791 7/3/2018
18.5.0 3,013 5/23/2018