GroupDocs.Parser 24.8.0

There is a newer version of this package available.
See the version list below for details.

dotnet add package GroupDocs.Parser --version 24.8.0

NuGet\Install-Package GroupDocs.Parser -Version 24.8.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="GroupDocs.Parser" Version="24.8.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add GroupDocs.Parser --version 24.8.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: GroupDocs.Parser, 24.8.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install GroupDocs.Parser as a Cake Addin
#addin nuget:?package=GroupDocs.Parser&version=24.8.0

// Install GroupDocs.Parser as a Cake Tool
#tool nuget:?package=GroupDocs.Parser&version=24.8.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Advanced Document Parsing API for .NET

Important Note: Starting from 24.2.0, the GroupDocs.Parser package has been split into two distinct platform packages: .NET Standard and .NET Framework. The GroupDocs.Parser package is specifically designed to support the .NET Standard platform, making it compatible with .NET Core, .NET 5, .NET 6, etc. It includes backward compatibility improvements, allowing it to function with .NET Framework versions starting from 4.6.2. In addition, we have introduced the GroupDocs.Parser.NETFramework package, which is optimized to run seamlessly in the .NET Framework runtime because it includes all the GroupDocs product libraries in their respective .NET Framework versions. It is tailored specifically for .NET Framework users and offers better dependency resolution for those utilizing the .NET Framework. We hope these changes will enhance your experience and provide a more streamlined approach to using the GroupDocs.Parser package. If you have any further questions or concerns, please don't hesitate to reach out to our free support forum.

GroupDocs.Parser for .NET is a powerful API designed for advanced document parsing, offering extensive features like text extraction, metadata retrieval, and image extraction across various document formats, including PDFs, Word, Excel, and PowerPoint. This robust API supports .NET Standard and .NET Framework, making it compatible with .NET Core, .NET 5, and .NET 6, while also providing backward compatibility with older .NET Framework versions. With specialized parsing capabilities for PDF documents, email parsing, and template-based data extraction, GroupDocs.Parser ensures high-performance, secure parsing and scalability, suitable for cross-platform environments including Windows, Linux, and macOS. It's the ideal solution for developers needing to integrate efficient document processing into their .NET applications.

Text Extraction

Extract text from various document formats (PDF, Microsoft Word, Excel, etc.).
Extract text with its formatting retained, including font styles, sizes, and colors.
Search for specific text within a document and extract it.
Optical Character Recognition (OCR) support to extract text from images within documents.

Metadata Extraction

Extract metadata from documents, including properties like author, title, subject, etc.
Extract document properties such as creation date, modification date, and more.
Extract field-specific data such as invoice numbers, dates, and other custom fields.

Image and Attachment Extraction

Extract images embedded within documents.
Extract file attachments from documents such as PDF and email files.
Extract and recognize barcodes from documents.

Document Structure Analysis

Analyze and extract information from structured documents like tables, lists, and paragraphs.
Extract tables and their content from documents.
Extract hyperlinks from documents.
Extract bookmarks from documents like PDF files.

PDF-Specific Parsing

Specialized parsing capabilities for PDF documents, including text extraction, image extraction, and metadata retrieval.
Extract page count and other PDF-specific properties.
Extract and manage PDF bookmarks.

Email Parsing

Extract text, attachments, and metadata from email formats like EML, MSG, etc.
Extract email-specific properties such as sender, receiver, subject, and email body.

Spreadsheet Parsing

Extract text, metadata, and other data from Excel spreadsheets.
Extract specific ranges, cells, or entire sheets from Excel documents.

Presentation Parsing

Extract text, images, and metadata from PowerPoint presentations.
Extract slide-specific content, including notes, shapes, and text.

Template-Based Data Extraction

Define and use templates to extract data based on specific document structures.
Template editor for creating and editing templates for structured data extraction.
Custom parsing logic to implement specific content extraction rules based on custom templates.

Advanced Features

Support for multiple file formats such as PDF, DOCX, XLSX, PPTX, RTF, TXT, and more.
Cross-Platform Compatibility: Works across Windows, Linux, and macOS.
Integration with .NET applications for seamless functionality.
High performance optimized for handling large documents efficiently.
Secure parsing that does not compromise the document's security and integrity.
Scalability to handle large volumes of documents in batch processing.

Additional Features

Retrieve the number of pages in a document.
Extract data from forms and other interactive elements within documents.
Support for content-aware parsing to detect and extract specific types of data.

Supported Document Formats

Word Processing

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Parse Form Data	Extract Table of Contents	Scan Barcode
DOC - Microsoft Word Document	✔	✔	✔	✔	✔	✔	✔		✔	✔
DOT - Microsoft Word Document Template	✔	✔	✔	✔	✔	✔	✔		✔	✔
DOCX - Office Open XML Document	✔	✔	✔	✔	✔	✔	✔		✔	✔
DOCM - Office Open XML Macro-Enabled Document	✔	✔	✔	✔	✔	✔	✔		✔	✔
DOTX - Office Open XML Document Template	✔	✔	✔	✔	✔	✔	✔		✔	✔
DOTM - Office Open XML Document Macro-Enabled Template	✔	✔	✔	✔	✔	✔	✔		✔	✔
TXT - Plain text		✔
ODT - Open Document Text	✔	✔	✔	✔	✔			✔	✔	✔
OTT - Open Document Text Template	✔	✔	✔	✔	✔			✔	✔	✔
RTF - Rich Text Format	✔	✔	✔	✔	✔			✔	✔	✔

PDF

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Parse Form Data	Extract Table of Contents	Scan Barcode
PDF - Portable Document Format	✔	✔	✔		✔	✔	✔	✔	✔	✔	✔

Markup

Document Type	Extract Text (Accurate)	Extract Structured Text and Formatted Text	Extract Metadata
XHTML - Extensible Hypertext Markup Language File	✔		✔
MHTML - MIME HTML File	✔		✔
MD - Markdown	✔	✔ (Formatted Text is Not supported)
XML - XML File	✔

Ebook

Document Type	Extract Text (Accurate)	Extract Structured Text and Formatted Text	Extract Metadata	Extract Containers and Attachments	Scan Barcode
CHM - Compiled HTML Help File	✔	✔	✔	✔	✔
EPUB - Digital E-Book File Format	✔	✔	✔	✔	✔
FB2 - FictionBook 2.0 File	✔	✔
MOBI - Mobipocket	✔
AZW3 - Kindle Format 8	✔

Spreadsheet

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments
XLS - Microsoft Excel Spreadsheet	✔	✔	✔	✔	✔	✔	✔	✔
XLT - Microsoft Excel Template	✔	✔	✔	✔	✔	✔	✔	✔
XLSX - Office Open XML Spreadsheet	✔	✔	✔	✔	✔	✔	✔	✔
XLSM - Office Open XML Macro-Enabled Spreadsheet	✔	✔	✔	✔	✔	✔	✔	✔
XLSB - Office Open XML Binary Spreadsheet	✔	✔		✔	✔	✔	✔	✔
XLTX - Office Open XML Spreadsheet Template	✔	✔	✔	✔	✔	✔	✔	✔
XLTM - Office Open XML Macro-Enabled Spreadsheet Template	✔	✔	✔	✔	✔	✔	✔	✔
ODS - Open Document Spreadsheet	✔	✔		✔	✔	✔
OTS - Open Document Spreadsheet Template	✔	✔		✔	✔	✔
CSV - Comma Separated Values		✔
XLA - Excel Add-In File	✔	✔	✔	✔	✔	✔	✔
XLAM - Excel Open XML Macro-Enabled Add-In	✔	✔	✔	✔	✔	✔	✔
NUMBERS - Apple iWork Numbers	✔	✔		✔		✔

Presentation

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Scan Barcode
PPT - PowerPoint Presentation	✔	✔	✔	✔	✔	✔	✔	✔	✔
PPS - PowerPoint Slideshow	✔	✔	✔	✔	✔	✔	✔	✔	✔
POT - PowerPoint Template	✔	✔	✔	✔	✔	✔	✔	✔	✔
PPTX - Office Open XML Presentation	✔	✔	✔	✔	✔	✔	✔	✔	✔
PPTM - Office Open XML Macro-Enabled Presentation	✔	✔	✔	✔	✔	✔	✔	✔	✔
POTX - Office Open XML Presentation Template	✔	✔	✔	✔	✔	✔	✔	✔	✔
POTM - Office Open XML Macro-Enabled Presentation Template	✔	✔	✔	✔	✔	✔	✔	✔	✔
PPSX - Office Open XML Presentation Slideshow	✔	✔	✔	✔	✔	✔	✔	✔	✔
PPSM - Office Open XML Macro-Enabled Presentation Slideshow	✔	✔	✔	✔	✔	✔	✔	✔	✔
ODP - Open Document Presentation	✔	✔		✔	✔	✔			✔
OTP - Open Document Presentation Template	✔	✔		✔	✔	✔			✔

Email

Document Type	Extract Text (Accurate)	Extract Structured Text and Formatted Text	Extract Metadata	Extract Images	Extract Containers and Attachments
PST - Outlook Personal Information Store File					✔
OST - Outlook Offline Data File					✔
EML - E-Mail Message	✔	✔	✔	✔	✔
EMLX - Apple Mail Message	✔	✔	✔	✔	✔
MSG - Outlook Mail Message	✔	✔	✔	✔	✔

Note

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Parse Form Data	Extract Table of Contents	Scan Barcode
ONE - OneNote Document		✔

Image*

Document Type	Extract Text (Accurate)	Extract Table of Contents
BMP - Bitmap Image file	✔	✔
GIF - Graphical Interchange Format		✔
JP2 - JPEG 2000		✔
JPG, JPEG - JPEG Image file	✔	✔
PNG - Portable Network Graphics	✔	✔
TIF, TIFF - Tagged Image File Format	✔	✔
DICOM - DICOM (Digital Imaging and Communications in Medicine)		✔
DJVU - DjVu File Format	✔	✔
EMF - Enhanced metafile		✔
J2K - JPEG 2000		✔
PS - PostScript File Format		✔
PSD - Photoshop Document		✔
SVG - Scalar Vector Graphics file		✔
SVGZ - Scalar Vector Graphics file (with gzip compression)		✔
WEBP - WebP Image File Format		✔
WMF - Microsoft Windows Metafile		✔

Database

Document Type	Parse Document by Template	Extract Text (Accurate)	Extract Text (Raw)	Extract Structured Text and Formatted Text	Extract Text Areas	Extract Metadata	Extract Images	Extract Containers and Attachments	Parse Form Data	Extract Table of Contents	Scan Barcode
ADO.NET		✔									✔

Platform Independence

GroupDocs.Parser for .NET does not require any external software or third-party tool to be installed. GroupDocs.Parser for .NET supports any 32-bit or 64-bit operating system where .NET or Mono framework is installed. The other details are as follows:

Microsoft Windows: Microsoft Windows Desktop (x86, x64) (XP & up), Microsoft Windows Server (x86, x64) (2000 & up), Windows Azure
Mac OS: Mac OS X
Linux: Linux (Ubuntu, OpenSUSE, CentOS and others)
Development Environments: Microsoft Visual Studio (2010 & up), Xamarin.Android, Xamarin.IOS, Xamarin.Mac, MonoDevelop 2.4 and later.
Supported Frameworks: GroupDocs.Conversion for .NET supports .NET and Mono frameworks.

Get Started

Are you ready to give GroupDocs.Parser for .NET a try? Simply execute Install-Package GroupDocs.Parser from Package Manager Console in Visual Studio to fetch & reference GroupDocs.Parser assembly in your project. If you already have GroupDocs.Parser for .Net and want to upgrade it, please execute Update-Package GroupDocs.Parser to get the latest version.

Please check the GitHub Repository for other common usage scenarios.

How to Install GroupDocs.Parser for .NET

1. Install from NuGet

Option 1: Using Package Manager GUI

Open Visual Studio:
- Load your solution/project.
Access NuGet Package Manager:
- Go to Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution.
- Alternatively, right-click the solution or project in Solution Explorer and select Manage NuGet Packages.
Search for GroupDocs.Parser:
- Navigate to the Browse tab.
- Type “GroupDocs.Parser” in the search box.
Install the Package:
- Click the Install button to add the latest version of GroupDocs.Parser to your project.

Option 2: Using Package Manager Console

Open Visual Studio:
- Load your solution/project.
Open Package Manager Console:
- Go to Tools -> NuGet Package Manager -> Package Manager Console.
Install GroupDocs.Parser:
- Type the command Install-Package GroupDocs.Parser and press Enter.
Verify Installation:
- GroupDocs.Parser should now be referenced in your application.

2. Handling .NET Framework and .NET Standard

Starting with version 24.2, GroupDocs.Parser is split into two packages: one for .NET Framework and one for .NET Standard.
For .NET Framework projects:
- Ensure AutoGenerateBindingRedirects is enabled.
- Add the following to your project file for unit tests:

<PropertyGroup>
    <AutoGenerateBindingRedirects>true</AutoGenerateBindingRedirects>
    <GenerateBindingRedirectsOutputType>true</GenerateBindingRedirectsOutputType>
</PropertyGroup>

3. Install from the Official GroupDocs Website

Download GroupDocs.Parser:
- Visit the official GroupDocs website and download the package.
Unpack or Install:
- Unzip the archive or run the MSI installer.
Add a Reference in Visual Studio:
- In Solution Explorer, right-click the References node of your project and select Add Reference.
- If you used the MSI installer, select GroupDocs.Parser from the .NET tab. Otherwise, browse to the location of the GroupDocs.Parser.dll file.
Confirm Reference:
- Ensure GroupDocs.Parser appears under the References node in your project.

4. Additional Considerations

.NET Standard 2.0 Version:
- This version has external references to several packages like System.Drawing.Common, System.Text.Encoding.CodePages, SkiaSharp, etc.
Linux Environment:
- Install the following packages for proper functionality:
  - libgdiplus
  - libc6-dev
  - ttf-mscorefonts-installer (e.g., sudo apt-get install ttf-mscorefonts-installer)
- Also, ensure SkiaSharp.NativeAssets.Linux.NoDependencies is installed.

GroupDocs.Parser for .NET Coding Samples

Code Sample 1: Extracting Text from a PDF Document

This code loads a PDF file (sample.pdf) and extracts its text content using the GetText() method. The extracted text is then displayed in the console.

using GroupDocs.Parser;
using GroupDocs.Parser.Options;

public class ExtractTextFromPdf
{
    public static void Run()
    {
        // Load the PDF document
        using (Parser parser = new Parser("sample.pdf"))
        {
            // Extract text from the document
            string text = parser.GetText();
            
            // Output the extracted text
            Console.WriteLine(text);
        }
    }
}

Code Sample 2: Extracting Images from a Word Document

This code loads a Word document (sample.docx) and extracts all images found within the document. Each image is saved as a separate PNG file.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;

public class ExtractImagesFromWord
{
    public static void Run()
    {
        // Load the Word document
        using (Parser parser = new Parser("sample.docx"))
        {
            // Get images from the document
            IEnumerable<PageImageArea> images = parser.GetImages();
            
            // Save each image to a file
            int imageNumber = 1;
            foreach (PageImageArea image in images)
            {
                image.Save($"image{imageNumber++}.png");
            }
        }
    }
}

Code Sample 3: Parsing Metadata from an Excel Spreadsheet

This code loads an Excel spreadsheet (sample.xlsx) and extracts its metadata, such as author, title, and creation date. The metadata is then displayed in the console.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;

public class ExtractMetadataFromExcel
{
    public static void Run()
    {
        // Load the Excel spreadsheet
        using (Parser parser = new Parser("sample.xlsx"))
        {
            // Get document's metadata
            IEnumerable<MetadataItem> metadata = parser.GetMetadata();
            
            // Output the metadata
            foreach (var item in metadata)
            {
                Console.WriteLine($"{item.Name}: {item.Value}");
            }
        }
    }
}

Product	Compatible and additional computed target framework versions.
.NET	net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.
.NET Core	netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed.
.NET Standard	netstandard2.0 is compatible. netstandard2.1 was computed.
.NET Framework	net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed.
MonoAndroid	monoandroid was computed.
MonoMac	monomac was computed.
MonoTouch	monotouch was computed.
Tizen	tizen40 was computed. tizen60 was computed.
Xamarin.iOS	xamarinios was computed.
Xamarin.Mac	xamarinmac was computed.
Xamarin.TVOS	xamarintvos was computed.
Xamarin.WatchOS	xamarinwatchos was computed.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
24.10.0	1,075	11/1/2024
24.9.0	2,230	9/30/2024
24.8.0	29,459	8/30/2024
24.7.0	1,534	7/24/2024
24.6.0	2,710	6/29/2024
24.5.0	5,471	5/31/2024
24.4.0	5,862	4/23/2024
24.2.1	7,201	3/13/2024
24.2.0	1,309	2/29/2024
23.12.0	134,040	12/23/2023
23.11.0	36,739	11/24/2023
23.10.0	13,560	10/21/2023
23.8.0	65,528	8/18/2023
23.5.0	84,970	5/31/2023
23.3.0	16,093	3/31/2023
23.2.0	22,868	3/1/2023
22.11.1	25,281	1/17/2023
22.11.0	38,896	11/29/2022
22.8.0	74,431	8/12/2022
22.6.0	31,446	6/7/2022
22.2.0	37,310	2/25/2022
21.5.0	63,323	5/31/2021
21.2.0	50,949	2/22/2021
20.12.0	24,429	12/30/2020
20.10.0	169,301	10/27/2020
20.8.0	49,008	8/19/2020
20.6.1	47,470	6/30/2020
20.6.0	20,077	6/19/2020
20.5.0	35,188	5/8/2020
20.3.0	48,429	3/19/2020
20.1.0	35,725	1/31/2020
19.12.0	33,537	12/27/2019
19.11.0	28,458	11/22/2019
19.9.0	2,809	9/27/2019
19.5.0	3,039	5/29/2019
18.12.0	3,214	12/11/2018
18.11.0	2,701	11/8/2018
18.10.0	2,785	10/10/2018
18.9.0	2,772	9/5/2018
18.8.0	2,841	8/7/2018
18.7.0	2,791	7/3/2018
18.5.0	3,013	5/23/2018

https://releases.groupdocs.com/parser/net/release-notes/2024/groupdocs-parser-for-net-24-8-release-notes/

Document Type	Extract Images	Extract Containers and Attachments
7Z* - 7Z File	✔	✔
ZIP - Zipped File	✔	✔
RAR - Rar File	✔	✔
TAR - Tar File	✔	✔
GZ - GZip file	✔	✔
BZ2 - BZip2 File	✔	✔