drittich.SemanticSlicer 1.1.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package drittich.SemanticSlicer --version 1.1.0                
NuGet\Install-Package drittich.SemanticSlicer -Version 1.1.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="drittich.SemanticSlicer" Version="1.1.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add drittich.SemanticSlicer --version 1.1.0                
#r "nuget: drittich.SemanticSlicer, 1.1.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install drittich.SemanticSlicer as a Cake Addin
#addin nuget:?package=drittich.SemanticSlicer&version=1.1.0

// Install drittich.SemanticSlicer as a Cake Tool
#tool nuget:?package=drittich.SemanticSlicer&version=1.1.0                

SemanticSlicer

SemanticSlicer is a C# library for slicing text data into smaller pieces while attempting to preserve context.

GitHub: https://github.com/drittich/SemanticSlicer

Table of Contents

Overview

This library accepts text and will break it into smaller chunks, typically useful for when creating LLM AI embeddings.

Sample Usage

Simple text document:

// The default options uses text separators, a max chunk size of 1,000, and 
// cl100k_base encoding to count tokens.
var slicer = new Slicer();
var text = File.ReadAllText("MyDocument.txt");
var documentChunks = slicer.GetDocumentChunks(text);

Markdown document:

var options = new SlicerOptions { MaxChunkTokenCount = 600, Separators = Separators.Markdown };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.md");
var documentChunks = slicer.GetDocumentChunks(text);

HTML document:

var options = new SlicerOptions { MaxChunkTokenCount = 600, Separators = Separators.Html };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.html");
var documentChunks = slicer.GetDocumentChunks(text);

Removing HTML tags:

For any content you can choose to remove HTML tags from the chunks to minimize the number of tokens. The inner text is preserved:

var options = new SlicerOptions { MaxChunkTokenCount = 600, Separators = Separators.Html, StripHtml = true };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.html");
var documentChunks = slicer.GetDocumentChunks(text);

Custom separators:

You can pass in your own list if of separators if you wish, e.g., if you wish to add support for other documents.

Chunk Order

Chunks will be returned in the order they were found in the document, and contain an Index property you can use to put them back in order if necessary.

Additional Metadata

You can pass any additional metadata you wish in as a dictionary, and it will be returned with each document chunk, so it's easy to persist. You might use the metadata to store the document id, title or last modified date.

var slicer = new Slicer();
var text = File.ReadAllText("MyDocument.txt");
var metadata = new Dictionary<string, object?>();
metadata["Id"] = 123;
metadata["FileName"] = "MyDocument.txt";
var documentChunks = slicer.GetDocumentChunks(text, metadata);
// All chunks returned will have a Metadata property with the data you passed in.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

If you have any questions or feedback, please open an issue on this repository.

Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.1 is compatible. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.4.0 442 11/9/2024
1.3.4 91 11/9/2024
1.2.0 4,397 12/3/2023
1.1.0 137 12/2/2023
1.0.0 168 11/13/2023

- Added support for chunking HTML documents
- Added support for stripping HTML tags