EasyReasy.KnowledgeBase.BertTokenization 1.0.0

.NET 8.0

dotnet add package EasyReasy.KnowledgeBase.BertTokenization --version 1.0.0

NuGet\Install-Package EasyReasy.KnowledgeBase.BertTokenization -Version 1.0.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="EasyReasy.KnowledgeBase.BertTokenization" Version="1.0.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="EasyReasy.KnowledgeBase.BertTokenization" Version="1.0.0" />
                    

                            Directory.Packages.props

<PackageReference Include="EasyReasy.KnowledgeBase.BertTokenization" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add EasyReasy.KnowledgeBase.BertTokenization --version 1.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: EasyReasy.KnowledgeBase.BertTokenization, 1.0.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package EasyReasy.KnowledgeBase.BertTokenization@1.0.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=EasyReasy.KnowledgeBase.BertTokenization&version=1.0.0
                    

                            Install as a Cake Addin

#tool nuget:?package=EasyReasy.KnowledgeBase.BertTokenization&version=1.0.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

EasyReasy.KnowledgeBase.BertTokenization

← Back to EasyReasy System

BERT-based tokenizer implementation for EasyReasy.KnowledgeBase. Provides accurate token counting and text processing using the FastBertTokenizer library with BERT base uncased vocabulary.

Features

🤖 BERT Tokenization: Industry-standard BERT base uncased vocabulary
📊 Token Counting: Accurate token count for chunking and size management
🔄 Encode/Decode: Full tokenization and detokenization support
⚡ Async Creation: Async initialization with Hugging Face model loading
🛡️ Truncation Control: Configurable maximum token limits

Quick Start

Installation

dotnet add package EasyReasy.KnowledgeBase.BertTokenization

Basic Usage

using EasyReasy.KnowledgeBase.BertTokenization;

// Create tokenizer
BertTokenizer tokenizer = await BertTokenizer.CreateAsync();

// Count tokens
int tokenCount = tokenizer.CountTokens("Hello, world!");

// Encode text to tokens
int[] tokens = tokenizer.Encode("This is a test sentence.");

// Decode tokens back to text
string decoded = tokenizer.Decode(tokens);

Console.WriteLine($"Token count: {tokenCount}");
Console.WriteLine($"Tokens: [{string.Join(", ", tokens)}]");
Console.WriteLine($"Decoded: {decoded}");

Using with KnowledgeBase

using EasyReasy.KnowledgeBase.BertTokenization;
using EasyReasy.KnowledgeBase.Chunking;

// Create tokenizer for use with document processing
BertTokenizer tokenizer = await BertTokenizer.CreateAsync();

// Use with section reader factory
SectionReaderFactory factory = new SectionReaderFactory(embeddingService, tokenizer);
using Stream stream = File.OpenRead("document.md");
SectionReader reader = factory.CreateForMarkdown(stream, maxTokensPerChunk: 100, maxTokensPerSection: 1000);

await foreach (List<KnowledgeFileChunk> chunks in reader.ReadSectionsAsync())
{
    // Process sections with accurate token counts
}

Custom Configuration

// Configure maximum encoding tokens to prevent truncation
BertTokenizer tokenizer = await BertTokenizer.CreateAsync();
tokenizer.MaxEncodingTokens = 4096; // Default is 2048

// Count tokens for longer texts
int tokenCount = tokenizer.CountTokens("Very long document text...");

API Reference

BertTokenizer

Creation

static Task<BertTokenizer> CreateAsync()
static Task<BertTokenizer> CreateAsync(FastBertTokenizer.BertTokenizer tokenizer)

Properties

MaxEncodingTokens: Maximum tokens allowed during encoding (default: 2048)

Methods

CountTokens(string text): Count tokens in text
Encode(string text): Encode text to token array
Decode(int[] tokens): Decode tokens back to text

Implementation Details

Vocabulary: Uses BERT base uncased model from Hugging Face
Token Range: Handles standard BERT vocabulary (30,522 tokens)
Truncation: Automatically truncates at MaxEncodingTokens limit
Performance: Optimized for repeated tokenization operations
Memory: Loads vocabulary once during initialization

Dependencies

.NET 8.0+: Modern async/await patterns
EasyReasy.KnowledgeBase: Core interfaces (ITokenizer)
FastBertTokenizer: High-performance BERT tokenization library

License

MIT

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- EasyReasy.KnowledgeBase (>= 1.0.0)
- FastBertTokenizer (>= 1.0.28)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.0	168	8/18/2025