EasyReasy.KnowledgeBase.BertTokenization
1.0.0
dotnet add package EasyReasy.KnowledgeBase.BertTokenization --version 1.0.0
NuGet\Install-Package EasyReasy.KnowledgeBase.BertTokenization -Version 1.0.0
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="EasyReasy.KnowledgeBase.BertTokenization" Version="1.0.0" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="EasyReasy.KnowledgeBase.BertTokenization" Version="1.0.0" />
<PackageReference Include="EasyReasy.KnowledgeBase.BertTokenization" />
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add EasyReasy.KnowledgeBase.BertTokenization --version 1.0.0
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: EasyReasy.KnowledgeBase.BertTokenization, 1.0.0"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package EasyReasy.KnowledgeBase.BertTokenization@1.0.0
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=EasyReasy.KnowledgeBase.BertTokenization&version=1.0.0
#tool nuget:?package=EasyReasy.KnowledgeBase.BertTokenization&version=1.0.0
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
EasyReasy.KnowledgeBase.BertTokenization
BERT-based tokenizer implementation for EasyReasy.KnowledgeBase. Provides accurate token counting and text processing using the FastBertTokenizer library with BERT base uncased vocabulary.
Features
- 🤖 BERT Tokenization: Industry-standard BERT base uncased vocabulary
- 📊 Token Counting: Accurate token count for chunking and size management
- 🔄 Encode/Decode: Full tokenization and detokenization support
- ⚡ Async Creation: Async initialization with Hugging Face model loading
- 🛡️ Truncation Control: Configurable maximum token limits
Quick Start
Installation
dotnet add package EasyReasy.KnowledgeBase.BertTokenization
Basic Usage
using EasyReasy.KnowledgeBase.BertTokenization;
// Create tokenizer
BertTokenizer tokenizer = await BertTokenizer.CreateAsync();
// Count tokens
int tokenCount = tokenizer.CountTokens("Hello, world!");
// Encode text to tokens
int[] tokens = tokenizer.Encode("This is a test sentence.");
// Decode tokens back to text
string decoded = tokenizer.Decode(tokens);
Console.WriteLine($"Token count: {tokenCount}");
Console.WriteLine($"Tokens: [{string.Join(", ", tokens)}]");
Console.WriteLine($"Decoded: {decoded}");
Using with KnowledgeBase
using EasyReasy.KnowledgeBase.BertTokenization;
using EasyReasy.KnowledgeBase.Chunking;
// Create tokenizer for use with document processing
BertTokenizer tokenizer = await BertTokenizer.CreateAsync();
// Use with section reader factory
SectionReaderFactory factory = new SectionReaderFactory(embeddingService, tokenizer);
using Stream stream = File.OpenRead("document.md");
SectionReader reader = factory.CreateForMarkdown(stream, maxTokensPerChunk: 100, maxTokensPerSection: 1000);
await foreach (List<KnowledgeFileChunk> chunks in reader.ReadSectionsAsync())
{
// Process sections with accurate token counts
}
Custom Configuration
// Configure maximum encoding tokens to prevent truncation
BertTokenizer tokenizer = await BertTokenizer.CreateAsync();
tokenizer.MaxEncodingTokens = 4096; // Default is 2048
// Count tokens for longer texts
int tokenCount = tokenizer.CountTokens("Very long document text...");
API Reference
BertTokenizer
Creation
static Task<BertTokenizer> CreateAsync()
static Task<BertTokenizer> CreateAsync(FastBertTokenizer.BertTokenizer tokenizer)
Properties
MaxEncodingTokens
: Maximum tokens allowed during encoding (default: 2048)
Methods
CountTokens(string text)
: Count tokens in textEncode(string text)
: Encode text to token arrayDecode(int[] tokens)
: Decode tokens back to text
Implementation Details
- Vocabulary: Uses BERT base uncased model from Hugging Face
- Token Range: Handles standard BERT vocabulary (30,522 tokens)
- Truncation: Automatically truncates at
MaxEncodingTokens
limit - Performance: Optimized for repeated tokenization operations
- Memory: Loads vocabulary once during initialization
Dependencies
- .NET 8.0+: Modern async/await patterns
- EasyReasy.KnowledgeBase: Core interfaces (
ITokenizer
) - FastBertTokenizer: High-performance BERT tokenization library
License
MIT
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
-
net8.0
- EasyReasy.KnowledgeBase (>= 1.0.0)
- FastBertTokenizer (>= 1.0.28)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last Updated |
---|---|---|
1.0.0 | 11 | 8/18/2025 |