Microsoft.ML.Tokenizers
1.0.0
Prefix Reserved
dotnet add package Microsoft.ML.Tokenizers --version 1.0.0
NuGet\Install-Package Microsoft.ML.Tokenizers -Version 1.0.0
<PackageReference Include="Microsoft.ML.Tokenizers" Version="1.0.0" />
paket add Microsoft.ML.Tokenizers --version 1.0.0
#r "nuget: Microsoft.ML.Tokenizers, 1.0.0"
// Install Microsoft.ML.Tokenizers as a Cake Addin #addin nuget:?package=Microsoft.ML.Tokenizers&version=1.0.0 // Install Microsoft.ML.Tokenizers as a Cake Tool #tool nuget:?package=Microsoft.ML.Tokenizers&version=1.0.0
About
Microsoft.ML.Tokenizers supports various the implementation of the tokenization used in the NLP transforms.
Key Features
- Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
- BPE - Byte pair encoding model
- English Roberta model
- Tiktoken model
- Llama model
- Phi2 model
How to Use
using Microsoft.ML.Tokenizers;
using System.Net.Http;
using System.IO;
//
// Using Tiktoken Tokenizer
//
// initialize the tokenizer for `gpt-4` model
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
string source = "Text tokenization is the process of splitting a string into a list of tokens.";
Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// print: Tokens: 16
var trimIndex = tokenizer.GetIndexByTokenCountFromEnd(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// 5 tokens from end: a list of tokens.
trimIndex = tokenizer.GetIndexByTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// 5 tokens from start: Text tokenization is the
IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13
//
// Using Llama Tokenizer
//
// Open stream of remote Llama tokenizer model data file
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);
// Create the Llama tokenizer using the remote stream
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);
string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991
Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// print: Tokens: 5
Main Types
The main types provided by this library are:
Microsoft.ML.Tokenizers.Tokenizer
Microsoft.ML.Tokenizers.BpeTokenizer
Microsoft.ML.Tokenizers.EnglishRobertaTokenizer
Microsoft.ML.Tokenizers.TiktokenTokenizer
Microsoft.ML.Tokenizers.Normalizer
Microsoft.ML.Tokenizers.PreTokenizer
Additional Documentation
Related Packages
Feedback & Contributing
Microsoft.ML.Tokenizers is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- Google.Protobuf (>= 3.27.1)
- Microsoft.Bcl.HashCode (>= 6.0.0)
- Microsoft.Bcl.Memory (>= 9.0.0)
- System.Text.Json (>= 8.0.5)
-
net8.0
- Google.Protobuf (>= 3.27.1)
- System.Text.Json (>= 8.0.5)
NuGet packages (18)
Showing the top 5 NuGet packages that depend on Microsoft.ML.Tokenizers:
Package | Downloads |
---|---|
Microsoft.KernelMemory.Core
The package contains the the core logic and abstractions of Kernel Memory, not including extensions. |
|
Microsoft.KernelMemory.AI.OpenAI
Provide access to OpenAI LLM models in Kernel Memory to generate embeddings and text |
|
Microsoft.ML.TorchSharp
Microsoft.ML.TorchSharp contains ML.NET integration of TorchSharp. |
|
Microsoft.KernelMemory.AI.TikToken
Provide TikToken tokenizers in Kernel Memory |
|
Microsoft.Teams.AI
SDK focused on building AI based applications for Microsoft Teams. |
GitHub repositories (10)
Showing the top 5 popular GitHub repositories that depend on Microsoft.ML.Tokenizers:
Repository | Stars |
---|---|
microsoft/semantic-kernel
Integrate cutting-edge LLM technology quickly and easily into your apps
|
|
microsoft/kernel-memory
RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
|
|
microsoft/teams-ai
SDK focused on building AI based applications and extensions for Microsoft Teams and other Bot Framework channels
|
|
dotnet/ai-samples
|
|
axzxs2001/Asp.NetCoreExperiment
原来所有项目都移动到**OleVersion**目录下进行保留。新的案例装以.net 5.0为主,一部分对以前案例进行升级,一部分将以前的工作经验总结出来,以供大家参考!
|
Version | Downloads | Last updated |
---|---|---|
1.0.0 | 1,665 | 11/14/2024 |
0.22.0 | 1,113 | 11/13/2024 |
0.22.0-preview.24526.1 | 1,632 | 10/27/2024 |
0.22.0-preview.24522.7 | 1,187 | 10/23/2024 |
0.22.0-preview.24378.1 | 92,369 | 7/29/2024 |
0.22.0-preview.24271.1 | 144,411 | 5/21/2024 |
0.22.0-preview.24179.1 | 141,442 | 4/2/2024 |
0.22.0-preview.24162.2 | 19,997 | 3/13/2024 |
0.21.1 | 92,829 | 1/18/2024 |
0.21.0 | 51,446 | 11/27/2023 |
0.21.0-preview.23511.1 | 51,610 | 10/13/2023 |
0.21.0-preview.23266.6 | 51,200 | 5/17/2023 |
0.21.0-preview.22621.2 | 2,106 | 12/22/2022 |
0.20.1 | 86,553 | 2/1/2023 |
0.20.1-preview.22573.9 | 2,299 | 11/24/2022 |
0.20.0 | 30,651 | 11/8/2022 |
0.20.0-preview.22551.1 | 233 | 11/1/2022 |