Searchify 1.0.0
dotnet add package Searchify --version 1.0.0
NuGet\Install-Package Searchify -Version 1.0.0
<PackageReference Include="Searchify" Version="1.0.0" />
paket add Searchify --version 1.0.0
#r "nuget: Searchify, 1.0.0"
// Install Searchify as a Cake Addin #addin nuget:?package=Searchify&version=1.0.0 // Install Searchify as a Cake Tool #tool nuget:?package=Searchify&version=1.0.0
Searchify
<img alt="Searchify Icon" src="https://raw.githubusercontent.com/Kir-Antipov/Searchify/HEAD/media/icon.png" width="128">
Searchify
is a high-performance .NET library designed to enhance search operations. It brings together efficient fuzzy search, spell checking, and data indexing capabilities - providing a practical solution for quick and precise data retrieval. With easy adaptability to large data sets and an innate ability to manage misspelled queries and flexible search parameters, it ensures optimized search performance. Ideal for data-heavy .NET projects, Searchify
delivers solid performance without overcomplicating the process.
NuGet Packages
Package | Latest Version |
---|---|
Searchify | |
Searchify.Abstractions |
Getting Started
Installation
To get started, first add the Searchify package to your project. You can do this by running the following command:
dotnet add package Searchify
Alternatively, you can install it via the Package Manager Console with this command:
Install-Package Searchify
Usage
SearchProvider
SearchProvider
is the magnum opus of this library that combines all its other features to deliver straightforward yet powerful search functionality.
SearchProvider.Create
provides a range of useful overloads, empowering you to index any collection via any property of any type. However, since string search queries are used far more often than anything else in the real world scenarios, we will focus on them. If you have a collection of articles and wish to index them by their titles to later perform searches on them, it's as easy as:
var search = SearchProvider.Create(articles, x => x.Title);
var result = search.Search("Some Title");
The new search provider indexes all supplied titles, utilizing the comprising words (you can provide a custom tokenizer), and provides fuzzy search based on the Levenshtein distance (a custom distance metric or an entire spell checker can also be used).
To delve deeper, let's consider the use case that inspired me to develop this library. I was working on a Steam-related application that necessitated a user search feature across games and their corresponding achievements. Considering users are unlikely to remember the exact wording of an achievement or its full description, fuzzy search support was crucial. Hence, I adopted approximate string matching via the Levenshtein distance and implemented a basic loop that compared a query against every entry in a given collection. This approach functioned reasonably well until I discovered that PayDay 2 has over 1300 achievements and is not even the holder of the most significant achievement count on Steam! As you can guess, the rudimentary for loop ain't gonna cut it for such scenarios. A single search would take between 100 to 150 ms on my machine, which was not that disastrous, but still quite and quite noticeable. Instead of settling for displaying a spinning preloader of some sort and only showing query results when a user finishes typing, I craved real-time results and, therefore, created this library. Now, it is absolutely effortless for anybody to index anything based on any amount of any properties and carry out instantaneous fuzzy searches with built-in spell checking:
var search = SearchProvider.Create(
achievements,
[
x => x.Name,
x => x.Description,
],
DistanceMetric.LevenshteinIgnoreCase
);
// ...
var query = "compete da heist stealthy with minigun and r-laucnher in shadow raid";
var result = search.Search(query, new() { MaxSuggestions = 3 });
// ...
In the example above, we indexed a collection of achievement objects by their names and descriptions using a case-insensitive version of the Levenshtein distance. The outcome of the executed query is as follows:
1. In the Shadow Raid job, complete the heist in stealth while having the Vulcan Minigun and HRL-7 Rocket Launcher equipped.
2. Complete any day of a heist in stealth with a Locomotive 12G shotgun modified with the "Silent Killer Suppressor" equipped. Unlocks the "Suppressed Barrel" for the Street Sweeper shotgun, "Rutger" mask, "Banana Peel" material and "Banana" pattern.
3. In the Art Gallery job, complete the heist in stealth within 4 minutes with each crew member wearing the Improved Combined Tactical Vest and no Armor Bag deployable equipped. Unlocks the "Classic Stock" for the AK weapon family, "2 Piece Stock" for the AK and CAR weapon families, "Pachy" mask, "Fossil" material and "Prehistoric Predator" pattern.
Elapsed: 2,3115 ms
Processing a natural language query and locating the best matches in the indexed collection took only 2 milliseconds on my laptop (compared with 100-150 ms using the initial approach). By integrating a synonym-normalization layer with built-in spell checkers, you could even create a fully-featured in-memory search engine - and, admit it, that's pretty darn neat!
SpellChecker
SpellChecker
does exactly what you think it does - it checks the spelling of the input.
There are several built-in methods to create a spell checker, one of which is through a provided vocabulary and a distance metric used to determine the similarity between words:
var spellChecker = SpellChecker.Create(
["book", "books", "cake", "boo", "boon", "cook", "cake", "cape", "cart"],
DistanceMetric.Levenshtein
);
// IsCorrect = false, Suggestions = ["cook"]
var result = spellChecker.CheckSpelling("cool");
// true, "cook"
var isWordCorrected = spellChecker.TryFixSpelling("cool", out var correctedWord);
However, you also have the freedom to create your own SpellChecker
according to your project requirements.
Levenshtein
Levenshtein
offers a bunch of methods useful for approximate string matching based on the Levenshtein distance algorithm.
The API is designed to be familiar to those accustomed to using Regex
:
IsMatch
- Determines if there is a subsequence within the pattern that, given a maximum specified distance for edits, matches the input sequence.Match
- Searches for the first subsequence within the pattern sequence that matches the input sequence when edited to within a specified maximum distance.LastMatch
- Searches for the last subsequence within the pattern sequence that matches the input sequence when edited to within a specified maximum distance.Matches
- Searches for all subsequences within the pattern sequence that match the input sequence when edited to within a specified maximum distance.EnumerateMatches
- Searches for all subsequences within the pattern sequence that match the input sequence when edited to within a specified maximum distance.Count
- Searches for all subsequences within the pattern sequence that match the input sequence when edited to within a specified maximum distance and returns the number of matches.
For its key feature, Levenshtein
provides methods that calculate the distance/similarity ratio between two sequences:
Distance
- Calculates the Levenshtein distance between two sequences.SubsequenceDistance
- Calculates the Levenshtein distance between the input sequence and its most comparable subsequence in the pattern sequence.Ratio
- Calculates a normalized similarity ratio between two sequences.SubsequenceRatio
- Calculates a normalized subsequence similarity ratio between two sequences.
Levenshtein.Distance("word", "World");
Levenshtein.Distance("word", "World", CharComparer.CurrentCultureIgnoreCase);
Levenshtein.Distance("word", "World", CharComparer.CurrentCultureIgnoreCase, substitutionCost: 0);
It's safe to say that this is one of the most advanced implementations of the Levenshtein distance algorithm available in .NET. It stands out for being not only one of the fastest and the most memory-efficient, but also the most flexible one. This flexibility extends beyond merely allowing you to provide a custom character comparer and configure weights for different edit operations (i.e., deletions, insertions, and substitutions) - it also permits the calculation of the Levenshtein distance between any sequence, not limited to strings or arrays of characters.
Here are a few benchmarks against the most popular existing implementation I've discovered on NuGet that continues to be actively maintained:
Method | Parameters | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
Distance_FuzzySearchNet | (Word, World) | 2,104.08 ns | 1.00 | 3056 B | 1.00 |
Distance_Searchify | (Word, World) | 97.25 ns | 0.05 | - | 0.00 |
IsMatch_FuzzySearchNet | (Thou(...)End]) [471203] | 6821.44 ms | 1.00 | 20872 B | 1.000 |
IsMatch_Searchify | (Thou(...)End]) [471203] | 115.71 ms | 0.02 | 184 B | 0.009 |
Imagine allocating (especially in 2024) an entire DOOM level, only to discover that the edit distance between the words "Word" and "World" is 1, while also being 20 times slower than a solution that doesn't allocate at all in such trivial scenarios. While you might argue that the time difference doesn't matter when the operation takes nanoseconds, the scalability issue becomes evident when the optimized solution takes milliseconds, and the non-optimized one takes seconds. Therefore, refrain from using naive implementations of popular algorithms; opt for Searchify
, which offers a greater level of flexibility, requires significantly fewer allocations, and operates more efficiently all at the same time 😉
DistanceMetric
DistanceMetric
serves as a facade for distance functions (e.g., Levenshtein distance), essential for defining a metric space.
Currently, there are two built-in implementations of DistanceMetric
in this library: DistanceMetric.Levenshtein
and DistanceMetric.LevenshteinIgnoreCase
.
// 2 (replace 'l' with 'L', insert missing 'h')
int distance = DistanceMetric.Levenshtein.Calculate("ligt", "Light");
// 1 (insert missing 'h')
int distanceIgnoreCase = DistanceMetric.LevenshteinIgnoreCase.Calculate("ligt", "Light");
Nonetheless, you can always wrap any distance function using the DistanceMetric.Create
method.
BKTree
BKTree
is an ICollection<T>
implementation of the Burkhard-Keller Tree.
var tree = new BKTree<string, int>(DistanceMetric.Levenshtein)
{
"book",
"books",
"cake",
"boo",
"boon",
"cook",
"cake",
"cape",
"cart",
};
var theCakeIsReal = tree.Contains("cake"); // true
var theCakeIsRemoved = tree.Remove("cake"); // true
var theCakeIsALie = !tree.Contains("cake"); // true
// Value = cook, Distance = 1
var result = tree.Find("cool");
// Value = cook, Distance = 1
// Value = boon, Distance = 2
// Value = boo, Distance = 2
// Value = book, Distance = 2
var results = tree.FindAll("cool", maxDistance: 2);
Tokenizer
Tokenizer
may seem like an intimidating term to some folks, but its concept is really quite simple - it takes a specific input and breaks it into smaller components.
Currently, the only built-in Tokenizer
in this library is Tokenizer.Words
. Given that the library is primarily targeted towards spell checking and fuzzy string matching, operations become simpler when conducted at the level of individual words:
// ["Hello", "world", "This", "is", "a", "test"]
var tokens = Tokenizer.Words.Tokenize("Hello, world! This is a test...");
Nevertheless, you can always create your very own Tokenizer
suited to any data type of your preference.
License
Licensed under the terms of the MIT License.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. |
-
net8.0
- Searchify.Abstractions (>= 1.0.0)
- Spanned (>= 1.0.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
1.0.0 | 136 | 2/5/2024 |