Clara.Analysis.Morfologik 0.1.37

.NET 6.0 .NET Standard 2.0 .NET Framework 4.6.2

dotnet add package Clara.Analysis.Morfologik --version 0.1.37

NuGet\Install-Package Clara.Analysis.Morfologik -Version 0.1.37

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Clara.Analysis.Morfologik" Version="0.1.37" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add Clara.Analysis.Morfologik --version 0.1.37

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Clara.Analysis.Morfologik, 0.1.37"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install Clara.Analysis.Morfologik as a Cake Addin
#addin nuget:?package=Clara.Analysis.Morfologik&version=0.1.37

// Install Clara.Analysis.Morfologik as a Cake Tool
#tool nuget:?package=Clara.Analysis.Morfologik&version=0.1.37

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Clara

Simple, yet feature complete, in memory search engine.

Highlights

This library is meant for relatively small document sets (up to tenths of thousands) while maintaining fast query times (measured in low milliseconds). Index updates are not supported by design and full index rebuild is required to reflect changes in source data.

Main features are:

Inspired by well known Lucene design
Fast in memory searching
Low memory allocation for search execution
Stemming and stop words handling for 31 languages
Text, keyword, hierarchy and range fields
Index time synonym maps with multi token support
Cross field searching with BM25 scoring
Filtering keyword and hierarchy fields by any or all values and range fields by value subrange
Faceting without restricting facet values by field filters
Result sorting by document scoring or range field values
Fully configurable and extendable text analysis pipeline
Fluent query builder

Supported Languages

Internally

Porter (English)
Using Clara.Analysis.Snowball package

Arabic, Armenian, Basque, Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Tamil, Turkish, Yiddish
Using Clara.Analysis.Morfologik package

Polish

A Quick Example

Given sample product data set from dummyjson.com.

[
  {
    "id": 1,
    "title": "iPhone 9",
    "description": "An apple mobile which is nothing like apple",
    "price": 549,
    "discountPercentage": 12.96,
    "rating": 4.69,
    "stock": 94,
    "brand": "Apple",
    "category": "smartphones"
  }
]

We define data model as follows.

public class Product
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string Description { get; set; }
    public decimal? Price { get; set; }
    public double? DiscountPercentage { get; set; }
    public double? Rating { get; set; }
    public int? Stock { get; set; }
    public string Brand { get; set; }
    public string Category { get; set; }
}

Now we need to define index model mapper. Mapper is a definition of how our index will be built from source documents and what capabilities will it provide afterwards.

Clara only supports single field searching, all text that is to be indexed has to be combined into single field. We can provide more text fields, for example when we want to provide multiple language support from single index. In such case we would combine text for each language and use adequate analyzer.

For simple fields we define delegates that provide raw values for indexing. Each field can provide none, one or more values, null values are automatically skipped during indexing. All non-text fields can be marked as filterable or facetable, while only range fields can be made sortable.

Built indexes have no persistence and reside only in memory. If index needs updating, it should be rebuild and old one should be discarded. This is why fields have no names and can be referenced only by their usually static definition.

IIndexMapper<TSource> interface is straightforward. It provides all fields collection, method to access document key and method to access indexed document value. Indexed document value, which is provided in query results can be different than index source document. To indicate such distinction use IIndexMapper<TSouce, TDocument> type instead and return proper document type in GetDocument method implementation.

public sealed class ProductMapper : IIndexMapper<Product>
{
    public TextField<Product> Text { get; } = new(GetText, new PorterAnalyzer());
    public DecimalField<Product> Price { get; } = new(x => x.Price, isFilterable: true, isFacetable: true, isSortable: true);
    public DoubleField<Product> DiscountPercentage { get; } = new(x => x.DiscountPercentage, isFilterable: true, isFacetable: true, isSortable: true);
    public DoubleField<Product> Rating { get; } = new(x => x.Rating, isFilterable: true, isFacetable: true, isSortable: true);
    public Int32Field<Product> Stock { get; } = new(x => x.Stock, isFilterable: true, isFacetable: true, isSortable: true);
    public KeywordField<Product> Brand { get; } = new(x => x.Brand, isFilterable: true, isFacetable: true);
    public HierarchyField<Product> Category { get; } = new(x => x.Category, separator: "-", root: "all", HierarchyValueHandling.Path, isFilterable: true, isFacetable: true);

    public IEnumerable<Field> GetFields()
    {
        yield return Text;
        yield return Price;
        yield return DiscountPercentage;
        yield return Rating;
        yield return Stock;
        yield return Brand;
        yield return Category;
    }

    public string GetDocumentKey(Product item) => item.Id.ToString();

    public Product GetDocument(Product item) => item;

    private static IEnumerable<string?> GetText(Product product)
    {
        yield return product.Id.ToString();
        yield return product.Title;
        yield return product.Description;
        yield return product.Brand;
        yield return product.Category;
    }
}

Then we build our index.

var mapper = new ProductMapper();
var index = IndexBuilder.Build(products, mapper);

With index built, we can run queries against it. Result documents can be accessed with Documents property and facet results via Facets. Documents are not paged, since search engine builds whole result set each time, in order to perform facet values computation, while using pooled buffers for result construction. If paging is needed, it can be added by simple Skip/Take logic on top Documents collection.

// Query result must always be disposed in order to return pooled buffers for reuse
using var result = index.Query(
    q => q
        .Search(mapper.Text, SearchMode.Any, "watch ring leather bag")
        .Filter(mapper.Brand, FilterMode.Any, "Eastern Watches", "Bracelet", "Copenhagen Luxe")
        .Filter(mapper.Category, FilterMode.Any, "womens")
        .Filter(mapper.Price, valueFrom: 10, valueTo: 90)
        .Facet(mapper.Brand)
        .Facet(mapper.Category)
        .Facet(mapper.Price)
        .Sort(mapper.Price, SortDirection.Descending));

Console.WriteLine("Documents:");

foreach (var document in result.Documents.Take(10))
{
    Console.WriteLine($"  [{document.Document.Title}] ${document.Document.Price} => {document.Score}");
}

Console.WriteLine("Brands:");

foreach (var value in result.Facets.Field(mapper.Brand).Values.Take(5))
{
    Console.WriteLine($"  {(value.IsSelected ? "(x)" : "( )")} [{value.Value}] => {value.Count}");
}

Console.WriteLine("Categories:");

foreach (var value in result.Facets.Field(mapper.Category).Values.Take(5))
{
    Console.WriteLine($"  (x) [{value.Value}] => {value.Count}");

    foreach (var child in value.Children)
    {
        Console.WriteLine($"    ( ) [{child.Value}] => {child.Count}");
    }
}

var priceFacet = result.Facets.Field(mapper.Price);

Console.WriteLine("Price:");
Console.WriteLine($"  [Min] => {priceFacet.Min}");
Console.WriteLine($"  [Max] => {priceFacet.Max}");

Running this query against sample data results in following output.

Documents:
  [Fashion Magnetic Wrist Watch] $60 => 3,31469
  [Leather Hand Bag] $57 => 6,3701906
  [Fancy hand clutch] $44 => 6,559135
  [Steel Analog Couple Watches] $35 => 3,4505985
  [Stainless Steel Women] $35 => 3,4802377
Brands:
  (x) [Eastern Watches] => 2
  (x) [Bracelet] => 2
  (x) [Copenhagen Luxe] => 1
  ( ) [LouisWill] => 2
  ( ) [Luxury Digital] => 1
Categories:
  (x) [womens] => 5
    ( ) [womens-watches] => 3
    ( ) [womens-bags] => 2
Price:
  [Min] => 35
  [Max] => 100

Field Mapping

TODO

Text Fields

TODO

Keyword Fields

TODO

Hierarchy Fields

TODO

Range Fields

Range fields represent index fields for struct values with IComparable<T> interface implementation, that is for comparable values, which can be within specific range. Range fields allow filtering by subrange and their facet values contain matched documents minimum and maximum values. Sorting is dependant on direction Ascending or Descedning. When sorting by Ascending order, then minumum document value is used and documents without any value are treated as if they had maxValue. While when sorting by Descending order maximum document value is used and documents without values have minValue assigned to them.

Internally DateTime, Decimal, Double and Int32 types are supported. Implementors can support any type that fullfills requirements by directly using RangeField<TValue> and providing minValue and maxValue for a given type. For example given DateOnly type, index field can be defined as follows.

public static RangeField<DateOnly> DateOfBirth { get; } = new(x => x.DateOfBirth, minValue: DateOnly.MinValue, maxValue: DateOnly.MaxValue, isFilterable: true, isFacetable: true, isSortable: true);

Alternatively it is possible to provide own concrete implementation, by creating subclass of RangeField<TValue>.

public sealed class DateOnlyField<TSource> : RangeField<TSource, int>
{
    public DateOnlyField(Func<TSource, DateOnly?> valueMapper, bool isFilterable = false, bool isFacetable = false, bool isSortable = false)
        : base(
            valueMapper: valueMapper,
            minValue: DateOnly.MinValue,
            maxValue: DateOnly.MaxValue,
            isFilterable: isFilterable,
            isFacetable: isFacetable,
            isSortable: isSortable)
    {
    }

    public DateOnlyField(Func<TSource, IEnumerable<DateOnly>> valueMapper, bool isFilterable = false, bool isFacetable = false, bool isSortable = false)
        : base(
            valueMapper: valueMapper,
            minValue: DateOnly.MinValue,
            maxValue: DateOnly.MaxValue,
            isFilterable: isFilterable,
            isFacetable: isFacetable,
            isSortable: isSortable)
    {
    }
}

Defined DateOnlyField can be used now directly, without the need to specify TValue and minValue/maxValue each time.

public static DateOnlyField DateOfBirth { get; } = new(x => x.DateOfBirth, isFilterable: true, isFacetable: true, isSortable: true);

Text Analysis

TODO

Analyzers

TODO

Internally only PorterAnalyzer is provided for English language stemming. For other languages Clara.Analysis.Snowball or Clara.Analysis.Morfologik packages can be used. Those packages provide stem and stop token filters for all supported languages.

TODO

Synonym Maps

TODO

Extending Analysis Pipeline

TODO

Benchmarks

Index and query benchmarks and tests are performed using sample 100 product data set. Benchmark variants with x100 suffix are based on data set multiplied 100 times.

BenchmarkDotNet v0.13.10, Windows 11 (10.0.22631.2715/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12900K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2 DEBUG
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

Tokenization Benchmarks

Method	Mean	Error	StdDev	Allocated
StandardTokenizer	136.9 ns	0.41 ns	0.37 ns	-
StandardAnalyzer	241.2 ns	0.79 ns	0.70 ns	-
PorterAnalyzer	636.0 ns	1.13 ns	1.06 ns	-
SynonymMap	849.9 ns	1.31 ns	1.16 ns	-
EnglishAnalyzer	1,359.9 ns	3.61 ns	3.38 ns	-
MorfologikAnalyzer	4,531.1 ns	20.32 ns	18.01 ns	944 B

Indexing Benchmarks

Method	Mean	Error	StdDev	Allocated
IndexInstance_x100	67,686.6 μs	1,331.34 μs	2,331.73 μs	25793.36 KB
IndexInstance	631.5 μs	1.63 μs	1.36 μs	598.36 KB
IndexShared_x100	67,934.6 μs	1,321.08 μs	1,356.65 μs	24511.86 KB
IndexShared	572.2 μs	2.09 μs	1.96 μs	485.55 KB

Querying Benchmarks

Method	Mean	Error	StdDev	Allocated
QueryComplex_x100	338,031.6 ns	5,036.67 ns	4,464.87 ns	904 B
QueryComplex	8,598.6 ns	27.15 ns	25.40 ns	904 B
QuerySearch	3,658.2 ns	9.13 ns	8.09 ns	352 B
QueryFilter	733.5 ns	1.80 ns	1.68 ns	424 B
QueryFacet	6,956.1 ns	19.35 ns	17.15 ns	624 B
QuerySort	1,721.3 ns	3.48 ns	3.08 ns	392 B
Query	610.4 ns	1.45 ns	1.28 ns	296 B

Memory Allocations

Clara depends heavily on internal buffer pooling in order to provide minimal query execution memory footprint. Due to that fact, memory allocation per search execution is constant after initial buffer allocation. Although there are compromises being made regarding Query and QueryResult object allocations to provide ease of use and proper disposal of internal buffers.

License

Product	Compatible and additional computed target framework versions.
.NET	net5.0 was computed. net5.0-windows was computed. net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.
.NET Core	netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed.
.NET Standard	netstandard2.0 is compatible. netstandard2.1 is compatible.
.NET Framework	net461 was computed. net462 is compatible. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed.
MonoAndroid	monoandroid was computed.
MonoMac	monomac was computed.
MonoTouch	monotouch was computed.
Tizen	tizen40 was computed. tizen60 was computed.
Xamarin.iOS	xamarinios was computed.
Xamarin.Mac	xamarinmac was computed.
Xamarin.TVOS	xamarintvos was computed.
Xamarin.WatchOS	xamarinwatchos was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

.NETFramework 4.6.2
- Clara (>= 0.1.37)
- Morfologik.Polish (>= 2.1.7)
.NETStandard 2.0
- Clara (>= 0.1.37)
- Morfologik.Polish (>= 2.1.7)
.NETStandard 2.1
- Clara (>= 0.1.37)
- Morfologik.Polish (>= 2.1.7)
net6.0
- Clara (>= 0.1.37)
- Morfologik.Polish (>= 2.1.7)
net7.0
- Clara (>= 0.1.37)
- Morfologik.Polish (>= 2.1.7)
net8.0
- Clara (>= 0.1.37)
- Morfologik.Polish (>= 2.1.7)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
0.1.37	301	11/23/2023
0.1.36	148	11/1/2023
0.1.35	123	10/28/2023
0.1.34	130	10/26/2023
0.1.33	120	10/26/2023
0.1.32	123	10/13/2023
0.1.31	118	10/12/2023
0.1.30	121	10/4/2023
0.1.29	105	9/28/2023
0.1.28	101	9/27/2023
0.1.27	118	9/24/2023
0.1.26	98	9/24/2023
0.1.25	111	9/23/2023
0.1.24	100	9/21/2023
0.1.23	113	9/19/2023
0.1.22	102	9/19/2023
0.1.21	118	9/18/2023
0.1.20	108	9/17/2023

Total 2.3K

Current version 301

Per day average 9

Clara.Analysis.Morfologik 0.1.37

Clara

Highlights

Supported Languages

A Quick Example

Field Mapping

Text Fields

Keyword Fields

Hierarchy Fields

Range Fields

Text Analysis

Analyzers

Synonym Maps

Extending Analysis Pipeline

Benchmarks

Tokenization Benchmarks

Indexing Benchmarks

Querying Benchmarks

Memory Allocations

License

.NETFramework 4.6.2

.NETStandard 2.0

.NETStandard 2.1

net6.0

net7.0

net8.0

NuGet packages

GitHub repositories