Duraxium.Text 3.0.0

dotnet add package Duraxium.Text --version 3.0.0                
NuGet\Install-Package Duraxium.Text -Version 3.0.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Duraxium.Text" Version="3.0.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add Duraxium.Text --version 3.0.0                
#r "nuget: Duraxium.Text, 3.0.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install Duraxium.Text as a Cake Addin
#addin nuget:?package=Duraxium.Text&version=3.0.0

// Install Duraxium.Text as a Cake Tool
#tool nuget:?package=Duraxium.Text&version=3.0.0                

Duraxium ANSEL Encoding Library

Version Notes
1.0.0 Origination.
1.1.0 Changed encoder replacement byte from 0xA0 to 0x81 – undefined in ASCII, ANSEL, UTF8 and Windows code page 1252 (ANSI). Added decoder recognition of 0xA0, 0xBB, 0xDE, and 0xDF for extended GEDCOM.
2.0.0 Fixed bug related to candrabindu byte, and another related to ogonek byte encoding. Refactored code to reduce memory allocation and improve speed.
2.1.0 Removed dependency on .NET Standard as it can cause projects to bloat with many dependent NuGet packages.
3.0.0 Removed invalid double low line characters (two total). Added MARC 21 encodings for the eszett and euro sign characters. Created a byte order distinction between macron+diaeresis versus diaeresis+macron. Added 23 new character mappings.

This NuGet package translates between ANSEL characters and composite (non-combining) Unicode characters.

ANSEL is the American National Standard for Extended Latin alphabet coded character set. The character set uses the ASCII characters defined for values 0 – 7F hex, thirty-four additional characters in the range A1 – C6 hex, and twenty-nine combining diacritic characters in the range E0 – FE hex. The MARC 21 standard defines two additional characters but is otherwise identical to ANSEL.

Though originally developed for bibliographic use, the LDS Church specified ANSEL as a valid character encoding for GEDCOM genealogical data.

The public types in this package are the AnselEncoding class and the GedcomOptions enumeration. The AnselEncoding class is derived from the .NET Encoding class, thereby providing both ANSEL encoding and decoding.

Translation

An ANSEL character is represented by a sequence of one, two or three byte values. A single byte sequence is always a non-combining Latin character, having a value in the range 0 – FE hex. A two byte sequence consists of one combining character having a value in the range E0 - FF hex, followed by one Latin base character. A three byte sequence consists of two combining characters followed by one Latin base character.

The Unicode character produced by an ANSEL three byte sequence is same regardless of the order of the first two bytes, except for the macron+diaeresis sequences.

ANSEL has three undefined character byte values – FC, FD, and FF hex. Eight other byte values do not appear in composite Unicode characters, so they are not used by the encoder or decoder. The unused codes are ligature left half (EB), ligature right half (EC), comma above right (ED), candrabindu (EF), comma below (F7), left half ring below (F8), double tilde left half (FA), and double tilde right half (FB).

The table below summarizes the character codes --

Combining Characters (glyphs)
Byte (hex) ANSEL Name Unicode Name Notes
E0 low rising tone mark hook above
E1 grave accent grave
E2 acute accent acute
E3 circumflex accent circumflex
E4 tilde tilde
E5 macron macron
E6 breve breve
E7 dot above dot above
E8 umlaut (diaeresis) diaeresis
E9 hacek caron
EA circle above ring above
EB ligature, left half ligature, left half No translation.
EC ligature, right half ligature, right half No translation.
ED high comma, off center comma above (right) No translation.
EE double acute accent double acute
EF candrabindu N/A No translation.
F0 cedilla cedilla
F1 right hook ogonek
F2 dot below dot below
F3 double dot below diaeresis below
F4 circle below ring below
F5 double underscore double low line No translation.
F6 underscore low line
F7 left hook comma below
F8 right cedilla left half ring below No translation.
F9 half circle below breve below
FA double tilde, left half double tilde, left half No translation.
FB double tilde, right half double tilde, right half No translation.
FC diacritic slash long solidus GEDCOM only; not ANSEL/MARC
FD UNDEFINED N/A No translation.
FE high comma, centered comma above
FF UNDEFINED N/A No translation.

Translation from ANSEL to Unicode is not complete or symmetrical. Many ANSEL byte sequences do not have corresponding composite Unicode characters. Translation is not symmetrical because, with one exception, the first two bytes of three-byte ANSEL sequences can appear in any order. Therefore, each pair of three-byte ANSEL sequences maps to the same Unicode character. The GEDCOM extensions also increase the asymmetry. For these reasons, an ANSEL data set converted to Unicode, then converted back to ANSEL may not match the original byte for byte.

AnselEncoding

The AnselEncoding class can convert a sequence of Unicode characters to a sequence of ANSEL bytes and vice versa.

public AnselEncoding(bool useFallback = true, GedcomOptions options = GedcomOptions.Enhanced)

Construct an AnselEncoding object.

useFallbacktrue to emit a 'fallback' sequence for an unrecognized sequence, or false to emit nothing for an unrecognized sequence

options – GedcomOptions.None, GedcomOptions.Standard, or GedcomOptions.Enhanced.

Invalid Inputs

An invalid input can either be ignored (skipped) or it can produce a 'fallback' sequence. By default, the AnselEncoding class emits a fallback sequence whether encoding or decoding.

When the built-in fallback mechanism is used, the encoder will emit the byte value 81 hex for an unrecognized Unicode input character. This value is used because it is undefined in ASCII, ANSEL, UTF-8, and Windows code page 1252. For any invalid ANSEL sequence, the decoder will emit the Latin base character (if present) or the Unicode replacement character, FFFD hex.

Because a valid ANSEL byte sequence cannot have more than two combining characters, any extra bytes will be ignored. Only the two bytes immediately preceding the base character will be used.

Some ANSEL-coded data (notably GEDCOM) includes the invalid three byte sequence EF BF BD hex, which is the UTF-8 sequence for the Unicode replacement character. To prevent these sequences from causing spurious decoding errors, the decoder translates this sequence as the Unicode replacement character.

The encoder's response to invalid inputs can be customized by overriding the OnEncodeFailure method. The decoder's response can be customized by overriding the OnDecodeFailure method.

GEDCOM Options
Option Description
None ANSEL/MARC only, with no GEDCOM extensions.
Standard ANSEL/MARC plus standard GEDCOM extensions. <br /><br />Beyond ANSEL, the byte values BE hex (empty box), BF hex (black box), CD hex (midline e), CE hex (midline o), CF hex (lowercase sharp S), and FC hex (combining long solidus) are translated. The decoder translates the midline e and midline o sequences to Unicode ‘e’ (0065) and ‘o’ (006F). The encoder emits only standard 'e' and 'o' values, never the bytes CD or CE hex.
Enhanced ANSEL/MARC plus non-standard GEDCOM extensions.<br /><br />Single-byte characters in the range from 80 – 9F hex are decoded as Windows code page 1252. Although these sequences are non-standard, they appear in some real GEDCOM data. Additionally, multibyte ANSEL sequences involving the base character 'i' (69 hex) are supported as alternatives to the dotless 'i' character (B8 hex). These byte sequences are never emitted by the encoder however.

Currently, the encoder maps 475 Unicode characters to ANSEL byte sequences -- 13 more if GEDCOM extensions are used. The decoder maps 566 ANSEL byte sequences to Unicode characters -- 22 more if the standard GEDCOM option is used, or 42 more if the enhanced GEDCOM option is used.

public int GetByteCount(string s)

Determine the number of ANSEL bytes needed to encode a Unicode character string.

s – the Unicode character string

returns – the number of ANSEL bytes

public int GetByteCount(char[] chars)

Determine the number of ANSEL bytes needed to encode an array of Unicode characters.

chars – the Unicode character array

returns – the number of ANSEL bytes

public int GetByteCount(char[] chars, int index, int count)

Determine the number of ANSEL bytes needed to encode an array of Unicode characters.

chars – the Unicode character array

index – the index of the character to start with

count – the number of characters to use

returns – the number of ANSEL bytes

public int GetMaxByteCount(int charCount)

Determine the maximum number of ANSEL bytes needed to encode an array of Unicode characters.

charCount – the number of Unicode characters

returns – the maximum number of ANSEL bytes

public override int GetCharCount(byte[] bytes)

Determine the number of Unicode characters needed to decode an array of ANSEL bytes.

bytes – the ANSEL byte array

returns – the number of Unicode characters

public int GetCharCount(byte[] bytes, int index, int count)

Determine the number of Unicode characters needed to decode an array of ANSEL bytes.

bytes – the ANSEL byte array

index – the index of the byte to start with

count – the number of bytes to use

returns – the number of Unicode characters

public int GetMaxCharCount(int byteCount)

Determine the maximum number of Unicode characters needed to decode an array of ANSEL bytes.

byteCount – the number of ANSEL bytes

returns – the maximum number of Unicode characters

public byte[] GetBytes(string s)

Encode a Unicode character string to create an equivalent array of ANSEL bytes.

s – the Unicode character string

returns – the array of ANSEL bytes

public byte[] GetBytes(char[] chars)

Encode an array of Unicode characters to create an equivalent array of ANSEL bytes.

chars – the Unicode character array

returns – the array of ANSEL bytes

public byte[] GetBytes(char[] chars, int index, int count)

Encode an array of Unicode characters to create an equivalent array of ANSEL bytes.

chars – the Unicode character array

index – the index of the first character to use

count – the number of characters to use

returns – the array of ANSEL bytes

public int GetBytes(string s, int charIndex, int charCount, byte[] bytes, int byteIndex)

Encode a Unicode character string to write the equivalent ANSEL bytes to a byte array.

s – the Unicode character string

charIndex – the index of the first character to use

charCount – the number of characters to use

bytes – the ANSEL byte array

byteIndex – the index of the first byte in the array to use

returns – the number of bytes written to the byte array

public int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)

Write Unicode characters to an array based on an array of ANSEL bytes.

chars – the Unicode character array

charIndex – the index of the first character to encode

charCount – the number of characters to encode

bytes – the ANSEL byte array. The array must be large enough to hold the encoded bytes (e.g. by first calling GetByteCount or GetMaxByteCount).

byteIndex – the index of the first byte encoded

returns – the number of bytes written to the byte array

public char[] GetChars(byte[] bytes)

Decode an array of ANSEL bytes to an array of Unicode characters.

bytes – the ANSEL byte array

returns – the Unicode character array

public char[] GetChars(byte[] bytes, int index, int count)

Decode an array of ANSEL bytes to an array of Unicode characters.

bytes – the ANSEL byte array

index – the index of the first byte to use

count – the number of bytes to use

returns – the Unicode character array

public int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)

Write characters to a Unicode character array based on an array of ANSEL bytes.

bytes – the ANSEL byte array

byteIndex – the index of the first byte to decode

byteCount – the number of bytes to decode

chars – the Unicode character array. The array must be large enough to hold the decoded characters (e.g. by first calling GetCharCount or GetMaxCharCount).

charIndex – the index of the first character decoded

returns – the number of characters written to the character array

public string GetString(byte[] bytes, int index, int count)

Decode an array of ANSEL bytes to a Unicode character string.

bytes – the ANSEL byte array

index – the index of the first byte to use

count – the number of bytes to use

returns – the Unicode character string

public Decoder GetDecoder()

Get the ANSEL decoder.

public Encoder GetEncoder()

Get the ANSEL encoder.

public byte[] GetPreamble()

Get a sequence of bytes that specify the encoding being used.

returns – always returns an array of zero bytes since no preamble has been defined for ANSEL encoding.

public virtual byte[] OnEncodeFailure(char c)

Called whenever encoding fails for a Unicode character.

c – the unrecognized character

returns – the ANSEL byte sequence to be emitted as the fallback

By overriding this method, you can customize the ANSEL byte sequence emitted for the unrecognized Unicode character. If OnEncodeFailure returns an empty (zero element) byte sequence, no bytes will be emitted.

public virtual char? OnDecodeFailure(byte[] sequence)

Called whenever decoding fails for an ANSEL byte sequence.

sequence – the unrecognized byte sequence

returns – the Unicode character to be emitted as the fallback

By overriding this method, you can customize the Unicode character emitted for the unrecognized ANSEL sequence. If OnDecodeFailure returns null, no character will be emitted.

protected bool UseFallback { get; }

Returns true if a fallback sequence should be used, else false. If UseFallback is false, the overrides for OnEncodeFailure and OnDecodeFailure should return an empty byte array and null, respectively, instead of a fallback sequence.

C# Examples

Decoding

The simplest way to perform ANSEL decoding is to use the AnselEncoding class with a FileStream and StreamReader as shown below:

public string Decode(string anselFile)
{
   AnselEncoding encoding = new AnselEncoding();
   string output = string.Empty;
 
   using (FileStream stream = new FileStream(“AnselFile.dat”, FileMode.Open))
   {
      using (StreamReader reader = new StreamReader(stream, encoding))
      {
         output = reader.ReadToEnd();
      }
   }

   return output;
}
Encoding

The simplest way to perform ANSEL encoding is to use the AnselEncoding class with a FileStream and StreamWriter as shown below:

public void Encode(string text, string outputFile)
{
   AnselEncoding encoding = new AnselEncoding();

   using (FileStream stream = new FileStream(outputFile, FileMode.Create))
   {
      using (StreamWriter writer = new StreamWriter(stream, encoding))
      {
         writer.Write(text);
      }
   }
}

Product Compatible and additional computed target framework versions.
.NET Framework net40 is compatible.  net403 was computed.  net45 is compatible.  net451 was computed.  net452 was computed.  net46 was computed.  net461 was computed.  net462 was computed.  net463 was computed.  net47 was computed.  net471 was computed.  net472 was computed.  net48 was computed.  net481 was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • .NETFramework 4.0

    • No dependencies.
  • .NETFramework 4.5

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
3.0.0 356 1/1/2024
2.1.0 742 4/30/2021
2.0.0 770 2/18/2021
1.1.0 777 2/9/2021
1.0.0 737 1/5/2021

Removed invalid double low line characters (two total). Added MARC 21 encodings for the eszett and euro sign characters. Created a byte order distinction between macron+diaeresis versus diaeresis+macron. Added 23 new character mappings.