Duraxium.Text 3.0.0

.NET Framework 4.0

dotnet add package Duraxium.Text --version 3.0.0

NuGet\Install-Package Duraxium.Text -Version 3.0.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Duraxium.Text" Version="3.0.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add Duraxium.Text --version 3.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Duraxium.Text, 3.0.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install Duraxium.Text as a Cake Addin
#addin nuget:?package=Duraxium.Text&version=3.0.0

// Install Duraxium.Text as a Cake Tool
#tool nuget:?package=Duraxium.Text&version=3.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Duraxium ANSEL Encoding Library

Version	Notes
1.0.0	Origination.
1.1.0	Changed encoder replacement byte from 0xA0 to 0x81 – undefined in ASCII, ANSEL, UTF8 and Windows code page 1252 (ANSI). Added decoder recognition of 0xA0, 0xBB, 0xDE, and 0xDF for extended GEDCOM.
2.0.0	Fixed bug related to candrabindu byte, and another related to ogonek byte encoding. Refactored code to reduce memory allocation and improve speed.
2.1.0	Removed dependency on .NET Standard as it can cause projects to bloat with many dependent NuGet packages.
3.0.0	Removed invalid double low line characters (two total). Added MARC 21 encodings for the eszett and euro sign characters. Created a byte order distinction between macron+diaeresis versus diaeresis+macron. Added 23 new character mappings.

This NuGet package translates between ANSEL characters and composite (non-combining) Unicode characters.

ANSEL is the American National Standard for Extended Latin alphabet coded character set. The character set uses the ASCII characters defined for values 0 – 7F hex, thirty-four additional characters in the range A1 – C6 hex, and twenty-nine combining diacritic characters in the range E0 – FE hex. The MARC 21 standard defines two additional characters but is otherwise identical to ANSEL.

Though originally developed for bibliographic use, the LDS Church specified ANSEL as a valid character encoding for GEDCOM genealogical data.

The public types in this package are the AnselEncoding class and the GedcomOptions enumeration. The AnselEncoding class is derived from the .NET Encoding class, thereby providing both ANSEL encoding and decoding.

Translation

An ANSEL character is represented by a sequence of one, two or three byte values. A single byte sequence is always a non-combining Latin character, having a value in the range 0 – FE hex. A two byte sequence consists of one combining character having a value in the range E0 - FF hex, followed by one Latin base character. A three byte sequence consists of two combining characters followed by one Latin base character.

The Unicode character produced by an ANSEL three byte sequence is same regardless of the order of the first two bytes, except for the macron+diaeresis sequences.

ANSEL has three undefined character byte values – FC, FD, and FF hex. Eight other byte values do not appear in composite Unicode characters, so they are not used by the encoder or decoder. The unused codes are ligature left half (EB), ligature right half (EC), comma above right (ED), candrabindu (EF), comma below (F7), left half ring below (F8), double tilde left half (FA), and double tilde right half (FB).

The table below summarizes the character codes --

Combining Characters (glyphs)

Byte (hex)	ANSEL Name	Unicode Name	Notes
E0	low rising tone mark	hook above
E1	grave accent	grave
E2	acute accent	acute
E3	circumflex accent	circumflex
E4	tilde	tilde
E5	macron	macron
E6	breve	breve
E7	dot above	dot above
E8	umlaut (diaeresis)	diaeresis
E9	hacek	caron
EA	circle above	ring above
EB	ligature, left half	ligature, left half	No translation.
EC	ligature, right half	ligature, right half	No translation.
ED	high comma, off center	comma above (right)	No translation.
EE	double acute accent	double acute
EF	candrabindu	N/A	No translation.
F0	cedilla	cedilla
F1	right hook	ogonek
F2	dot below	dot below
F3	double dot below	diaeresis below
F4	circle below	ring below
F5	double underscore	double low line	No translation.
F6	underscore	low line
F7	left hook	comma below
F8	right cedilla	left half ring below	No translation.
F9	half circle below	breve below
FA	double tilde, left half	double tilde, left half	No translation.
FB	double tilde, right half	double tilde, right half	No translation.
FC	diacritic slash	long solidus	GEDCOM only; not ANSEL/MARC
FD	UNDEFINED	N/A	No translation.
FE	high comma, centered	comma above
FF	UNDEFINED	N/A	No translation.

Translation from ANSEL to Unicode is not complete or symmetrical. Many ANSEL byte sequences do not have corresponding composite Unicode characters. Translation is not symmetrical because, with one exception, the first two bytes of three-byte ANSEL sequences can appear in any order. Therefore, each pair of three-byte ANSEL sequences maps to the same Unicode character. The GEDCOM extensions also increase the asymmetry. For these reasons, an ANSEL data set converted to Unicode, then converted back to ANSEL may not match the original byte for byte.

AnselEncoding

The AnselEncoding class can convert a sequence of Unicode characters to a sequence of ANSEL bytes and vice versa.

public AnselEncoding(bool useFallback = true, GedcomOptions options = GedcomOptions.Enhanced)

Construct an AnselEncoding object.

useFallback – true to emit a 'fallback' sequence for an unrecognized sequence, or false to emit nothing for an unrecognized sequence

options – GedcomOptions.None, GedcomOptions.Standard, or GedcomOptions.Enhanced.

Invalid Inputs

An invalid input can either be ignored (skipped) or it can produce a 'fallback' sequence. By default, the AnselEncoding class emits a fallback sequence whether encoding or decoding.

When the built-in fallback mechanism is used, the encoder will emit the byte value 81 hex for an unrecognized Unicode input character. This value is used because it is undefined in ASCII, ANSEL, UTF-8, and Windows code page 1252. For any invalid ANSEL sequence, the decoder will emit the Latin base character (if present) or the Unicode replacement character, FFFD hex.

Because a valid ANSEL byte sequence cannot have more than two combining characters, any extra bytes will be ignored. Only the two bytes immediately preceding the base character will be used.

Some ANSEL-coded data (notably GEDCOM) includes the invalid three byte sequence EF BF BD hex, which is the UTF-8 sequence for the Unicode replacement character. To prevent these sequences from causing spurious decoding errors, the decoder translates this sequence as the Unicode replacement character.

The encoder's response to invalid inputs can be customized by overriding the OnEncodeFailure method. The decoder's response can be customized by overriding the OnDecodeFailure method.

GEDCOM Options

Option	Description
None	ANSEL/MARC only, with no GEDCOM extensions.
Standard	ANSEL/MARC plus standard GEDCOM extensions. <br /><br />Beyond ANSEL, the byte values BE hex (empty box), BF hex (black box), CD hex (midline e), CE hex (midline o), CF hex (lowercase sharp S), and FC hex (combining long solidus) are translated. The decoder translates the midline e and midline o sequences to Unicode ‘e’ (0065) and ‘o’ (006F). The encoder emits only standard 'e' and 'o' values, never the bytes CD or CE hex.
Enhanced	ANSEL/MARC plus non-standard GEDCOM extensions.<br /><br />Single-byte characters in the range from 80 – 9F hex are decoded as Windows code page 1252. Although these sequences are non-standard, they appear in some real GEDCOM data. Additionally, multibyte ANSEL sequences involving the base character 'i' (69 hex) are supported as alternatives to the dotless 'i' character (B8 hex). These byte sequences are never emitted by the encoder however.

Currently, the encoder maps 475 Unicode characters to ANSEL byte sequences -- 13 more if GEDCOM extensions are used. The decoder maps 566 ANSEL byte sequences to Unicode characters -- 22 more if the standard GEDCOM option is used, or 42 more if the enhanced GEDCOM option is used.

public int GetByteCount(string s)

Determine the number of ANSEL bytes needed to encode a Unicode character string.

s – the Unicode character string

returns – the number of ANSEL bytes

public int GetByteCount(char[] chars)

Determine the number of ANSEL bytes needed to encode an array of Unicode characters.

chars – the Unicode character array

returns – the number of ANSEL bytes

public int GetByteCount(char[] chars, int index, int count)

Determine the number of ANSEL bytes needed to encode an array of Unicode characters.

chars – the Unicode character array

index – the index of the character to start with

count – the number of characters to use

returns – the number of ANSEL bytes

public int GetMaxByteCount(int charCount)

Determine the maximum number of ANSEL bytes needed to encode an array of Unicode characters.

charCount – the number of Unicode characters

returns – the maximum number of ANSEL bytes

public override int GetCharCount(byte[] bytes)

Determine the number of Unicode characters needed to decode an array of ANSEL bytes.

bytes – the ANSEL byte array

returns – the number of Unicode characters

public int GetCharCount(byte[] bytes, int index, int count)

Determine the number of Unicode characters needed to decode an array of ANSEL bytes.

bytes – the ANSEL byte array

index – the index of the byte to start with

count – the number of bytes to use

returns – the number of Unicode characters

public int GetMaxCharCount(int byteCount)

Determine the maximum number of Unicode characters needed to decode an array of ANSEL bytes.

byteCount – the number of ANSEL bytes

returns – the maximum number of Unicode characters

public byte[] GetBytes(string s)

Encode a Unicode character string to create an equivalent array of ANSEL bytes.

s – the Unicode character string

returns – the array of ANSEL bytes

public byte[] GetBytes(char[] chars)

Encode an array of Unicode characters to create an equivalent array of ANSEL bytes.

chars – the Unicode character array

returns – the array of ANSEL bytes

public byte[] GetBytes(char[] chars, int index, int count)

Encode an array of Unicode characters to create an equivalent array of ANSEL bytes.

chars – the Unicode character array

index – the index of the first character to use

count – the number of characters to use

returns – the array of ANSEL bytes

public int GetBytes(string s, int charIndex, int charCount, byte[] bytes, int byteIndex)

Encode a Unicode character string to write the equivalent ANSEL bytes to a byte array.

s – the Unicode character string

charIndex – the index of the first character to use

charCount – the number of characters to use

bytes – the ANSEL byte array

byteIndex – the index of the first byte in the array to use

returns – the number of bytes written to the byte array

public int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)

Write Unicode characters to an array based on an array of ANSEL bytes.

chars – the Unicode character array

charIndex – the index of the first character to encode

charCount – the number of characters to encode

bytes – the ANSEL byte array. The array must be large enough to hold the encoded bytes (e.g. by first calling GetByteCount or GetMaxByteCount).

byteIndex – the index of the first byte encoded

returns – the number of bytes written to the byte array

public char[] GetChars(byte[] bytes)

Decode an array of ANSEL bytes to an array of Unicode characters.

bytes – the ANSEL byte array

returns – the Unicode character array

public char[] GetChars(byte[] bytes, int index, int count)

Decode an array of ANSEL bytes to an array of Unicode characters.

bytes – the ANSEL byte array

index – the index of the first byte to use

count – the number of bytes to use

returns – the Unicode character array

public int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)

Write characters to a Unicode character array based on an array of ANSEL bytes.

bytes – the ANSEL byte array

byteIndex – the index of the first byte to decode

byteCount – the number of bytes to decode

chars – the Unicode character array. The array must be large enough to hold the decoded characters (e.g. by first calling GetCharCount or GetMaxCharCount).

charIndex – the index of the first character decoded

returns – the number of characters written to the character array

public string GetString(byte[] bytes, int index, int count)

Decode an array of ANSEL bytes to a Unicode character string.

bytes – the ANSEL byte array

index – the index of the first byte to use

count – the number of bytes to use

returns – the Unicode character string

public Decoder GetDecoder()

Get the ANSEL decoder.

public Encoder GetEncoder()

Get the ANSEL encoder.

public byte[] GetPreamble()

Get a sequence of bytes that specify the encoding being used.

returns – always returns an array of zero bytes since no preamble has been defined for ANSEL encoding.

public virtual byte[] OnEncodeFailure(char c)

Called whenever encoding fails for a Unicode character.

c – the unrecognized character

returns – the ANSEL byte sequence to be emitted as the fallback

By overriding this method, you can customize the ANSEL byte sequence emitted for the unrecognized Unicode character. If OnEncodeFailure returns an empty (zero element) byte sequence, no bytes will be emitted.

public virtual char? OnDecodeFailure(byte[] sequence)

Called whenever decoding fails for an ANSEL byte sequence.

sequence – the unrecognized byte sequence

returns – the Unicode character to be emitted as the fallback

By overriding this method, you can customize the Unicode character emitted for the unrecognized ANSEL sequence. If OnDecodeFailure returns null, no character will be emitted.

protected bool UseFallback { get; }

Returns true if a fallback sequence should be used, else false. If UseFallback is false, the overrides for OnEncodeFailure and OnDecodeFailure should return an empty byte array and null, respectively, instead of a fallback sequence.

C# Examples

Decoding

The simplest way to perform ANSEL decoding is to use the AnselEncoding class with a FileStream and StreamReader as shown below:

public string Decode(string anselFile)
{
   AnselEncoding encoding = new AnselEncoding();
   string output = string.Empty;
 
   using (FileStream stream = new FileStream(“AnselFile.dat”, FileMode.Open))
   {
      using (StreamReader reader = new StreamReader(stream, encoding))
      {
         output = reader.ReadToEnd();
      }
   }

   return output;
}

Encoding

The simplest way to perform ANSEL encoding is to use the AnselEncoding class with a FileStream and StreamWriter as shown below:

public void Encode(string text, string outputFile)
{
   AnselEncoding encoding = new AnselEncoding();

   using (FileStream stream = new FileStream(outputFile, FileMode.Create))
   {
      using (StreamWriter writer = new StreamWriter(stream, encoding))
      {
         writer.Write(text);
      }
   }
}

Product	Compatible and additional computed target framework versions.
.NET Framework	net40 is compatible. net403 was computed. net45 is compatible. net451 was computed. net452 was computed. net46 was computed. net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

.NETFramework 4.0
- No dependencies.
.NETFramework 4.5
- No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
3.0.0	311	1/1/2024
2.1.0	712	4/30/2021
2.0.0	742	2/18/2021
1.1.0	751	2/9/2021
1.0.0	714	1/5/2021

Removed invalid double low line characters (two total). Added MARC 21 encodings for the eszett and euro sign characters. Created a byte order distinction between macron+diaeresis versus diaeresis+macron. Added 23 new character mappings.

Total 3.2K

Current version 311

Per day average 2

ANSEL MARC GEDCOM genealogy encoding decoding encoder decoder text

Duraxium.Text 3.0.0

Duraxium ANSEL Encoding Library

Translation

Combining Characters (glyphs)

AnselEncoding

GEDCOM Options

C# Examples

Decoding

Encoding

.NETFramework 4.0

.NETFramework 4.5

NuGet packages

GitHub repositories