Duraxium.Text
3.0.0
dotnet add package Duraxium.Text --version 3.0.0
NuGet\Install-Package Duraxium.Text -Version 3.0.0
<PackageReference Include="Duraxium.Text" Version="3.0.0" />
paket add Duraxium.Text --version 3.0.0
#r "nuget: Duraxium.Text, 3.0.0"
// Install Duraxium.Text as a Cake Addin #addin nuget:?package=Duraxium.Text&version=3.0.0 // Install Duraxium.Text as a Cake Tool #tool nuget:?package=Duraxium.Text&version=3.0.0
Duraxium ANSEL Encoding Library
Version | Notes |
---|---|
1.0.0 | Origination. |
1.1.0 | Changed encoder replacement byte from 0xA0 to 0x81 – undefined in ASCII, ANSEL, UTF8 and Windows code page 1252 (ANSI). Added decoder recognition of 0xA0, 0xBB, 0xDE, and 0xDF for extended GEDCOM. |
2.0.0 | Fixed bug related to candrabindu byte, and another related to ogonek byte encoding. Refactored code to reduce memory allocation and improve speed. |
2.1.0 | Removed dependency on .NET Standard as it can cause projects to bloat with many dependent NuGet packages. |
3.0.0 | Removed invalid double low line characters (two total). Added MARC 21 encodings for the eszett and euro sign characters. Created a byte order distinction between macron+diaeresis versus diaeresis+macron. Added 23 new character mappings. |
This NuGet package translates between ANSEL characters and composite (non-combining) Unicode characters.
ANSEL is the American National Standard for Extended Latin alphabet coded character set. The character set uses the ASCII characters defined for values 0 – 7F hex, thirty-four additional characters in the range A1 – C6 hex, and twenty-nine combining diacritic characters in the range E0 – FE hex. The MARC 21 standard defines two additional characters but is otherwise identical to ANSEL.
Though originally developed for bibliographic use, the LDS Church specified ANSEL as a valid character encoding for GEDCOM genealogical data.
The public types in this package are the AnselEncoding class and the GedcomOptions enumeration. The AnselEncoding class is derived from the .NET Encoding class, thereby providing both ANSEL encoding and decoding.
Translation
An ANSEL character is represented by a sequence of one, two or three byte values. A single byte sequence is always a non-combining Latin character, having a value in the range 0 – FE hex. A two byte sequence consists of one combining character having a value in the range E0 - FF hex, followed by one Latin base character. A three byte sequence consists of two combining characters followed by one Latin base character.
The Unicode character produced by an ANSEL three byte sequence is same regardless of the order of the first two bytes, except for the macron+diaeresis sequences.
ANSEL has three undefined character byte values – FC, FD, and FF hex. Eight other byte values do not appear in composite Unicode characters, so they are not used by the encoder or decoder. The unused codes are ligature left half (EB), ligature right half (EC), comma above right (ED), candrabindu (EF), comma below (F7), left half ring below (F8), double tilde left half (FA), and double tilde right half (FB).
The table below summarizes the character codes --
Combining Characters (glyphs)
Byte (hex) | ANSEL Name | Unicode Name | Notes |
---|---|---|---|
E0 | low rising tone mark | hook above | |
E1 | grave accent | grave | |
E2 | acute accent | acute | |
E3 | circumflex accent | circumflex | |
E4 | tilde | tilde | |
E5 | macron | macron | |
E6 | breve | breve | |
E7 | dot above | dot above | |
E8 | umlaut (diaeresis) | diaeresis | |
E9 | hacek | caron | |
EA | circle above | ring above | |
EB | ligature, left half | ligature, left half | No translation. |
EC | ligature, right half | ligature, right half | No translation. |
ED | high comma, off center | comma above (right) | No translation. |
EE | double acute accent | double acute | |
EF | candrabindu | N/A | No translation. |
F0 | cedilla | cedilla | |
F1 | right hook | ogonek | |
F2 | dot below | dot below | |
F3 | double dot below | diaeresis below | |
F4 | circle below | ring below | |
F5 | double underscore | double low line | No translation. |
F6 | underscore | low line | |
F7 | left hook | comma below | |
F8 | right cedilla | left half ring below | No translation. |
F9 | half circle below | breve below | |
FA | double tilde, left half | double tilde, left half | No translation. |
FB | double tilde, right half | double tilde, right half | No translation. |
FC | diacritic slash | long solidus | GEDCOM only; not ANSEL/MARC |
FD | UNDEFINED | N/A | No translation. |
FE | high comma, centered | comma above | |
FF | UNDEFINED | N/A | No translation. |
Translation from ANSEL to Unicode is not complete or symmetrical. Many ANSEL byte sequences do not have corresponding composite Unicode characters. Translation is not symmetrical because, with one exception, the first two bytes of three-byte ANSEL sequences can appear in any order. Therefore, each pair of three-byte ANSEL sequences maps to the same Unicode character. The GEDCOM extensions also increase the asymmetry. For these reasons, an ANSEL data set converted to Unicode, then converted back to ANSEL may not match the original byte for byte.
AnselEncoding
The AnselEncoding class can convert a sequence of Unicode characters to a sequence of ANSEL bytes and vice versa.
public AnselEncoding(bool useFallback = true, GedcomOptions options = GedcomOptions.Enhanced)
Construct an AnselEncoding object.
useFallback – true to emit a 'fallback' sequence for an unrecognized sequence, or false to emit nothing for an unrecognized sequence
options – GedcomOptions.None, GedcomOptions.Standard, or GedcomOptions.Enhanced.
Invalid Inputs
An invalid input can either be ignored (skipped) or it can produce a 'fallback' sequence. By default, the AnselEncoding class emits a fallback sequence whether encoding or decoding.
When the built-in fallback mechanism is used, the encoder will emit the byte value 81 hex for an unrecognized Unicode input character. This value is used because it is undefined in ASCII, ANSEL, UTF-8, and Windows code page 1252. For any invalid ANSEL sequence, the decoder will emit the Latin base character (if present) or the Unicode replacement character, FFFD hex.
Because a valid ANSEL byte sequence cannot have more than two combining characters, any extra bytes will be ignored. Only the two bytes immediately preceding the base character will be used.
Some ANSEL-coded data (notably GEDCOM) includes the invalid three byte sequence EF BF BD hex, which is the UTF-8 sequence for the Unicode replacement character. To prevent these sequences from causing spurious decoding errors, the decoder translates this sequence as the Unicode replacement character.
The encoder's response to invalid inputs can be customized by overriding the OnEncodeFailure method. The decoder's response can be customized by overriding the OnDecodeFailure method.
GEDCOM Options
Option | Description |
---|---|
None | ANSEL/MARC only, with no GEDCOM extensions. |
Standard | ANSEL/MARC plus standard GEDCOM extensions. <br /><br />Beyond ANSEL, the byte values BE hex (empty box), BF hex (black box), CD hex (midline e), CE hex (midline o), CF hex (lowercase sharp S), and FC hex (combining long solidus) are translated. The decoder translates the midline e and midline o sequences to Unicode ‘e’ (0065) and ‘o’ (006F). The encoder emits only standard 'e' and 'o' values, never the bytes CD or CE hex. |
Enhanced | ANSEL/MARC plus non-standard GEDCOM extensions.<br /><br />Single-byte characters in the range from 80 – 9F hex are decoded as Windows code page 1252. Although these sequences are non-standard, they appear in some real GEDCOM data. Additionally, multibyte ANSEL sequences involving the base character 'i' (69 hex) are supported as alternatives to the dotless 'i' character (B8 hex). These byte sequences are never emitted by the encoder however. |
Currently, the encoder maps 475 Unicode characters to ANSEL byte sequences -- 13 more if GEDCOM extensions are used. The decoder maps 566 ANSEL byte sequences to Unicode characters -- 22 more if the standard GEDCOM option is used, or 42 more if the enhanced GEDCOM option is used.
public int GetByteCount(string s)
Determine the number of ANSEL bytes needed to encode a Unicode character string.
s – the Unicode character string
returns – the number of ANSEL bytes
public int GetByteCount(char[] chars)
Determine the number of ANSEL bytes needed to encode an array of Unicode characters.
chars – the Unicode character array
returns – the number of ANSEL bytes
public int GetByteCount(char[] chars, int index, int count)
Determine the number of ANSEL bytes needed to encode an array of Unicode characters.
chars – the Unicode character array
index – the index of the character to start with
count – the number of characters to use
returns – the number of ANSEL bytes
public int GetMaxByteCount(int charCount)
Determine the maximum number of ANSEL bytes needed to encode an array of Unicode characters.
charCount – the number of Unicode characters
returns – the maximum number of ANSEL bytes
public override int GetCharCount(byte[] bytes)
Determine the number of Unicode characters needed to decode an array of ANSEL bytes.
bytes – the ANSEL byte array
returns – the number of Unicode characters
public int GetCharCount(byte[] bytes, int index, int count)
Determine the number of Unicode characters needed to decode an array of ANSEL bytes.
bytes – the ANSEL byte array
index – the index of the byte to start with
count – the number of bytes to use
returns – the number of Unicode characters
public int GetMaxCharCount(int byteCount)
Determine the maximum number of Unicode characters needed to decode an array of ANSEL bytes.
byteCount – the number of ANSEL bytes
returns – the maximum number of Unicode characters
public byte[] GetBytes(string s)
Encode a Unicode character string to create an equivalent array of ANSEL bytes.
s – the Unicode character string
returns – the array of ANSEL bytes
public byte[] GetBytes(char[] chars)
Encode an array of Unicode characters to create an equivalent array of ANSEL bytes.
chars – the Unicode character array
returns – the array of ANSEL bytes
public byte[] GetBytes(char[] chars, int index, int count)
Encode an array of Unicode characters to create an equivalent array of ANSEL bytes.
chars – the Unicode character array
index – the index of the first character to use
count – the number of characters to use
returns – the array of ANSEL bytes
public int GetBytes(string s, int charIndex, int charCount, byte[] bytes, int byteIndex)
Encode a Unicode character string to write the equivalent ANSEL bytes to a byte array.
s – the Unicode character string
charIndex – the index of the first character to use
charCount – the number of characters to use
bytes – the ANSEL byte array
byteIndex – the index of the first byte in the array to use
returns – the number of bytes written to the byte array
public int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
Write Unicode characters to an array based on an array of ANSEL bytes.
chars – the Unicode character array
charIndex – the index of the first character to encode
charCount – the number of characters to encode
bytes – the ANSEL byte array. The array must be large enough to hold the encoded bytes (e.g. by first calling GetByteCount or GetMaxByteCount).
byteIndex – the index of the first byte encoded
returns – the number of bytes written to the byte array
public char[] GetChars(byte[] bytes)
Decode an array of ANSEL bytes to an array of Unicode characters.
bytes – the ANSEL byte array
returns – the Unicode character array
public char[] GetChars(byte[] bytes, int index, int count)
Decode an array of ANSEL bytes to an array of Unicode characters.
bytes – the ANSEL byte array
index – the index of the first byte to use
count – the number of bytes to use
returns – the Unicode character array
public int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
Write characters to a Unicode character array based on an array of ANSEL bytes.
bytes – the ANSEL byte array
byteIndex – the index of the first byte to decode
byteCount – the number of bytes to decode
chars – the Unicode character array. The array must be large enough to hold the decoded characters (e.g. by first calling GetCharCount or GetMaxCharCount).
charIndex – the index of the first character decoded
returns – the number of characters written to the character array
public string GetString(byte[] bytes, int index, int count)
Decode an array of ANSEL bytes to a Unicode character string.
bytes – the ANSEL byte array
index – the index of the first byte to use
count – the number of bytes to use
returns – the Unicode character string
public Decoder GetDecoder()
Get the ANSEL decoder.
public Encoder GetEncoder()
Get the ANSEL encoder.
public byte[] GetPreamble()
Get a sequence of bytes that specify the encoding being used.
returns – always returns an array of zero bytes since no preamble has been defined for ANSEL encoding.
public virtual byte[] OnEncodeFailure(char c)
Called whenever encoding fails for a Unicode character.
c – the unrecognized character
returns – the ANSEL byte sequence to be emitted as the fallback
By overriding this method, you can customize the ANSEL byte sequence emitted for the unrecognized Unicode character. If OnEncodeFailure returns an empty (zero element) byte sequence, no bytes will be emitted.
public virtual char? OnDecodeFailure(byte[] sequence)
Called whenever decoding fails for an ANSEL byte sequence.
sequence – the unrecognized byte sequence
returns – the Unicode character to be emitted as the fallback
By overriding this method, you can customize the Unicode character emitted for the unrecognized ANSEL sequence. If OnDecodeFailure returns null, no character will be emitted.
protected bool UseFallback { get; }
Returns true if a fallback sequence should be used, else false. If UseFallback is false, the overrides for OnEncodeFailure and OnDecodeFailure should return an empty byte array and null, respectively, instead of a fallback sequence.
C# Examples
Decoding
The simplest way to perform ANSEL decoding is to use the AnselEncoding class with a FileStream and StreamReader as shown below:
public string Decode(string anselFile)
{
AnselEncoding encoding = new AnselEncoding();
string output = string.Empty;
using (FileStream stream = new FileStream(“AnselFile.dat”, FileMode.Open))
{
using (StreamReader reader = new StreamReader(stream, encoding))
{
output = reader.ReadToEnd();
}
}
return output;
}
Encoding
The simplest way to perform ANSEL encoding is to use the AnselEncoding class with a FileStream and StreamWriter as shown below:
public void Encode(string text, string outputFile)
{
AnselEncoding encoding = new AnselEncoding();
using (FileStream stream = new FileStream(outputFile, FileMode.Create))
{
using (StreamWriter writer = new StreamWriter(stream, encoding))
{
writer.Write(text);
}
}
}
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET Framework | net40 is compatible. net403 was computed. net45 is compatible. net451 was computed. net452 was computed. net46 was computed. net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
-
.NETFramework 4.0
- No dependencies.
-
.NETFramework 4.5
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Removed invalid double low line characters (two total). Added MARC 21 encodings for the eszett and euro sign characters. Created a byte order distinction between macron+diaeresis versus diaeresis+macron. Added 23 new character mappings.