Character encoding support for Rust

Overview

Encoding 0.3.0-dev

Encoding on Travis CI

Character encoding support for Rust. (also known as rust-encoding) It is based on WHATWG Encoding Standard, and also provides an advanced interface for error detection and recovery.

This documentation is for the development version (0.3). Please see the stable documentation for 0.2.x versions.

Complete Documentation (stable)

Usage

Put this in your Cargo.toml:

[dependencies]
encoding = "0.3"

Then put this in your crate root:

extern crate encoding;

Data Table

By default, Encoding comes with ~480 KB of data table ("indices"). This allows Encoding to encode and decode legacy encodings efficiently, but this might not be desirable for some applications.

Encoding provides the no-optimized-legacy-encoding Cargo feature to reduce the size of encoding tables (to ~185 KB) at the expense of encoding performance (typically 5x to 20x slower). The decoding performance remains identical. This feature is strongly intended for end users. Do not try to enable this feature from library crates, ever.

For finer-tuned optimization, see src/index/gen_index.py for custom table generation.

Overview

To encode a string:

use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.encode("caf\u{e9}", EncoderTrap::Strict),
           Ok(vec![99,97,102,233]));

To encode a string with unrepresentable characters:

use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_2;

assert!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Strict).is_err());
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Replace),
           Ok(vec![65,99,109,101,63]));
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Ignore),
           Ok(vec![65,99,109,101]));
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::NcrEscape),
           Ok(vec![65,99,109,101,38,35,49,54,57,59]));

To decode a byte sequence:

use encoding::{Encoding, DecoderTrap};
use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.decode(&[99,97,102,233], DecoderTrap::Strict),
           Ok("caf\u{e9}".to_string()));

To decode a byte sequence with invalid sequences:

use encoding::{Encoding, DecoderTrap};
use encoding::all::ISO_8859_6;

assert!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Strict).is_err());
assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Replace),
           Ok("Acme\u{fffd}".to_string()));
assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Ignore),
           Ok("Acme".to_string()));

To encode or decode the input into the already allocated buffer:

use encoding::{Encoding, EncoderTrap, DecoderTrap};
use encoding::all::{ISO_8859_2, ISO_8859_6};

let mut bytes = Vec::new();
let mut chars = String::new();

assert!(ISO_8859_2.encode_to("Acme\u{a9}", EncoderTrap::Ignore, &mut bytes).is_ok());
assert!(ISO_8859_6.decode_to(&[65,99,109,101,169], DecoderTrap::Replace, &mut chars).is_ok());

assert_eq!(bytes, [65,99,109,101]);
assert_eq!(chars, "Acme\u{fffd}");

A practical example of custom encoder traps:

use encoding::{Encoding, ByteWriter, EncoderTrap, DecoderTrap};
use encoding::types::RawEncoder;
use encoding::all::ASCII;

// hexadecimal numeric character reference replacement
fn hex_ncr_escape(_encoder: &mut RawEncoder, input: &str, output: &mut ByteWriter) -> bool {
    let escapes: Vec<String> =
        input.chars().map(|ch| format!("&#x{:x};", ch as isize)).collect();
    let escapes = escapes.concat();
    output.write_bytes(escapes.as_bytes());
    true
}
static HEX_NCR_ESCAPE: EncoderTrap = EncoderTrap::Call(hex_ncr_escape);

let orig = "Hello, 世界!".to_string();
let encoded = ASCII.encode(&orig, HEX_NCR_ESCAPE).unwrap();
assert_eq!(ASCII.decode(&encoded, DecoderTrap::Strict),
           Ok("Hello, &#x4e16;&#x754c;!".to_string()));

Getting the encoding from the string label, as specified in WHATWG Encoding standard:

use encoding::{Encoding, DecoderTrap};
use encoding::label::encoding_from_whatwg_label;
use encoding::all::WINDOWS_949;

let euckr = encoding_from_whatwg_label("euc-kr").unwrap();
assert_eq!(euckr.name(), "windows-949");
assert_eq!(euckr.whatwg_name(), Some("euc-kr")); // for the sake of compatibility
let broken = &[0xbf, 0xec, 0xbf, 0xcd, 0xff, 0xbe, 0xd3];
assert_eq!(euckr.decode(broken, DecoderTrap::Replace),
           Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));

// corresponding Encoding native API:
assert_eq!(WINDOWS_949.decode(broken, DecoderTrap::Replace),
           Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));

Types and Stuffs

There are three main entry points to Encoding.

Encoding is a single character encoding. It contains encode and decode methods for converting String to Vec<u8> and vice versa. For the error handling, they receive traps (EncoderTrap and DecoderTrap respectively) which replace any error with some string (e.g. U+FFFD) or sequence (e.g. ?). You can also use EncoderTrap::Strict and DecoderTrap::Strict traps to stop on an error.

There are two ways to get Encoding:

  • encoding::all has static items for every supported encoding. You should use them when the encoding would not change or only handful of them are required. Combined with link-time optimization, any unused encoding would be discarded from the binary.
  • encoding::label has functions to dynamically get an encoding from given string ("label"). They will return a static reference to the encoding, which type is also known as EncodingRef. It is useful when a list of required encodings is not available in advance, but it will result in the larger binary and missed optimization opportunities.

RawEncoder is an experimental incremental encoder. At each step of raw_feed, it receives a slice of string and emits any encoded bytes to a generic ByteWriter (normally Vec<u8>). It will stop at the first error if any, and would return a CodecError struct in that case. The caller is responsible for calling raw_finish at the end of encoding process.

RawDecoder is an experimental incremental decoder. At each step of raw_feed, it receives a slice of byte sequence and emits any decoded characters to a generic StringWriter (normally String). Otherwise it is identical to RawEncoders.

One should prefer Encoding::{encode,decode} as a primary interface. RawEncoder and RawDecoder is experimental and can change substantially. See the additional documents on encoding::types module for more information on them.

Supported Encodings

Encoding covers all encodings specified by WHATWG Encoding Standard and some more:

  • 7-bit strict ASCII (ascii)
  • UTF-8 (utf-8)
  • UTF-16 in little endian (utf-16 or utf-16le) and big endian (utf-16be)
  • All single byte encoding in WHATWG Encoding Standard:
    • IBM code page 866
    • ISO 8859-{2,3,4,5,6,7,8,10,13,14,15,16}
    • KOI8-R, KOI8-U
    • MacRoman (macintosh), Macintosh Cyrillic encoding (x-mac-cyrillic)
    • Windows code pages 874, 1250, 1251, 1252 (instead of ISO 8859-1), 1253, 1254 (instead of ISO 8859-9), 1255, 1256, 1257, 1258
  • All multi byte encodings in WHATWG Encoding Standard:
    • Windows code page 949 (euc-kr, since the strict EUC-KR is hardly used)
    • EUC-JP and Windows code page 932 (shift_jis, since it's the most widespread extension to Shift_JIS)
    • ISO-2022-JP with asymmetric JIS X 0212 support (Note: this is not yet up to date to the current standard)
    • GBK
    • GB 18030
    • Big5-2003 with HKSCS-2008 extensions
  • Encodings that were originally specified by WHATWG Encoding Standard:
    • HZ
  • ISO 8859-1 (distinct from Windows code page 1252)

Parenthesized names refer to the encoding's primary name assigned by WHATWG Encoding Standard.

Many legacy character encodings lack the proper specification, and even those that have a specification are highly dependent of the actual implementation. Consequently one should be careful when picking a desired character encoding. The only standards reliable in this regard are WHATWG Encoding Standard and vendor-provided mappings from the Unicode consortium. Whenever in doubt, look at the source code and specifications for detailed explanations.

Comments
  • (fix) removed deprecated syntax for lifetime in traits

    (fix) removed deprecated syntax for lifetime in traits

    I don't really understand whats going on, but removing 'static lifetime allows this library to compile and tests to pass. However 104 tests were ignored.

    This fixes the error: src/encoding/types.rs:105:25: 105:32 error: expected ident, found 'static src/encoding/types.rs:105 pub trait StringWriter: 'static {

    opened by brycefisher 6
  • "Replace" vs. WHATWG error handling

    Hi,

    Quoting from the README:

    use encoding::whatwg;
    let mut euckr = whatwg::TextDecoder::new(Some(~"euc-kr")).unwrap();
    euckr.encoding(); // => ~"euc-kr"
    let broken = &[0xbf, 0xec, 0xbf, 0xcd, 0xff, 0xbe, 0xd3];
    euckr.decode_buffer(Some(broken)); // => Ok(~"\uc6b0\uc640\ufffd\uc559")
    
    // this is different from rust-encoding's default behavior:
    let decoded = all::WINDOWS_949.decode(broken, Replace); // => Ok(~"\uc6b0\uc640\ufffd\ufffd")
    

    Is there a reason for this difference? Could the Replace built-in trap be align with the spec?

    opened by SimonSapin 6
  • Fix hyphens on target names error

    Fix hyphens on target names error

    As of the latest nightly, the rules in https://github.com/rust-lang/rfcs/blob/master/text/0940-hyphens-considered-harmful.md are now fully implemented and in use. This patch fixes the errors on cargo build that arise when attempting to build projects that depend on this library:

    Unable to get packages from source
    
    Caused by:
      failed to parse manifest at `/home/filipe/.cargo/registry/src/github.com-1ecc6299db9ec823/encoding-index-simpchinese-1.20141219.1/Cargo.toml`
    
    Caused by:
      target names cannot contain hyphens: encoding-index-simpchinese
    
    opened by filipegoncalves 5
  • Charset request: ArmSCII-8

    Charset request: ArmSCII-8

    Would it be possible to add support for the ArmSCII-8 encoding? Ref: https://manned.org/armscii-8 and https://en.wikipedia.org/wiki/ArmSCII

    I had a quick look to see if I could add this myself, as it's just a single-byte encoding; But seeing how all current codecs are autogenerated from the whatwg specs, I'm a bit lost as to the best approach to implement a custom codec. I'd be happy to provide a PR if I have some guidance on the next steps to take.

    opened by 17dec 4
  • `all::encodings()` returns an errornous list (and should be sorted alphabetically).

    `all::encodings()` returns an errornous list (and should be sorted alphabetically).

    The bug is in src/all.rs:

       const ENCODINGS: &'static [EncodingRef] = &[ ....
    

    This is the way I collect the list (please confirm that it should be done this way):

       let list = all::encodings().iter().map(|&e|format!(" {}\n",e.name())).collect::<String>();
    

    the following names do not work:

     error mac-roman mac-roman mac-cyrillic hz big5-2003 pua-mapped-binary encoder-only-utf-8
    

    These two do work but are not listed:

      x-user-defined macintosh
    

    PLEASE return them alphabetically sorted!

    opened by getreu 4
  • hz-gb-2312 encoding and WHATWG compatibility

    hz-gb-2312 encoding and WHATWG compatibility

    The WHATWG Encoding Spec lists hz-gb-2312 as mapping to the replacement encoding, which uses the UTF-8 encoder and throws a special replacement encoding error for its decoder. However, it looks like this crate implements the actual HZ encoding. For WHATWG compatibility, this would have to get folded in with the rest of the replacement encodings, but I don't know if that's acceptable considering other people may be using the current implementation.

    Would you prefer to maintain strict WHATWG compatibility or keep the current implementation? If the current implementation is kept, this deviation needs to be well documented - it isn't too hard to work around, but is a bit annoying and could catch someone unaware because the rest of the crate is compatible.

    opened by aneeshusa 4
  • Incrementally parsed invalid sequences spanning multiple chunks write data

    Incrementally parsed invalid sequences spanning multiple chunks write data

        #[test]
        fn test_invalid_multibyte_span() {
            use std::mem;
            let mut d = UTF8Encoding.decoder();
            // "ef bf be" is an invalid sequence.
            assert_feed_ok!(d, [], [0xef, 0xbf], "");
            let input: [u8, ..1] = [ 0xbe ];
            let (_, _, buf) = unsafe { d.test_feed(mem::transmute(input.as_slice())) };
            // Make sure no data was written to the buffer.
            assert_eq!(buf, String::new());
            // task 'codec::utf_8::tests::test_invalid_multibyte_span' failed at 'assertion failed: `(left == right) && (right == left)` (left: `￾`, right: ``)', /Users/cgaebel/code/rust-encoding/src/codec/utf_8.rs:529
        }
    

    This test successfully reports an error, but when it does it writes an invalid code sequence into the buffer.

    (side note, github markup is eating the invalid UTF-8 char in left. Rest assured SOMETHING is in there.

    opened by cgaebel 4
  • Encoding.name() vs. WHATWG encoding name

    Encoding.name() vs. WHATWG encoding name

    whatwg::encoding_from_label returns a tuple of an Encoding object and the encoding name as a string, while the object also has a .name() that returns the same as a string. This seems redundant.

    I would like to remove the former and only keep the latter, which should use names from the spec. The requires changes are:

    1. Rename shift-jis to shift_jis.
    2. Add iso-8859-8-i, identical to iso-8859-8 but with a different name.
    3. Rename windows-949 to euc-kr

    1 and 2 are harmless, but 3 seems to have been deliberate. Is there a difference between windows-949 and euc-kr, or a reason to prefer the first name?

    opened by SimonSapin 4
  • Split encoding::types into encoding-types crate

    Split encoding::types into encoding-types crate

    This allows for creation of alternative non-WHATWG encodings that use the same interface as encodings defined in this crate without pulling in all the tables and encodings.

    This commit does not introduce any breaking changes; all the types previously defined in encoding::types are reexported.

    Fixes #81.


    I tried to avoid breaking changes for now, but IMO fn decode being in encoding::types makes little sense; I’d move it to encoding at some point later.

    opened by nagisa 3
  • Change hyphen to _

    Change hyphen to _

    I'm getting errors when building crates which depend on "encoding-index-*" crates because of the hyphen.

    cargo build Unable to get packages from source

    Caused by: failed to parse manifest at .../.cargo/registry/src/github.com-1ecc6299db9ec823/encoding-index-tradchinese-1.20141219.2/Cargo.toml

    Caused by: library target names cannot contain hyphens: encoding-index-tradchinese

    opened by marvelm 3
  • Fix building with current rust

    Fix building with current rust

    By now, #10683 has been fixed, so the temporary can be dropped, but with the DST changes, we now get &[char, ..5], which doesn't coerce to &[char]. And since only the latter implements CharEq, we have a problem. Using as_slice() instead of prefixing the vector with &, we get a &[char] and all is good.

    opened by dotdash 3
  • to GBK and to UTF8 is not right work

    to GBK and to UTF8 is not right work

    i have a GBK string, GBK.decode(rst_raw, DecoderTrap::Strict).is_err() and UTF_8.decode(rst_raw, DecoderTrap::Strict).is_err() can not judge right result.i don't why, so i writed judge "utf8 str" code: fn is_utf8(data: &[u8]) -> bool { let mut i = 0; while i < data.len() { let num = preNUm(data[i]); if data[i] & 0x80 == 0x00 { i += 1; continue; } else if num > 2 { i += 1; let mut j = 0; while j < num -1 { if data[i] & 0xc0 != 0x80 { return false; } j += 1; i += 1; } } else { return false; }

    }
    return true;
    

    }

    fn preNUm(data: u8) -> i32 { let rst = format!("{:b}", data); let mut i = 0; for j in rst.chars() { if j != '1' { break; } i += 1; } return i; }

    opened by whereisyou 0
  • Abandoned?

    Abandoned?

    What is the status of the project? It seems to have seen no updates in the last 5 years, is it abandoned? And if so what is the "official" replacement?

    If the project is to be considered abandoned, maybe that could be indicated in the readme and the project archived?

    opened by masklinn 2
  • Performance: Consider replacing lookup tables with match statements or binary search in single byte index

    Performance: Consider replacing lookup tables with match statements or binary search in single byte index

    The current technique for building the single byte "forward" and "backward" function is to generate lookup tables using gen_index.py

    Here's an example generated file: https://github.com/lifthrasiir/rust-encoding/blob/master/src/index/singlebyte/windows_1252.rs

    There are some benchmarks that are generated, but they're micro-benchmarks with synthetic data, and I'm not sure they adequately capture how the library would be used in the wild.

    So I wrote a few tiny benchmarks that exercise the encoder and decoder at the level they're typically used.

    /// Some Latin-1 text to test
    //
    // the first few sentences of the article "An Ghaeilge" from Irish Wikipedia.
    // https://ga.wikipedia.org/wiki/An_Ghaeilge
    pub static IRISH_TEXT: &'static str =
        "Is ceann de na teangacha Ceilteacha í an Ghaeilge (nó Gaeilge na hÉireann mar a thugtar \
         uirthi corruair), agus ceann den dtrí cinn de theangacha Ceilteacha ar a dtugtar na \
         teangacha Gaelacha (.i. an Ghaeilge, Gaeilge na hAlban agus Gaeilge Mhanann) go háirithe. \
         Labhraítear in Éirinn go príomha í, ach tá cainteoirí Gaeilge ina gcónaí in áiteanna eile ar \
         fud an domhain. Is í an teanga náisiúnta nó dhúchais agus an phríomhtheanga oifigiúil i \
         bPoblacht na hÉireann í an Ghaeilge. Tá an Béarla luaite sa Bhunreacht mar theanga oifigiúil \
         eile. Tá aitheantas oifigiúil aici chomh maith i dTuaisceart Éireann, atá mar chuid den \
         Ríocht Aontaithe. Ar an 13 Meitheamh 2005 d'aontaigh airí gnóthaí eachtracha an Aontais \
         Eorpaigh glacadh leis an nGaeilge mar theanga oifigiúil oibre san AE";
    
    pub static RUSSIAN_TEXT: &'static str =
        "Ру?сский язы?к Информация о файле слушать)[~ 3] один из восточнославянских языков, \
         национальный язык русского народа. Является одним из наиболее распространённых языков мира \
         шестым среди всех языков мира по общей численности говорящих и восьмым по численности \
         владеющих им как родным[9]. Русский является также самым распространённым славянским \
         языком[10] и самым распространённым языком в Европе ? географически и по числу носителей \
         языка как родного[7]. Русский язык ? государственный язык Российской Федерации, один из \
         двух государственных языков Белоруссии, один из официальных языков Казахстана, Киргизии и \
         некоторых других стран, основной язык международного общения в Центральной Евразии, в \
         Восточной Европе, в странах бывшего Советского Союза, один из шести рабочих языков ООН, \
         ЮНЕСКО и других международных организаций[11][12][13].";
    
    
    #[bench]
    fn bench_encode_irish(bencher: &mut test::Bencher) {
        bencher.bytes = IRISH_TEXT.len() as u64;
        bencher.iter(|| {
            test::black_box(
                WINDOWS_1252.encode(&ASCII_TEXT, EncoderTrap::Strict)
            )
        })
    }
    
    #[bench]
    fn bench_decode_irish(bencher: &mut test::Bencher) {
        let bytes = WINDOWS_1252.encode(IRISH_TEXT, EncoderTrap::Strict).unwrap();
        
        bencher.bytes = bytes.len() as u64;
        bencher.iter(|| {
            test::black_box(
                WINDOWS_1252.decode(&bytes, DecoderTrap::Strict)
            )
        })
    }
    
    #[bench]
    fn bench_encode_russian(bencher: &mut test::Bencher) {
        bencher.bytes = RUSSIAN_TEXT.len() as u64;
        bencher.iter(|| {
            test::black_box(
                ISO_8859_5.encode(&RUSSIAN_TEXT, EncoderTrap::Strict)
            )
        })
    }
    
    #[bench]
    fn bench_decode_russian(bencher: &mut test::Bencher) {
        let bytes = ISO_8859_5.encode(RUSSIAN_TEXT, EncoderTrap::Strict).unwrap();
        
        bencher.bytes = bytes.len() as u64;
        bencher.iter(|| {
            test::black_box(
                ISO_8859_5.decode(&bytes, DecoderTrap::Strict)
            )
        })
    }
    

    I picked the windows-1252 encoding because it's similar to the old latin-1 standard and can encode the special characters in the Irish text I grabbed, and iso-8859-5 for similar reasons for the Russian test.

    I rewrote gen_index.py to create match statements instead of building a lookup table. You get something like this:

    // AUTOGENERATED FROM index-windows-1252.txt, ORIGINAL COMMENT FOLLOWS:
    //
    // For details on index index-windows-1252.txt see the Encoding Standard
    // https://encoding.spec.whatwg.org/
    //
    // Identifier: e56d49d9176e9a412283cf29ac9bd613f5620462f2a080a84eceaf974cfa18b7
    // Date: 2018-01-06
    #[inline]
    pub fn forward(code: u8) -> Option<u16> {
        match code {
            128 => Some(8364),
            129 => Some(129),
            130 => Some(8218),
            131 => Some(402),
            132 => Some(8222),
            133 => Some(8230),
            134 => Some(8224),
            135 => Some(8225),
            136 => Some(710),
            137 => Some(8240),
            //  a bunch more items
            250 => Some(250),
            251 => Some(251),
            252 => Some(252),
            253 => Some(253),
            254 => Some(254),
            255 => Some(255),
            _ => None
        }
    }
    
    #[inline]
    pub fn backward(code: u32) -> Option<u8> {
        match code {
            8364 => Some(128),
            129 => Some(129),
            8218 => Some(130),
            402 => Some(131),
            8222 => Some(132),
            8230 => Some(133),
            8224 => Some(134),
            8225 => Some(135),
            710 => Some(136),
            8240 => Some(137),
            352 => Some(138),
            8249 => Some(139),
            338 => Some(140),
            141 => Some(141),
            381 => Some(142),
            //  a bunch more items
            251 => Some(251),
            252 => Some(252),
            253 => Some(253),
            254 => Some(254),
            255 => Some(255),
            _ => None
        }
    }
    

    Note that I changed the function signature to return an Option instead of a sentinel value. That wasn't strictly required, and didn't have a large effect on performance, but makes the code more idiomatic, I think.

    I also generated a version that uses a binary search. It's pretty simple.

    const BACKWARD_KEYS: &'static [u32] = &[
        128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
        147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
        167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
        188, 189, 190, 215, 247, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499,
        1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 8206,
        8207, 8215
    ];
    
    const BACKWARD_VALUES: &'static [u8] = &[
        128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
        147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
        167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
        188, 189, 190, 170, 186, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
        238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 253, 254, 223
    ];
    
    #[inline]
    pub fn backward(code: u32) -> u8 {
        if let Ok(index) = BACKWARD_KEYS.binary_search(&code) {
            BACKWARD_VALUES[index]
        } else {
            0
        }
    }
    

    Here's a table comparing the three techniques (scroll to see entire table):

    test | master |   |   |   | match |   |   |   |   | binary search |   |   |   |   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- codec::singlebyte::tests::bench_decode_irish | 3246 | ns/iter | 240 | MB/s | 3171 | ns/iter | 245 | MB/s | 2.08% |   |   |   |   |   codec::singlebyte::tests::bench_decode_russian | 8508 | ns/iter | 98 | MB/s | 8890 | ns/iter | 94 | MB/s | -4.08% |   |   |   |   |   codec::singlebyte::tests::bench_encode_irish | 2622 | ns/iter | 310 | MB/s | 1688 | ns/iter | 482 | MB/s | 55.48% | 2243 | ns/iter | 363 | MB/s | 17.10% codec::singlebyte::tests::bench_encode_russian | 6692 | ns/iter | 228 | MB/s | 10578 | ns/iter | 144 | MB/s | -36.84% | 10019 | ns/iter | 152 | MB/s | -33.33%

    Obviously the Irish / Windows-1252 case is improved with both alternative techniques, but the Russian case is degraded.

    It looks like the decode method isn't changed much, and that makes sense, because the match expressions are contiguous integers, I bet that LLVM is optimizing that down to a lookup table anyways.

    I'll try running some more tests.

    opened by john-parton 0
  • Bugfix/warnings

    Bugfix/warnings

    Fix for https://github.com/lifthrasiir/rust-encoding/issues/123

    Most of the fixes were generated by running cargo fix and cargo fix --edition on the current nightly toolchain.

    Unresolved

    If you rebuild the .rs files using the gen_index.py script, it will generate code that generates warning. I can resolve that in this PR or in another PR.

    Let me know if you have any questions.

    Thanks!

    opened by john-parton 0
  • Warnings emited when building

    Warnings emited when building

    On master, cargo build emits 237 warnings.

    Here are the different kinds of warnings:

    • [ ] warning: trait objects without an explicit dyn are deprecated
    • [ ] ... range patterns are deprecated
    • [ ] unreachable pattern (this one has to do with the fact that Rust wasn't able to tell when a match pattern was exhaustive if you used all of the scalar values for a type, but now it appears to handle that correctly)
    • [ ] use of deprecated item 'try': use the ? operator instead

    You can run cargo build -v 2>&1 | grep warning | sort | uniq to get a summary.

    opened by john-parton 0
Owner
Kang Seonghoon
Yet another procrastinating programmer
Kang Seonghoon
TLV-C encoding support.

TLV-C: Tag - Length - Value - Checksum TLV-C is a variant on the traditional [TLV] format that adds a whole mess of checksums and whatnot. Why, you as

Oxide Computer Company 3 Nov 25, 2022
Implementation of Bencode encoding written in rust

Rust Bencode Implementation of Bencode encoding written in rust. Project Status Not in active developement due to lack of time and other priorities. I

Arjan Topolovec 32 Aug 6, 2022
A Gecko-oriented implementation of the Encoding Standard in Rust

encoding_rs encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Fire

Henri Sivonen 284 Dec 13, 2022
A HTML entity encoding library for Rust

A HTML entity encoding library for Rust Example usage All example assume a extern crate htmlescape; and use htmlescape::{relevant functions here}; is

Viktor Dahl 41 Nov 1, 2022
A TOML encoding/decoding library for Rust

toml-rs A TOML decoder and encoder for Rust. This library is currently compliant with the v0.5.0 version of TOML. This library will also likely contin

Alex Crichton 1k Dec 30, 2022
Variable-length signed and unsigned integer encoding that is byte-orderable for Rust

ordered-varint Provides variable-length signed and unsigned integer encoding that is byte-orderable. This crate provides the Variable trait which enco

Khonsu Labs 7 Dec 6, 2022
A series of compact encoding schemes for building small and fast parsers and serializers

A series of compact encoding schemes for building small and fast parsers and serializers

Manfred Kröhnert 2 Feb 5, 2022
Astro Format is a library for efficiently encoding and decoding a set of bytes into a single buffer format.

Astro Format is a library for efficiently transcoding arrays into a single buffer and native rust types into strings

Stelar Labs 1 Aug 13, 2022
Entropy Encoding notebook. Simple implementations of the "tANS" encoder/decoder.

EntropyEncoding Experiments This repository contains my Entropy Encoding notebook. Entropy encoding is an efficient lossless data compression scheme.

Nadav Rotem 4 Dec 21, 2022
Rust implementation of CRC(16, 32, 64) with support of various standards

crc Rust implementation of CRC(16, 32, 64). MSRV is 1.46. Usage Add crc to Cargo.toml [dependencies] crc = "2.0" Compute CRC use crc::{Crc, Algorithm,

Rui Hu 120 Dec 23, 2022
A CSV parser for Rust, with Serde support.

csv A fast and flexible CSV reader and writer for Rust, with support for Serde. Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.r

Andrew Gallant 1.3k Jan 5, 2023
rust-jsonnet - The Google Jsonnet( operation data template language) for rust

rust-jsonnet ==== Crate rust-jsonnet - The Google Jsonnet( operation data template language) for rust Google jsonnet documet: (http://google.github.io

Qihoo 360 24 Dec 1, 2022
MessagePack implementation for Rust / msgpack.org[Rust]

RMP - Rust MessagePack RMP is a pure Rust MessagePack implementation. This repository consists of three separate crates: the RMP core and two implemen

Evgeny Safronov 840 Dec 30, 2022
A Rust ASN.1 (DER) serializer.

rust-asn1 This is a Rust library for parsing and generating ASN.1 data (DER only). Installation Add asn1 to the [dependencies] section of your Cargo.t

Alex Gaynor 85 Dec 16, 2022
Rust library for reading/writing numbers in big-endian and little-endian.

byteorder This crate provides convenience methods for encoding and decoding numbers in either big-endian or little-endian order. Dual-licensed under M

Andrew Gallant 811 Jan 1, 2023
Cap'n Proto for Rust

Cap'n Proto for Rust documentation blog Introduction Cap'n Proto is a type system for distributed systems. With Cap'n Proto, you describe your data an

Cap'n Proto 1.5k Dec 26, 2022
A HTTP Archive format (HAR) serialization & deserialization library, written in Rust.

har-rs HTTP Archive format (HAR) serialization & deserialization library, written in Rust. Install Add the following to your Cargo.toml file: [depende

Sebastian Mandrean 25 Dec 24, 2022
pem-rs pem PEM jcreekmore/pem-rs [pem] — A Rust based way to parse and encode PEM-encoded data

pem A Rust library for parsing and encoding PEM-encoded data. Documentation Module documentation with examples Usage Add this to your Cargo.toml: [dep

Jonathan Creekmore 30 Dec 27, 2022
PROST! a Protocol Buffers implementation for the Rust Language

PROST! prost is a Protocol Buffers implementation for the Rust Language. prost generates simple, idiomatic Rust code from proto2 and proto3 files. Com

Dan Burkert 17 Jan 8, 2023