A Gecko-oriented implementation of the Encoding Standard in Rust

Overview

encoding_rs

Build Status crates.io docs.rs Apache 2 / MIT dual-licensed

encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Firefox 56).

Additionally, the mem module provides various operations for dealing with in-RAM text (as opposed to data that's coming from or going to an IO boundary). The mem module is a module instead of a separate crate due to internal implementation detail efficiencies.

Functionality

Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.

Specifically, encoding_rs does the following:

  • Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid aligned native-endian in-RAM UTF-16 (units of u16 / char16_t).
  • Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16 (units of u16 / char16_t) into a sequence of bytes in an Encoding Standard-defined character encoding as if the lone surrogates had been replaced with the REPLACEMENT CHARACTER before performing the encode. (Gecko's UTF-16 is potentially invalid.)
  • Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid UTF-8.
  • Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
  • Does the above in streaming (input and output split across multiple buffers) and non-streaming (whole input in a single buffer and whole output in a single buffer) variants.
  • Avoids copying (borrows) when possible in the non-streaming cases when decoding to or encoding from UTF-8.
  • Resolves textual labels that identify character encodings in protocol text into type-safe objects representing the those encodings conceptually.
  • Maps the type-safe encoding objects onto strings suitable for returning from document.characterSet.
  • Validates UTF-8 (in common instruction set scenarios a bit faster for Web workloads than the standard library; hopefully will get upstreamed some day) and ASCII.

Additionally, encoding_rs::mem does the following:

  • Checks if a byte buffer contains only ASCII.
  • Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
  • Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 buffer contains only Latin1 code points (below U+0100).
  • Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior (suitable for checking if the Unicode Bidirectional Algorithm can be optimized out).
  • Combined versions of the above two checks.
  • Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
  • Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
  • Converts UTF-8 and UTF-16 to Latin1 (if in range).
  • Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
  • Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
  • Copies ASCII from one buffer to another up to the first non-ASCII byte.
  • Converts ASCII to UTF-16 up to the first non-ASCII byte.
  • Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.

Integration with std::io

Notably, the above feature list doesn't include the capability to wrap a std::io::Read, decode it into UTF-8 and presenting the result via std::io::Read. The encoding_rs_io crate provides that capability.

no_std Environment

The crate works in a no_std environment assuming that alloc is present. The alloc-using part are on the outer edge of the crate, so if there is interest in using the crate in environments without alloc it would be feasible to add a way to turn off those parts of the API of this crate that use Vec/String/Cow.

Decoding Email

For decoding character encodings that occur in email, use the charset crate instead of using this one directly. (It wraps this crate and adds UTF-7 decoding.)

Windows Code Page Identifier Mappings

For mappings to and from Windows code page identifiers, use the codepage crate.

DOS Encodings

This crate does not support single-byte DOS encodings that aren't required by the Web Platform, but the oem_cp crate does.

Preparing Text for the Encoders

Normalizing text into Unicode Normalization Form C prior to encoding text into a legacy encoding minimizes unmappable characters. Text can be normalized to Unicode Normalization Form C using the unic-normal crate.

The exception is windows-1258, which after normalizing to Unicode Normalization Form C requires tone marks to be decomposed in order to minimize unmappable characters. Vietnamese tone marks can be decomposed using the detone crate.

Licensing

Please see the file named COPYRIGHT.

Documentation

Generated API documentation is available online.

There is a long-form write-up about the design and internals of the crate.

C and C++ bindings

An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.

The bindings for the mem module are in the encoding_c_mem crate.

For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.

There's a write-up about the C++ wrappers.

Sample programs

Optional features

There are currently these optional cargo features:

simd-accel

Enables SIMD acceleration using the nightly-dependent packed_simd_2 crate.

This is an opt-in feature, because enabling this feature opts out of Rust's guarantees of future compilers compiling old code (aka. "stability story").

Currently, this has not been tested to be an improvement except for these targets:

  • x86_64
  • i686
  • aarch64
  • thumbv7neon

If you use nightly Rust, you use targets whose first component is one of the above, and you are prepared to have to revise your configuration when updating Rust, you should enable this feature. Otherwise, please do not enable this feature.

Note! If you are compiling for a target that does not have 128-bit SIMD enabled as part of the target definition and you are enabling 128-bit SIMD using -C target_feature, you need to enable the core_arch Cargo feature for packed_simd_2 to compile a crates.io snapshot of core_arch instead of using the standard-library copy of core::arch, because the core::arch module of the pre-compiled standard library has been compiled with the assumption that the CPU doesn't have 128-bit SIMD. At present this applies mainly to 32-bit ARM targets whose first component does not include the substring neon.

The encoding_rs side of things has not been properly set up for POWER, PowerPC, MIPS, etc., SIMD at this time, so even if you were to follow the advice from the previous paragraph, you probably shouldn't use the simd-accel option on the less mainstream architectures at this time.

Used by Firefox.

serde

Enables support for serializing and deserializing &'static Encoding-typed struct fields using Serde.

Not used by Firefox.

fast-legacy-encode

A catch-all option for enabling the fastest legacy encode options. Does not affect decode speed or UTF-8 encode speed.

At present, this option is equivalent to enabling the following options:

  • fast-hangul-encode
  • fast-hanja-encode
  • fast-kanji-encode
  • fast-gb-hanzi-encode
  • fast-big5-hanzi-encode

Adds 176 KB to the binary size.

Not used by Firefox.

fast-hangul-encode

Changes encoding precomposed Hangul syllables into EUC-KR from binary search over the decode-optimized tables to lookup by index making Korean plain-text encode about 4 times as fast as without this option.

Adds 20 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-hanja-encode

Changes encoding of Hanja into EUC-KR from linear search over the decode-optimized table to lookup by index. Since Hanja is practically absent in modern Korean text, this option doesn't affect perfomance in the common case and mainly makes sense if you want to make your application resilient agaist denial of service by someone intentionally feeding it a lot of Hanja to encode into EUC-KR.

Adds 40 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-kanji-encode

Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linear search over the decode-optimized tables to lookup by index making Japanese plain-text encode to legacy encodings 30 to 50 times as fast as without this option (about 2 times as fast as with less-slow-kanji-encode).

Takes precedence over less-slow-kanji-encode.

Adds 36 KB to the binary size (24 KB compared to less-slow-kanji-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-kanji-encode

Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP and ISO-2022-JP) encode less slow (binary search instead of linear search) making Japanese plain-text encode to legacy encodings 14 to 23 times as fast as without this option.

Adds 12 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-gb-hanzi-encode

Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK and gb18030 from linear search over a part the decode-optimized tables followed by a binary search over another part of the decode-optimized tables to lookup by index making Simplified Chinese plain-text encode to the legacy encodings 100 to 110 times as fast as without this option (about 2.5 times as fast as with less-slow-gb-hanzi-encode).

Takes precedence over less-slow-gb-hanzi-encode.

Adds 36 KB to the binary size (24 KB compared to less-slow-gb-hanzi-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-gb-hanzi-encode

Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode less slow (binary search instead of linear search) making Simplified Chinese plain-text encode to the legacy encodings about 40 times as fast as without this option.

Adds 12 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-big5-hanzi-encode

Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 from linear search over a part the decode-optimized tables to lookup by index making Traditional Chinese plain-text encode to Big5 105 to 125 times as fast as without this option (about 3 times as fast as with less-slow-big5-hanzi-encode).

Takes precedence over less-slow-big5-hanzi-encode.

Adds 40 KB to the binary size (20 KB compared to less-slow-big5-hanzi-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-big5-hanzi-encode

Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow (binary search instead of linear search) making Traditional Chinese plain-text encode to Big5 about 36 times as fast as without this option.

Adds 20 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

Performance goals

For decoding to UTF-16, the goal is to perform at least as well as Gecko's old uconv. For decoding to UTF-8, the goal is to perform at least as well as rust-encoding. These goals have been achieved.

Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent to memcpy and UTF-16 to UTF-8 should be fast.)

Speed is a non-goal when encoding to legacy encodings. By default, encoding to legacy encodings should not be optimized for speed at the expense of code size as long as form submission and URL parsing in Gecko don't become noticeably too slow in real-world use.

In the interest of binary size, by default, encoding_rs does not have encode-specific data tables beyond 32 bits of encode-specific data for each single-byte encoding. Therefore, encoders search the decode-optimized data tables. This is a linear search in most cases. As a result, by default, encode to legacy encodings varies from slow to extremely slow relative to other libraries. Still, with realistic work loads, this seemed fast enough not to be user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing) in the Web-exposed encoder use cases.

See the cargo features above for optionally making CJK legacy encode fast.

A framework for measuring performance is available separately.

Rust Version Compatibility

It is a goal to support the latest stable Rust, the latest nightly Rust and the version of Rust that's used for Firefox Nightly.

At this time, there is no firm commitment to support a version older than what's required by Firefox, and there is no commitment to treat MSRV changes as semver-breaking, because this crate depends on cfg-if, which doesn't appear to treat MSRV changes as semver-breaking, so it would be useless for this crate to treat MSRV changes as semver-breaking.

As of 2021-02-04, MSRV appears to be Rust 1.36.0 for using the crate and 1.42.0 for doc tests to pass without errors about the global allocator.

Compatibility with rust-encoding

A compatibility layer that implements the rust-encoding API on top of encoding_rs is provided as a separate crate (cannot be uploaded to crates.io). The compatibility layer was originally written with the assuption that Firefox would need it, but it is not currently used in Firefox.

Regenerating Generated Code

To regenerate the generated code:

Roadmap

  • Design the low-level API.
  • Provide Rust-only convenience features.
  • Provide an stl/gsl-flavored C++ API.
  • Implement all decoders and encoders.
  • Add unit tests for all decoders and encoders.
  • Finish BOM sniffing variants in Rust-only convenience features.
  • Document the API.
  • Publish the crate on crates.io.
  • Create a solution for measuring performance.
  • Accelerate ASCII conversions using SSE2 on x86.
  • Accelerate ASCII conversions using ALU register-sized operations on non-x86 architectures (process an usize instead of u8 at a time).
  • Split FFI into a separate crate so that the FFI doesn't interfere with LTO in pure-Rust usage.
  • Compress CJK indices by making use of sequential code points as well as Unicode-ordered parts of indices.
  • Make lookups by label or name use binary search that searches from the end of the label/name to the start.
  • Make labels with non-ASCII bytes fail fast.
  • Parallelize UTF-8 validation using Rayon. (This turned out to be a pessimization in the ASCII case due to memory bandwidth reasons.)
  • Provide an XPCOM/MFBT-flavored C++ API.
  • Investigate accelerating single-byte encode with a single fast-tracked range per encoding.
  • Replace uconv with encoding_rs in Gecko.
  • Implement the rust-encoding API in terms of encoding_rs.
  • Add SIMD acceleration for Aarch64.
  • Investigate the use of NEON on 32-bit ARM.
  • Investigate Björn Höhrmann's lookup table acceleration for UTF-8 as adapted to Rust in rust-encoding.
  • Add actually fast CJK encode options.
  • Investigate Bob Steagall's lookup table acceleration for UTF-8.
  • Provide a build mode that works without alloc (with lesser API surface).
  • Migrate to std::simd once it is stable and declare 1.0.

Release Notes

0.8.28

  • Fix error in Serde support introduced as part of no_std support.

0.8.27

  • Make the crate works in a no_std environment (with alloc).

0.8.26

  • Fix oversights in edition 2018 migration that broke the simd-accel feature.

0.8.25

  • Do pointer alignment checks in a way where intermediate steps aren't defined to be Undefined Behavior.
  • Update the packed_simd dependency to packed_simd_2.
  • Update the cfg-if dependency to 1.0.
  • Address warnings that have been introduced by newer Rust versions along the way.
  • Update to edition 2018, since even prior to 1.0 cfg-if updated to edition 2018 without a semver break.

0.8.24

  • Avoid computing an intermediate (not dereferenced) pointer value in a manner designated as Undefined Behavior when computing pointer alignment.

0.8.23

  • Remove year from copyright notices. (No features or bug fixes.)

0.8.22

  • Formatting fix and new unit test. (No features or bug fixes.)

0.8.21

  • Fixed a panic with invalid UTF-16[BE|LE] input at the end of the stream.

0.8.20

  • Make Decoder::latin1_byte_compatible_up_to return None in more cases to make the method actually useful. While this could be argued to be a breaking change due to the bug fix changing semantics, it does not break callers that had to handle the None case in a reasonable way anyway.

0.8.19

  • Removed a bunch of bound checks in convert_str_to_utf16.
  • Added mem::convert_utf8_to_utf16_without_replacement.

0.8.18

  • Added mem::utf8_latin1_up_to and mem::str_latin1_up_to.
  • Added Decoder::latin1_byte_compatible_up_to.

0.8.17

  • Update bincode (dev dependency) version requirement to 1.0.

0.8.16

  • Switch from the simd crate to packed_simd.

0.8.15

  • Adjust documentation for simd-accel (README-only release).

0.8.14

  • Made UTF-16 to UTF-8 encode conversion fill the output buffer as closely as possible.

0.8.13

  • Made the UTF-8 to UTF-16 decoder compare the number of code units written with the length of the right slice (the output slice) to fix a panic introduced in 0.8.11.

0.8.12

  • Removed the clippy:: prefix from clippy lint names.

0.8.11

  • Changed minimum Rust requirement to 1.29.0 (for the ability to refer to the interior of a static when defining another static).
  • Explicitly aligned the lookup tables for single-byte encodings and UTF-8 to cache lines in the hope of freeing up one cache line for other data. (Perhaps the tables were already aligned and this is placebo.)
  • Added 32 bits of encode-oriented data for each single-byte encoding. The change was performance-neutral for non-Latin1-ish Latin legacy encodings, improved Latin1-ish and Arabic legacy encode speed somewhat (new speed is 2.4x the old speed for German, 2.3x for Arabic, 1.7x for Portuguese and 1.4x for French) and improved non-Latin1, non-Arabic legacy single-byte encode a lot (7.2x for Thai, 6x for Greek, 5x for Russian, 4x for Hebrew).
  • Added compile-time options for fast CJK legacy encode options (at the cost of binary size (up to 176 KB) and run-time memory usage). These options still retain the overall code structure instead of rewriting the CJK encoders totally, so the speed isn't as good as what could be achieved by using even more memory / making the binary even langer.
  • Made UTF-8 decode and validation faster.
  • Added method is_single_byte() on Encoding.
  • Added mem::decode_latin1() and mem::encode_latin1_lossy().

0.8.10

  • Disabled a unit test that tests a panic condition when the assertion being tested is disabled.

0.8.9

  • Made --features simd-accel work with stable-channel compiler to simplify the Firefox build system.

0.8.8

  • Made the is_foo_bidi() not treat U+FEFF (ZERO WIDTH NO-BREAK SPACE aka. BYTE ORDER MARK) as right-to-left.
  • Made the is_foo_bidi() functions report true if the input contains Hebrew presentations forms (which are right-to-left but not in a right-to-left-roadmapped block).

0.8.7

  • Fixed a panic in the UTF-16LE/UTF-16BE decoder when decoding to UTF-8.

0.8.6

  • Temporarily removed the debug assertion added in version 0.8.5 from convert_utf16_to_latin1_lossy.

0.8.5

  • If debug assertions are enabled but fuzzing isn't enabled, lossy conversions to Latin1 in the mem module assert that the input is in the range U+0000...U+00FF (inclusive).
  • In the mem module provide conversions from Latin1 and UTF-16 to UTF-8 that can deal with insufficient output space. The idea is to use them first with an allocation rounded up to jemalloc bucket size and do the worst-case allocation only if the jemalloc rounding up was insufficient as the first guess.

0.8.4

  • Fix SSE2-specific, simd-accel-specific memory corruption introduced in version 0.8.1 in conversions between UTF-16 and Latin1 in the mem module.

0.8.3

  • Removed an #[inline(never)] annotation that was not meant for release.

0.8.2

  • Made non-ASCII UTF-16 to UTF-8 encode faster by manually omitting bound checks and manually adding branch prediction annotations.

0.8.1

  • Tweaked loop unrolling and memory alignment for SSE2 conversions between UTF-16 and Latin1 in the mem module to increase the performance when converting long buffers.

0.8.0

  • Changed the minimum supported version of Rust to 1.21.0 (semver breaking change).
  • Flipped around the defaults vs. optional features for controlling the size vs. speed trade-off for Kanji and Hanzi legacy encode (semver breaking change).
  • Added NEON support on ARMv7.
  • SIMD-accelerated x-user-defined to UTF-16 decode.
  • Made UTF-16LE and UTF-16BE decode a lot faster (including SIMD acceleration).

0.7.2

  • Add the mem module.
  • Refactor SIMD code which can affect performance outside the mem module.

0.7.1

  • When encoding from invalid UTF-16, correctly handle U+DC00 followed by another low surrogate.

0.7.0

  • Make replacement a label of the replacement encoding. (Spec change.)
  • Remove Encoding::for_name(). (Encoding::for_label(foo).unwrap() is now close enough after the above label change.)
  • Remove the parallel-utf8 cargo feature.
  • Add optional Serde support for &'static Encoding.
  • Performance tweaks for ASCII handling.
  • Performance tweaks for UTF-8 validation.
  • SIMD support on aarch64.

0.6.11

  • Make Encoder::has_pending_state() public.
  • Update the simd crate dependency to 0.2.0.

0.6.10

  • Reserve enough space for NCRs when encoding to ISO-2022-JP.
  • Correct max length calculations for multibyte decoders.
  • Correct max length calculations before BOM sniffing has been performed.
  • Correctly calculate max length when encoding from UTF-16 to GBK.

0.6.9

0.6.8

  • Correcly handle the case where the first buffer contains potentially partial BOM and the next buffer is the last buffer.
  • Decode byte 7F correctly in ISO-2022-JP.
  • Make UTF-16 to UTF-8 encode write closer to the end of the buffer.
  • Implement Hash for Encoding.

0.6.7

0.6.6

  • Correct max length calculation when a partial BOM prefix is part of the decoder's state.

0.6.5

  • Correct max length calculation in various encoders.
  • Correct max length calculation in the UTF-16 decoder.
  • Derive PartialEq and Eq for the CoderResult, DecoderResult and EncoderResult types.

0.6.4

  • Avoid panic when encoding with replacement and the destination buffer is too short to hold one numeric character reference.

0.6.3

  • Add support for 32-bit big-endian hosts. (For real this time.)

0.6.2

  • Fix a panic from subslicing with bad indices in Encoder::encode_from_utf16. (Due to an oversight, it lacked the fix that Encoder::encode_from_utf8 already had.)
  • Micro-optimize error status accumulation in non-streaming case.

0.6.1

  • Avoid panic near integer overflow in a case that's unlikely to actually happen.
  • Address Clippy lints.

0.6.0

  • Make the methods for computing worst-case buffer size requirements check for integer overflow.
  • Upgrade rayon to 0.7.0.

0.5.1

  • Reorder methods for better documentation readability.
  • Add support for big-endian hosts. (Only 64-bit case actually tested.)
  • Optimize the ALU (non-SIMD) case for 32-bit ARM instead of x86_64.

0.5.0

  • Avoid allocating an excessively long buffers in non-streaming decode.
  • Fix the behavior of ISO-2022-JP and replacement decoders near the end of the output buffer.
  • Annotate the result structs with #[must_use].

0.4.0

  • Split FFI into a separate crate.
  • Performance tweaks.
  • CJK binary size and encoding performance changes.
  • Parallelize UTF-8 validation in the case of long buffers (with optional feature parallel-utf8).
  • Borrow even with ISO-2022-JP when possible.

0.3.2

  • Fix moving pointers to alignment in ALU-based ASCII acceleration.
  • Fix errors in documentation and improve documentation.

0.3.1

  • Fix UTF-8 to UTF-16 decode for byte sequences beginning with 0xEE.
  • Make UTF-8 to UTF-8 decode SSE2-accelerated when feature simd-accel is used.
  • When decoding and encoding ASCII-only input from or to an ASCII-compatible encoding using the non-streaming API, return a borrow of the input.
  • Make encode from UTF-16 to UTF-8 faster.

0.3

  • Change the references to the instances of Encoding from const to static to make the referents unique across crates that use the refernces.
  • Introduce non-reference-typed FOO_INIT instances of Encoding to allow foreign crates to initialize static arrays with references to Encoding instances even under Rust's constraints that prohibit the initialization of &'static Encoding-typed array items with &'static Encoding-typed statics.
  • Document that the above two points will be reverted if Rust changes const to work so that cross-crate usage keeps the referents unique.
  • Return Cows from Rust-only non-streaming methods for encode and decode.
  • Encoding::for_bom() returns the length of the BOM.
  • ASCII-accelerated conversions for encodings other than UTF-16LE, UTF-16BE, ISO-2022-JP and x-user-defined.
  • Add SSE2 acceleration behind the simd-accel feature flag. (Requires nightly Rust.)
  • Fix panic with long bogus labels.
  • Map 0xCA to U+05BA in windows-1255. (Spec change.)
  • Correct the end of the Shift_JIS EUDC range. (Spec change.)

0.2.4

  • Polish FFI documentation.

0.2.3

  • Fix UTF-16 to UTF-8 encode.

0.2.2

  • Add Encoder.encode_from_utf8_to_vec_without_replacement().

0.2.1

  • Add Encoding.is_ascii_compatible().

  • Add Encoding::for_bom().

  • Make == for Encoding use name comparison instead of pointer comparison, because uses of the encoding constants in different crates result in different addresses and the constant cannot be turned into statics without breaking other things.

0.2.0

The initial release.

Comments
  • implementation for io::Read/io::Write

    implementation for io::Read/io::Write

    What are your thoughts on providing implementations of the io::Read/io::Write traits as a convenience for handling stream encoding/decoding?

    Here is the specific problem I'd like to solve. Simplifying, I have a function that looks like the following:

    fn search<R: io::Read>(rdr: R) -> io::Result<SearchResults> { ... }
    

    Internally, the search function limits itself to the methods of io::Read to execute a search on its contents. The search is exhaustive, but is guaranteed to use a constant amount of heap space. The search routine expects the buffer to be UTF-8 encoded (and will handle invalid UTF-8 gracefully). I'd like to use this same search routine even if the contents of rdr are, say, UTF-16. I claim that this is possible if I wrap rdr in something that satisfies io::Read but uses a encoding_rs::Decoder internally to convert UTF-16 to UTF-8. I would expect the callers of search to do that wrapping. If there's invalid UTF-16, then inserting replacement characters is OK.

    Does this sound like something you'd be willing to maintain? I would be happy to take an initial crack at an implementation if so. (In fact, I must do this. The point of this issue is asking whether I should try to upstream it or not.) However, I think there are some interesting points worth mentioning. (There may be more!)

    1. Is this type of API useful in the context of the web? If not, then maybe it shouldn't live in this crate.
    2. The io::Read interface feels not-quite-right in some respects. For example, the io::Read primarily operates on a &[u8]. But if encoding_rs is used to provide an io::Read implementation, then it necessarily guarantees that all consumers of that implementation will read valid UTF-8, which means converting the &[u8] bytes to &str safely will incur an unnecessary cost. I'm not sure what to make of this and how much one might care, but it seems worth pointing out. (This particular issue isn't a problem for me, since the search routine itself handles UTF-8 implicitly.)
    opened by BurntSushi 17
  • Non-streaming decode() appears to remove the BOM?

    Non-streaming decode() appears to remove the BOM?

    https://github.com/hsivonen/encoding_rs/blob/d4d7d2a99aac266ecf6938c3832aefaaf8c1e52b/src/lib.rs#L2974-L2980

    Functionally, decode() and decode_with_bom_removal() seem pretty much the same? That doesn't seem correct? If there's a variant called "decode_with_bom_removal" then I would expect the standard variant not to remove the BOM.

    Compare to:

    https://github.com/hsivonen/encoding_rs/blob/d4d7d2a99aac266ecf6938c3832aefaaf8c1e52b/src/lib.rs#L3019-L3030

    It's totally valid to decode the BOM, the BOM is a unicode character like any other character. Decoding a UTF-16 document with a BOM should yield a UTF-8 document with a BOM. Otherwise, you would just use the BOM-removing version...

    use encoding_rs::*;
    
    fn main() {
        // Two characters, '1' and then BOM character
        println!("{:?}", UTF_16LE.decode(&[0x31, 0x00, 0xFF, 0xFE]).0.as_bytes()); 
        // Nothing - BOM removed
        println!("{:?}", UTF_16LE.decode(&[0xFF, 0xFE]).0.as_bytes());
    }
    
    [49, 239, 187, 191]
    []
    
    opened by dralley 15
  • Re-add license field to Cargo.toml

    Re-add license field to Cargo.toml

    This was removed in https://github.com/hsivonen/encoding_rs/commit/3a4033e67b6b9d1c1e9514bcb5c20ae05bf8391d#diff-2e9d962a08321605940b5a657135052fbcef87b5e360662bb527c96d9a615542 and causes automated tooling like cargo deny to fail detecting the license.

    It should probably be something like (Apache-2.0 OR MIT) AND BSD-3 but I'm not sure the expression syntax allows parenthesis. If it doesn't then we have a problem and you might want to reconsider if dual-licensing warrants the increased license complexity here. Having to worry about 3 different licenses for a single crate is a bit suboptimal, even if MIT and BSD-3 are approximately the same.

    opened by sdroege 10
  • Compilation issues under 1.43.0 nightly

    Compilation issues under 1.43.0 nightly

    I've encountered a compilation issue when building under 1.43.0 nightly of the rust toolchain. I noticed the problem when building the dependent orjson which uses the nightly toolchain for compilation.

    I don't know much about rust, but it seems that an error occurs within a macro and the rust compiler subsequently panics.

    I was able to reproduce the issue with the following commands, the features are the ones used by orjson. I'm filing this issue here, as I don't quite understand what is happening with regards to macros, user code and compiler code.

    $ docker run --rm -it --entrypoint /bin/bash konstin2/maturin:master
    (docker) $ git clone https://github.com/hsivonen/encoding_rs.git
    (docker) $ cd encoding_rs/
    (docker) $ git checkout v0.8.22
    (docker) $ echo nightly > rust-toolchain
    (docker) $ cargo --version
    cargo 1.43.0-nightly (e02974078 2020-02-18)
    (docker) $ rustc --version
    rustc 1.43.0-nightly (7760cd0fb 2020-02-19)
    (docker) $ RUST_BACKTRACE=full cargo build --features simd-accel --no-default-features
    ...
    

    --verbose does not give much more information. -Z macro-backtrace does not seem to a valid flag.

    cargo build output
    info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'
    info: latest update on 2020-02-20, rust version 1.43.0-nightly (7760cd0fb 2020-02-19)
    info: downloading component 'cargo'
    info: downloading component 'clippy'
    info: downloading component 'rust-docs'
    info: downloading component 'rust-std'
    info: downloading component 'rustc'
    info: downloading component 'rustfmt'
    info: installing component 'cargo'
    info: installing component 'clippy'
    info: installing component 'rust-docs'
    info: installing component 'rust-std'
    info: installing component 'rustc'
    info: installing component 'rustfmt'
        Updating crates.io index
      Downloaded packed_simd v0.3.3
       Compiling packed_simd v0.3.3
       Compiling encoding_rs v0.8.22 (/io/encoding_rs)
       Compiling cfg-if v0.1.10
    warning: unused label
       --> src/macros.rs:878:41
        |
    878 |   ...                   'innermost: loop {
        |                         ^^^^^^^^^^
        | 
       ::: src/euc_jp.rs:77:5
        |
    77  | /     euc_jp_decoder_functions!(
    78  | |         {
    79  | |             let trail_minus_offset = byte.wrapping_sub(0xA1);
    80  | |             // Fast-track Hiragana (60% according to Lunde)
    ...   |
    220 | |         handle
    221 | |     );
        | |______- in this macro invocation
        |
        = note: `#[warn(unused_labels)]` on by default
        = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
    
    warning: unused label
       --> src/macros.rs:878:41
        |
    878 |   ...                   'innermost: loop {
        |                         ^^^^^^^^^^
        | 
       ::: src/euc_jp.rs:77:5
        |
    77  | /     euc_jp_decoder_functions!(
    78  | |         {
    79  | |             let trail_minus_offset = byte.wrapping_sub(0xA1);
    80  | |             // Fast-track Hiragana (60% according to Lunde)
    ...   |
    220 | |         handle
    221 | |     );
        | |______- in this macro invocation
        |
        = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
    
    warning: unused label
       --> src/macros.rs:574:41
        |
    574 |   ...                   'innermost: loop {
        |                         ^^^^^^^^^^
        | 
       ::: src/gb18030.rs:111:5
        |
    111 | /     gb18030_decoder_functions!(
    112 | |         {
    113 | |             // If first is between 0x81 and 0xFE, inclusive,
    114 | |             // subtract offset 0x81.
    ...   |
    294 | |         handle,
    295 | |         'outermost);
        | |____________________- in this macro invocation
        |
        = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
    
    warning: unused label
       --> src/macros.rs:574:41
        |
    574 |   ...                   'innermost: loop {
        |                         ^^^^^^^^^^
        | 
       ::: src/gb18030.rs:111:5
        |
    111 | /     gb18030_decoder_functions!(
    112 | |         {
    113 | |             // If first is between 0x81 and 0xFE, inclusive,
    114 | |             // subtract offset 0x81.
    ...   |
    294 | |         handle,
    295 | |         'outermost);
        | |____________________- in this macro invocation
        |
        = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
    
    warning: unused label
       --> src/mem.rs:279:17
        |
    279 |                 'inner: loop {
        |                 ^^^^^^
    
    warning: `...` range patterns are deprecated
       --> src/mem.rs:743:26
        |
    743 |                         0...0x7F => {
        |                          ^^^ help: use `..=` for an inclusive range
        |
        = note: `#[warn(ellipsis_inclusive_range_patterns)]` on by default
    
    warning: `...` range patterns are deprecated
       --> src/mem.rs:749:29
        |
    749 |                         0xC2...0xD5 => {
        |                             ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
       --> src/mem.rs:770:36
        |
    770 |                         0xE1 | 0xE3...0xEC | 0xEE => {
        |                                    ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
       --> src/mem.rs:879:29
        |
    879 |                         0xF1...0xF4 => {
        |                             ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
       --> src/mem.rs:942:18
        |
    942 |                 0...0x7F => {
        |                  ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
       --> src/mem.rs:948:21
        |
    948 |                 0xC2...0xD5 => {
        |                     ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
       --> src/mem.rs:985:28
        |
    985 |                 0xE1 | 0xE3...0xEC | 0xEE => {
        |                            ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
        --> src/lib.rs:2686:29
         |
    2686 |                         b'A'...b'Z' => {
         |                             ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
        --> src/lib.rs:2691:29
         |
    2691 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
         |                             ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
        --> src/lib.rs:2691:43
         |
    2691 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
         |                                           ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
        --> src/lib.rs:2714:29
         |
    2714 |                         b'A'...b'Z' => {
         |                             ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
        --> src/lib.rs:2723:29
         |
    2723 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
         |                             ^^^ help: use `..=` for an inclusive range
    
    warning: `...` range patterns are deprecated
        --> src/lib.rs:2723:43
         |
    2723 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
         |                                           ^^^ help: use `..=` for an inclusive range
    
    warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
      --> src/simd_funcs.rs:19:20
       |
    19 |     let mut simd = ::std::mem::uninitialized();
       |                    ^^^^^^^^^^^^^^^^^^^^^^^^^
       |
       = note: `#[warn(deprecated)]` on by default
    
    warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
      --> src/simd_funcs.rs:43:20
       |
    43 |     let mut simd = ::std::mem::uninitialized();
       |                    ^^^^^^^^^^^^^^^^^^^^^^^^^
    
    warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
       --> src/handles.rs:113:30
        |
    113 |             let mut u: u16 = ::std::mem::uninitialized();
        |                              ^^^^^^^^^^^^^^^^^^^^^^^^^
    
    warning: unnecessary `unsafe` block
      --> src/utf_8.rs:91:12
       |
    91 |         if unsafe { likely(read + 4 <= src.len()) } {
       |            ^^^^^^ unnecessary `unsafe` block
       |
       = note: `#[warn(unused_unsafe)]` on by default
    
    warning: unnecessary `unsafe` block
      --> src/utf_8.rs:98:20
       |
    98 |                 if unsafe { likely(in_inclusive_range8(byte, 0xC2, 0xDF)) } {
       |                    ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:107:24
        |
    107 |                     if unsafe { likely(read + 4 <= src.len()) } {
        |                        ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:117:20
        |
    117 |                 if unsafe { likely(byte < 0xF0) } {
        |                    ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:132:28
        |
    132 |                         if unsafe { likely(read + 4 <= src.len()) } {
        |                            ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:137:32
        |
    137 | ...                   if unsafe { likely(byte < 0x80) } {
        |                          ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:162:20
        |
    162 |                 if unsafe { likely(read + 4 <= src.len()) } {
        |                    ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:261:12
        |
    261 |         if unsafe { likely(read + 4 <= src.len()) } {
        |            ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:271:20
        |
    271 |                 if unsafe { likely(in_inclusive_range8(byte, 0xC2, 0xDF)) } {
        |                    ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:288:24
        |
    288 |                     if unsafe { likely(read + 4 <= src.len()) } {
        |                        ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:300:20
        |
    300 |                 if unsafe { likely(byte < 0xF0) } {
        |                    ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:323:28
        |
    323 |                         if unsafe { likely(read + 4 <= src.len()) } {
        |                            ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:328:32
        |
    328 | ...                   if unsafe { likely(byte < 0x80) } {
        |                          ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:370:20
        |
    370 |                 if unsafe { likely(read + 4 <= src.len()) } {
        |                    ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:657:20
        |
    657 |                 if unsafe { likely(unit_minus_surrogate_start > (0xDFFF - 0xD800)) } {
        |                    ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:668:20
        |
    668 |                 if unsafe { likely(unit_minus_surrogate_start <= (0xDBFF - 0xD800)) } {
        |                    ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:687:24
        |
    687 |                     if unsafe { likely(second_minus_low_surrogate_start <= (0xDFFF - 0xDC00)) } {
        |                        ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/utf_8.rs:729:16
        |
    729 |             if unsafe { unlikely(unit < 0x80) } {
        |                ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
       --> src/mem.rs:913:32
        |
    913 | ...                   if unsafe { unlikely(second == 0x90 || second == 0x9E) } {
        |                          ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
        --> src/mem.rs:1171:28
         |
    1171 |                         if unsafe { unlikely(byte >= 0xD6) } {
         |                            ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
        --> src/mem.rs:1195:24
         |
    1195 |                     if unsafe { unlikely(!in_inclusive_range8(byte, 0xE3, 0xEE) && byte != 0xE1) } {
         |                        ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
        --> src/mem.rs:1244:24
         |
    1244 |                     if unsafe { unlikely(byte == 0xF0 && (second == 0x90 || second == 0x9E)) } {
         |                        ^^^^^^ unnecessary `unsafe` block
    
    warning: unnecessary `unsafe` block
        --> src/mem.rs:1658:8
         |
    1658 |     if unsafe { likely(read == src.len()) } {
         |        ^^^^^^ unnecessary `unsafe` block
    
    error: internal compiler error: src/librustc_codegen_ssa/mir/block.rs:622: shuffle indices must be constant
       --> src/simd_funcs.rs:289:28
        |
    289 |           let first: u8x16 = shuffle!(
        |  ____________________________^
    290 | |             s,
    291 | |             u8x16::splat(0),
    292 | |             [0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23]
    293 | |         );
        | |_________^
        |
        = note: this error: internal compiler error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
    
    thread 'rustc' panicked at 'Box<Any>', <::std::macros::panic macros>:2:4
    stack backtrace:
       0:     0x7fce48a8a634 - backtrace::backtrace::libunwind::trace::h0743ecf0c905ca1e
                                   at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/libunwind.rs:86
       1:     0x7fce48a8a634 - backtrace::backtrace::trace_unsynchronized::h0e046f0811b0ae4d
                                   at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/mod.rs:66
       2:     0x7fce48a8a634 - std::sys_common::backtrace::_print_fmt::h5fcd1fd3d0e5d79e
                                   at src/libstd/sys_common/backtrace.rs:78
       3:     0x7fce48a8a634 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h85ffb53d56efd098
                                   at src/libstd/sys_common/backtrace.rs:59
       4:     0x7fce48ac37dc - core::fmt::write::h231e5515e704e96b
                                   at src/libcore/fmt/mod.rs:1052
       5:     0x7fce48a7bf97 - std::io::Write::write_fmt::h56f503f924d6c255
                                   at src/libstd/io/mod.rs:1428
       6:     0x7fce48a8f425 - std::sys_common::backtrace::_print::hf64c641be26866a9
                                   at src/libstd/sys_common/backtrace.rs:62
       7:     0x7fce48a8f425 - std::sys_common::backtrace::print::h16b5d561563c7498
                                   at src/libstd/sys_common/backtrace.rs:49
       8:     0x7fce48a8f425 - std::panicking::default_hook::{{closure}}::h8363003bce1deb1a
                                   at src/libstd/panicking.rs:204
       9:     0x7fce48a8f166 - std::panicking::default_hook::hb365b24076d7b200
                                   at src/libstd/panicking.rs:224
      10:     0x7fce490f9c39 - rustc_driver::report_ice::h2624db039b9cfba9
      11:     0x7fce48a8fb55 - std::panicking::rust_panic_with_hook::h2adc1d4c38cb25af
                                   at src/libstd/panicking.rs:474
      12:     0x7fce494cf363 - std::panicking::begin_panic::h6fca9fdb6d23f676
      13:     0x7fce493e488c - rustc_errors::HandlerInner::span_bug::h6840991938d37012
      14:     0x7fce493e4c40 - rustc_errors::Handler::span_bug::h107187c882152f33
      15:     0x7fce49478c69 - rustc::util::bug::opt_span_bug_fmt::{{closure}}::hf73fd7e05df26a89
      16:     0x7fce4947715b - rustc::ty::context::tls::with_opt::{{closure}}::h0c4fdf5a849e88e3
      17:     0x7fce49477106 - rustc::ty::context::tls::with_opt::h92cfac8e0dd8f2c9
      18:     0x7fce49478b58 - rustc::util::bug::opt_span_bug_fmt::haf8b4183f62d8df3
      19:     0x7fce49478b0a - rustc::util::bug::span_bug_fmt::h0be341af60d13d91
      20:     0x7fce49573f1a - <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold::h333c620e944c2a61
      21:     0x7fce495532cc - rustc_codegen_ssa::mir::block::<impl rustc_codegen_ssa::mir::FunctionCx<Bx>>::codegen_call_terminator::h61b66235d798dc9e
      22:     0x7fce4954e212 - rustc_codegen_ssa::mir::block::<impl rustc_codegen_ssa::mir::FunctionCx<Bx>>::codegen_block::h977ed6f45937d617
      23:     0x7fce4956055e - rustc_codegen_ssa::base::codegen_instance::h1faa821de1d9e487
      24:     0x7fce4947f6b5 - <rustc::mir::mono::MonoItem as rustc_codegen_ssa::mono_item::MonoItemExt>::define::h0b6bdfededc22107
      25:     0x7fce4940668a - rustc_codegen_llvm::base::compile_codegen_unit::module_codegen::h469c76d782c84352
      26:     0x7fce494b3227 - rustc::dep_graph::graph::DepGraph::with_task::h29956dbbd3cd6e7c
      27:     0x7fce49406254 - rustc_codegen_llvm::base::compile_codegen_unit::hc09ab7897a17060a
      28:     0x7fce4955d55a - rustc_codegen_ssa::base::codegen_crate::h80e90e6d82f0580d
      29:     0x7fce494f1715 - <rustc_codegen_llvm::LlvmCodegenBackend as rustc_codegen_utils::codegen_backend::CodegenBackend>::codegen_crate::hbcef469c00126974
      30:     0x7fce492e0710 - rustc_session::utils::<impl rustc_session::session::Session>::time::h101a151e306dd79b
      31:     0x7fce4938b2ef - rustc_interface::passes::QueryContext::enter::hc499d446e1b9ab96
      32:     0x7fce492bbf4b - rustc_interface::queries::Queries::ongoing_codegen::h201d0ed995ada5da
      33:     0x7fce491632be - rustc_interface::interface::run_compiler_in_existing_thread_pool::hdde65f8eb6e34231
      34:     0x7fce4911d29d - scoped_tls::ScopedKey<T>::set::h774e12e87074d2a2
      35:     0x7fce49104d82 - syntax::attr::with_globals::hd6f4e6fb8aaadb66
      36:     0x7fce4911e963 - std::sys_common::backtrace::__rust_begin_short_backtrace::h2e517a7b74830ac8
      37:     0x7fce48aa1447 - __rust_maybe_catch_panic
                                   at src/libpanic_unwind/lib.rs:86
      38:     0x7fce49164ef6 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h3814fa1c62419cc0
      39:     0x7fce48a6c31f - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h8e917a822ffc0592
                                   at /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a/src/liballoc/boxed.rs:1017
      40:     0x7fce48a9fd50 - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h8aa486ee72f31ff1
                                   at /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a/src/liballoc/boxed.rs:1017
      41:     0x7fce48a9fd50 - std::sys_common::thread::start_thread::h8407e13fad90fc7e
                                   at src/libstd/sys_common/thread.rs:13
      42:     0x7fce48a9fd50 - std::sys::unix::thread::Thread::new::thread_start::h55e6429cb8ed2e9f
                                   at src/libstd/sys/unix/thread.rs:80
      43:     0x7fce4880883d - start_thread
      44:     0x7fce48170fdd - clone
    
    note: the compiler unexpectedly panicked. this is a bug.
    
    note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports
    
    note: rustc 1.43.0-nightly (7760cd0fb 2020-02-19) running on x86_64-unknown-linux-gnu
    
    note: compiler flags: -C debuginfo=2 -C incremental --crate-type lib
    
    note: some of the compiler flags provided by cargo are hidden
    
    query stack during panic:
    end of query stack
    error: aborting due to previous error
    
    error: could not compile `encoding_rs`.
    
    To learn more, run the command again with --verbose.
    
    
    opened by pskopnik 10
  • minimum Rust version?

    minimum Rust version?

    Is there any policy for this crate with respect to the minimum Rust version supported? In particular, it looks like CI always runs against whatever the current stable/beta/nightly releases are. So if a change gets merged that requires a newer version of Rust, you might not even realize that it happens.

    (N.B. As an ecosystem, the "right" policy here is terribly unclear. I personally have been operating under a conservative policy where by bumping the minimum Rust version requires a semver bump, but I fear this won't always be tenable.)

    opened by BurntSushi 9
  • Potential Unsound: 1 out-of-bound read and 5 unaligned memory access.

    Potential Unsound: 1 out-of-bound read and 5 unaligned memory access.

    Hello.

    I'm Yoshiki, a PhD student at CMU.

    We are testing a tool to automatically generate test cases from API data and existing tests.

    A few of our generated test cases were reported as "unsound" by Miri, mostly due to unaligned or out-of-bound memory. I've attached a Tarball that contains the test cases that induce this behavior.

    Please note that, because the framework leverages existing tests as templates, some of the test cases overlap with existing test cases for the library. In particular,

    decode(BIG5, b"", &"");//LAYER:0
    

    also shows up in the manually written test cases.

    In case this is intended behavior, or you would prefer if I focused on other parts of the code, please let me know.

    Thanks. ~Yoshiki

    opened by YoshikiTakashima 8
  • Enhancement: get read access to the decoder's inner state

    Enhancement: get read access to the decoder's inner state

    This is also about Stringsext, a GNU Strings Alternative with Multi-Byte-Encoding Support which I migrated from rust-encoding to encoding_rs.

    In order to keep anchors between the input and the output stream, I would need to know - when the decoder finished - if it has still some bytes stored in its inner state. The best would be to know how many bytes are hold back, but even the information that there are any would help already.

    Is there a way to access this information?

    opened by getreu 8
  • Allocating three times the size of the input seems excessive.

    Allocating three times the size of the input seems excessive.

    The Encoding::decode_* methods need in some cases to allocate a String, and decide how much capacity to give it. Other than *_without_replacement (https://github.com/hsivonen/encoding_rs/commit/2984a8b0a310b52fe7112671c5fb94446a7f78f8#commitcomment-20990260), this is based on Encoding::max_utf8_buffer_length which assumes the worst case. For many encodings, that’s when every byte of the input is an error that emits a three-byte U+FFFD code point.

    In short, as soon as there’s an error, these method allocate three times the size of the (remaining) input. Assuming the worst case simplifies the code which only needs to allocate once, but it seems excessive that a single bit flip near the beginning of the input could triple memory usage.

    So a more adaptive allocation scheme might be desirable, but admittedly there is no obvious answer as to what it should be.

    opened by SimonSapin 7
  • UTF_16LE.encode does not encode string to UTF-16 LE correctly?

    UTF_16LE.encode does not encode string to UTF-16 LE correctly?

    Environment

    rustc --version output:

    rustc 1.27.0-nightly (0b72d48f8 2018-04-10)
    

    and my encoding_rs version is 0.7.2.

    Steps to reproduce

    run the following program

    extern crate encoding_rs;
    
    use encoding_rs::UTF_16LE;
    
    fn main() {
        let s = "aa";
        let (bytes, enc, unmappable) = UTF_16LE.encode(s);
        let (dec, enc, unmappable) = UTF_16LE.decode(&bytes);
        for i in dec.chars() {
            println!("{}", i as i32)
        }
        println!("{}", dec);
    }
    

    Expected

    output following text

    97
    0
    97
    0
    aa
    

    Actual

    output following text(24929 = 0x6161)

    24929
    慡
    
    opened by itn3000 6
  • Add Encoding::label()

    Add Encoding::label()

    Motivation: Get a standard representation of an encoding that can be used between different encoding libraries, like encoding_rs and rust-encoding.

    Helpful for https://github.com/servo/servo/issues/13238

    opened by talklittle 6
  • Add Read and Write wrappers

    Add Read and Write wrappers

    Implement the wrapper types described in https://github.com/hsivonen/encoding_rs/issues/8#issuecomment-285057121.

    This is not ready for merging, but I’m opening it for discussion. It needs at least some docs and tests, but more importantly while this demonstrates that all four types and five impls included here are possible, I don’t know if they’re all useful or which if any belong in this repository.

    In https://github.com/hsivonen/encoding_rs/issues/8#issuecomment-285057121 I mentioned a fifth possible wrapper type, but that one can be another impl for one of the other four. (See second commit of this PR.)

    In each case a buffer is needed for temporary space. In the *Write case a &mut [u8] of that buffer is passed to the underlying stream which is only expected to write to it, but since that stream there is nothing stopping an unusual impl from reading from the buffer. This means that an uninitialized should not be used unless the stream is known not to read form it. (This is the case of std::io::Read for std::fs::File, for example.) On the other hand, initializing a buffer (e.g. with zeros) has a cost that some users may want to avoid. This is why the buffer is also generic, to leave up to users to decide. (The buffers could also be taken as &mut [u8], but that would add a mandatory lifetime parameter to the wrapper types.)

    Buffers of less than 4 bytes (or can that number be higher for encoding other than UTF-8?) can cause infinite loops, with a stream unable to make progress. Larger sizes are probably better for performance. For example [u8; 1024] on the stack seems nice, though I totally just pulled that number out of thin air.

    Currently WriteDecoder and WriteEncoder signal the end of the stream when dropped, but errors (such as I/O errors) from the underlying stream that occur at that time are ignored since Drop::drop cannot return a Result, and panicking in a destructor is generally avoided. (Other destructors are still run after one panic, but panicking while panicking causes the process to abort.) Perhaps an fn end(&mut self) -> Result method could be added to each of them to allow users to signal the end of the stream and handle errors. @hsivonen, is it OK to call encoder.encode_from_utf8("", buffer, /* last = */ true) or decoder.decode_to_utf8(b"", buffer, /* last = */ true) twice?

    CC @BurntSushi

    opened by SimonSapin 6
  • Integration with oss-fuzz fuzzing service

    Integration with oss-fuzz fuzzing service

    Hi @hsivonen, I would like to help integrate this project into OSS-Fuzz.

    • As an initial step for integration I have created this PR: https://github.com/google/oss-fuzz/pull/8652, it contains necessary logic from an OSS-Fuzz perspective to integrate encoding_rs.

    • OSS-Fuzz is a free service run by Google that performs continuous fuzzing of important open source projects.

    • As encoding_rs already have cargo-fuzz based fuzzing implemented, this makes it easily compatible with oss-fuzz out of box.

    • If you would like to integrate, the only thing I need is a list of email(s), it must be associated with a google account like gmail (why?). by doing that, the provided email(s) will get access to the data produced by OSS-Fuzz, such as bug reports, coverage reports and more stats.

    • As an alternative, if you don't have a google/gmail id, but still wish to integrate. I can add my mail id for time being and monitor bug/crashes.

    • Notice the email(s) affiliated with the project will be public in the OSS-Fuzz repo, as they will be part of a configuration file.

    opened by manunio 1
  • Migrate ASCII acceleration code to align_to/align_to_mut

    Migrate ASCII acceleration code to align_to/align_to_mut

    Currently, the ASCII acceleration code manually reinterprets slice memory as wider SIMD or ALU types. This code predates the align_to and align_to_mut methods on slices.

    This code should be rewritten to use these methods with the middle slice being a SIMD type or a wider ALU type in the aligned case or a fixed-length array that can be unalignedly read as a SIMD type for unaligned SIMD.

    opened by hsivonen 0
  • Broken links in `EUC_JP` encoding doc

    Broken links in `EUC_JP` encoding doc

    EUC_JP doc refers to the euc-jp.html and euc-jp-bmp.html, which are not exist. According to the https://encoding.spec.whatwg.org/#indexes I assume, that the correct link is https://encoding.spec.whatwg.org/jis0212.html and https://encoding.spec.whatwg.org/jis0212-bmp.html

    opened by Mingun 1
  • set_len on a Vec<u8> of uninit is UB

    set_len on a Vec of uninit is UB

    encoding_rs currently has UB in the form of creating uninitialized u8's via set_len Here are 2 examples where the UB is crystal clear:

    https://github.com/hsivonen/encoding_rs/blob/dd9d99bb185f93d4fe5071291cdc54278e193955/src/mem.rs#L2007-L2010

    https://github.com/hsivonen/encoding_rs/blob/dd9d99bb185f93d4fe5071291cdc54278e193955/src/mem.rs#L2044-L2047

    set_len is also used in 7 functions in lib.rs, but I haven't looked at them very closely.

    The docs for set_len explicitly say https://doc.rust-lang.org/std/vec/struct.Vec.html#method.set_len :

    The elements at old_len..new_len must be initialized.

    Some relevant discussion can be found here https://github.com/rust-lang/unsafe-code-guidelines/issues/71

    rustc itself has a lint specifically for this kind of thing: https://github.com/rust-lang/rust/issues/75968

    Using MaybeUninit::uninit().assume_init() is instant UB unless the target type is itself composed entirely of MaybeUninit

    My understanding is this is currently considered UB, but this rule may be relaxed in the future to allow types where all bit patterns are valid to store uninitalized if they are not read from.

    opened by nico-abram 7
  • Fix clippy lints

    Fix clippy lints

    This PR fixes clippy lints that showed up on clippy 1.56.1

    Lints that showed up multiple times were:

    • Add clippy:: prefix to lint allow()s
    • Use matches!
    • Replace range check with (a..b).contains()
    • Remove unnecessary 'static from statics
    opened by nico-abram 0
  • Allow passing `String` to `Encoding::encode`

    Allow passing `String` to `Encoding::encode`

    In the common case when converting from UTF8 to UTF8, or the string is all ASCII, this avoids an extra heap allocation for the caller if they only have a String available. Previously, they would have to call encoding.encode(&string).into_owned() to avoid lifetime errors.

    This change is backwards compatible.

    opened by jyn514 0
Owner
Henri Sivonen
Making Firefox focus out-of-process iframes
Henri Sivonen
Encoding and decoding support for BSON in Rust

bson-rs Encoding and decoding support for BSON in Rust Index Overview of BSON Format Usage BSON Values BSON Documents Modeling BSON with strongly type

mongodb 304 Dec 30, 2022
Character encoding support for Rust

Encoding 0.3.0-dev Character encoding support for Rust. (also known as rust-encoding) It is based on WHATWG Encoding Standard, and also provides an ad

Kang Seonghoon 264 Dec 14, 2022
A HTML entity encoding library for Rust

A HTML entity encoding library for Rust Example usage All example assume a extern crate htmlescape; and use htmlescape::{relevant functions here}; is

Viktor Dahl 41 Nov 1, 2022
A TOML encoding/decoding library for Rust

toml-rs A TOML decoder and encoder for Rust. This library is currently compliant with the v0.5.0 version of TOML. This library will also likely contin

Alex Crichton 1k Dec 30, 2022
Variable-length signed and unsigned integer encoding that is byte-orderable for Rust

ordered-varint Provides variable-length signed and unsigned integer encoding that is byte-orderable. This crate provides the Variable trait which enco

Khonsu Labs 7 Dec 6, 2022
A series of compact encoding schemes for building small and fast parsers and serializers

A series of compact encoding schemes for building small and fast parsers and serializers

Manfred Kröhnert 2 Feb 5, 2022
Astro Format is a library for efficiently encoding and decoding a set of bytes into a single buffer format.

Astro Format is a library for efficiently transcoding arrays into a single buffer and native rust types into strings

Stelar Labs 1 Aug 13, 2022
Entropy Encoding notebook. Simple implementations of the "tANS" encoder/decoder.

EntropyEncoding Experiments This repository contains my Entropy Encoding notebook. Entropy encoding is an efficient lossless data compression scheme.

Nadav Rotem 4 Dec 21, 2022
TLV-C encoding support.

TLV-C: Tag - Length - Value - Checksum TLV-C is a variant on the traditional [TLV] format that adds a whole mess of checksums and whatnot. Why, you as

Oxide Computer Company 3 Nov 25, 2022
MessagePack implementation for Rust / msgpack.org[Rust]

RMP - Rust MessagePack RMP is a pure Rust MessagePack implementation. This repository consists of three separate crates: the RMP core and two implemen

Evgeny Safronov 840 Dec 30, 2022
Rust implementation of CRC(16, 32, 64) with support of various standards

crc Rust implementation of CRC(16, 32, 64). MSRV is 1.46. Usage Add crc to Cargo.toml [dependencies] crc = "2.0" Compute CRC use crc::{Crc, Algorithm,

Rui Hu 120 Dec 23, 2022
PROST! a Protocol Buffers implementation for the Rust Language

PROST! prost is a Protocol Buffers implementation for the Rust Language. prost generates simple, idiomatic Rust code from proto2 and proto3 files. Com

Dan Burkert 17 Jan 8, 2023
Rust implementation of Google protocol buffers

rust-protobuf Protobuf implementation in Rust. Written in pure rust Generate rust code Has runtime library for generated code (Coded{Input|Output}Stre

Stepan Koltsov 2.3k Dec 31, 2022
A fast, performant implementation of skip list in Rust.

Subway A fast, performant implementation of skip list in Rust. A skip list is probabilistic data structure that provides O(log N) search and insertion

Sushrut 16 Apr 5, 2022
Pure Rust port of CRFsuite: a fast implementation of Conditional Random Fields (CRFs)

crfs-rs Pure Rust port of CRFsuite: a fast implementation of Conditional Random Fields (CRFs) Currently only support prediction, model training is not

messense 24 Nov 23, 2022
A binary encoder / decoder implementation in Rust.

Bincode A compact encoder / decoder pair that uses a binary zero-fluff encoding scheme. The size of the encoded object will be the same or smaller tha

Bincode 1.9k Dec 29, 2022
rust-jsonnet - The Google Jsonnet( operation data template language) for rust

rust-jsonnet ==== Crate rust-jsonnet - The Google Jsonnet( operation data template language) for rust Google jsonnet documet: (http://google.github.io

Qihoo 360 24 Dec 1, 2022
A Rust ASN.1 (DER) serializer.

rust-asn1 This is a Rust library for parsing and generating ASN.1 data (DER only). Installation Add asn1 to the [dependencies] section of your Cargo.t

Alex Gaynor 85 Dec 16, 2022
Rust library for reading/writing numbers in big-endian and little-endian.

byteorder This crate provides convenience methods for encoding and decoding numbers in either big-endian or little-endian order. Dual-licensed under M

Andrew Gallant 811 Jan 1, 2023