A Gecko-oriented implementation of the Encoding Standard in Rust

Henri Sivonen

Last update: Dec 13, 2022

Comments

implementation for io::Read/io::Write
What are your thoughts on providing implementations of the io::Read/io::Write traits as a convenience for handling stream encoding/decoding?

Here is the specific problem I'd like to solve. Simplifying, I have a function that looks like the following:

fn search<R: io::Read>(rdr: R) -> io::Result<SearchResults> { ... }

Internally, the search function limits itself to the methods of io::Read to execute a search on its contents. The search is exhaustive, but is guaranteed to use a constant amount of heap space. The search routine expects the buffer to be UTF-8 encoded (and will handle invalid UTF-8 gracefully). I'd like to use this same search routine even if the contents of rdr are, say, UTF-16. I claim that this is possible if I wrap rdr in something that satisfies io::Read but uses a encoding_rs::Decoder internally to convert UTF-16 to UTF-8. I would expect the callers of search to do that wrapping. If there's invalid UTF-16, then inserting replacement characters is OK.

Does this sound like something you'd be willing to maintain? I would be happy to take an initial crack at an implementation if so. (In fact, I must do this. The point of this issue is asking whether I should try to upstream it or not.) However, I think there are some interesting points worth mentioning. (There may be more!)

Is this type of API useful in the context of the web? If not, then maybe it shouldn't live in this crate.

The io::Read interface feels not-quite-right in some respects. For example, the io::Read primarily operates on a &[u8]. But if encoding_rs is used to provide an io::Read implementation, then it necessarily guarantees that all consumers of that implementation will read valid UTF-8, which means converting the &[u8] bytes to &str safely will incur an unnecessary cost. I'm not sure what to make of this and how much one might care, but it seems worth pointing out. (This particular issue isn't a problem for me, since the search routine itself handles UTF-8 implicitly.)
opened by BurntSushi 17
Non-streaming decode() appears to remove the BOM?
https://github.com/hsivonen/encoding_rs/blob/d4d7d2a99aac266ecf6938c3832aefaaf8c1e52b/src/lib.rs#L2974-L2980

Functionally, decode() and decode_with_bom_removal() seem pretty much the same? That doesn't seem correct? If there's a variant called "decode_with_bom_removal" then I would expect the standard variant not to remove the BOM.

Compare to:

https://github.com/hsivonen/encoding_rs/blob/d4d7d2a99aac266ecf6938c3832aefaaf8c1e52b/src/lib.rs#L3019-L3030

It's totally valid to decode the BOM, the BOM is a unicode character like any other character. Decoding a UTF-16 document with a BOM should yield a UTF-8 document with a BOM. Otherwise, you would just use the BOM-removing version...

use encoding_rs::*; fn main() { // Two characters, '1' and then BOM character println!("{:?}", UTF_16LE.decode(&[0x31, 0x00, 0xFF, 0xFE]).0.as_bytes()); // Nothing - BOM removed println!("{:?}", UTF_16LE.decode(&[0xFF, 0xFE]).0.as_bytes()); }

[49, 239, 187, 191] []
opened by dralley 15
Re-add license field to Cargo.toml

This was removed in https://github.com/hsivonen/encoding_rs/commit/3a4033e67b6b9d1c1e9514bcb5c20ae05bf8391d#diff-2e9d962a08321605940b5a657135052fbcef87b5e360662bb527c96d9a615542 and causes automated tooling like cargo deny to fail detecting the license.

It should probably be something like (Apache-2.0 OR MIT) AND BSD-3 but I'm not sure the expression syntax allows parenthesis. If it doesn't then we have a problem and you might want to reconsider if dual-licensing warrants the increased license complexity here. Having to worry about 3 different licenses for a single crate is a bit suboptimal, even if MIT and BSD-3 are approximately the same.

opened by sdroege 10

Compilation issues under 1.43.0 nightly

I've encountered a compilation issue when building under 1.43.0 nightly of the rust toolchain. I noticed the problem when building the dependent orjson which uses the nightly toolchain for compilation.

I don't know much about rust, but it seems that an error occurs within a macro and the rust compiler subsequently panics.

I was able to reproduce the issue with the following commands, the features are the ones used by orjson. I'm filing this issue here, as I don't quite understand what is happening with regards to macros, user code and compiler code.

$ docker run --rm -it --entrypoint /bin/bash konstin2/maturin:master
(docker) $ git clone https://github.com/hsivonen/encoding_rs.git
(docker) $ cd encoding_rs/
(docker) $ git checkout v0.8.22
(docker) $ echo nightly > rust-toolchain
(docker) $ cargo --version
cargo 1.43.0-nightly (e02974078 2020-02-18)
(docker) $ rustc --version
rustc 1.43.0-nightly (7760cd0fb 2020-02-19)
(docker) $ RUST_BACKTRACE=full cargo build --features simd-accel --no-default-features
...

--verbose does not give much more information. -Z macro-backtrace does not seem to a valid flag.

cargo build output

info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'
info: latest update on 2020-02-20, rust version 1.43.0-nightly (7760cd0fb 2020-02-19)
info: downloading component 'cargo'
info: downloading component 'clippy'
info: downloading component 'rust-docs'
info: downloading component 'rust-std'
info: downloading component 'rustc'
info: downloading component 'rustfmt'
info: installing component 'cargo'
info: installing component 'clippy'
info: installing component 'rust-docs'
info: installing component 'rust-std'
info: installing component 'rustc'
info: installing component 'rustfmt'
    Updating crates.io index
  Downloaded packed_simd v0.3.3
   Compiling packed_simd v0.3.3
   Compiling encoding_rs v0.8.22 (/io/encoding_rs)
   Compiling cfg-if v0.1.10
warning: unused label
   --> src/macros.rs:878:41
    |
878 |   ...                   'innermost: loop {
    |                         ^^^^^^^^^^
    | 
   ::: src/euc_jp.rs:77:5
    |
77  | /     euc_jp_decoder_functions!(
78  | |         {
79  | |             let trail_minus_offset = byte.wrapping_sub(0xA1);
80  | |             // Fast-track Hiragana (60% according to Lunde)
...   |
220 | |         handle
221 | |     );
    | |______- in this macro invocation
    |
    = note: `#[warn(unused_labels)]` on by default
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

warning: unused label
   --> src/macros.rs:878:41
    |
878 |   ...                   'innermost: loop {
    |                         ^^^^^^^^^^
    | 
   ::: src/euc_jp.rs:77:5
    |
77  | /     euc_jp_decoder_functions!(
78  | |         {
79  | |             let trail_minus_offset = byte.wrapping_sub(0xA1);
80  | |             // Fast-track Hiragana (60% according to Lunde)
...   |
220 | |         handle
221 | |     );
    | |______- in this macro invocation
    |
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

warning: unused label
   --> src/macros.rs:574:41
    |
574 |   ...                   'innermost: loop {
    |                         ^^^^^^^^^^
    | 
   ::: src/gb18030.rs:111:5
    |
111 | /     gb18030_decoder_functions!(
112 | |         {
113 | |             // If first is between 0x81 and 0xFE, inclusive,
114 | |             // subtract offset 0x81.
...   |
294 | |         handle,
295 | |         'outermost);
    | |____________________- in this macro invocation
    |
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

warning: unused label
   --> src/macros.rs:574:41
    |
574 |   ...                   'innermost: loop {
    |                         ^^^^^^^^^^
    | 
   ::: src/gb18030.rs:111:5
    |
111 | /     gb18030_decoder_functions!(
112 | |         {
113 | |             // If first is between 0x81 and 0xFE, inclusive,
114 | |             // subtract offset 0x81.
...   |
294 | |         handle,
295 | |         'outermost);
    | |____________________- in this macro invocation
    |
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

warning: unused label
   --> src/mem.rs:279:17
    |
279 |                 'inner: loop {
    |                 ^^^^^^

warning: `...` range patterns are deprecated
   --> src/mem.rs:743:26
    |
743 |                         0...0x7F => {
    |                          ^^^ help: use `..=` for an inclusive range
    |
    = note: `#[warn(ellipsis_inclusive_range_patterns)]` on by default

warning: `...` range patterns are deprecated
   --> src/mem.rs:749:29
    |
749 |                         0xC2...0xD5 => {
    |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:770:36
    |
770 |                         0xE1 | 0xE3...0xEC | 0xEE => {
    |                                    ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:879:29
    |
879 |                         0xF1...0xF4 => {
    |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:942:18
    |
942 |                 0...0x7F => {
    |                  ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:948:21
    |
948 |                 0xC2...0xD5 => {
    |                     ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:985:28
    |
985 |                 0xE1 | 0xE3...0xEC | 0xEE => {
    |                            ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2686:29
     |
2686 |                         b'A'...b'Z' => {
     |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2691:29
     |
2691 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
     |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2691:43
     |
2691 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
     |                                           ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2714:29
     |
2714 |                         b'A'...b'Z' => {
     |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2723:29
     |
2723 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
     |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2723:43
     |
2723 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
     |                                           ^^^ help: use `..=` for an inclusive range

warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
  --> src/simd_funcs.rs:19:20
   |
19 |     let mut simd = ::std::mem::uninitialized();
   |                    ^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(deprecated)]` on by default

warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
  --> src/simd_funcs.rs:43:20
   |
43 |     let mut simd = ::std::mem::uninitialized();
   |                    ^^^^^^^^^^^^^^^^^^^^^^^^^

warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
   --> src/handles.rs:113:30
    |
113 |             let mut u: u16 = ::std::mem::uninitialized();
    |                              ^^^^^^^^^^^^^^^^^^^^^^^^^

warning: unnecessary `unsafe` block
  --> src/utf_8.rs:91:12
   |
91 |         if unsafe { likely(read + 4 <= src.len()) } {
   |            ^^^^^^ unnecessary `unsafe` block
   |
   = note: `#[warn(unused_unsafe)]` on by default

warning: unnecessary `unsafe` block
  --> src/utf_8.rs:98:20
   |
98 |                 if unsafe { likely(in_inclusive_range8(byte, 0xC2, 0xDF)) } {
   |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:107:24
    |
107 |                     if unsafe { likely(read + 4 <= src.len()) } {
    |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:117:20
    |
117 |                 if unsafe { likely(byte < 0xF0) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:132:28
    |
132 |                         if unsafe { likely(read + 4 <= src.len()) } {
    |                            ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:137:32
    |
137 | ...                   if unsafe { likely(byte < 0x80) } {
    |                          ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:162:20
    |
162 |                 if unsafe { likely(read + 4 <= src.len()) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:261:12
    |
261 |         if unsafe { likely(read + 4 <= src.len()) } {
    |            ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:271:20
    |
271 |                 if unsafe { likely(in_inclusive_range8(byte, 0xC2, 0xDF)) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:288:24
    |
288 |                     if unsafe { likely(read + 4 <= src.len()) } {
    |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:300:20
    |
300 |                 if unsafe { likely(byte < 0xF0) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:323:28
    |
323 |                         if unsafe { likely(read + 4 <= src.len()) } {
    |                            ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:328:32
    |
328 | ...                   if unsafe { likely(byte < 0x80) } {
    |                          ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:370:20
    |
370 |                 if unsafe { likely(read + 4 <= src.len()) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:657:20
    |
657 |                 if unsafe { likely(unit_minus_surrogate_start > (0xDFFF - 0xD800)) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:668:20
    |
668 |                 if unsafe { likely(unit_minus_surrogate_start <= (0xDBFF - 0xD800)) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:687:24
    |
687 |                     if unsafe { likely(second_minus_low_surrogate_start <= (0xDFFF - 0xDC00)) } {
    |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:729:16
    |
729 |             if unsafe { unlikely(unit < 0x80) } {
    |                ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/mem.rs:913:32
    |
913 | ...                   if unsafe { unlikely(second == 0x90 || second == 0x9E) } {
    |                          ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
    --> src/mem.rs:1171:28
     |
1171 |                         if unsafe { unlikely(byte >= 0xD6) } {
     |                            ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
    --> src/mem.rs:1195:24
     |
1195 |                     if unsafe { unlikely(!in_inclusive_range8(byte, 0xE3, 0xEE) && byte != 0xE1) } {
     |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
    --> src/mem.rs:1244:24
     |
1244 |                     if unsafe { unlikely(byte == 0xF0 && (second == 0x90 || second == 0x9E)) } {
     |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
    --> src/mem.rs:1658:8
     |
1658 |     if unsafe { likely(read == src.len()) } {
     |        ^^^^^^ unnecessary `unsafe` block

error: internal compiler error: src/librustc_codegen_ssa/mir/block.rs:622: shuffle indices must be constant
   --> src/simd_funcs.rs:289:28
    |
289 |           let first: u8x16 = shuffle!(
    |  ____________________________^
290 | |             s,
291 | |             u8x16::splat(0),
292 | |             [0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23]
293 | |         );
    | |_________^
    |
    = note: this error: internal compiler error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

thread 'rustc' panicked at 'Box<Any>', <::std::macros::panic macros>:2:4
stack backtrace:
   0:     0x7fce48a8a634 - backtrace::backtrace::libunwind::trace::h0743ecf0c905ca1e
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/libunwind.rs:86
   1:     0x7fce48a8a634 - backtrace::backtrace::trace_unsynchronized::h0e046f0811b0ae4d
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/mod.rs:66
   2:     0x7fce48a8a634 - std::sys_common::backtrace::_print_fmt::h5fcd1fd3d0e5d79e
                               at src/libstd/sys_common/backtrace.rs:78
   3:     0x7fce48a8a634 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h85ffb53d56efd098
                               at src/libstd/sys_common/backtrace.rs:59
   4:     0x7fce48ac37dc - core::fmt::write::h231e5515e704e96b
                               at src/libcore/fmt/mod.rs:1052
   5:     0x7fce48a7bf97 - std::io::Write::write_fmt::h56f503f924d6c255
                               at src/libstd/io/mod.rs:1428
   6:     0x7fce48a8f425 - std::sys_common::backtrace::_print::hf64c641be26866a9
                               at src/libstd/sys_common/backtrace.rs:62
   7:     0x7fce48a8f425 - std::sys_common::backtrace::print::h16b5d561563c7498
                               at src/libstd/sys_common/backtrace.rs:49
   8:     0x7fce48a8f425 - std::panicking::default_hook::{{closure}}::h8363003bce1deb1a
                               at src/libstd/panicking.rs:204
   9:     0x7fce48a8f166 - std::panicking::default_hook::hb365b24076d7b200
                               at src/libstd/panicking.rs:224
  10:     0x7fce490f9c39 - rustc_driver::report_ice::h2624db039b9cfba9
  11:     0x7fce48a8fb55 - std::panicking::rust_panic_with_hook::h2adc1d4c38cb25af
                               at src/libstd/panicking.rs:474
  12:     0x7fce494cf363 - std::panicking::begin_panic::h6fca9fdb6d23f676
  13:     0x7fce493e488c - rustc_errors::HandlerInner::span_bug::h6840991938d37012
  14:     0x7fce493e4c40 - rustc_errors::Handler::span_bug::h107187c882152f33
  15:     0x7fce49478c69 - rustc::util::bug::opt_span_bug_fmt::{{closure}}::hf73fd7e05df26a89
  16:     0x7fce4947715b - rustc::ty::context::tls::with_opt::{{closure}}::h0c4fdf5a849e88e3
  17:     0x7fce49477106 - rustc::ty::context::tls::with_opt::h92cfac8e0dd8f2c9
  18:     0x7fce49478b58 - rustc::util::bug::opt_span_bug_fmt::haf8b4183f62d8df3
  19:     0x7fce49478b0a - rustc::util::bug::span_bug_fmt::h0be341af60d13d91
  20:     0x7fce49573f1a - <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold::h333c620e944c2a61
  21:     0x7fce495532cc - rustc_codegen_ssa::mir::block::<impl rustc_codegen_ssa::mir::FunctionCx<Bx>>::codegen_call_terminator::h61b66235d798dc9e
  22:     0x7fce4954e212 - rustc_codegen_ssa::mir::block::<impl rustc_codegen_ssa::mir::FunctionCx<Bx>>::codegen_block::h977ed6f45937d617
  23:     0x7fce4956055e - rustc_codegen_ssa::base::codegen_instance::h1faa821de1d9e487
  24:     0x7fce4947f6b5 - <rustc::mir::mono::MonoItem as rustc_codegen_ssa::mono_item::MonoItemExt>::define::h0b6bdfededc22107
  25:     0x7fce4940668a - rustc_codegen_llvm::base::compile_codegen_unit::module_codegen::h469c76d782c84352
  26:     0x7fce494b3227 - rustc::dep_graph::graph::DepGraph::with_task::h29956dbbd3cd6e7c
  27:     0x7fce49406254 - rustc_codegen_llvm::base::compile_codegen_unit::hc09ab7897a17060a
  28:     0x7fce4955d55a - rustc_codegen_ssa::base::codegen_crate::h80e90e6d82f0580d
  29:     0x7fce494f1715 - <rustc_codegen_llvm::LlvmCodegenBackend as rustc_codegen_utils::codegen_backend::CodegenBackend>::codegen_crate::hbcef469c00126974
  30:     0x7fce492e0710 - rustc_session::utils::<impl rustc_session::session::Session>::time::h101a151e306dd79b
  31:     0x7fce4938b2ef - rustc_interface::passes::QueryContext::enter::hc499d446e1b9ab96
  32:     0x7fce492bbf4b - rustc_interface::queries::Queries::ongoing_codegen::h201d0ed995ada5da
  33:     0x7fce491632be - rustc_interface::interface::run_compiler_in_existing_thread_pool::hdde65f8eb6e34231
  34:     0x7fce4911d29d - scoped_tls::ScopedKey<T>::set::h774e12e87074d2a2
  35:     0x7fce49104d82 - syntax::attr::with_globals::hd6f4e6fb8aaadb66
  36:     0x7fce4911e963 - std::sys_common::backtrace::__rust_begin_short_backtrace::h2e517a7b74830ac8
  37:     0x7fce48aa1447 - __rust_maybe_catch_panic
                               at src/libpanic_unwind/lib.rs:86
  38:     0x7fce49164ef6 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h3814fa1c62419cc0
  39:     0x7fce48a6c31f - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h8e917a822ffc0592
                               at /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a/src/liballoc/boxed.rs:1017
  40:     0x7fce48a9fd50 - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h8aa486ee72f31ff1
                               at /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a/src/liballoc/boxed.rs:1017
  41:     0x7fce48a9fd50 - std::sys_common::thread::start_thread::h8407e13fad90fc7e
                               at src/libstd/sys_common/thread.rs:13
  42:     0x7fce48a9fd50 - std::sys::unix::thread::Thread::new::thread_start::h55e6429cb8ed2e9f
                               at src/libstd/sys/unix/thread.rs:80
  43:     0x7fce4880883d - start_thread
  44:     0x7fce48170fdd - clone

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports

note: rustc 1.43.0-nightly (7760cd0fb 2020-02-19) running on x86_64-unknown-linux-gnu

note: compiler flags: -C debuginfo=2 -C incremental --crate-type lib

note: some of the compiler flags provided by cargo are hidden

query stack during panic:
end of query stack
error: aborting due to previous error

error: could not compile `encoding_rs`.

To learn more, run the command again with --verbose.

opened by pskopnik 10

minimum Rust version?

Is there any policy for this crate with respect to the minimum Rust version supported? In particular, it looks like CI always runs against whatever the current stable/beta/nightly releases are. So if a change gets merged that requires a newer version of Rust, you might not even realize that it happens.

(N.B. As an ecosystem, the "right" policy here is terribly unclear. I personally have been operating under a conservative policy where by bumping the minimum Rust version requires a semver bump, but I fear this won't always be tenable.)

opened by BurntSushi 9
Potential Unsound: 1 out-of-bound read and 5 unaligned memory access.
Hello.

I'm Yoshiki, a PhD student at CMU.

We are testing a tool to automatically generate test cases from API data and existing tests.

A few of our generated test cases were reported as "unsound" by Miri, mostly due to unaligned or out-of-bound memory. I've attached a Tarball that contains the test cases that induce this behavior.

Please note that, because the framework leverages existing tests as templates, some of the test cases overlap with existing test cases for the library. In particular,

decode(BIG5, b"", &"");//LAYER:0

also shows up in the manually written test cases.

In case this is intended behavior, or you would prefer if I focused on other parts of the code, please let me know.

Thanks. ~Yoshiki
opened by YoshikiTakashima 8
Enhancement: get read access to the decoder's inner state

This is also about Stringsext, a GNU Strings Alternative with Multi-Byte-Encoding Support which I migrated from rust-encoding to encoding_rs.

In order to keep anchors between the input and the output stream, I would need to know - when the decoder finished - if it has still some bytes stored in its inner state. The best would be to know how many bytes are hold back, but even the information that there are any would help already.

Is there a way to access this information?

opened by getreu 8
Allocating three times the size of the input seems excessive.

The Encoding::decode_* methods need in some cases to allocate a String, and decide how much capacity to give it. Other than *_without_replacement (https://github.com/hsivonen/encoding_rs/commit/2984a8b0a310b52fe7112671c5fb94446a7f78f8#commitcomment-20990260), this is based on Encoding::max_utf8_buffer_length which assumes the worst case. For many encodings, that’s when every byte of the input is an error that emits a three-byte U+FFFD code point.

In short, as soon as there’s an error, these method allocate three times the size of the (remaining) input. Assuming the worst case simplifies the code which only needs to allocate once, but it seems excessive that a single bit flip near the beginning of the input could triple memory usage.

So a more adaptive allocation scheme might be desirable, but admittedly there is no obvious answer as to what it should be.

opened by SimonSapin 7

UTF_16LE.encode does not encode string to UTF-16 LE correctly?

Environment

rustc --version output:

rustc 1.27.0-nightly (0b72d48f8 2018-04-10)

and my encoding_rs version is 0.7.2.

Steps to reproduce

run the following program

extern crate encoding_rs;

use encoding_rs::UTF_16LE;

fn main() {
    let s = "aa";
    let (bytes, enc, unmappable) = UTF_16LE.encode(s);
    let (dec, enc, unmappable) = UTF_16LE.decode(&bytes);
    for i in dec.chars() {
        println!("{}", i as i32)
    }
    println!("{}", dec);
}

Expected

output following text

Actual

output following text(24929 = 0x6161)

24929
慡

opened by itn3000 6

Add Encoding::label()

Motivation: Get a standard representation of an encoding that can be used between different encoding libraries, like encoding_rs and rust-encoding.

Helpful for https://github.com/servo/servo/issues/13238

opened by talklittle 6
Add Read and Write wrappers

Implement the wrapper types described in https://github.com/hsivonen/encoding_rs/issues/8#issuecomment-285057121.

This is not ready for merging, but I’m opening it for discussion. It needs at least some docs and tests, but more importantly while this demonstrates that all four types and five impls included here are possible, I don’t know if they’re all useful or which if any belong in this repository.

In https://github.com/hsivonen/encoding_rs/issues/8#issuecomment-285057121 I mentioned a fifth possible wrapper type, but that one can be another impl for one of the other four. (See second commit of this PR.)

In each case a buffer is needed for temporary space. In the *Write case a &mut [u8] of that buffer is passed to the underlying stream which is only expected to write to it, but since that stream there is nothing stopping an unusual impl from reading from the buffer. This means that an uninitialized should not be used unless the stream is known not to read form it. (This is the case of std::io::Read for std::fs::File, for example.) On the other hand, initializing a buffer (e.g. with zeros) has a cost that some users may want to avoid. This is why the buffer is also generic, to leave up to users to decide. (The buffers could also be taken as &mut [u8], but that would add a mandatory lifetime parameter to the wrapper types.)

Buffers of less than 4 bytes (or can that number be higher for encoding other than UTF-8?) can cause infinite loops, with a stream unable to make progress. Larger sizes are probably better for performance. For example [u8; 1024] on the stack seems nice, though I totally just pulled that number out of thin air.

Currently WriteDecoder and WriteEncoder signal the end of the stream when dropped, but errors (such as I/O errors) from the underlying stream that occur at that time are ignored since Drop::drop cannot return a Result, and panicking in a destructor is generally avoided. (Other destructors are still run after one panic, but panicking while panicking causes the process to abort.) Perhaps an fn end(&mut self) -> Result method could be added to each of them to allow users to signal the end of the stream and handle errors. @hsivonen, is it OK to call encoder.encode_from_utf8("", buffer, /* last = */ true) or decoder.decode_to_utf8(b"", buffer, /* last = */ true) twice?

CC @BurntSushi

opened by SimonSapin 6
Integration with oss-fuzz fuzzing service
Hi @hsivonen, I would like to help integrate this project into OSS-Fuzz.

As an initial step for integration I have created this PR: https://github.com/google/oss-fuzz/pull/8652, it contains necessary logic from an OSS-Fuzz perspective to integrate encoding_rs.

OSS-Fuzz is a free service run by Google that performs continuous fuzzing of important open source projects.

As encoding_rs already have cargo-fuzz based fuzzing implemented, this makes it easily compatible with oss-fuzz out of box.

If you would like to integrate, the only thing I need is a list of email(s), it must be associated with a google account like gmail (why?). by doing that, the provided email(s) will get access to the data produced by OSS-Fuzz, such as bug reports, coverage reports and more stats.

As an alternative, if you don't have a google/gmail id, but still wish to integrate. I can add my mail id for time being and monitor bug/crashes.

Notice the email(s) affiliated with the project will be public in the OSS-Fuzz repo, as they will be part of a configuration file.
opened by manunio 1
Migrate ASCII acceleration code to align_to/align_to_mut

Currently, the ASCII acceleration code manually reinterprets slice memory as wider SIMD or ALU types. This code predates the align_to and align_to_mut methods on slices.

This code should be rewritten to use these methods with the middle slice being a SIMD type or a wider ALU type in the aligned case or a fixed-length array that can be unalignedly read as a SIMD type for unaligned SIMD.

opened by hsivonen 0
Broken links in `EUC_JP` encoding doc

EUC_JP doc refers to the euc-jp.html and euc-jp-bmp.html, which are not exist. According to the https://encoding.spec.whatwg.org/#indexes I assume, that the correct link is https://encoding.spec.whatwg.org/jis0212.html and https://encoding.spec.whatwg.org/jis0212-bmp.html

opened by Mingun 1
set_len on a Vec of uninit is UB

encoding_rs currently has UB in the form of creating uninitialized u8's via set_len Here are 2 examples where the UB is crystal clear:

https://github.com/hsivonen/encoding_rs/blob/dd9d99bb185f93d4fe5071291cdc54278e193955/src/mem.rs#L2007-L2010

https://github.com/hsivonen/encoding_rs/blob/dd9d99bb185f93d4fe5071291cdc54278e193955/src/mem.rs#L2044-L2047

set_len is also used in 7 functions in lib.rs, but I haven't looked at them very closely.

The docs for set_len explicitly say https://doc.rust-lang.org/std/vec/struct.Vec.html#method.set_len :

The elements at old_len..new_len must be initialized.

Some relevant discussion can be found here https://github.com/rust-lang/unsafe-code-guidelines/issues/71

rustc itself has a lint specifically for this kind of thing: https://github.com/rust-lang/rust/issues/75968

Using MaybeUninit::uninit().assume_init() is instant UB unless the target type is itself composed entirely of MaybeUninit

My understanding is this is currently considered UB, but this rule may be relaxed in the future to allow types where all bit patterns are valid to store uninitalized if they are not read from.

opened by nico-abram 7
Fix clippy lints
This PR fixes clippy lints that showed up on clippy 1.56.1

Lints that showed up multiple times were:

Add clippy:: prefix to lint allow()s

Use matches!

Replace range check with (a..b).contains()

Remove unnecessary 'static from statics
opened by nico-abram 0
Allow passing `String` to `Encoding::encode`

In the common case when converting from UTF8 to UTF8, or the string is all ASCII, this avoids an extra heap allocation for the caller if they only have a String available. Previously, they would have to call encoding.encode(&string).into_owned() to avoid lifetime errors.

This change is backwards compatible.

opened by jyn514 0