Standalone JIT-style runtime for WebAssembly, using Cranelift

Overview

wasmtime

A standalone runtime for WebAssembly

A Bytecode Alliance project

build status zulip chat supported rustc stable Documentation Status

Guide | Contributing | Website | Chat

Installation

The Wasmtime CLI can be installed on Linux and macOS with a small install script:

$ curl https://wasmtime.dev/install.sh -sSf | bash

Windows or otherwise interested users can download installers and binaries directly from the GitHub Releases page.

Example

If you've got the Rust compiler installed then you can take some Rust source code:

fn main() {
    println!("Hello, world!");
}

and compile/run it with:

$ rustup target add wasm32-wasi
$ rustc hello.rs --target wasm32-wasi
$ wasmtime hello.wasm
Hello, world!

Features

  • Lightweight. Wasmtime is a standalone runtime for WebAssembly that scales with your needs. It fits on tiny chips as well as makes use of huge servers. Wasmtime can be embedded into almost any application too.

  • Fast. Wasmtime is built on the optimizing Cranelift code generator to quickly generate high-quality machine code at runtime.

  • Configurable. Whether you need to precompile your wasm ahead of time, or interpret it at runtime, Wasmtime has you covered for all your wasm-executing needs.

  • WASI. Wasmtime supports a rich set of APIs for interacting with the host environment through the WASI standard.

  • Standards Compliant. Wasmtime passes the official WebAssembly test suite, implements the official C API of wasm, and implements future proposals to WebAssembly as well. Wasmtime developers are intimately engaged with the WebAssembly standards process all along the way too.

Language Support

You can use Wasmtime from a variety of different languages through embeddings of the implementation:

Documentation

📚 Read the Wasmtime guide here! 📚

The wasmtime guide is the best starting point to learn about what Wasmtime can do for you or help answer your questions about Wasmtime. If you're curious in contributing to Wasmtime, it can also help you do that!


It's Wasmtime.

Comments
  • `wasmtime`: Implement fast Wasm stack walking

    `wasmtime`: Implement fast Wasm stack walking

    Why do we want Wasm stack walking to be fast? Because we capture stacks whenever there is a trap and traps actually happen fairly frequently with short-lived programs and WASI's exit.

    Previously, we would rely on generating the system unwind info (e.g. .eh_frame) and using the system unwinder (via the backtracecrate) to walk the full stack and filter out any non-Wasm stack frames. This can, unfortunately, be slow for two primary reasons:

    1. The system unwinder is doing O(all-kinds-of-frames) work rather than O(wasm-frames) work.

    2. System unwind info and the system unwinder need to be much more general than a purpose-built stack walker for Wasm needs to be. It has to handle any kind of stack frame that any compiler might emit where as our Wasm frames are emitted by Cranelift and always have frame pointers. This translates into implementation complexity and general overhead. There can also be unnecessary-for-our-use-cases global synchronization and locks involved, further slowing down stack walking in the presence of multiple threads trying to capture stacks in parallel.

    This commit introduces a purpose-built stack walker for traversing just our Wasm frames. To find all the sequences of Wasm-to-Wasm stack frames, and ignore non-Wasm stack frames, we keep a linked list of (entry stack pointer, exit frame pointer) pairs. This linked list is maintained via Wasm-to-host and host-to-Wasm trampolines. Within a sequence of Wasm-to-Wasm calls, we can use frame pointers (which Cranelift preserves) to find the next older Wasm frame on the stack, and we keep doing this until we reach the entry stack pointer, meaning that the next older frame will be a host frame.

    The trampolines need to avoid a couple stumbling blocks. First, they need to be compiled ahead of time, since we may not have access to a compiler at runtime (e.g. if the cranelift feature is disabled) but still want to be able to call functions that have already been compiled and get stack traces for those functions. Usually this means we would compile the appropriate trampolines inside Module::new and the compiled module object would hold the trampolines. However, we also need to support calling host functions that are wrapped into wasmtime::Funcs and there doesn't exist any ahead-of-time compiled module object to hold the appropriate trampolines:

    // Define a host function.
    let func_type = wasmtime::FuncType::new(
        vec![wasmtime::ValType::I32],
        vec![wasmtime::ValType::I32],
    );
    let func = Func::new(&mut store, func_type, |_, params, results| {
        // ...
        Ok(())
    });
    
    // Call that host function.
    let mut results = vec![wasmtime::Val::I32(0)];
    func.call(&[wasmtime::Val::I32(0)], &mut results)?;
    

    Therefore, we define one host-to-Wasm trampoline and one Wasm-to-host trampoline in assembly that work for all Wasm and host function signatures. These trampolines are careful to only use volatile registers, avoid touching any register that is an argument in the calling convention ABI, and tail call to the target callee function. This allows forwarding any set of arguments and any returns to and from the callee, while also allowing us to maintain our linked list of Wasm stack and frame pointers before transferring control to the callee. These trampolines are not used in Wasm-to-Wasm calls, only when crossing the host-Wasm boundary, so they do not impose overhead on regular calls. (And if using one trampoline for all host-Wasm boundary crossing ever breaks branch prediction enough in the CPU to become any kind of bottleneck, we can do fun things like have multiple copies of the same trampoline and choose a random copy for each function, sharding the functions across branch predictor entries.)

    Finally, this commit also ends the use of a synthetic Module and allocating a stubbed out VMContext for host functions. Instead, we define a VMHostFuncContext with its own magic value, similar to VMComponentContext, specifically for host functions.

    Benchmarks

    Traps and Stack Traces

    Large improvements to taking stack traces on traps, ranging from shaving off 64% to 99.95% of the time it used to take.

    multi-threaded-traps/0  time:   [2.5686 us 2.5808 us 2.5934 us]
                            thrpt:  [0.0000  elem/s 0.0000  elem/s 0.0000  elem/s]
                     change:
                            time:   [-85.419% -85.153% -84.869%] (p = 0.00 < 0.05)
                            thrpt:  [+560.90% +573.56% +585.84%]
                            Performance has improved.
    Found 8 outliers among 100 measurements (8.00%)
      4 (4.00%) high mild
      4 (4.00%) high severe
    multi-threaded-traps/1  time:   [2.9021 us 2.9167 us 2.9322 us]
                            thrpt:  [341.04 Kelem/s 342.86 Kelem/s 344.58 Kelem/s]
                     change:
                            time:   [-91.455% -91.294% -91.096%] (p = 0.00 < 0.05)
                            thrpt:  [+1023.1% +1048.6% +1070.3%]
                            Performance has improved.
    Found 6 outliers among 100 measurements (6.00%)
      1 (1.00%) high mild
      5 (5.00%) high severe
    multi-threaded-traps/2  time:   [2.9996 us 3.0145 us 3.0295 us]
                            thrpt:  [660.18 Kelem/s 663.47 Kelem/s 666.76 Kelem/s]
                     change:
                            time:   [-94.040% -93.910% -93.762%] (p = 0.00 < 0.05)
                            thrpt:  [+1503.1% +1542.0% +1578.0%]
                            Performance has improved.
    Found 5 outliers among 100 measurements (5.00%)
      5 (5.00%) high severe
    multi-threaded-traps/4  time:   [5.5768 us 5.6052 us 5.6364 us]
                            thrpt:  [709.68 Kelem/s 713.63 Kelem/s 717.25 Kelem/s]
                     change:
                            time:   [-93.193% -93.121% -93.052%] (p = 0.00 < 0.05)
                            thrpt:  [+1339.2% +1353.6% +1369.1%]
                            Performance has improved.
    multi-threaded-traps/8  time:   [8.6408 us 9.1212 us 9.5438 us]
                            thrpt:  [838.24 Kelem/s 877.08 Kelem/s 925.84 Kelem/s]
                     change:
                            time:   [-94.754% -94.473% -94.202%] (p = 0.00 < 0.05)
                            thrpt:  [+1624.7% +1709.2% +1806.1%]
                            Performance has improved.
    multi-threaded-traps/16 time:   [10.152 us 10.840 us 11.545 us]
                            thrpt:  [1.3858 Melem/s 1.4760 Melem/s 1.5761 Melem/s]
                     change:
                            time:   [-97.042% -96.823% -96.577%] (p = 0.00 < 0.05)
                            thrpt:  [+2821.5% +3048.1% +3281.1%]
                            Performance has improved.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    
    many-modules-registered-traps/1
                            time:   [2.6278 us 2.6361 us 2.6447 us]
                            thrpt:  [378.11 Kelem/s 379.35 Kelem/s 380.55 Kelem/s]
                     change:
                            time:   [-85.311% -85.108% -84.909%] (p = 0.00 < 0.05)
                            thrpt:  [+562.65% +571.51% +580.76%]
                            Performance has improved.
    Found 9 outliers among 100 measurements (9.00%)
      3 (3.00%) high mild
      6 (6.00%) high severe
    many-modules-registered-traps/8
                            time:   [2.6294 us 2.6460 us 2.6623 us]
                            thrpt:  [3.0049 Melem/s 3.0235 Melem/s 3.0425 Melem/s]
                     change:
                            time:   [-85.895% -85.485% -85.022%] (p = 0.00 < 0.05)
                            thrpt:  [+567.63% +588.95% +608.95%]
                            Performance has improved.
    Found 8 outliers among 100 measurements (8.00%)
      3 (3.00%) high mild
      5 (5.00%) high severe
    many-modules-registered-traps/64
                            time:   [2.6218 us 2.6329 us 2.6452 us]
                            thrpt:  [24.195 Melem/s 24.308 Melem/s 24.411 Melem/s]
                     change:
                            time:   [-93.629% -93.551% -93.470%] (p = 0.00 < 0.05)
                            thrpt:  [+1431.4% +1450.6% +1469.5%]
                            Performance has improved.
    Found 3 outliers among 100 measurements (3.00%)
      3 (3.00%) high mild
    many-modules-registered-traps/512
                            time:   [2.6569 us 2.6737 us 2.6923 us]
                            thrpt:  [190.17 Melem/s 191.50 Melem/s 192.71 Melem/s]
                     change:
                            time:   [-99.277% -99.268% -99.260%] (p = 0.00 < 0.05)
                            thrpt:  [+13417% +13566% +13731%]
                            Performance has improved.
    Found 4 outliers among 100 measurements (4.00%)
      4 (4.00%) high mild
    many-modules-registered-traps/4096
                            time:   [2.7258 us 2.7390 us 2.7535 us]
                            thrpt:  [1.4876 Gelem/s 1.4955 Gelem/s 1.5027 Gelem/s]
                     change:
                            time:   [-99.956% -99.955% -99.955%] (p = 0.00 < 0.05)
                            thrpt:  [+221417% +223380% +224881%]
                            Performance has improved.
    Found 2 outliers among 100 measurements (2.00%)
      1 (1.00%) high mild
      1 (1.00%) high severe
    
    many-stack-frames-traps/1
                            time:   [1.4658 us 1.4719 us 1.4784 us]
                            thrpt:  [676.39 Kelem/s 679.38 Kelem/s 682.21 Kelem/s]
                     change:
                            time:   [-90.368% -89.947% -89.586%] (p = 0.00 < 0.05)
                            thrpt:  [+860.23% +894.72% +938.21%]
                            Performance has improved.
    Found 8 outliers among 100 measurements (8.00%)
      5 (5.00%) high mild
      3 (3.00%) high severe
    many-stack-frames-traps/8
                            time:   [2.4772 us 2.4870 us 2.4973 us]
                            thrpt:  [3.2034 Melem/s 3.2167 Melem/s 3.2294 Melem/s]
                     change:
                            time:   [-85.550% -85.370% -85.199%] (p = 0.00 < 0.05)
                            thrpt:  [+575.65% +583.51% +592.03%]
                            Performance has improved.
    Found 8 outliers among 100 measurements (8.00%)
      4 (4.00%) high mild
      4 (4.00%) high severe
    many-stack-frames-traps/64
                            time:   [10.109 us 10.171 us 10.236 us]
                            thrpt:  [6.2525 Melem/s 6.2925 Melem/s 6.3309 Melem/s]
                     change:
                            time:   [-78.144% -77.797% -77.336%] (p = 0.00 < 0.05)
                            thrpt:  [+341.22% +350.38% +357.55%]
                            Performance has improved.
    Found 7 outliers among 100 measurements (7.00%)
      5 (5.00%) high mild
      2 (2.00%) high severe
    many-stack-frames-traps/512
                            time:   [126.16 us 126.54 us 126.96 us]
                            thrpt:  [4.0329 Melem/s 4.0461 Melem/s 4.0583 Melem/s]
                     change:
                            time:   [-65.364% -64.933% -64.453%] (p = 0.00 < 0.05)
                            thrpt:  [+181.32% +185.17% +188.71%]
                            Performance has improved.
    Found 4 outliers among 100 measurements (4.00%)
      4 (4.00%) high severe
    

    Calls

    There is, however, a small regression in raw Wasm-to-host and host-to-Wasm call performance due the new trampolines. It seems to be on the order of about 2-10 nanoseconds per call, depending on the benchmark.

    I believe this regression is ultimately acceptable because

    1. this overhead will be vastly dominated by whatever work a non-nop callee actually does,

    2. we will need these trampolines, or something like them, when implementing the Wasm exceptions proposal to do things like translate Wasm's exceptions into Rust's Results,

    3. and because the performance improvements to trapping and capturing stack traces are of such a larger magnitude than this call regressions.

    sync/no-hook/host-to-wasm - typed - nop
                            time:   [28.683 ns 28.757 ns 28.844 ns]
                            change: [+16.472% +17.183% +17.904%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      1 (1.00%) low mild
      4 (4.00%) high mild
      5 (5.00%) high severe
    sync/no-hook/host-to-wasm - untyped - nop
                            time:   [42.515 ns 42.652 ns 42.841 ns]
                            change: [+12.371% +14.614% +17.462%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 11 outliers among 100 measurements (11.00%)
      1 (1.00%) high mild
      10 (10.00%) high severe
    sync/no-hook/host-to-wasm - unchecked - nop
                            time:   [33.936 ns 34.052 ns 34.179 ns]
                            change: [+25.478% +26.938% +28.369%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 9 outliers among 100 measurements (9.00%)
      7 (7.00%) high mild
      2 (2.00%) high severe
    sync/no-hook/host-to-wasm - typed - nop-params-and-results
                            time:   [34.290 ns 34.388 ns 34.502 ns]
                            change: [+40.802% +42.706% +44.526%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 13 outliers among 100 measurements (13.00%)
      5 (5.00%) high mild
      8 (8.00%) high severe
    sync/no-hook/host-to-wasm - untyped - nop-params-and-results
                            time:   [62.546 ns 62.721 ns 62.919 ns]
                            change: [+2.5014% +3.6319% +4.8078%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 12 outliers among 100 measurements (12.00%)
      2 (2.00%) high mild
      10 (10.00%) high severe
    sync/no-hook/host-to-wasm - unchecked - nop-params-and-results
                            time:   [42.609 ns 42.710 ns 42.831 ns]
                            change: [+20.966% +22.282% +23.475%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 11 outliers among 100 measurements (11.00%)
      4 (4.00%) high mild
      7 (7.00%) high severe
    
    sync/hook-sync/host-to-wasm - typed - nop
                            time:   [29.546 ns 29.675 ns 29.818 ns]
                            change: [+20.693% +21.794% +22.836%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 5 outliers among 100 measurements (5.00%)
      3 (3.00%) high mild
      2 (2.00%) high severe
    sync/hook-sync/host-to-wasm - untyped - nop
                            time:   [45.448 ns 45.699 ns 45.961 ns]
                            change: [+17.204% +18.514% +19.590%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 14 outliers among 100 measurements (14.00%)
      4 (4.00%) high mild
      10 (10.00%) high severe
    sync/hook-sync/host-to-wasm - unchecked - nop
                            time:   [34.334 ns 34.437 ns 34.558 ns]
                            change: [+23.225% +24.477% +25.886%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 12 outliers among 100 measurements (12.00%)
      5 (5.00%) high mild
      7 (7.00%) high severe
    sync/hook-sync/host-to-wasm - typed - nop-params-and-results
                            time:   [36.594 ns 36.763 ns 36.974 ns]
                            change: [+41.967% +47.261% +52.086%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 12 outliers among 100 measurements (12.00%)
      3 (3.00%) high mild
      9 (9.00%) high severe
    sync/hook-sync/host-to-wasm - untyped - nop-params-and-results
                            time:   [63.541 ns 63.831 ns 64.194 ns]
                            change: [-4.4337% -0.6855% +2.7134%] (p = 0.73 > 0.05)
                            No change in performance detected.
    Found 8 outliers among 100 measurements (8.00%)
      6 (6.00%) high mild
      2 (2.00%) high severe
    sync/hook-sync/host-to-wasm - unchecked - nop-params-and-results
                            time:   [43.968 ns 44.169 ns 44.437 ns]
                            change: [+18.772% +21.802% +24.623%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 15 outliers among 100 measurements (15.00%)
      3 (3.00%) high mild
      12 (12.00%) high severe
    
    async/no-hook/host-to-wasm - typed - nop
                            time:   [4.9612 us 4.9743 us 4.9889 us]
                            change: [+9.9493% +11.911% +13.502%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      6 (6.00%) high mild
      4 (4.00%) high severe
    async/no-hook/host-to-wasm - untyped - nop
                            time:   [5.0030 us 5.0211 us 5.0439 us]
                            change: [+10.841% +11.873% +12.977%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      3 (3.00%) high mild
      7 (7.00%) high severe
    async/no-hook/host-to-wasm - typed - nop-params-and-results
                            time:   [4.9273 us 4.9468 us 4.9700 us]
                            change: [+4.7381% +6.8445% +8.8238%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 14 outliers among 100 measurements (14.00%)
      5 (5.00%) high mild
      9 (9.00%) high severe
    async/no-hook/host-to-wasm - untyped - nop-params-and-results
                            time:   [5.1151 us 5.1338 us 5.1555 us]
                            change: [+9.5335% +11.290% +13.044%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 16 outliers among 100 measurements (16.00%)
      3 (3.00%) high mild
      13 (13.00%) high severe
    
    async/hook-sync/host-to-wasm - typed - nop
                            time:   [4.9330 us 4.9394 us 4.9467 us]
                            change: [+10.046% +11.038% +12.035%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 12 outliers among 100 measurements (12.00%)
      5 (5.00%) high mild
      7 (7.00%) high severe
    async/hook-sync/host-to-wasm - untyped - nop
                            time:   [5.0073 us 5.0183 us 5.0310 us]
                            change: [+9.3828% +10.565% +11.752%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 8 outliers among 100 measurements (8.00%)
      3 (3.00%) high mild
      5 (5.00%) high severe
    async/hook-sync/host-to-wasm - typed - nop-params-and-results
                            time:   [4.9610 us 4.9839 us 5.0097 us]
                            change: [+9.0857% +11.513% +14.359%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 13 outliers among 100 measurements (13.00%)
      7 (7.00%) high mild
      6 (6.00%) high severe
    async/hook-sync/host-to-wasm - untyped - nop-params-and-results
                            time:   [5.0995 us 5.1272 us 5.1617 us]
                            change: [+9.3600% +11.506% +13.809%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      6 (6.00%) high mild
      4 (4.00%) high severe
    
    async-pool/no-hook/host-to-wasm - typed - nop
                            time:   [2.4242 us 2.4316 us 2.4396 us]
                            change: [+7.8756% +8.8803% +9.8346%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 8 outliers among 100 measurements (8.00%)
      5 (5.00%) high mild
      3 (3.00%) high severe
    async-pool/no-hook/host-to-wasm - untyped - nop
                            time:   [2.5102 us 2.5155 us 2.5210 us]
                            change: [+12.130% +13.194% +14.270%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 12 outliers among 100 measurements (12.00%)
      4 (4.00%) high mild
      8 (8.00%) high severe
    async-pool/no-hook/host-to-wasm - typed - nop-params-and-results
                            time:   [2.4203 us 2.4310 us 2.4440 us]
                            change: [+4.0380% +6.3623% +8.7534%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 14 outliers among 100 measurements (14.00%)
      5 (5.00%) high mild
      9 (9.00%) high severe
    async-pool/no-hook/host-to-wasm - untyped - nop-params-and-results
                            time:   [2.5501 us 2.5593 us 2.5700 us]
                            change: [+8.8802% +10.976% +12.937%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 16 outliers among 100 measurements (16.00%)
      5 (5.00%) high mild
      11 (11.00%) high severe
    
    async-pool/hook-sync/host-to-wasm - typed - nop
                            time:   [2.4135 us 2.4190 us 2.4254 us]
                            change: [+8.3640% +9.3774% +10.435%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 11 outliers among 100 measurements (11.00%)
      6 (6.00%) high mild
      5 (5.00%) high severe
    async-pool/hook-sync/host-to-wasm - untyped - nop
                            time:   [2.5172 us 2.5248 us 2.5357 us]
                            change: [+11.543% +12.750% +13.982%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 8 outliers among 100 measurements (8.00%)
      1 (1.00%) high mild
      7 (7.00%) high severe
    async-pool/hook-sync/host-to-wasm - typed - nop-params-and-results
                            time:   [2.4214 us 2.4353 us 2.4532 us]
                            change: [+1.5158% +5.0872% +8.6765%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 15 outliers among 100 measurements (15.00%)
      2 (2.00%) high mild
      13 (13.00%) high severe
    async-pool/hook-sync/host-to-wasm - untyped - nop-params-and-results
                            time:   [2.5499 us 2.5607 us 2.5748 us]
                            change: [+10.146% +12.459% +14.919%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 18 outliers among 100 measurements (18.00%)
      3 (3.00%) high mild
      15 (15.00%) high severe
    
    sync/no-hook/wasm-to-host - nop - typed
                            time:   [6.6135 ns 6.6288 ns 6.6452 ns]
                            change: [+37.927% +38.837% +39.869%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 7 outliers among 100 measurements (7.00%)
      2 (2.00%) high mild
      5 (5.00%) high severe
    sync/no-hook/wasm-to-host - nop-params-and-results - typed
                            time:   [15.930 ns 15.993 ns 16.067 ns]
                            change: [+3.9583% +5.6286% +7.2430%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 12 outliers among 100 measurements (12.00%)
      11 (11.00%) high mild
      1 (1.00%) high severe
    sync/no-hook/wasm-to-host - nop - untyped
                            time:   [20.596 ns 20.640 ns 20.690 ns]
                            change: [+4.3293% +5.2047% +6.0935%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      5 (5.00%) high mild
      5 (5.00%) high severe
    sync/no-hook/wasm-to-host - nop-params-and-results - untyped
                            time:   [42.659 ns 42.882 ns 43.159 ns]
                            change: [-2.1466% -0.5079% +1.2554%] (p = 0.58 > 0.05)
                            No change in performance detected.
    Found 15 outliers among 100 measurements (15.00%)
      1 (1.00%) high mild
      14 (14.00%) high severe
    sync/no-hook/wasm-to-host - nop - unchecked
                            time:   [10.671 ns 10.691 ns 10.713 ns]
                            change: [+83.911% +87.620% +92.062%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 9 outliers among 100 measurements (9.00%)
      2 (2.00%) high mild
      7 (7.00%) high severe
    sync/no-hook/wasm-to-host - nop-params-and-results - unchecked
                            time:   [11.136 ns 11.190 ns 11.263 ns]
                            change: [-29.719% -28.446% -27.029%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 14 outliers among 100 measurements (14.00%)
      4 (4.00%) high mild
      10 (10.00%) high severe
    
    sync/hook-sync/wasm-to-host - nop - typed
                            time:   [6.7964 ns 6.8087 ns 6.8226 ns]
                            change: [+21.531% +24.206% +27.331%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 14 outliers among 100 measurements (14.00%)
      4 (4.00%) high mild
      10 (10.00%) high severe
    sync/hook-sync/wasm-to-host - nop-params-and-results - typed
                            time:   [15.865 ns 15.921 ns 15.985 ns]
                            change: [+4.8466% +6.3330% +7.8317%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 16 outliers among 100 measurements (16.00%)
      3 (3.00%) high mild
      13 (13.00%) high severe
    sync/hook-sync/wasm-to-host - nop - untyped
                            time:   [21.505 ns 21.587 ns 21.677 ns]
                            change: [+8.0908% +9.1943% +10.254%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 8 outliers among 100 measurements (8.00%)
      4 (4.00%) high mild
      4 (4.00%) high severe
    sync/hook-sync/wasm-to-host - nop-params-and-results - untyped
                            time:   [44.018 ns 44.128 ns 44.261 ns]
                            change: [-1.4671% -0.0458% +1.2443%] (p = 0.94 > 0.05)
                            No change in performance detected.
    Found 14 outliers among 100 measurements (14.00%)
      5 (5.00%) high mild
      9 (9.00%) high severe
    sync/hook-sync/wasm-to-host - nop - unchecked
                            time:   [11.264 ns 11.326 ns 11.387 ns]
                            change: [+80.225% +81.659% +83.068%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 6 outliers among 100 measurements (6.00%)
      3 (3.00%) high mild
      3 (3.00%) high severe
    sync/hook-sync/wasm-to-host - nop-params-and-results - unchecked
                            time:   [11.816 ns 11.865 ns 11.920 ns]
                            change: [-29.152% -28.040% -26.957%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 14 outliers among 100 measurements (14.00%)
      8 (8.00%) high mild
      6 (6.00%) high severe
    
    async/no-hook/wasm-to-host - nop - typed
                            time:   [6.6221 ns 6.6385 ns 6.6569 ns]
                            change: [+43.618% +44.755% +45.965%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 13 outliers among 100 measurements (13.00%)
      6 (6.00%) high mild
      7 (7.00%) high severe
    async/no-hook/wasm-to-host - nop-params-and-results - typed
                            time:   [15.884 ns 15.929 ns 15.983 ns]
                            change: [+3.5987% +5.2053% +6.7846%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 16 outliers among 100 measurements (16.00%)
      3 (3.00%) high mild
      13 (13.00%) high severe
    async/no-hook/wasm-to-host - nop - untyped
                            time:   [20.615 ns 20.702 ns 20.821 ns]
                            change: [+6.9799% +8.1212% +9.2819%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      2 (2.00%) high mild
      8 (8.00%) high severe
    async/no-hook/wasm-to-host - nop-params-and-results - untyped
                            time:   [41.956 ns 42.207 ns 42.521 ns]
                            change: [-4.3057% -2.7730% -1.2428%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 14 outliers among 100 measurements (14.00%)
      3 (3.00%) high mild
      11 (11.00%) high severe
    async/no-hook/wasm-to-host - nop - unchecked
                            time:   [10.440 ns 10.474 ns 10.513 ns]
                            change: [+83.959% +85.826% +87.541%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 11 outliers among 100 measurements (11.00%)
      5 (5.00%) high mild
      6 (6.00%) high severe
    async/no-hook/wasm-to-host - nop-params-and-results - unchecked
                            time:   [11.476 ns 11.512 ns 11.554 ns]
                            change: [-29.857% -28.383% -26.978%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 12 outliers among 100 measurements (12.00%)
      1 (1.00%) low mild
      6 (6.00%) high mild
      5 (5.00%) high severe
    async/no-hook/wasm-to-host - nop - async-typed
                            time:   [26.427 ns 26.478 ns 26.532 ns]
                            change: [+6.5730% +7.4676% +8.3983%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 9 outliers among 100 measurements (9.00%)
      2 (2.00%) high mild
      7 (7.00%) high severe
    async/no-hook/wasm-to-host - nop-params-and-results - async-typed
                            time:   [28.557 ns 28.693 ns 28.880 ns]
                            change: [+1.9099% +3.7332% +5.9731%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 15 outliers among 100 measurements (15.00%)
      1 (1.00%) high mild
      14 (14.00%) high severe
    
    async/hook-sync/wasm-to-host - nop - typed
                            time:   [6.7488 ns 6.7630 ns 6.7784 ns]
                            change: [+19.935% +22.080% +23.683%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 9 outliers among 100 measurements (9.00%)
      4 (4.00%) high mild
      5 (5.00%) high severe
    async/hook-sync/wasm-to-host - nop-params-and-results - typed
                            time:   [15.928 ns 16.031 ns 16.149 ns]
                            change: [+5.5188% +6.9567% +8.3839%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 11 outliers among 100 measurements (11.00%)
      9 (9.00%) high mild
      2 (2.00%) high severe
    async/hook-sync/wasm-to-host - nop - untyped
                            time:   [21.930 ns 22.114 ns 22.296 ns]
                            change: [+4.6674% +7.7588% +10.375%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 4 outliers among 100 measurements (4.00%)
      3 (3.00%) high mild
      1 (1.00%) high severe
    async/hook-sync/wasm-to-host - nop-params-and-results - untyped
                            time:   [42.684 ns 42.858 ns 43.081 ns]
                            change: [-5.2957% -3.4693% -1.6217%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 14 outliers among 100 measurements (14.00%)
      2 (2.00%) high mild
      12 (12.00%) high severe
    async/hook-sync/wasm-to-host - nop - unchecked
                            time:   [11.026 ns 11.053 ns 11.086 ns]
                            change: [+70.751% +72.378% +73.961%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      5 (5.00%) high mild
      5 (5.00%) high severe
    async/hook-sync/wasm-to-host - nop-params-and-results - unchecked
                            time:   [11.840 ns 11.900 ns 11.982 ns]
                            change: [-27.977% -26.584% -24.887%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 18 outliers among 100 measurements (18.00%)
      3 (3.00%) high mild
      15 (15.00%) high severe
    async/hook-sync/wasm-to-host - nop - async-typed
                            time:   [27.601 ns 27.709 ns 27.882 ns]
                            change: [+8.1781% +9.1102% +10.030%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 11 outliers among 100 measurements (11.00%)
      2 (2.00%) low mild
      3 (3.00%) high mild
      6 (6.00%) high severe
    async/hook-sync/wasm-to-host - nop-params-and-results - async-typed
                            time:   [28.955 ns 29.174 ns 29.413 ns]
                            change: [+1.1226% +3.0366% +5.1126%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 13 outliers among 100 measurements (13.00%)
      7 (7.00%) high mild
      6 (6.00%) high severe
    
    async-pool/no-hook/wasm-to-host - nop - typed
                            time:   [6.5626 ns 6.5733 ns 6.5851 ns]
                            change: [+40.561% +42.307% +44.514%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 9 outliers among 100 measurements (9.00%)
      5 (5.00%) high mild
      4 (4.00%) high severe
    async-pool/no-hook/wasm-to-host - nop-params-and-results - typed
                            time:   [15.820 ns 15.886 ns 15.969 ns]
                            change: [+4.1044% +5.7928% +7.7122%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 17 outliers among 100 measurements (17.00%)
      4 (4.00%) high mild
      13 (13.00%) high severe
    async-pool/no-hook/wasm-to-host - nop - untyped
                            time:   [20.481 ns 20.521 ns 20.566 ns]
                            change: [+6.7962% +7.6950% +8.7612%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 11 outliers among 100 measurements (11.00%)
      6 (6.00%) high mild
      5 (5.00%) high severe
    async-pool/no-hook/wasm-to-host - nop-params-and-results - untyped
                            time:   [41.834 ns 41.998 ns 42.189 ns]
                            change: [-3.8185% -2.2687% -0.7541%] (p = 0.01 < 0.05)
                            Change within noise threshold.
    Found 13 outliers among 100 measurements (13.00%)
      3 (3.00%) high mild
      10 (10.00%) high severe
    async-pool/no-hook/wasm-to-host - nop - unchecked
                            time:   [10.353 ns 10.380 ns 10.414 ns]
                            change: [+82.042% +84.591% +87.205%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 7 outliers among 100 measurements (7.00%)
      4 (4.00%) high mild
      3 (3.00%) high severe
    async-pool/no-hook/wasm-to-host - nop-params-and-results - unchecked
                            time:   [11.123 ns 11.168 ns 11.228 ns]
                            change: [-30.813% -29.285% -27.874%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 12 outliers among 100 measurements (12.00%)
      11 (11.00%) high mild
      1 (1.00%) high severe
    async-pool/no-hook/wasm-to-host - nop - async-typed
                            time:   [27.442 ns 27.528 ns 27.638 ns]
                            change: [+7.5215% +9.9795% +12.266%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 18 outliers among 100 measurements (18.00%)
      3 (3.00%) high mild
      15 (15.00%) high severe
    async-pool/no-hook/wasm-to-host - nop-params-and-results - async-typed
                            time:   [29.014 ns 29.148 ns 29.312 ns]
                            change: [+2.0227% +3.4722% +4.9047%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 7 outliers among 100 measurements (7.00%)
      6 (6.00%) high mild
      1 (1.00%) high severe
    
    async-pool/hook-sync/wasm-to-host - nop - typed
                            time:   [6.7916 ns 6.8116 ns 6.8325 ns]
                            change: [+20.937% +22.050% +23.281%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 11 outliers among 100 measurements (11.00%)
      5 (5.00%) high mild
      6 (6.00%) high severe
    async-pool/hook-sync/wasm-to-host - nop-params-and-results - typed
                            time:   [15.917 ns 15.975 ns 16.051 ns]
                            change: [+4.6404% +6.4217% +8.3075%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 16 outliers among 100 measurements (16.00%)
      5 (5.00%) high mild
      11 (11.00%) high severe
    async-pool/hook-sync/wasm-to-host - nop - untyped
                            time:   [21.558 ns 21.612 ns 21.679 ns]
                            change: [+8.1158% +9.1409% +10.217%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 9 outliers among 100 measurements (9.00%)
      2 (2.00%) high mild
      7 (7.00%) high severe
    async-pool/hook-sync/wasm-to-host - nop-params-and-results - untyped
                            time:   [42.475 ns 42.614 ns 42.775 ns]
                            change: [-6.3613% -4.4709% -2.7647%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 18 outliers among 100 measurements (18.00%)
      3 (3.00%) high mild
      15 (15.00%) high severe
    async-pool/hook-sync/wasm-to-host - nop - unchecked
                            time:   [11.150 ns 11.195 ns 11.247 ns]
                            change: [+74.424% +77.056% +79.811%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 14 outliers among 100 measurements (14.00%)
      3 (3.00%) high mild
      11 (11.00%) high severe
    async-pool/hook-sync/wasm-to-host - nop-params-and-results - unchecked
                            time:   [11.639 ns 11.695 ns 11.760 ns]
                            change: [-30.212% -29.023% -27.954%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 15 outliers among 100 measurements (15.00%)
      7 (7.00%) high mild
      8 (8.00%) high severe
    async-pool/hook-sync/wasm-to-host - nop - async-typed
                            time:   [27.480 ns 27.712 ns 27.984 ns]
                            change: [+2.9764% +6.5061% +9.8914%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 8 outliers among 100 measurements (8.00%)
      6 (6.00%) high mild
      2 (2.00%) high severe
    async-pool/hook-sync/wasm-to-host - nop-params-and-results - async-typed
                            time:   [29.218 ns 29.380 ns 29.600 ns]
                            change: [+5.2283% +7.7247% +10.822%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 16 outliers among 100 measurements (16.00%)
      2 (2.00%) high mild
      14 (14.00%) high severe
    
    wasmtime:api cranelift cranelift:area:machinst cranelift:meta fuzzing cranelift:area:aarch64 cranelift:area:x64 wasmtime:ref-types wasmtime:config 
    opened by fitzgen 40
  • wasm64 support

    wasm64 support

    We should consider supporting wasm64 modules, not just wasm32; people will want to run with large linear address spaces, both to process large amounts of data and to provide address space for shared mappings or file mappings.

    Opening this issue to start discussing what that support should look like, and how we can do that with minimal complexity or duplication.

    opened by joshtriplett 40
  • externref: implement stack map-based garbage collection

    externref: implement stack map-based garbage collection

    For host VM code, we use plain reference counting, where cloning increments the reference count, and dropping decrements it. We can avoid many of the on-stack increment/decrement operations that typically plague the performance of reference counting via Rust's ownership and borrowing system. Moving a VMExternRef avoids mutating its reference count, and borrowing it either avoids the reference count increment or delays it until if/when the VMExternRef is cloned.

    When passing a VMExternRef into compiled Wasm code, we don't want to do reference count mutations for every compiled local.{get,set}, nor for every function call. Therefore, we use a variation of deferred reference counting, where we only mutate reference counts when storing VMExternRefs somewhere that outlives the activation: into a global or table. Simultaneously, we over-approximate the set of VMExternRefs that are inside Wasm function activations. Periodically, we walk the stack at GC safe points, and use stack map information to precisely identify the set of VMExternRefs inside Wasm activations. Then we take the difference between this precise set and our over-approximation, and decrement the reference count for each of the VMExternRefs that are in our over-approximation but not in the precise set. Finally, the over-approximation is replaced with the precise set.

    The VMExternRefActivationsTable implements the over-approximized set of VMExternRefs referenced by Wasm activations. Calling a Wasm function and passing it a VMExternRef moves the VMExternRef into the table, and the compiled Wasm function logically "borrows" the VMExternRef from the table. Similarly, global.get and table.get operations clone the gotten VMExternRef into the VMExternRefActivationsTable and then "borrow" the reference out of the table.

    When a VMExternRef is returned to host code from a Wasm function, the host increments the reference count (because the reference is logically "borrowed" from the VMExternRefActivationsTable and the reference count from the table will be dropped at the next GC).

    For more general information on deferred reference counting, see An Examination of Deferred Reference Counting and Cycle Detection by Quinane: https://openresearch-repository.anu.edu.au/bitstream/1885/42030/2/hon-thesis.pdf

    cc #929

    Fixes #1804

    Depends on https://github.com/rust-lang/backtrace-rs/pull/341

    wasmtime:api cranelift 
    opened by fitzgen 33
  • memfd/madvise-based CoW pooling allocator

    memfd/madvise-based CoW pooling allocator

    Add a pooling allocator mode based on copy-on-write mappings of memfds.

    As first suggested by Jan on the Zulip here [1], a cheap and effective way to obtain copy-on-write semantics of a "backing image" for a Wasm memory is to mmap a file with MAP_PRIVATE. The memfd mechanism provided by the Linux kernel allows us to create anonymous, in-memory-only files that we can use for this mapping, so we can construct the image contents on-the-fly then effectively create a CoW overlay. Furthermore, and importantly, madvise(MADV_DONTNEED, ...) will discard the CoW overlay, returning the mapping to its original state.

    By itself this is almost enough for a very fast instantiation-termination loop of the same image over and over, without changing the address space mapping at all (which is expensive). The only missing bit is how to implement heap growth. But here memfds can help us again: if we create another anonymous file and map it where the extended parts of the heap would go, we can take advantage of the fact that a mmap() mapping can be larger than the file itself, with accesses beyond the end generating a SIGBUS, and the fact that we can cheaply resize the file with ftruncate, even after a mapping exists. So we can map the "heap extension" file once with the maximum memory-slot size and grow the memfd itself as memory.grow operations occur.

    The above CoW technique and heap-growth technique together allow us a fastpath of madvise() and ftruncate() only when we re-instantiate the same module over and over, as long as we can reuse the same slot. This fastpath avoids all whole-process address-space locks in the Linux kernel, which should mean it is highly scalable. It also avoids the cost of copying data on read, as the uffd heap backend does when servicing pagefaults; the kernel's own optimized CoW logic (same as used by all file mmaps) is used instead.

    There are still a few loose ends in this PR, which I intend to tie up before merging:

    • There is no InstanceAllocationStrategy yet that attempts to actually reuse instance slots; that should be added ASAP. For testing so far, I have just instantiated the same one module repeatedly (so reuse naturally occurs).

    • The guard-page strategy is slightly wrong; I need to implement the pre-heap guard region as well. This will be done by performing another mapping once, to reserve the whole address range, then mmap'ing the image and extension file on top at appropriate offsets (2GiB, 2GiB plus image size).

    Thanks to Jan on Zulip (are you also @koute from #3691?) for the initial idea/inspiration! This PR is meant to demonstrate my thoughts on how to build the feature and spawn discussion; now that we see both approaches hopefully we can work out a way to meet the needs of both of our use-cases.

    [1] https://bytecodealliance.zulipchat.com/#narrow/stream/206238-general/topic/Copy.20on.20write.20based.20instance.20reuse/near/266657772

    wasmtime:api 
    opened by cfallin 32
  • support a few DWARF-5 only features

    support a few DWARF-5 only features

    See #932.

    • accept and pass DebugAddrIndex, DebugStrOffsetsIndex attributes
    • skip DebugAddrBase, DebugStrOffsetsBase attribute when transforming, these are managed by the compilation unit elsewhere
    • accept and resolve DebugLineStrRef in line programs
    • read .debug_addr
    • read .debug_rnglists
    • read .debug_loclists
    • read .debug_line_str
    • read .debug_str_offsets
    • perform the DebugAddrIndex and DebugStrOffsetsIndex indirections

    TODO:

    • [x] tests (added DWARF-5 test, but it needs a refresh, lldb test also needed).
    opened by ggreif 31
  • Implement path_link for Windows.

    Implement path_link for Windows.

    This is probably the last missing syscall for Windows!

    This PR implements path_link for Windows and adds a non-strict version of the path_link integration test.

    I'm unsure about the error handling in path_link. MSDN doesn't say much about possible error codes for either CreateHardLinkA or CreateSymbolicLinkA. I mostly copied over the error conversion from path_symlink, but I'm not sure if it's correct. In particular, it's unclear to me what the purpose of strip_trailing_slashes_and_concatenate is.

    path_symlink will now also detect an attempt to create a dangling symlink and return ENOTSUP. (is this the correct return code)?

    Currently the non-strictness of the test consists of:

    1. we use a separate subdirectories subdir, subdir2, subdir3 for each test stage. This is due to the fact Windows will not remove the directory and won't allow to create a directory with the same name until the previous one has been deleted. I don't see any way of circumventing it, because the application may still try to access the directory through the unclosed file descriptor.
    2. path_link will return EACCES instead of EPERM when trying to create a link to a subdirectory. This violates the POSIX spec. We could manually check if the source path is a directory in case of ERROR_ACCESS_DENIED but this would cost us an extra syscall.
    3. Tests for dangling symlinks or symlink loops have been disabled. Alternatively, we could check if the attempt to create a dangling symlink returns ENOTSUP, but this doesn't make much sense while 1&2 are an issue.

    Let me know what you think.

    Btw. @kubkon, according to this stackoverflow post Mac OS X 10.5+ permits hard links to directories, which our tests expect to fail.


    Notes about links and symlinks under Windows:

    • creating a symlink requires administrative privileges (SeCreateSymbolicLinkPrivilege). On Windows 10 this requirement may be removed, but this requires enabling developer mode
    • Windows distinguishes between file and directory symlinks
    • It's possible to create a dangling symlink, but the type (file/directory) has to be specified upon creation. The behavior in case of type mismatch is inconsistent. Precisely, suppose that a dangling file symlink is created foo -> bar and later, a directory bar is created. Then:
      • under msys64 bash, cd foo succeeds and the directory view is the same when access either directly or through the symlink
      • under cmd (both windowed and as a child process from msys64 bash). cd foo fails with The directory name is invalid
    wasi:impl wasi:tests wasi 
    opened by marmistrz 31
  • Implement lazy funcref table and anyfunc initialization.

    Implement lazy funcref table and anyfunc initialization.

    During instance initialization, we build two sorts of arrays eagerly:

    • We create an "anyfunc" (a VMCallerCheckedAnyfunc) for every function in an instance.

    • We initialize every element of a funcref table with an initializer to a pointer to one of these anyfuncs.

    Most instances will not touch (via call_indirect or table.get) all funcref table elements. And most anyfuncs will never be referenced, because most functions are never placed in tables or used with ref.func. Thus, both of these initialization tasks are quite wasteful. Profiling shows that a significant fraction of the remaining instance-initialization time after our other recent optimizations is going into these two tasks.

    This PR implements two basic ideas:

    • The anyfunc array can be lazily initialized as long as we retain the information needed to do so. A zero in the func-ptr part of the tuple means "uninitalized"; a null-check and slowpath does the initialization whenever we take a pointer to an anyfunc.

    • A funcref table can be lazily initialized as long as we retain a link to its corresponding instance and function index for each element. A zero in a table element means "uninitialized", and a slowpath does the initialization.

    The use of all-zeroes to mean "uninitialized" means that we can use fast memory clearing techniques, like madvise(DONTNEED) on Linux or just freshly-mmap'd anonymous memory, to get to the initial state without a lot of memory writes.

    Funcref tables are a little tricky because funcrefs can be null. We need to distinguish "element was initially non-null, but user stored explicit null later" from "element never touched" (ie the lazy init should not blow away an explicitly stored null). We solve this by stealing the LSB from every funcref (anyfunc pointer): when the LSB is set, the funcref is initialized and we don't hit the lazy-init slowpath. We insert the bit on storing to the table and mask it off after loading.

    Performance effect on instantiation in the on-demand allocator (pooling allocator effect should be similar as the table-init path is the same):

    sequential/default/spidermonkey.wasm
                            time:   [71.886 us 72.012 us 72.133 us]
    
    sequential/default/spidermonkey.wasm
                            time:   [22.243 us 22.256 us 22.270 us]
                            change: [-69.117% -69.060% -69.000%] (p = 0.00 < 0.05)
                            Performance has improved.
    

    So, 72µs to 22µs, or a 69% reduction.

    wasmtime:api cranelift cranelift:area:machinst cranelift:wasm cranelift:area:x64 
    opened by cfallin 28
  • Debug a wasm application with reasonable amount of RAM

    Debug a wasm application with reasonable amount of RAM

    I'm trying to run a large (~15 MB) wasm application that crashes on ud2. Is there any way to determine which function the crash address corresponds to?

    opened by whitequark 28
  • Support records, variants, enums, unions, and flags in the component model

    Support records, variants, enums, unions, and flags in the component model

    I'm splitting this issue out of https://github.com/bytecodealliance/wasmtime/issues/4185 to write up some thoughts on how this can be done. Specifically today the current Wasmtime support for the component model has mappings for many component model types to Rust native types but not all of them. For example integers, strings, lists, tuples, etc, are all mapped directly to Rust types. Basically if the component model types equivalent in Rust is in the Rust standard library that's already implemented. What that leaves to implement, however, is Rust-defined mappings for component model types that are "structural" like records.

    This issue is intended to document the current thinking of how we're going to expose this. The general idea is that we'll create a proc-macro crate, probably named something like wasmtime-component-macro, which is an internal dependency of the wasmtime crate. The various macros would then get reexported at the wasmtime::component::* namespace.

    Currently the bindings for host types are navigated through three traits: ComponentValue, Lift, and Lower. We'll want a custom derive for all three of these traits. Deriving Lift and Lower require a ComponentValue derive as well, but users should be able to pick one of Lift and Lower without the other one.

    record

    Records in the component model correspond to structs in Rust. The rough shape of this will be:

    use wasmtime::component::{ComponentValue, Lift, Lower};
    
    #[derive(ComponentValue, Lift, Lower)]
    #[component(record)]
    struct Foo {
        #[component(name = "foo-bar-baz")]
        a: i32,
        b: u32,
    }
    

    To typecheck correctly the record type must list fields in the same order as the fields listed in the Rust code for now. Field reordering may be implemented at a later date but for now we'll do strict matching. Fields must have both matching names and matching types.

    The #[component(record)] here may seem redundant but it's somewhat required below for variants/enums.

    The #[component(name = "...")] is intended to rename the field from the component model's perspective. The type-checking will test against the name specified.

    Using this derive on a tuple or empty struct will result in a compile-time error.

    variant

    Variants roughly correspond to Rust enums:

    use wasmtime::component::{ComponentValue, Lift, Lower};
    
    #[derive(ComponentValue, Lift, Lower)]
    #[component(variant)]
    enum Foo {
        #[component(name = "foo-bar-baz")]
        A(u32),
        B,
    }
    

    Typechecking, like records, will check cases in-order and all cases must match in both name and payload. A missing payload in Rust is automatically interpreted as the unit payload in the component model.

    Variants with named fields (B { bar: u32 }) will be disallowed. Variants with multiple payloads (B(u32, u32)) will also be disallowed.

    Note that #[component(variant)] here distinguishes it from...

    enum

    use wasmtime::component::{ComponentValue, Lift, Lower};
    
    #[derive(ComponentValue, Lift, Lower)]
    #[component(enum)]
    enum Foo {
        #[component(name = "foo-bar-baz")]
        A,
        B,
    }
    

    Typechecking is similar to variants where the number/names of cases must all match.

    Variants with any payload are disallowed in this derive mode.

    union

    This will, perhaps surprisingly, still map to an enum in Rust since this is still a tagged union, not a literal C union:

    use wasmtime::component::{ComponentValue, Lift, Lower};
    
    #[derive(ComponentValue, Lift, Lower)]
    #[component(union)]
    enum Foo {
        A(u32),
        B(f32),
    }
    

    The number of cases and the types of each case must match a union definition to correctly typecheck. Union cases don't have names so renaming here isn't needed.

    A payload on each enum case in Rust is required, and like with variant it's required to be a tuple-variant with only one element. All other forms of payloads are disallowed. Note that the names in Rust are just informative in Rust, it doesn't affect the ABI or type-checking

    flags

    These will be a bit "funkier" than the above since there's not something obvious to attach a #[derive] to:

    wasmtime::component::flags! {
        #[derive(Lift, Lower)]
        flags Foo {
            #[component(name = "...")]
            const A;
            const B;
            const C;
        }
    }
    

    The general idea here is to roughly take inspiration from the bitflags crate in terms of what the generated code does. Ideally this should have a convenient Debug implementation along with various constants to OR-together and such in Rust. The exact syntax here is up for debate, this is just a strawman.

    Implementation Details

    One caveat is that the ComponentValue/Lift/Lower traits mention internal types in the wasmtime crate which aren't intended to be part of the public API. To solve this the macro will reference items in a path such as:

    wasmtime::component::__internal::the_name
    

    The __internal module will be #[doc(hidden)] and will only exist to reexport dependencies needed by the proc-macro. This crate may end up having a bland pub use wasmtime_environ or individual items, whatever works best.

    The actual generated trait impls will probably look very similar to the implementations that exist for tuples, and Result<T, E> already present in typed.rs

    Alternatives

    One alternative to the above is to have #[derive(ComponentRecord)] instead of #[derive(ComponentValue)] #[component(record)] or something like that. While historically some discussions have leaned in this direction with the introduction of Lift and Lower traits I personally feel that the balance is now slightly in the other direction where it would be nice if we can keep derive targeted at the specific traits and then configuration for the derive happens afterwards.

    wasm-proposal:component-model 
    opened by alexcrichton 27
  • Small wasm modules taking excessive amounts of time to compile

    Small wasm modules taking excessive amounts of time to compile

    In reviewing some fuzz bugs we've got a good number of test cases that end up timing out unfortunately. I believe that cranelift has a number of known issues about the speed of its compilation, particularly around register allocation. I figure it'd be good to collect a few concrete wasm files (discovered from fuzzing) which take an abnormally long amount of time to compile compared to the size of the input.

    It's worth noting that the timeout on the fuzzers is relatively high, I think something like 30 or 60 seconds. When fuzzing though binaries can be up to 50x slower which means our time budget for passing the fuzzers is pretty small, generally less than 3 seconds I think (ish). It's also worth noting that fuzzers are compiled with debug assertions enabled, which enables, well, debug assertions, but also the cranelift verifier pass. I've seen the verifier pass be quite expensive on some of these modules below, but in general the modules without the verifier pass still take an abnormally long amount of time to compile.

    For the files below I'm testing with:

    $ cargo build --release && time ./target/release/wasmtime --disable-cache ./foo2.wasm 
    

    most of them fail to instantiate or run, but it's the compilation which largely matters here which all happens as part of time. Which is to say you can ignore the errors of the CLI generally

    • file1.wasm[1].gz - takes 1s locally. Profiling shows lots of time in the register allocator.
    • file2.wasm[1].gz - this takes 300ms locally, but with the verifier and debug assertions enabled takes about 1s. The time looks to be largely in the verifier/register allocator like before.
    • file3.wasm[1].gz - same as previous
    • file4.wasm[1].gz - same as previous

    I'm assuming that there's generally not a huge amount we can do about this. When cranelift is looking to benchmark new register allocator implementations, however, we can perhaps use the files here as test beds to see how things are improving?

    In any case I figure it's good to start tracking files as we come across them if we can.

    cranelift cranelift:goal:compile-time performance 
    opened by alexcrichton 27
  • Draft: I128 support (partial) on x64.

    Draft: I128 support (partial) on x64.

    This PR generalizes all of the MachInst framework to reason about SSA Values as being located in multiple registers (one, two or four, currently, in an efficient packed form). This is necessary in order to handle CLIF with values wider than the machine width (I128 on 64/32-bit machines and I64 on 32-bit machines), unless we legalize it beforehand.

    It also adds support for some basic 128-bit ALU ops to the x64 backend, loewring these directly to open-coded instruction sequences (add/adc, sub/sbb, etc.).

    @julian-seward1 and @bnjbvr: this is the approach we had discussed a long time ago (in January, I think!). It follows from the "every backend accepts the same IR" philosophy, with the idea that maybe we will get to the point where legalization is largely not necessary.

    However: I must say that I'm not super-happy with the level of complexity this has added to the framework. The fact that the work we do in the x64 backend to support this will have to be repeated on aarch64 is kind of unfortunate; and this all feels somewhat silly given that we still have the legalization framework's narrowing support, and could use that instead. Philosophically, I think that legalization is actually the right approach here: we should be able to factor out "general machine-independent algorithm for 128-bit multiply with 64-bit pieces" from the specific machine backends.

    So I'm inclined not to go in this direction, but (i) want to see what thoughts anyone might have, and (ii) save this for the record, in case we wire up the old legalizations for now but reconsider or need multi-reg values in the future.

    cranelift cranelift:area:machinst cranelift:area:aarch64 cranelift:area:x64 
    opened by cfallin 26
  • WIP: egraph-based midend: draw the rest of the owl (productionized).

    WIP: egraph-based midend: draw the rest of the owl (productionized).

    This PR is a draft of an updated version of the egraph patch (and thus supersedes #4249) with the two parts already merged (multi-etors and the egraph crate proper) removed; it includes the Cranelift integration, the egraph build (CLIF to egraph) and elaboration (egraph to CLIF) algorithms, and rule application engine, as well as a set of rewrite rules that replaces the existing mid-end optimizations.

    It still needs a bit more productionizing:

    • removal of recursion in elaboration;
    • removal of recursion in rule application (This one is trickier! Immediate rule application on the sub-nodes created from constructors means more than one ISLE invocation can be on the stack, in a reentrant way. My thought is to use a sort of workqueue to "unstack" it.);
    • generalization of the several ad-hoc egraph analyses (loop depth, etc) into a framework.

    The purpose of this draft PR is to be a place to do this work on a rebased and up-to-date basis. (Lots happened since the original egraph work branched off in May, including incremental compilation and a good number of smaller changes.)

    While patch-wrangling this week, I tried pulling this apart into smaller pieces, but the remaining bits are pretty cyclically entangled, and/or some of the intermediate points that might make sense (e.g. egraph build and elaboration without rule application) require re-synthesizing some scaffolding that would then disappear in the final state, so that seems a bit counterproductive. Once we have a polished state I can try pulling it apart into separate logical commits at least.

    cranelift cranelift:meta cranelift:area:aarch64 cranelift:area:x64 isle 
    opened by cfallin 1
  • wasi-parallel: implement CPU parallelism

    wasi-parallel: implement CPU parallelism

    This change implements wasi-parallel in Wasmtime. It addresses only the CPU parallelism (not GPU), since this would introduce additional complexity to an already difficult review (if you're interested about the GPU progress, talk to @egalli). Each commit implements a separate piece to the puzzle. Though most of the changed lines are not the core implementation (see tests, LICENSE, etc.), due to the size of the PR, it may be convenient to look at this commit by commit.

    I see several issues highlighted by this change:

    • lacking toolchain support: in the absence of any other feasible path, the WebAssembly bytes of the kernel code to be executed in parallel are embedded in the host WebAssembly module itself. This is problematic for many reasons, not least of which is the ability for the host WebAssembly module to tamper with the kernel (on the flip side: an intra-module JIT compiler feature!). But that is not the end of it: toolchain support is also lacking simply to produce modules that use atomics and shared memory. For C programs: https://github.com/WebAssembly/wasi-libc/issues/326. For Rust programs: https://github.com/rust-lang/rust/issues/102157. Due to all this, most tests and examples in this crate are meticulously hand-crafted WAT files — once toolchain support improves these difficult-to-maintain files should be replaced.
    • uncomfortable Wasmtime/Wiggle APIs: this implementation of wasi-parallel works by creating new instances of the kernel in each thread of a thread pool — the host module exports a shared memory and the kernel imports it. Getting access to the shared memory from within a wasi-parallel invocation is difficult with Wasmtime/Wiggle: the best way I could think to resolve this is to teach Wiggle to skip a WITX function and then manually implement add_to_linker for that function. This is tedious and error-prone, so I would be excited to discuss better solutions. Note that an almost identical version of this problem will arise when I try to implement wasi-threads.

    Why now? The rationale behind merging this code behind the wasi-parallel feature flag is to enable users to try this out locally and to iterate on this in-tree (e.g., toolchain, GPU parts).

    opened by abrown 0
  • Implement x64 vector const with aligned loads

    Implement x64 vector const with aligned loads

    One other thought that came back to me while looking at this was that I really wanted to implement vconst (and all other constant accesses for vectors) with the aligned version (MOVDQA, MOVAPS, MOVAPD) instead of the unaligned version we currently use. At the time, it was impossible to ensure that the vector constants were actually aligned so these aligned loads would have trapped--that's why I abandoned that thought, IIRC. But now that may be more possible (?). Now, I'm not sure if this will result in any great latency improvement on most CPUs we use today, but I did wonder if it might improve the cache line story. Just a thought...

    Originally posted by @abrown in https://github.com/bytecodealliance/wasmtime/issues/2399#issuecomment-1235016953

    enhancement cranelift:goal:optimize-speed cranelift:area:x64 
    opened by jameysharp 0
  • fix: check filetype in `path_open`

    fix: check filetype in `path_open`

    Using the directory open flag to determine whether the entity being opened is a directory or a file is undefined behavior, since that means that the only way to acquire a usable directory is to specify the flag.

    Instead of relying on the open flag value to determine the filetype, issue get_path_filestat first to determine it and then perform appropriate operation depending on the value.

    ~Note that this also slightly changes the behavior, where create open flag is not dropped anymore, but rather passed through to the open_dir call.~

    @pchickey seems to be the last person making any significant changes in this method, so assigning him for review

    wasi 
    opened by rvolosatovs 2
  • Port branches to ISLE (AArch64)

    Port branches to ISLE (AArch64)

    Ported the existing implementations of the following opcodes for AArch64 to ISLE:

    • Brz
    • Brnz
    • Brif
    • Brff
    • BrIcmp
    • Jump
    • BrTable

    Copyright (c) 2022 Arm Limited

    cranelift cranelift:area:machinst cranelift:area:aarch64 
    opened by dheaton-arm 1
  • ISLE: Resolve overlap in prelude.isle and x64/inst.isle

    ISLE: Resolve overlap in prelude.isle and x64/inst.isle

    Resolve overlap in the ISLE prelude and the x64 inst module by introducing new types that allow better sharing of extractor resuls, or falling back on priorities.

    This PR makes the following changes in overlap counts for the different backends:

    | branch | x64 | aarch64 | s390x | | --- | --- | --- | --- | | main | 168 | 214 | 446 | | this | 138 | 212 | 440 |

    cranelift cranelift:area:machinst cranelift:area:x64 isle 
    opened by elliottt 1
Releases(dev)
Owner
Bytecode Alliance
Bytecode Alliance
A (very experimental) WebAssembly backend for Cranelift.

cranelift_codegen_wasm Experimental code generation for WebAssembly from Cranelift IR. note: not ready for usage yet Setup Contains an item called Was

Teymour Aldridge 5 Aug 17, 2022
A standalone Forth interpreter/compiler for WebAssembly.

ForSM A standalone Forth interpreter/compiler for WebAssembly. Bootstrapped from a Rust program, but the ultimate goal for it is to be self-hosting. A

Simon Gellis 5 Jun 15, 2022
🚀Wasmer is a fast and secure WebAssembly runtime that enables super lightweight containers to run anywhere

Wasmer is a fast and secure WebAssembly runtime that enables super lightweight containers to run anywhere: from Desktop to the Cloud, Edge and IoT devices.

Wasmer 13k Sep 23, 2022
WebAssembly to Lua translator, with runtime

This is a WIP (read: absolutely not ready for serious work) tool for translating WebAssembly into Lua. Support is specifically for LuaJIT, with the se

null 36 Sep 20, 2022
Lunatic is an Erlang-inspired runtime for WebAssembly

Lunatic is a universal runtime for fast, robust and scalable server-side applications. It's inspired by Erlang and can be used from any language that

Lunatic 3.3k Sep 19, 2022
A prototype WebAssembly linker using module linking.

WebAssembly Module Linker Please note: this is an experimental project. wasmlink is a prototype WebAssembly module linker that can link together a mod

Peter Huene 18 Aug 18, 2022
Zaplib is an open-source library for speeding up web applications using Rust and WebAssembly.

⚡ Zaplib Zaplib is an open-source library for speeding up web applications using Rust and WebAssembly. It lets you write high-performance code in Rust

Zaplib 1.2k Sep 26, 2022
A template for kick starting a Rust and WebAssembly project using wasm-pack.

A template for kick starting a Rust and WebAssembly project using wasm-pack.

Haoxi Tan 1 Feb 14, 2022
Client for integrating private analytics in fast and reliable libraries and apps using Rust and WebAssembly

TelemetryDeck Client Client for integrating private analytics in fast and reliable libraries and apps using Rust and WebAssembly The library provides

Konstantin 2 Apr 20, 2022
Lumen - A new compiler and runtime for BEAM languages

An alternative BEAM implementation, designed for WebAssembly

Lumen 3k Sep 19, 2022
A high-performance, secure, extensible, and OCI-complaint JavaScript runtime for WasmEdge.

Run JavaScript in WebAssembly Now supporting wasmedge socket for HTTP requests and Tensorflow in JavaScript programs! Prerequisites Install Rust and w

Second State 166 Sep 27, 2022
Wasm runtime written in Rust

Wasm runtime written in Rust

Teppei Fukuda 1 Oct 29, 2021
Sealed boxes implementation for Rust/WebAssembly.

Sealed boxes for Rust/WebAssembly This Rust crate provides libsodium sealed boxes for WebAssembly. Usage: // Recipient: create a new key pair let reci

Frank Denis 16 Aug 28, 2022
WebAssembly on Rust is a bright future in making application runs at the Edge or on the Serverless technologies.

WebAssembly Tour WebAssembly on Rust is a bright future in making application runs at the Edge or on the Serverless technologies. We spend a lot of ti

Thang Chung 117 Sep 5, 2022
WebAssembly modules that use Azure services

This is an experimental repository containing WebAssembly modules running on top of WAGI (WebAssembly Gateway Interface, which allows you to run WebAssembly WASI binaries as HTTP handlers) and using Azure services.

null 7 Apr 18, 2022
WebAssembly Service Porter

WebAssembly Service Porter.

henrylee2cn 11 Sep 12, 2022
WAGI: WebAssembly Gateway Interface

Write HTTP handlers in WebAssembly with a minimal amount of work

null 633 Sep 30, 2022
A console and web-based Gomoku written in Rust and WebAssembly

?? rust-gomoku A console and web-based Gomoku written in Rust and WebAssembly Getting started with cargo & npm Install required program, run # install

namkyu1999 2 Jan 4, 2022
WebAssembly development with Trunk & Vite.js

Trunk & Vite.js Demo Trunk is a WASM web application bundler for Rust, and Vite.js is next Generation Frontend Tooling. Ok, they are together now for

Libing Chen 6 Nov 24, 2021