Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: HW-Specialized WebAssembly #1528

Open
abrown opened this issue Sep 12, 2024 · 6 comments
Open

Proposal: HW-Specialized WebAssembly #1528

abrown opened this issue Sep 12, 2024 · 6 comments

Comments

@abrown
Copy link

abrown commented Sep 12, 2024

This proposal describes a mechanism for extending WebAssembly with access to hardware primitives for
improving WebAssembly’s performance. Its structure begins with the motivation and relevant
background (“why?”) and proceeds to discuss options for implementing this mechanism (“how?”). Some
issues are still up for debate and are collected in the open questions section at the end.

Why?

WebAssembly “aims to execute at native speed by taking advantage of common hardware capabilities”
(https://webassembly.org). The “common hardware capabilities” referred to
have, in many cases, been sufficiently “common” for this to be true. But WebAssembly is now used in
domains where native primitives are no longer “common” hardware: e.g., native execution of machine
learning code enjoys support from matrix multiplication primitives like AMX, wider vector sizes like
AVX-512, uncommon types like FP16, and even non-CPU accelerators, like GPUs and NPUs.

Problem

The WebAssembly specification cannot extend itself to support all hardware primitives:

  • First, there is less and less common ground: the “common hardware capabilities” WebAssembly
    was initially built on relied on half a century of CPU development. More recent HW features can be
    quite distinct between HW vendors; it becomes difficult or impossible to paper over the semantic
    differences (e.g., see relaxed SIMD).
  • Second, the rate of development is too fast: HW companies experiment with new features and the
    WebAssembly specification cannot change at the same rate. And, it could be argued, it should
    not
    : some new HW features here today may not be here tomorrow and WebAssembly would,
    unfortunately, be stuck with them.
  • Third, the complexity increase is undesirable: WebAssembly is used in many environments (e.g.,
    embedded) in which additional performance may not always be worth the additional implementation
    effort (e.g., some engines refuse to support SIMD, GC, etc.). Each new feature complicates the
    WebAssembly specification itself, raising the burden to maintain it.

Use Cases

An extension to allow HW-specialized WebAssembly could:

  • Allow experimentation with new features (e.g., FP16) before they are engraved in the
    specification; this would allow for soliciting user feedback and realistic performance numbers
    without the overhead of the CG process
  • Optimize cryptographic functions by allowing engines to use native HW features (e.g., compiling
    libssl with AESNI support).
  • Optimize machine learning algorithms (e.g., XNNPACK kernels using FP16, AVX512, AMX).
  • Retain program intent: e.g., MLIR expresses “multi-level” interactions within a program; ML
    compilers (Triton, IREE, ONNX, etc.) that target WebAssembly lose critical information about
    tensor manipulation when squeezed into current WebAssembly.

Objections

The primary objection is fragmentation. If WebAssembly can be extended in any conceivable way,
what is “standard” WebAssembly then? A HW-specialized module relying on a specific ARM instruction,
e.g., cannot be executed on any engine — it is no longer portable.

This proposal suggests a mechanism to solve this — built-in functions with WebAssembly
fallbacks. We argue that fragmentation is both inevitable and a present reality, but that it can
be addressed in a portable way that doesn’t burden the core specification with the union of all
possible instruction sets. We see hardware ISA fragmentation as inevitable in the sense that
businesses relying on WebAssembly will want to use the HW they purchased — efficiency is a
steady force towards environment-specific code.

Secondly, we see software ecosystem fragmentation as a different but related problem. Presently, a
WebAssembly module compiled to run on Amazon’s embedded engine will surely not instantiate in
Emscripten’s browser environment, Fastly’s Compute@Edge, DFINITY, etc. — the imports simply do
not line up! Software fragmentation already exists. We, however, claim that solutions for SW
fragmentation do not solve the HW fragmentation problem. E.g., WASI requires a canonical ABI for
guest-host boundary calls; what we propose here could even allow inlining single HW instructions
in JIT-compiled machine code. And though V8’s builtins are a step in this direction, they do not go
far enough, as we will see.

How?

Our proposal is an annotation mechanism for marking WebAssembly functions and types as built-ins
which engines use to emit semantically equivalent HW instructions. It mainly relies on conventions
(e.g., tool-conventions) but may require slight changes to the WebAssembly specification. Beyond
these tweaks, it requires no spec involvement when creating a new built-in.

Built-in functions

A cryptographic example:

(func (@builtin "libssl" "vaes_gcm_setiv")
  (param $ctx i32) (param $iv i32) (param $ivlen i32)
  (result i32)
  <function body>)

We propose that toolchains use the custom annotations
proposal to emit @Builtin functions; programmers express this intent with C attributes, e.g. Engines
may use this annotation to replace the WebAssembly function body with a HW-specialized body,
optimizing this in a semantics-preserving way (more on this later). For portability, engines that
choose not to implement built-ins simply execute the fallback function body as before.

We expect each built-in function group (e.g., libssl above) to provide an is_available function
returning 0 or 1 to query whether the engine will optimize the built-in (i.e., inline the
HW-specialized machine code) or not (i.e., run the slower fallback code). The fallback code for this
must always return 0.

(func (@builtin "libssl" "is_available")
  (result i32)
  (i32.const 0))

Built-in types

We additionally consider HW representations that have no equivalent Wasm representation today. The
problem with many HW features is that they require types unknown to WebAssembly. Examples include
smaller-precision floats (fp16), different-sized integers (i1, i8, i128), wider vectors (v256,
v512), tensors (!), masks, flags, etc. We propose a mechanism to introduce built-in types. E.g.,
certain operations may need a wider SIMD type:

(type $zmm (@builtin "avx512" "zmm")
  (tuple i64 i64 i64 i64 i64 i64 i64 i64))

As with built-in functions, built-in types are annotated with a @builtin annotation and include a
fallback mechanism for portability. Because (a) fallback WebAssembly code must construct a value of
the type in WebAssembly and (b) to avoid misusing struct in a non-GC sense, we propose a new way to
organize types: a tuple. It would collect several known WebAssembly types under one type index,
including a tuple.new for creation and tuple.get|set for field access. It would have no GC
semantics (each tuple slot occupies a Wasm stack slot) and could even be considered an alias (e.g.,
returning a $zmm could be considered sugar for a multi-value result). But it has an advantage: JIT
compilers can now reason about the built-in type and emit specialized code; e.g., a tuple.get of
$zmm could be lowered to a lane extraction instruction.

This is quite similar to the
type-imports
proposal in that both enforce a representation, but that proposal only allows existing heap types
which won’t work here.

Semantic equivalence

For an engine to correctly replace built-ins with HW-specialized code, it must preserve the
semantics of the fallback code (i.e., the WebAssembly function body and tuple structure). We expect
engines to verify this, but do not mandate a mechanism since this engine-specific decision is open
for experimentation.

One way engines could check that built-in functions and types are equivalent is by “bit by bit”
equivalence
. The engine would maintain a map of the fallback code to its HW-specialized equivalent
and, for each @builtin function, would check that the guest’s fallback code matches the engine’s
exactly. An optimization of this idea is to use collision-resistant hashes, though this entails some
discussion of risk.

We expect “bit by bit” equivalence to be a good starting point, but encourage research along a
different path: checking semantic equivalence. Changes to the toolchain and optimization
settings will break “bit by bit” equivalence, but analyzing the data flow trees into each
WebAssembly effect (global, table, and memory access, along with function calls) should provide a
more robust check.

Yet another solution here is to mix “bit by bit” equivalence with golden fallback libraries. If
the fallback code is kept separate (i.e., in a library), used only via import, and linked together
at instantiation in the engine, it is more likely that the fallback bits are never changed.

Built-in databases

The addition of builtin functions and types could be problematic for engines with JIT compilers: not
all engines have V8-like infrastructure for translating specific function imports to an internal IR.
We envision creating a “database” of HW-specialized templates that engines could integrate into
their JIT-compilation pipeline, e.g., distributed as a library engines could depend on. Each entry
would contain: the fallback WebAssembly code, a HW-specialized machine code template, the
HW-specific features required (e.g., CPUID checks), and a description of how to emit each HW
instruction. The HW-specialized template might look like:

<mem.load> size=16 $addr, $a
VCVTPS2PH $a, $b
PSHUFD $b, $b, $imm
<global.set> $b, $g

The template is not true assembly code: it has (b) meta-instructions for any WebAssembly operations
that only the engine knows how to emit and (b) holes for engine-specific register allocation. During
compilation, a JIT compiler would “fill out” and emit the template for a built-in function rather
than JIT compiling the fallback code. (This approach is inspired by the “Copy and
Patch
” paper.)

Of course, accepting and emitting machine code from a template database is risky. We encourage
research in developing an automated equivalence checker between the WebAssembly fallback code and
these HW-specialized templates. This would necessarily build on the formal semantics of WebAssembly
and any machine languages (e.g., ARM, x86). This research direction does not block the general idea,
though: engines can (and may even prefer) to manually vet which database entries they choose to
emit.

Versioning

In certain cases, checking the built-in fallback code for semantic equivalence is not enough; in
these cases applications need different code versions. For example, relaxed SIMD added a small,
fixed set of semantic choices; these could have been expressed by versions. Another motivating
example: if WebAssembly’s popcount instruction did not exist, a natural way to express it would be
via different versions, each version representing the different HW semantics. This section proposes
two alternate ways to express different semantics; note that neither necessarily depends on the
built-in infrastructure described above, though we use built-ins as easy examples.

Function-based Versioning

One way to conditionally define semantics is at the function level:

(version "foo" "deterministic" (func $foo (@builtin ...) ...))
(version "foo" "relaxed_x86" (func $foo (@builtin ...) ...))
(version "foo" "relaxed_aarch64" (func $foo (@builtin ...) ...))
(version "foo" "relaxed_..." (func $foo (@builtin ...) ...))

Because the function ID remains the same, call sites (e.g., call $foo) are unchanged by the decision
of which version to use. For WebAssembly modules containing versions, applications may specify which
version to use:

let module = WebAssembly.compile(source, {"foo": "deterministic"})

If no version is chosen, the engine must choose the first one listed in the module.

Block-based Versioning

Another way to conditionally specify semantics is at the block level:

(choice-block "deterministic" (param i32 i32) (result i32) ...
  (elseif "relaxed_x86" ...)
  (elseif "relaxed_aarch64" ...)
  (else "relaxed_..." ...)
)

Toolchains emit choice-block much like they would a regular WebAssembly if-else-end construct.
The difference is that, at compile time, the engine decides to use the first block that matches a
set of version strings. Users pass version strings as boolean flags:

let module = WebAssembly.compile(source, ["deterministic"]);

As before, if no version is selected (e.g., []), the engine must choose the first block.

Example: half-precision

To understand the concepts above, let’s consider a recent WebAssembly use case that could benefit
from this proposal. The half-precision proposal
adding FP16 SIMD operations to WebAssembly has met resistance in the WebAssembly CG due to the
problem of HW availability: not all users have CPUs that support the optimal lowering of the new
instructions. Nevertheless, FP16 is clearly a valuable feature in certain domains, such as ML
kernels, and would be useful now. This proposal could resolve that tension:

  1. Define a new f16x8 type: the performance benefit of this proposal comes from understanding
    that certain HW has vector registers supporting f16. Since the proposal only allows accessing
    lanes as f32, we could define:

    (type $f16x8 (@builtin "f16x8" "f16x8")
      (tuple f32 f32 f32 f32 f32 f32 f32 f32))
  2. Define f16x8 instructions as built-ins: e.g.,

    (func (@builtin “f16x8” “mul”)
      (param (type $f16x8))
      (param (type $f16x8))
      (result (type $f16x8))
      <fallback code>)
  3. Compile code to use built-ins: e.g., port kernels from XNNPACK’s fp16_gemm to use the f16x8
    built-in functions and types. This presupposes that the toolchain now supports C/C++ attributes
    that emit the WebAssembly @builtin annotations.

  4. Implement engine support: in a WebAssembly engine, add the HW
    lowerings
    of f16x8 built-ins. In V8/Chrome’s case, these new built-ins would naturally be placed behind an
    origin trial flag. At this point, users could experiment with the new built-ins and compare
    against the fallback path in other engines.

  5. Upstream built-in definitions: several commenters have requested a centralized process to
    reach community consensus. While not the main thrust of this proposal, one such process might be
    to add them to the tool-conventions repository, maintained by a subgroup of the WebAssembly CG
    (alternately: a separate builtins repository). Remember that built-in optimizations are optional
    for engines, so engine maintainers would still be free to choose which built-ins to implement and
    when — the fallback code ensures compatibility in the meantime. Proposers would submit an
    entry much like the
    lowerings
    FP16 already describes, e.g.:

    group: f16x8
    name: mul
    fallback: |
      (func (@builtin “f16x8” “mul”)
        (param $lhs (type $f16x8))
        (param $rhs (type $f16x8))
        (result $dst (type $f16x8))
        <fallback code>)
    specializations:
    - check: x86_64 && avx && f16c
      emit: |
        VCVTPH2PS $dst, $lhs
        VCVTPH2PS $dst, $rhs
        VMULPS $dst, $dst, $tmp
        VCVTPS2PH $dst, $dst, 0
    - check: aarch64 && ...
  6. Feature detection: during the period where support for f16x8.mul is not available in all
    engines, developers can provide an alternate code path. While the fallback code guarantees
    semantic compatibility, it may be slow and an alternate algorithm may be preferable. To choose a
    different algorithm, a developer could write:

    (version "xnnpack" "f16x8"
      (func $f16_gemm <code using @builtins>))
    (version "xnnpack" "default"
      (func $f16_gemm <code using an alternate algorithm>))

    By checking the f16x8.is_available built-in, developers could select which version to compile.

  7. Adopt into WebAssembly specification (or not): if the f16x8 built-ins are successfully
    adopted by multiple engines and applications, this is a strong indication for addition to the
    WebAssembly specification. If any HW concerns are alleviated and the proposal is adopted, the
    original fallback code can eventually be replaced by the new f16x8 instructions. If it is not
    adopted, the built-ins continue to work in engines that continue to support them. If usage of a
    built-in declines, engines can safely drop support, relying on the fallback code.

Clearly the process side of this proposal is up for discussion; this example explains one concrete
elaboration of the built-in idea. Similar examples could be crafted for cryptographic primitives, JS
strings, and 128-bit arithmetic — with their various motivation and details.

Open Questions

  • Should we restrict the size or shape of built-in functions and types? While some of us advocated
    for built-ins that directly corresponded to HW instructions, others considered a higher-level
    approach more interesting (replace multi3, XNNPACK kernels, entire libssl functions, etc.). It is
    not clear which approach is best so we have left this open. One approach might be to limit
    functions to well-defined instruction sets, such as x86, ARM, or domain-specific IRs. Examples of
    such domains are machine learning, security, etc.
  • Won’t this encourage custom extensions to WebAssembly that undermine the standard? WebAssembly’s
    wide adoption in disparate environments would hint that this is already happening; our hope is
    that this brings those extensions under a common umbrella. We admit, however, that this would
    result in different performance between engines based on what built-ins they support.
  • Could this be used to compile MLIR? E.g., emit MLIR operations as Wasm built-ins? Indeed, one of
    the motivations behind this is MLIR’s popularity as an intermediate format for various
    high-performance compiler tools, including machine language model compilers (Triton, IREE,
    etc.). What MLIR has done well by allowing custom instructions,
    types, attributes, and regions is to retain the original program intent through various compiler
    transformations. When MLIR is emitted as WebAssembly, though, it meets an impedance mismatch:
    high-level tensor operations are squeezed into non-performant i32.load|store operations, e.g. The
    hope is that this proposal could bridge MLIR to WebAssembly in a different way, via built-ins.
    Then, one might imagine WebAssembly code that interleaves MLIR-based GPU invocations with
    tightly-compiled vector kernels with run-of-the-mill WebAssembly, etc.
  • What about the JS string
    built-ins

    proposal?
    The JS string proposal, now at phase 4, is a proof point that engine-provided built-ins
    are in fact necessary for performance. One difference between that proposal and this one is in
    scope: this proposal would allow the use of engine built-ins far beyond JS strings. One can
    imagine implementing the "wasm:js_string" imports from that proposal in terms of this one:
    (@builtin "wasm:js_string" "..."). If this were the case, this would result in improved
    compatibility: using this proposal’s built-in fallback code, the JS string built-ins would “come
    with” their sub-optimal WebAssembly-only implementation, ensuring modules are at least executable
    on any engine — not just browsers — albeit less optimally.
  • What about the type
    imports

    proposal?
    The type imports proposal, now at phase 1, is similar in spirit to this one: both
    intend to extend a WebAssembly module with additional type information. But, whereas type imports
    are concerned with types coming from outside (e.g., a browser reference), this proposal has to
    “lay out” (i.e., provide a representation for) new value types for HW-specialized built-ins. We
    expect the particular layout of these new types to be critical for performance but, at the same
    time, the type must be transparent to WebAssembly to be useful. This led to the built-in tuple
    syntax, but we are open to better syntax suggestions. One possible future is for the type import
    proposal to be extended with this or another aliasing syntax which this proposal could then depend
    on when marking new types as built-in ones.

Written collaboratively by @abrown, @titzer, @woodsmc, @ppenzin. Thanks to all those who provided
feedback, including @dtig, @tlively, @ajklein, @mingqiusun, @alexcrichton, @dicej, @cfallin.

@dtig
Copy link
Member

dtig commented Sep 12, 2024

Thanks for filing this issue! Adding some notes from offline discussion that aren't fully reflected here:

  • First, there is less and less common ground: the “common hardware capabilities” WebAssembly
    was initially built on relied on half a century of CPU development. More recent HW features can be
    quite distinct between HW vendors; it becomes difficult or impossible to paper over the semantic
    differences (e.g., see relaxed SIMD).

I would argue that this isn't strictly true on the CPU, while older ISA extensions are more fragmented (which is what relaxed-simd was targeting), there's definitely a movement towards convergence on the newer extensions. As FP16 is used in this context, it's also the extension where hardware is moving towards convergence soon as seen in the instruction lowerings for FP16.

An extension to allow HW-specialized WebAssembly could:

  • Allow experimentation with new features (e.g., FP16) before they are engraved in the
    specification; this would allow for soliciting user feedback and realistic performance numbers
    without the overhead of the CG process

This is the use case that we are currently focused on - though arguably to experiment with new features a simple opcode prefix for experimentation could also suffice, with a potential path towards standardization.

  • Optimize cryptographic functions by allowing engines to use native HW features (e.g., compiling
    libssl with AESNI support).

In theory, I like this approach, but there is a practical usability concern that this sidesteps. In the web context, a big value add for Wasm is the access to lower level primitives which complement existing Web APIs. For example, providing optionality for codecs or storage APIs (SQLite on the web through Wasm) that benefit from lower level primitives, or ease of use for existing native libraries that can then use Wasm intrinsics as a drop in replacement instead of higher level primitives/functions that applications need to rewrite their program for. I'm not an expert in the cryptographic domain, but usually the requests for AES instructions are motivated by applications wanting to use a Wasm intrinsics header as a drop-in replacement for native intrinsic header files without actually having to rewrite much of their program.

  • Optimize machine learning algorithms (e.g., XNNPACK kernels using FP16, AVX512, AMX).

I suspect engine authors may not want to sign up for maintaining or updating kernels, especially at the rate they're changing. The other side of this is also that engines may be left with supporting multiple versions of the kernels because it is quite hard to deprecate something once there are actual users. My argument is that the complexity of handling this belongs in the library/application space and not in the Wasm space.

  • Retain program intent: e.g., MLIR expresses “multi-level” interactions within a program; ML
    compilers (Triton, IREE, ONNX, etc.) that target WebAssembly lose critical information about
    tensor manipulation when squeezed into current WebAssembly.

This raises an interesting point, should tensor manipulation be squeezed into current WebAssembly? CPUs (and by extension Wasm) are great for general purpose compute, but probably aren't very well suited for special purpose compute. There should be a graceful fallback path to the CPU when special purpose compute isn't available, but that seems more like an implementation consideration for runtimes. Again, far from an expert, but please correct me if I'm misunderstanding the intent of this use case.

Semantic equivalence

For an engine to correctly replace built-ins with HW-specialized code, it must preserve the
semantics of the fallback code (i.e., the WebAssembly function body and tuple structure). We expect
engines to verify this, but do not mandate a mechanism since this engine-specific decision is open
for experimentation.

One way engines could check that built-in functions and types are equivalent is by “bit by bit”
equivalence
. The engine would maintain a map of the fallback code to its HW-specialized equivalent
and, for each @builtin function, would check that the guest’s fallback code matches the engine’s
exactly. An optimization of this idea is to use collision-resistant hashes, though this entails some
discussion of risk.

I think the point that is tripping me up a bit is that the fallback path is the default path - i.e the default
builtin code may not be performant and it relies on the engine checking semantic equivalence and having the right hardware optimized equivalents to generate performant code. The fallback code is also likely the one that isn't quite interesting in a user facing way, because what applications want to do is to detect is use the performant code path if HW support exists, or ship a different kernel/binary instead of using the slow path. The point below on the brittleness of this type of equivalence to optimizations could be challenging from a maintenance perspective.

Open Questions

  • Should we restrict the size or shape of built-in functions and types? While some of us advocated
    for built-ins that directly corresponded to HW instructions, others considered a higher-level
    approach more interesting (replace multi3, XNNPACK kernels, entire libssl functions, etc.). It is
    not clear which approach is best so we have left this open. One approach might be to limit
    functions to well-defined instruction sets, such as x86, ARM, or domain-specific IRs. Examples of
    such domains are machine learning, security, etc.
  • Won’t this encourage custom extensions to WebAssembly that undermine the standard? WebAssembly’s
    wide adoption in disparate environments would hint that this is already happening; our hope is
    that this brings those extensions under a common umbrella. We admit, however, that this would
    result in different performance between engines based on what built-ins they support.

In the web environment, standardization has been critical for adoption and for consistent behavior across engines. While it is true that different engines adopt features at different points in time, they have largely conformed to the standard which I see as a strength of the Wasm process. From the perspective of an application developer, the user experience of having significant difference in performance profiles is quite challenging to navigate. That said, I'm sure the web ecosytem is different in this regard, so I understand that some guardrails should be there to make sure that non-web engines don't bear the implementation burden for features that aren't relevant to them.

@rossberg
Copy link
Member

Thanks for writing this up. It is a good time to explore directions for a generic extension mechanism. My first reaction after reading this proposal is that it is adding a lot of complex and cross-cutting machinery to the language, some of which (conditional compilation) we tried before. What I am missing is an argument why the much simpler approach taken for JS string builtins, plus the addition of type imports to allow for stricter static typing, is not sufficient for other use cases.

@titzer
Copy link

titzer commented Sep 15, 2024

@rossberg I think the main argument here is that a builtin comes with its own specification via a standard lowering (i.e. the fallback) which is guaranteed to be semantically equivalent to the desired HW instruction(s). While this new mechanism is more spec burden than, e.g. the string built-ins (which AFAICT can be spec'd in an embedding) it seems like a more general mechanism that could enable many use cases with a single mechanism.

@rossberg
Copy link
Member

@titzer, the built-ins approach would also specify a precise semantics for those built-ins, so what's the practical difference? And it is not just more spec burden — it is multiple difficult new features with non-obvious design.

@eqrion
Copy link

eqrion commented Oct 8, 2024

A couple thoughts here, thanks for writing this up!

Semantic equivalence

I think that true semantic equivalence here will be very difficult. It's already the case that engines could pattern match whole functions and replace them with equivalent native instructions. We don't do that for various reasons, but one major one is that it's really hard to prove that any non-trivial wasm function is equivalent to a single machine instruction for all inputs. And also maintaining a deterministic trapping order across mutations to the store is hard.

That leads me to wonder, does anyone actually care about running the fallback code? It seems like the reason for having an 'is_available' predicate was so that users could avoid calling the fallback code altogether (as it's likely much slower). If that's the case, we could avoid the whole semantic equivalence issue by not specifying a fallback and instead specifying a deterministic trap if a builtin is called but not available.

Another question here, why not use the approach from js-string-builtins of importing the builtin function? This would avoid the need to change core wasm. If the platform supports the builtin, you let it satisfy the import. If you want to instead use a fallback, you provide the import yourself. For example with the 'half-precision' proposal, nearly every instruction there could be an imported function. The only two that would need some care would be the load/store instructions.

I do support the idea of using 'builtins' to express things that aren't a fit for all platforms. This can lower the bar for new instructions from e.g. 'good fit for all platforms' to 'good fit for the web platform'.

However, when it comes to the web platform we do still have constraints that will make adoption of 'hardware experimental features' (as one of the use-cases) very difficult. The biggest issue is that we can't really unship things from the web. If we ship a builtin for some experimental hardware feature and it gets adopted in a major ML framework as a critical optimization, it's very difficult to remove it. Firefox still supports asm.js even though it's usage is very low. The kinds of builtins that we'd support for the Web would need to meet a certain amount of stability that may be lower than the core spec, but still pretty high. There's also concerns about fingerprinting that I'm not an expert in, but know it can cause issues for shipping things.

@mmcloughlin
Copy link

I explored some aspects of this proposal for a final project in @titzer's Virtual Machines class, so I thought I'd provide an experience report and share the full write-up.

Summary

Proof of concept demonstrates that SHA-1 with dedicated AArch64 C intrinsics can be executed via Wasm intrinsics in Wasmtime at 1.3x native performance.

Potential issues revealed by this experiment:

  • Semantic mismatches between C, Wasm and target instructions can eliminate performance gains if not handled carefully.
  • Peak performance may be limited by relatively simple optimizers in JIT engines
  • Supporting large sets of intrinsics in Wasm JITs would require careful engineering

Application

The experiment provides a proof-of-concept for a representative use case, namely the SHA-1 hash algorithm using the Cryptographic Extension on AArch64. The prototype demonstrates how C code written against ARM's C intrinsics API can be executed both natively and via Wasm. Wasm execution is achieved with a Wasm AArch64 intrinsics C API layer that serves as a "drop-in replacement for native intrinsic header files", as @dtig mentioned earlier. In addition, I have a fork of Wasmtime with support for intrinsic calls for a select group of AArch64 instructions. The end result is SHA-1 execution via Wasm with intrinsics at 1.3x native AArch64 performance.

To give a feel for the implementation, four rounds of the SHA-1 compression function in C with AArch64 intrinsics are:

// Rounds 28-31
e0 = vsha1h_u32(vgetq_lane_u32(abcd, 0));
abcd = vsha1pq_u32(abcd, e1, t1);
t1 = vaddq_u32(m1, vdupq_n_u32(K1));
m2 = vsha1su1q_u32(m2, m1);
m3 = vsha1su0q_u32(m3, m0, m1);

These intrinsics are defined in arm_neon.h. The proof of concept provides an alternate wasm_arm_neon.h that the C code can be compiled against unchanged, and pure Wasm fallbacks that would work on any platform. However, when executed under the modified Wasmtime, calls to intrinsic functions such as vsha1h_u32 are recognized and compiled directly to the corresponding hardware instructions like SHA1H.

Lessons

Some lessons from this proof-of-concept, with the caveat that they may not generalize to other intrinsics domains.

Challenge of Semantics Mismatches. Compilation via intrinsics passes through many layers: C intrinsics API, engine intrinsics API, Wasm operators, CLIF IR and machine code representation. Each of these has their own semantics and value representations. Earlier stages of this project showed that if not handled correctly, semantics mismatches can eliminate any performance you might hope to gain from the intrinsics calls. Specifically, in this case some of the special SHA-1 instructions have the oddity that they accept s<n> registers which are scalar 32-bit values in the low bits of vector registers. Wasm engines naturally want to store 32-bit integers in general purpose registers, therefore without careful handling the intrisincs calls are surrounded by redundant register moves between vector and general purpose register files (with a significant performance penalty). These details are important when the entire goal of hardware intrinsics in Wasm is reaching near-native performance. We might hope that the idiosyncrasies of the SHA-1 instruction set are not widespread. However, it seems possible that this broader problem of semantic mismatches could rear its head in other cases, for example when attempting to use wide vector types (e.g. Intel AVX-512) that do not have Wasm equivalents.

Significance of the Intrinsics API. The design of the C API layer was critical in achieving near native performance. Specifically, it should be designed to limit the number of intrinsics required in the engine, and intrinsics offered by the engine should be as close as possible to the machine instructions. Therefore:

  • Implement C layer intrinsics as existing Wasm operators wherever possible. For example, the AArch64 intrinsic vdupq_n_u32 can just be implemented as i32x4.splat (or wasm_u32x4_splat in C) without the need to add an intrinsic to the engine. Importantly, this also improves the ability of the AOT compiler to optimize code around the intrinsics.
  • Engine intrinsics API should be as close as possible to machine instructions. This makes the engine work essentially a passthrough, and limits the optimizations required from the JIT. As a concrete example, the vsha1h_u32 intrinsic takes and returns a uint32_t. However, it is best if the C layer maps to an internal __intrinsic_vsha1h_u32 version that takes and returns a v128, since these match the underlying SHA1H instruction more closely.

Importance of Accompanying Optimizations. The first version of SHA-1 via Wasm intrinsics had poor performance (3.2x native), showing that merely mapping to the right machine instructions is not enough. Supporting optimization passes are critical. In the SHA-1 case, it was crucial to eliminate redundant moves between register classes, but it is reasonable to expect instances of this problem for other classes of intrinsics. Optimizing JITs are designed for compile speed and therefore have a much more limited set of optimizations than a full AOT compiler. In this case we were able to work around missing Cranelift JIT optimizations by moving the problem to the AOT compilation layer, however it is not clear that would always be possible. Indeed, the remaining approximately 30% overhead over native execution may be a difficult gap to close, given the lack of optimizations such as instruction scheduling in JIT compilers. Overall, we might expect that Wasm intrinsics performance would be limited by JIT compiler optimization capabilities.

Fallback Performance. When the intrinsics implementation is executed under Wasm with the fallback implementations, the performance is very poor (over 9x native intrinsics). In fact, it's even worse than a generic version of SHA-1 compiled to Wasm. The function call overhead is likely a major problem, so inlining of fallbacks would likely be necessary for tolerable performance. Alternatively we could accept that fallback performance is not a goal, to @eqrion's point, and the is_available functions are there to allow users to provide an alternative.

Engineering Aspects. The fork of Wasmtime for this project was modified with this proof-of-concept in mind. While the engineering was reasonable, the approach taken is not one that would scale to adding hundreds or thousands of intrinsic calls. At the time of writing, the ARM intrinsics database contains 12,855 function calls, with 4,344 in the Neon instruction set extension. A full production-grade version of the Wasm intrinsic header library and accompanying engine support would be a substantial undertaking. You would almost certainly want automation and code-generation involved, but also certain parts of the engine integration would not scale well. The current hand-written assembler would need to support many more instructions. You also probably would not want to actually extend the Engine's IR to support every intrinsic either, but instead perhaps support an explicit passthrough or intrinsic IR node that would effectively perform a trivial lowering to a wrapped machine instruction. None of these engineering challenges are intractable, but they would need careful thought.

Reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants