High-order Virtual Machine (HVM) is a pure functional compile target that is lazy, non-garbage-collected and massively parallel

Overview

High-order Virtual Machine (HVM)

High-order Virtual Machine (HVM) is a pure functional compile target that is lazy, non-garbage-collected and massively parallel. It is also beta-optimal, meaning that, in several cases, it can be exponentially faster than most functional runtimes, including Haskell's GHC.

That is possible due to a new model of computation, the Interaction Net, which combines the Turing Machine with the Lambda Calculus. Previous implementations of this model have been inefficient in practice, however, a recent breakthrough has drastically improved its efficiency, giving birth to the HVM. Despite being a prototype, it already beats mature compilers in many cases, and is set to scale towards uncharted levels of performance.

Welcome to the inevitable parallel, functional future of computers!

Usage

1. Install it

First, install Rust. Then, type:

cargo install hvm

2. Create an HVM file

HVM files look like untyped Haskell. Save the file below as main.hvm:

// Creates a tree with `2^n` elements
(Gen 0) = (Leaf 1)
(Gen n) = (Node (Gen(- n 1)) (Gen(- n 1)))

// Adds all elements of a tree
(Sum (Leaf x))   = x
(Sum (Node a b)) = (+ (Sum a) (Sum b))

// Performs 2^n additions in parallel
(Main n) = (Sum (Gen n))

The program above creates a perfect binary tree with 2^n elements and adds them up. Since it is recursive, HVM will parallelize it automatically.

3. Run and compile

hvm r main 10                      # runs it with n=10
hvm c main                         # compiles HVM to C
clang -O2 main.c -o main -lpthread # compiles C to BIN
./main 30                          # runs it with n=30

The program above runs in about 6.4 seconds in a modern 8-core processor, while the identical Haskell code takes about 19.2 seconds in the same machine with GHC. This is HVM: write a functional program, get a parallel C runtime. And that's just the tip of iceberg!

For Nix users

See Nix usage documentation here.

Benchmarks

HVM has two main advantages over GHC: automatic parallelism and beta-optimality. I've selected 5 common micro-benchmarks to compare them. Keep in mind that HVM is still an early prototype, so it obviously won't beat GHC in general, but it does quite well already and should improve steadily as optimizations are implemented. Tests were compiled with ghc -O2 for Haskell and clang -O2 for HVM, on an 8-core M1 Max processor. The complete files to replicate these results are in the /bench directory.

List Fold (Sequential)

main.hvm main.hs
// Folds over a list
(Fold Nil         c n) = n
(Fold (Cons x xs) c n) = (c x (Fold xs c n))

// A list from 0 to n
(Range 0 xs) = xs
(Range n xs) =
  let m = (- n 1)
  (Range m (Cons m xs))

// Sums a big list with fold
(Main n) =
  let size = (* n 1000000)
  let list = (Range size Nil)
  (Fold list λaλb(+ a b) 0)
-- Folds over a list
fold Nil         c n = n
fold (Cons x xs) c n = c x (fold xs c n)

-- A list from 0 to n
range 0 xs = xs
range n xs =
  let m = n - 1
  in range m (Cons m xs)

-- Sums a big list with fold
main = do
  n <- read.head <$> getArgs :: IO Word32
  let size = 1000000 * n
  let list = range size Nil
  print $ fold list (+) 0

*the lower the better

In this micro-benchmark, we just build a huge list of numbers, and fold over it to sum them. Since lists are sequential, and since there are no higher-order lambdas, HVM doesn't have any technical advantage over GHC. As such, both runtimes perform very similarly.

Tree Sum (Parallel)

main.hvm main.hs
// Creates a tree with `2^n` elements
(Gen 0) = (Leaf 1)
(Gen n) = (Node (Gen(- n 1)) (Gen(- n 1)))

// Adds all elemements of a tree
(Sum (Leaf x))   = x
(Sum (Node a b)) = (+ (Sum a) (Sum b))

// Performs 2^n additions
(Main n) = (Sum (Gen n))
-- Creates a tree with 2^n elements
gen 0 = Leaf 1
gen n = Node (gen(n - 1)) (gen(n - 1))

-- Adds all elements of a tree
sun (Leaf x)   = 1
sun (Node a b) = sun a + sun b

-- Performs 2^n additions
main = do
  n <- read.head <$> getArgs :: IO Word32
  print $ sun (gen n)

TreeSum recursively builds and sums all elements of a perfect binary tree. HVM outperforms Haskell by a wide margin because this algorithm is embarassingly parallel, allowing it to fully use the available cores.

QuickSort (Parallel)

main.hvm main.hs
// QuickSort
(QSort p s Nil)          = Empty
(QSort p s (Cons x Nil)) = (Single x)
(QSort p s (Cons x xs))  =
  (Split p s (Cons x xs) Nil Nil)

// Splits list in two partitions
(Split p s Nil min max) =
  let s   = (>> s 1)
  let min = (QSort (- p s) s min)
  let max = (QSort (+ p s) s max)
  (Concat min max)
(Split p s (Cons x xs) min max) =
  (Place p s (< p x) x xs min max)

// Sorts and sums n random numbers
(Main n) =
  let list = (Randoms 1 (* 100000 n))
  (Sum (QSort Pivot Pivot list))
-- QuickSort
qsort p s Nil          = Empty
qsort p s (Cons x Nil) = Single x
qsort p s (Cons x xs)  =
  split p s (Cons x xs) Nil Nil

-- Splits list in two partitions
split p s Nil min max =
  let s'   = shiftR s 1
      min' = qsort (p - s') s' min
      max' = qsort (p + s') s' max
  in  Concat min' max'
split p s (Cons x xs) min max =
  place p s (p < x) x xs min max

-- Sorts and sums n random numbers
main = do
  n <- read.head <$> getArgs :: IO Word32
  let list = randoms 1 (100000 * n)
  print $ sun $ qsort pivot pivot $ list 

This test modifies QuickSort to return a concatenation tree instead of a flat list. This makes it embarassingly parallel, allowing HVM to outperform GHC by a wide margin again. It even beats Haskell's sort from Data.List! Note that flattening the tree will make the algorithm sequential. That's why we didn't chose MergeSort, as merge operates on lists. In general, trees should be favoured over lists on HVM.

Composition (Optimal)

main.hvm main.hs
// Computes f^(2^n)
(Comp 0 f x) = (f x)
(Comp n f x) = (Comp (- n 1) λk(f (f k)) x)

// Performs 2^n compositions
(Main n) = (Comp n λx(x) 0)
-- Computes f^(2^n)
comp 0 f x = f x
comp n f x = comp (n - 1) (\x -> f (f x)) x

-- Performs 2^n compositions
main = do
  n <- read.head <$> getArgs :: IO Int
  print $ comp n (\x -> x) (0 :: Int)

This chart isn't wrong: HVM is exponentially faster for function composition, due to optimality, depending on the target function. There is no parallelism involved here. In general, if the composition of a function f has a constant-size normal form, then f^(2^N)(x) is linear-time (O(N)) on HVM, and exponential-time (O(2^N)) on GHC. This can be taken advantage of to design novel functional algorithms. I highly encourage you to try composing different functions and watching how their complexity behaves. Can you tell if it will be linear or exponential? Or how recursion will affect it? That's a very insightful experience!

Lambda Arithmetic (Optimal)

main.hvm main.hs
// Increments a Bits by 1
(Inc xs) = λex λox λix
  let e = ex
  let o = ix
  let i = λp (ox (Inc p))
  (xs e o i)

// Adds two Bits
(Add xs ys) = (App xs λx(Inc x) ys)

// Multiplies two Bits
(Mul xs ys) = 
  let e = End
  let o = λp (B0 (Mul p ys))
  let i = λp (Add ys (B0 (Mul p ys)))
  (xs e o i)

// Squares (n * 100k)
(Main n) =
  let a = (FromU32 32 (* 100000 n))
  let b = (FromU32 32 (* 100000 n))
  (ToU32 (Mul a b))
-- Increments a Bits by 1
inc xs = Bits $ \ex -> \ox -> \ix ->
  let e = ex
      o = ix
      i = \p -> ox (inc p)
  in get xs e o i

-- Adds two Bits
add xs ys = app xs (\x -> inc x) ys

-- Multiplies two Bits
mul xs ys = 
  let e = end
      o = \p -> b0 (mul p ys)
      i = \p -> add ys (b1 (mul p ys))
  in get xs e o i

-- Squares (n * 100k)
main = do
  n <- read.head <$> getArgs :: IO Word32
  let a = fromU32 32 (100000 * n)
  let b = fromU32 32 (100000 * n)
  print $ toU32 (mul a b)

This example takes advantage of beta-optimality to implement multiplication using lambda-encoded bitstrings. Once again, HVM halts instantly, while GHC struggles to deal with all these lambdas. Lambda encodings have wide practical applications. For example, Haskell's Lists are optimized by converting them to lambdas (foldr/build), its Free Monads library has a faster version based on lambdas, and so on. HVM's optimality open doors for an entire unexplored field of lambda-encoded algorithms that were simply impossible before.

Charts made on plotly.com.

How is that possible?

Check HOW.md.

How can I help?

Most importantly, if you appreciate our work, help spreading the project! Posting on Reddit, communities, etc. helps more than you think.

Second, I'm actually looking for partners! I'm confident HVM's current design is ready to scale and become the fastest runtime in the world. There are many cool things we'd like to implement:

If you'd like to be part of any of these, please email me, or just send me a personal message on Twitter.

Community

To just follow the project, join our Telegram Chat, the Kindelia community on Discord or Matrix!

Comments
  • Incorrect result on some inputs

    Incorrect result on some inputs

    HVM produces incorrect results on some inputs, one of which being:

    (Main arg) = ((λa (a (λb ((λc (λd (b (c d)))) a)))) (λe (e e)))
    

    On this input, HVM produces the result λx1 λx2 ((x1 x2) (x1 x2)) while the correct result is λx1 λx2 ((x1 x1) (x2 x2)) (as verified by https://crypto.stanford.edu/~blynn/lambda/, with the corresponding input syntax by adding dots after the variable of each lambda: ((λa. (a (λb. ((λc. (λd. (b (c d)))) a)))) (λe. (e e)))).

    Besides, on other inputs such as:

    ((λa λb ((b a) λc λd (λe (d a) a)) λf (f f)) λz z)
    

    HVM produces the result λx1 (x1 λx2 (x2 _)) which is not even well-formed.

    discussion 
    opened by Ekdohibs 41
  • Improvements needed, and planned features

    Improvements needed, and planned features

    Better allocator

    Currently, the C allocator just reserves 8 GB of memory when the program starts, and, if there are multiple threads, that memory is split between then. So, for example, if there are 8 threads, each one gets 1 GB to work with. A worker can read from other worker's memories (which may happen for shared data, i.e., when they pass through a DP0/DP1 pointer), but only the owner of the slice can allocate memory from it. Each worker also keeps its own freelist.

    As such, a huge improvement would be thread-safe, global, possibly arena-based, alloc and free.

    Better scheduler

    Right now, the task scheduler will just partition threads evenly among the normal form of the result. So, for example, if the result is (Pair A B), and there are 8 threads, 4 will be assigned to reduce A and B. This is obviously very limited: if A reduces quickly, it won't re-assign the threads to help reducing B, so the program will just use half of the cores from that point. And so on. This is why algorithms that return lists are often slower, they aren't using threads at all. In the worst case, it will just fallback to being single threaded.

    A huge improvement would be a proper scheduler, that adds potential redexes to a task pool. When I attempted to do that, the cost of synchronization added too much overhead, ruining the performance. Perhaps a heuristic to consider would be to limit the depth of the redexes for which global tasks are emitted; anything below, say, depth=8, would just be reduced by the thread that reaches it. Many improvements can be done, though.

    I32, I64, U64, F32, F64

    Right now, HVM only supports U32. The numeric types above should be supported too. I32 and F32 should be easy to add, since they are unboxed, like U32. I64, U64 and F64 require implementing boxed numbers, since they don't fit inside a 64-bit Lnk (due to the 4-bit pointer), but shouldn't be hard. Discussion on whether we should have unboxed 60-bit variants is valid.

    On #81.

    Improve the left-hand side flattener

    On #54.

    A nice API to use as a Rust library

    Using HVM as a pure Rust lib inside other projects should be very easy, specially considering the interpreter has a fairly good performance (only about 50% slower than the single-thread compiled version). We must think of the right API to expose these features in a Rust-idiomatic, future-proof way. Input from the Rust community is appreciated.

    A JIT compiler

    Right now, HVM compiles to C. That means a C compiler like clang or gcc is necessary to produce executables, which means that it isn't portable when used as a lib. Of course, libs can just use the interpreter, but it is ~3x slower than the compiler, and is not parallel. Ideally, we'd instead JIT-compile HVM files using WASM or perhaps Cranelift, allowing lib users to enjoy the full speed of HVM in a portable way.

    IO

    Adding IO should be easy: just write an alternative version of normalize that, instead of just outputting the normal form, pattern-matches it against the expected IO type, and performs IO operations as demanded.

    Windows support

    The compiler doesn't support Windows yet. The use of -lpthreads may be an issue. The interpreter should work fine, though.

    On #52.

    GPU compilers

    Compiling to the GPU would be amazing, and I see no reason this shouldn't be possible. The only complications I can think of are:

    1. How do we alloc inside a GPU work unit?

    2. Reduce() is highly branching, so this may destroy the performance. But it can be altered to group tasks by categories, and merge all branches in one go. The reduce function essentially does this: branch -> match -> alloc -> subst -> free. The last 4 steps are very similar independent of the branch it falls into. So, instead, the branch part can just prepare some data, and the match -> alloc -> subst -> free pass are all moved to the same branch, just with different indices and parameters. This will avoid thread divergence.

    3. GPGPU is a laborious mess. I'm under OSX, so I can't use CUDA; OpenCL is terrible; WebGPU is too new (and from what I could see, doesn't have the ability to read/write from the same buffer in the same pass!?). Regardless, we'd probably need different versions, Metal for OSX, CUDA and OpenCL for Windows depending on the GPU, etc. So, that's a lot of work that I can't do myself, it would require expensive engineers.

    Compile Kind to HVM

    HVM is meant to be a "low-level target", which is kinda confusing because it is actually very high-level, but in the sense that users shouldn't be writing it directly. Ideally, Kind will compile to HVM, but it needs a lot of adjustments before that is possible.

    Add a checker for the "derivatives of (λx(x x) λx(x x)) not allowed" rule

    See this issue for more information, as well as the superposed duplication section on HOW.md.

    On #61.

    Tests

    This project has basically no test. Write tests!


    Will edit the thread as I think of more things. Feel free to add to this thread.

    enhancement 
    opened by VictorTaelin 35
  • Windows support

    Windows support

    Right now the generated C code can't be compiled on Windows due to the use of <pthreads.h> etc.

    One way it could be done would be to compile it using a compatibility layer like Cygwin, but I never did this before.

    Help from some Windows programmer would be appreciated. :3

    enhancement 
    opened by steinerkelvin 19
  • Modified example program outputs expression instead of result

    Modified example program outputs expression instead of result

    I modified the example program:

    // Creates a tree with `2^n` elements
    (Gen 0) = (Leaf 1)
    (Gen n) = (Node (Gen(- n 1)) (Gen(- n 1)))
    
    // Adds all elements of a tree
    (Sum (Leaf x))   = x
    (Sum (Node a b)) = (+ ((+ (Sum a) (Sum b))) ((- (Sum a) (Sum b))))
    
    // Performs 2^n additions in parallel
    (Main n) = (Sum (Gen n))
    
    

    It often outputs 256, but sometimes outputs an expression. The expressions seem to be different:

    $ time ./main 8
    Rewrites: 304812 (2.60 MR/s).
    Mem.Size: 198923 words.
    
    ((((64+((32+((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))+(32-((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))))+(64-((32+((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))+(32-((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))))))))))+128)+0)
    
    real	0m0.146s
    user	0m0.052s
    sys	0m0.111s
    
    Rewrites: 305163 (2.14 MR/s).
    Mem.Size: 200042 words.
    
    (256+(((64+((((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))+32)+(((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))-32)))+(64-((((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))+32)+(((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))-32))))-128))
    
    real	0m0.164s
    user	0m0.045s
    sys	0m0.130s
    
    bug 
    opened by bjm5 18
  • `cargo install hvm` considered harmful

    `cargo install hvm` considered harmful

    ToT (61a331ffdc7c04567245f2cae969ab4643e91f20) doesn't appear to compile with gcc version 11.2.0 (from Ubuntu 21.10)

    $ hvm c bench/Fibonacci/main.hvm
    Compiled to 'bench/Fibonacci/main.c'.
    $ gcc -O2 bench/Fibonacci/main.c
    bench/Fibonacci/main.c:932:5: error: variably modified ‘normal_seen_data’ at file scope
      932 | u64 normal_seen_data[NORMAL_SEEN_MCAP];
          |     ^~~~~~~~~~~~~~~~
    

    Digging in I see

    #define NORMAL_SEEN_MCAP (HEAP_SIZE/sizeof(u64)/(sizeof(u64)*8))
    const u64 HEAP_SIZE = 8 * U64_PER_GB * sizeof(u64);
    
    discussion 
    opened by tommythorn 16
  • Generated C code runs forever

    Generated C code runs forever

    @VictorTaelin

    Just to be sure, all works fine there?

    Hmm, I'm actually not sure. I tried one of the benchmarks which ended up compiling ok, with only the following warning:

    bash-5.1$ clang -O2 compose_id.hovm.out.c -o main -lpthread
    compose_id.hovm.out.c:974:23: warning: incompatible pointer to integer conversion assigning to 'Thd' (aka 'unsigned long') from 'void *'
          [-Wint-conversion]
        workers[t].thread = NULL;
                          ^ ~~~~
    1 warning generated.
    

    But when I try to run the resulting binary, it doesn't seem to finish running, instead seemingly getting stuck:

    bash-5.1$ ./main
    Reducing.
    
    

    It hasn't printed anything after that new line for > than 30 minutes (I ended up killing it). Not sure if it's actually not working properly or if it's just my slow CPU though. When I looked at my process manager, the process didn't seem to be using any CPU cycles either.

    What is your CPU

    $ lscpu
    Architecture:            x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Address sizes:         40 bits physical, 48 bits virtual
      Byte Order:            Little Endian
    CPU(s):                  4
      On-line CPU(s) list:   0-3
    Vendor ID:               AuthenticAMD
      Model name:            HP Hexa-Core 2.0GHz
        CPU family:          22
        Model:               48
        Thread(s) per core:  1
        Core(s) per socket:  4
        Socket(s):           1
        Stepping:            1
        Frequency boost:     enabled
        CPU max MHz:         2000.0000
        CPU min MHz:         1000.0000
        BogoMIPS:            3992.48
        Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_op
                             t pdpe1gb rdtscp lm constant_tsc rep_good acc_power nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3
                             cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefet
                             ch osvw ibs skinit wdt topoext perfctr_nb bpext ptsc perfctr_llc cpb hw_pstate ssbd vmmcall bmi1 xsaveopt arat npt lbrv svm_lo
                             ck nrip_save tsc_scale flushbyasid decodeassists pausefilter pfthreshold overflow_recov
    Virtualization features:
      Virtualization:        AMD-V
    Caches (sum of all):
      L1d:                   128 KiB (4 instances)
      L1i:                   128 KiB (4 instances)
      L2:                    2 MiB (1 instance)
    NUMA:
      NUMA node(s):          1
      NUMA node0 CPU(s):     0-3
    Vulnerabilities:
      Itlb multihit:         Not affected
      L1tf:                  Not affected
      Mds:                   Not affected
      Meltdown:              Not affected
      Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
      Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
      Spectre v2:            Mitigation; Full AMD retpoline, STIBP disabled, RSB filling
      Srbds:                 Not affected
      Tsx async abort:       Not affected
    

    Originally posted by @nothingnesses in https://github.com/Kindelia/HOVM/issues/15#issuecomment-1021494829

    opened by nothingnesses 15
  • Improve Nix usage instructions. Regenerate Cargo.nix.

    Improve Nix usage instructions. Regenerate Cargo.nix.

    Since Cargo.lock was changed with commit 6975488b0c253ee598981aaaf3260d7d9cfab74c. (Note: when Cargo.lock changes, crate2nix generate needs to be ran to update Cargo.nix)

    opened by nothingnesses 14
  • C Runtime rationale

    C Runtime rationale

    Hi there, this is a really interesting project. I was wondering why the target runtime is obtained transpiling to a C template, instead of directly using rust itself (eg. using Rayon); another option could be targeting something like LLVM.

    Thank you

    discussion 
    opened by csuriano23 13
  • Code crashes with segfault

    Code crashes with segfault

    First of all, thanks for this very interesting project!

    I tried some of the examples, and the last one about lambda arithmetic:

    // Increments a Bits by 1
    (Inc xs) = λex λox λix
      let e = ex
      let o = ix
      let i = λp (ox (Inc p))
      (xs e o i)
    
    // Adds two Bits
    (Add xs ys) = (App xs λx(Inc x) ys)
    
    // Multiplies two Bits
    (Mul xs ys) = 
      let e = End
      let o = λp (B0 (Mul p ys))
      let i = λp (Add ys (B0 (Mul p ys)))
      (xs e o i)
    
    // Squares (n * 100k)
    (Main n) =
      let a = (FromU32 32 (* 100000 n))
      let b = (FromU32 32 (* 100000 n))
      (ToU32 (Mul a b))
    

    just gives me a segfault:

    $ hvm c main && clang -O2 main.c -o main -lpthread && ./main 25
    Compiled to 'main.c'.
    Segmentation fault (core dumped)
    

    Tested with commit 1da28755.

    Furthermore, the QuickSort example just outputs:

    (Sum (QSort (Pivot) (Pivot) (Randoms 1 2500000)))
    

    Probably because neither Randoms, Sum, nor Pivot are defined ...

    Now back to studying HOW.md ... ;)

    bug 
    opened by 01mf02 12
  • Checker for the derivatives of `(λx(x x) λx(x x))`

    Checker for the derivatives of `(λx(x x) λx(x x))`

    If a lambda that clones its argument is itself cloned, then its clones aren't allowed to clone each-other.

    See https://github.com/Kindelia/HVM/issues/44 for more information, as well as the superposed duplication section on HOW.md.

    Example of what is not allowed:

    let f = λs λz (s (s z)) # a lambda that clones its argument
    let g = f               # a clone of that lambda
    (g f)                   # the clone will clone the original
    
    enhancement 
    opened by steinerkelvin 11
  • Pattern matching on a lambda?

    Pattern matching on a lambda?

    Awesome project, got to it via HN I thought I'd try to port microKanren to HVM (why not), but it requires matching on a lambda, or a slight change in the semantics

    Here's what I came up with until now, it's likely I may have missed a few things or got them wrong

    (Car (Cons x y)) = x
    (Cdr (Cons x y)) = y
    (Vareq (Var v1) (Var v2)) = (== v1 v2)
    
    (ExtS x v s) = (Cons (Cons x v) s)
    
    (EgInner sc NIL) = MZERO
    (EgInner sc s) = (Unit (Cons s (Cdr sc)))
    
    (Eg u v) =
        λsc(EgInner sc (Unify u v (Car sc)))
    
    (Unit sc) = (Cons sc MZERO)
    MZERO = NIL
    
    (UnifyInner u v NIL) = NIL
    (UnifyInner (Var v) (Var v) s) = s
    (UnifyInner (Var u) v s) = (ExtS (Var u) v s)
    (UnifyInner u (Var v) s) = (ExtS (Var v) u s)
    (UnifyInner (Cons hu tu) (Cons hv tv) s) =
        (let s = (Unify hu hv s);
         (Unify tu tv s))
    (UnifyInner u u s) = s
    
    (Unify u v s) =
        (UnifyInner (Walk u s) (Walk v s) s)
    
    (Walk u NIL) = u
    (Walk (Var u) (Cons (Cons (Var u) v) s)) = (Walk v s)
    (Walk (Var u) .) = (Var u)
    
    (CallFresh f) = @sc(let c = (Cdr sc);
                        ((f (Var c)) (Cons (Car sc) (+ c 1))))
    (Disj g h) = @sc(Mplus (g sc) (h sc))
    (Conj g h) = @sc(Bind (g sc) h)
    
    (Mplus NIL y) = y
    (Mplus (Cons h t) y) = (Cons h (Mplus t) y)
    (Mplus fn x) = @(Mplus x (fn))
    
    (Bind NIL g) = MZERO
    (Bind (Cons h t) g) = (MPlus (g h) (Bind t g))
    (Bind fn g) = @(Bind (fn) g)
    
    EMPTY_STATE = (Cons NIL 0)
    
    // uK ends here
    
    (Zzz g) = @sc(@(g sc))
    
    (ConjL (Cons g NIL)) = (Zzz g)
    (ConjL (Cons g gs)) = (Conj (Zzz g) (ConjL gs))
    
    (DisjL (Cons g NIL)) = (Zzz g)
    (DisjL (Cons g gs)) = (Disj (Zzz g) (DisjL gs))
    
    (Conde (Cons l alts)) = (DisjL (Cons (ConjL l) (Conde alts)))
    
    (Fresh NIL goals) = (ConjL goals)
    (Fresh (Cons v lvars) goals) = @v(Fresh lvars goals)
    
    (Pull NIL) = NIL
    (Pull (Cons x y)) = (Cons x y)
    (Pull f) = (Pull (f))
    
    (Take 0 f) = NIL
    (Take n NIL) = NIL
    (Take n (Cons x y)) = (Cons x (Take (- n 1) y))
    (Take n f)  = (Take n (Pull f))
    
    //
    
    (Fives x) = (Disj (Eg x 5) (Fives x))
    
    (Main) = (let f = (CallFresh Fives);
              (f EMPTY_STATE))
    
    opened by bsless 11
  • Compiled HVM programs need a better CLI interface

    Compiled HVM programs need a better CLI interface

    Running a program compiled with hvm compile somefile.hvm and then cargo build --release, I am greeted with this message

    $ ./target/release/somefile 
    
    somefile 1.0.17-beta
    A massively parallel functional runtime.
    
    USAGE:
        somefile <SUBCOMMAND>
    
    OPTIONS:
        -h, --help       Print help information
        -V, --version    Print version information
    
    SUBCOMMANDS:
        compile    Compile a file to Rust
        help       Print this message or the help of the given subcommand(s)
        run        Load a file and run an expression
    

    That means that this program is not carrying only the interpreted and precompiled functions, but also the whole compiler.

    That is unintuitive, what I expected was to just be able to call my program directly (which would correspond to calling it with somefile run "(Main arg)").

    We should probably create a separate CLI for the compiled programs to make this simpler to use

    opened by developedby 0
  • Speed up default_heap_size

    Speed up default_heap_size

    System::new_all does a lot more than just measuring available mem, so let's cut those extra syscalls to make startup faster.

    Bench (examples/hello):

    Old:

    $ hyperfine "target/release/main run '(Main 0)'"
    Benchmark 1: target/release/main run '(Main 0)'
      Time (mean ± σ):     513.4 ms ±  36.6 ms    [User: 285.6 ms, System: 455.2 ms]
      Range (min … max):   469.5 ms … 587.2 ms    10 runs
    

    New:

    $ hyperfine "target/release/main run '(Main 0)'"
    Benchmark 1: target/release/main run '(Main 0)'
      Time (mean ± σ):      77.5 ms ±   4.9 ms    [User: 18.0 ms, System: 59.9 ms]
      Range (min … max):    67.2 ms …  85.6 ms    33 runs
    
    opened by funbringer 0
  • Multithreaded sequential programs spend most their active time waiting

    Multithreaded sequential programs spend most their active time waiting

    For programs that are dominated by expressions that must be reduced in sequential order, there is a very large performance drop proportional to the number of threads used.

    Take this example program:

    (Func 0 b) = b
    (Func a b) = (Func (- a 1) (^ (<< b 1) b))
    
    (Main) = (Func 100000 1)
    

    Running with one thread, I get ~30MR/s, while with 2 threads I get ~20MR/s and with 16 threads ~10MR/s.

    Analizing what the program spends its time on using perf, it looks like most of the time the threads are trying to steal a job from another task, failing (because this program is inherently sequential and there is not enough parallel work for all the threads), and the waiting to try again.

    While it's true that we should try to write parallelizable programs for the HVM, for tasks that are mostly sequential we should have better behaviour.

    opened by developedby 2
  • Programs with a lot of dups get stuck

    Programs with a lot of dups get stuck

    (Func 0 b) = b
    (Func a b) = (Func (- a 1) (+ b b))
    
    (Main) = (Func 100000000 2)
    

    Running this code with a very large a, makes it seemingly never end (expected 10s, waited 10m). Other programs that create lots and lots of dups also seem to cause this.

    bug 
    opened by developedby 5
  • Questions about Lambda operator and parallel evaluation.

    Questions about Lambda operator and parallel evaluation.

    Are the lambda operator and let syntax necessary? E.g. aren't these rewrite rules sufficient to emulate them:

    (Replace (Var var) value (Var var)) = value
    (Replace (Var var) value (Lambda (Var arg) body) = 
        (Lambda (Var arg) (Replace (Var var) value body) -- where var != arg for proper scoping (idk how to encode this as a rewrite rule)
    (Apply (Lambda (Var x) body) y) = (Replace (Var x) y body)
    

    Example: (Apply (Lambda (Var X) (Var X)) (Var Y)) should evaluate to (Var Y). Is there a reason these exit as primitives (convenience? efficiency?) What about dup?

    (Dup x) = (Dup' x x)
    -- ... extra rules for lambda/replace that propagate the dup appropriately
    -- can be used in code by pattern matching against Dup' e.g. (F (Dup' x y)) = Whatever
    

    Basically, I am asking if rewrite rules are sufficient to encode all these computations (according to the interaction net paper I've read it seems so)? So what's the point of these primitives if this is intended as a bytecode/IR and not human-writable?

    Secondly, I am still curious about how HVM handles parallel computation. I could probably dig around the code more but a succinct explanation would be helpful I think.

    So suppose the following rules:

    (And True x)  = x
    (And False x) = False
    

    And suppose I want to evaluate an expression in the form (And True (And False (And ...))). For each And, we can use the rewrite rules above to rewrite each nesting in parallel (assuming infinite cores). Is this done in practice and how? Like what struct in C do each of these evaluate to?

    opened by vladov3000 1
Owner
null
An experimental, well-documented and expansion-ready virtual machine written in Rust.

Notice ivm is undergoing a complete refactor. ivm ivm is a rich, well-documented virtual machine written in Rust. Pros Documentation everywhere. Every

Imajin 4 Jul 9, 2022
A fast and secure RISC-V based virtual machine

PolkaVM PolkaVM is a general purpose user-level RISC-V based virtual machine. This project is still unfinished and is a very heavy work-in-progress! D

Koute 31 Sep 4, 2023
🔈 Elegant print for lazy devs

leg ?? Elegant print for lazy devs Make your CLIs nicer with minimal effort. Simple wrapper on top of: async-std printing macros. Prints to stderr to

Jesús Rubio 202 Nov 6, 2022
Embeddable tree-walk interpreter for a "mostly lazy" Lisp-like scripting language.

ceceio Embeddable tree-walk interpreter for a "mostly lazy" Lisp-like scripting language. Just a work-in-progress testbed for now. Sample usage us

Vinícius Miguel 7 Aug 18, 2022
Too lazy to read the full article? Skim it

SkimGPT When you're too lazy to either read the article or ask AI questions, you can use SkimGPT to help you. Install Clone this repo: git clone https

Huy 9 Apr 22, 2023
Lazy Sieve of Eratosthenes for infinitely generating primes lazily in Rust.

lazy-prime-sieve Lazy Sieve of Eratosthenes for infinitely generating primes lazily in Rust. Usage lazy-prime-sieve is a library crate. You may add it

Arindam Das 2 Jul 13, 2023
gomicollector is a simple mark-sweep garbage collector in Rust.

gomicollector gomicollector is a simple mark-sweep garbage collector in Rust. gomicollector collects garbage when the heap is full. Since pointer oper

speed 7 Feb 10, 2023
Wait, another virtual machine ?

WAVM WAVM, Wait, another virtual machine ?, is a register based 64 bits virtual machine written in Rust. It relies on 32 registers and 31 opcodes that

Wafelack 61 May 2, 2022
A stack based, virtual machine language written in Rust

Stackyy A stack based, virtual machine language written in Rust Description: Stackyy is a stack based, virtual machine language inspired by Forth and

FlawlessCode 2 May 2, 2022
Functional Reactive Programming library for Rust

Carboxyl is a library for functional reactive programming in Rust, a functional and composable approach to handle events in interactive applications.

Emilia Bopp 379 Dec 25, 2022
👌 A smol functional language that targets other languages

ditto A small, pure functional language that targets other languages. Syntax highlighting coming soon Elevator pitch ⏱️ Ditto is a mashup of my favour

ditto 45 Dec 17, 2022
A simplistic functional programming language based around Lisp syntax.

Orchid A simplistic functional programming language based around Lisp syntax. Short taste # function to return the larger list (fn larger-list (as bs)

rem 3 May 7, 2022
ShakeFlow: Functional Hardware Description with Latency-Insensitive Interface Combinators

ShakeFlow: Functional Hardware Description with Latency-Insensitive Interface Combinators This repository contains the artifact for the following pape

KAIST Concurrency & Parallelism Laboratory 36 Feb 16, 2023
Parallel pipelined map over iterators.

plmap Parallel pipelined map over iterators for rust. Documentation docs.rs/plmap Example Parallel pipelined mapping: // Import the iterator extension

null 3 Sep 15, 2022
Functions for mapping plaintexts to a u64 while preserving sort order

ore_encoding.rs This is a companion package to ore.rs that can generate and manipulate u64 plaintexts before they are encrypted by ore.rs. Being able

CipherStash 2 Dec 14, 2022
A small utility for modifying ELF shared library loading order.

elfpromote A small utility for modifying ELF shared library loading order. Usage $ cargo install elfpromote $ ldd blueboat_server linux-vdso.s

Heyang Zhou 5 Jul 21, 2022
ARM TrustZone-M example application in Rust, both secure world side and non-secure world side

ARM TrustZone-M example application in Rust, both secure world side and non-secure world side; projects are modified from generated result of cortex-m-quickstart.

null 44 Dec 4, 2022
An asynchronous runtime compatible with WebAssembly and non-WebAssembly targets.

Promise x Tokio = Prokio An asynchronous runtime compatible with WebAssembly and non-WebAssembly targets. Rationale When designing components and libr

Yew Stack 29 Feb 6, 2023
Compile-time stuff and other goodies for rustaceans 🦀

?? bagel: Always baked, never fried bagel is a collection of macros and other things that we frequently use at Skytable, primarily to get work done at

Skytable 3 Jul 4, 2022