A massively parallel, optimal functional runtime in Rust

Last update: Jan 8, 2023

Related tags

Miscellaneous HVM

Overview

High-order Virtual Machine (HVM)

High-order Virtual Machine (HVM) is a pure functional compile target that is lazy, non-garbage-collected and massively parallel. It is also beta-optimal, meaning that, in several cases, it can be exponentially faster than most functional runtimes, including Haskell's GHC.

That is possible due to a new model of computation, the Interaction Net, which combines the Turing Machine with the Lambda Calculus. Previous implementations of this model have been inefficient in practice, however, a recent breakthrough has drastically improved its efficiency, giving birth to the HVM. Despite being a prototype, it already beats mature compilers in many cases, and is set to scale towards uncharted levels of performance.

Welcome to the inevitable parallel, functional future of computers!

Usage

1. Install it

First, install Rust. Then, type:

cargo install hvm

2. Create an HVM file

HVM files look like untyped Haskell. Save the file below as main.hvm:

// Creates a tree with `2^n` elements
(Gen 0) = (Leaf 1)
(Gen n) = (Node (Gen(- n 1)) (Gen(- n 1)))

// Adds all elements of a tree
(Sum (Leaf x))   = x
(Sum (Node a b)) = (+ (Sum a) (Sum b))

// Performs 2^n additions in parallel
(Main n) = (Sum (Gen n))

The program above creates a perfect binary tree with 2^n elements and adds them up. Since it is recursive, HVM will parallelize it automatically.

3. Run and compile

hvm r main 10                      # runs it with n=10
hvm c main                         # compiles HVM to C
clang -O2 main.c -o main -lpthread # compiles C to BIN
./main 30                          # runs it with n=30

The program above runs in about 6.4 seconds in a modern 8-core processor, while the identical Haskell code takes about 19.2 seconds in the same machine with GHC. This is HVM: write a functional program, get a parallel C runtime. And that's just the tip of iceberg!

See Nix usage documentation here.

Benchmarks

HVM has two main advantages over GHC: automatic parallelism and beta-optimality. I've selected 5 common micro-benchmarks to compare them. Keep in mind that HVM is still an early prototype, so it obviously won't beat GHC in general, but it does quite well already and should improve steadily as optimizations are implemented. Tests were compiled with ghc -O2 for Haskell and clang -O2 for HVM, on an 8-core M1 Max processor. The complete files to replicate these results are in the /bench directory.

List Fold (Sequential)

main.hvm

main.hs

// Folds over a list
(Fold Nil         c n) = n
(Fold (Cons x xs) c n) = (c x (Fold xs c n))

// A list from 0 to n
(Range 0 xs) = xs
(Range n xs) =
  let m = (- n 1)
  (Range m (Cons m xs))

// Sums a big list with fold
(Main n) =
  let size = (* n 1000000)
  let list = (Range size Nil)
  (Fold list λaλb(+ a b) 0)

-- Folds over a list
fold Nil         c n = n
fold (Cons x xs) c n = c x (fold xs c n)

-- A list from 0 to n
range 0 xs = xs
range n xs =
  let m = n - 1
  in range m (Cons m xs)

-- Sums a big list with fold
main = do
  n <- read.head <$> getArgs :: IO Word32
  let size = 1000000 * n
  let list = range size Nil
  print $ fold list (+) 0

_{*the lower the better}

In this micro-benchmark, we just build a huge list of numbers, and fold over it to sum them. Since lists are sequential, and since there are no higher-order lambdas, HVM doesn't have any technical advantage over GHC. As such, both runtimes perform very similarly.

Tree Sum (Parallel)

main.hvm

main.hs

// Creates a tree with `2^n` elements
(Gen 0) = (Leaf 1)
(Gen n) = (Node (Gen(- n 1)) (Gen(- n 1)))

// Adds all elemements of a tree
(Sum (Leaf x))   = x
(Sum (Node a b)) = (+ (Sum a) (Sum b))

// Performs 2^n additions
(Main n) = (Sum (Gen n))

-- Creates a tree with 2^n elements
gen 0 = Leaf 1
gen n = Node (gen(n - 1)) (gen(n - 1))

-- Adds all elements of a tree
sun (Leaf x)   = 1
sun (Node a b) = sun a + sun b

-- Performs 2^n additions
main = do
  n <- read.head <$> getArgs :: IO Word32
  print $ sun (gen n)

TreeSum recursively builds and sums all elements of a perfect binary tree. HVM outperforms Haskell by a wide margin because this algorithm is embarassingly parallel, allowing it to fully use the available cores.

QuickSort (Parallel)

main.hvm

main.hs

// QuickSort
(QSort p s Nil)          = Empty
(QSort p s (Cons x Nil)) = (Single x)
(QSort p s (Cons x xs))  =
  (Split p s (Cons x xs) Nil Nil)

// Splits list in two partitions
(Split p s Nil min max) =
  let s   = (>> s 1)
  let min = (QSort (- p s) s min)
  let max = (QSort (+ p s) s max)
  (Concat min max)
(Split p s (Cons x xs) min max) =
  (Place p s (< p x) x xs min max)

// Sorts and sums n random numbers
(Main n) =
  let list = (Randoms 1 (* 100000 n))
  (Sum (QSort Pivot Pivot list))

-- QuickSort
qsort p s Nil          = Empty
qsort p s (Cons x Nil) = Single x
qsort p s (Cons x xs)  =
  split p s (Cons x xs) Nil Nil

-- Splits list in two partitions
split p s Nil min max =
  let s'   = shiftR s 1
      min' = qsort (p - s') s' min
      max' = qsort (p + s') s' max
  in  Concat min' max'
split p s (Cons x xs) min max =
  place p s (p < x) x xs min max

-- Sorts and sums n random numbers
main = do
  n <- read.head <$> getArgs :: IO Word32
  let list = randoms 1 (100000 * n)
  print $ sun $ qsort pivot pivot $ list

This test modifies QuickSort to return a concatenation tree instead of a flat list. This makes it embarassingly parallel, allowing HVM to outperform GHC by a wide margin again. It even beats Haskell's sort from Data.List! Note that flattening the tree will make the algorithm sequential. That's why we didn't chose MergeSort, as merge operates on lists. In general, trees should be favoured over lists on HVM.

Composition (Optimal)

main.hvm

main.hs

// Computes f^(2^n)
(Comp 0 f x) = (f x)
(Comp n f x) = (Comp (- n 1) λk(f (f k)) x)

// Performs 2^n compositions
(Main n) = (Comp n λx(x) 0)

-- Computes f^(2^n)
comp 0 f x = f x
comp n f x = comp (n - 1) (\x -> f (f x)) x

-- Performs 2^n compositions
main = do
  n <- read.head <$> getArgs :: IO Int
  print $ comp n (\x -> x) (0 :: Int)

This chart isn't wrong: HVM is exponentially faster for function composition, due to optimality, depending on the target function. There is no parallelism involved here. In general, if the composition of a function f has a constant-size normal form, then f^(2^N)(x) is linear-time (O(N)) on HVM, and exponential-time (O(2^N)) on GHC. This can be taken advantage of to design novel functional algorithms. I highly encourage you to try composing different functions and watching how their complexity behaves. Can you tell if it will be linear or exponential? Or how recursion will affect it? That's a very insightful experience!

Lambda Arithmetic (Optimal)

main.hvm

main.hs

// Increments a Bits by 1
(Inc xs) = λex λox λix
  let e = ex
  let o = ix
  let i = λp (ox (Inc p))
  (xs e o i)

// Adds two Bits
(Add xs ys) = (App xs λx(Inc x) ys)

// Multiplies two Bits
(Mul xs ys) = 
  let e = End
  let o = λp (B0 (Mul p ys))
  let i = λp (Add ys (B0 (Mul p ys)))
  (xs e o i)

// Squares (n * 100k)
(Main n) =
  let a = (FromU32 32 (* 100000 n))
  let b = (FromU32 32 (* 100000 n))
  (ToU32 (Mul a b))

-- Increments a Bits by 1
inc xs = Bits $ \ex -> \ox -> \ix ->
  let e = ex
      o = ix
      i = \p -> ox (inc p)
  in get xs e o i

-- Adds two Bits
add xs ys = app xs (\x -> inc x) ys

-- Multiplies two Bits
mul xs ys = 
  let e = end
      o = \p -> b0 (mul p ys)
      i = \p -> add ys (b1 (mul p ys))
  in get xs e o i

-- Squares (n * 100k)
main = do
  n <- read.head <$> getArgs :: IO Word32
  let a = fromU32 32 (100000 * n)
  let b = fromU32 32 (100000 * n)
  print $ toU32 (mul a b)

This example takes advantage of beta-optimality to implement multiplication using lambda-encoded bitstrings. Once again, HVM halts instantly, while GHC struggles to deal with all these lambdas. Lambda encodings have wide practical applications. For example, Haskell's Lists are optimized by converting them to lambdas (foldr/build), its Free Monads library has a faster version based on lambdas, and so on. HVM's optimality open doors for an entire unexplored field of lambda-encoded algorithms that were simply impossible before.

Charts made on plotly.com.

How is that possible?

Check HOW.md.

How can I help?

Most importantly, if you appreciate our work, help spreading the project! Posting on Reddit, communities, etc. helps more than you think.

Second, I'm looking for partners! I believe HVM's current design is ready to scale and become the fastest runtime in the world, but much needs to be done to get there. We're also building interesting products built on top of it. If you'd like to get involved, please email me, or just send me a personal message on Twitter.

Community

To just follow the project, join our Telegram Chat, the Kindelia community on Discord or Matrix!

Comments

Incorrect result on some inputs
HVM produces incorrect results on some inputs, one of which being:

(Main arg) = ((λa (a (λb ((λc (λd (b (c d)))) a)))) (λe (e e)))

On this input, HVM produces the result λx1 λx2 ((x1 x2) (x1 x2)) while the correct result is λx1 λx2 ((x1 x1) (x2 x2)) (as verified by https://crypto.stanford.edu/~blynn/lambda/, with the corresponding input syntax by adding dots after the variable of each lambda: ((λa. (a (λb. ((λc. (λd. (b (c d)))) a)))) (λe. (e e)))).

Besides, on other inputs such as:

((λa λb ((b a) λc λd (λe (d a) a)) λf (f f)) λz z)

HVM produces the result λx1 (x1 λx2 (x2 _)) which is not even well-formed.
discussion
opened by Ekdohibs 41
Improvements needed, and planned features
Better allocator

Currently, the C allocator just reserves 8 GB of memory when the program starts, and, if there are multiple threads, that memory is split between then. So, for example, if there are 8 threads, each one gets 1 GB to work with. A worker can read from other worker's memories (which may happen for shared data, i.e., when they pass through a DP0/DP1 pointer), but only the owner of the slice can allocate memory from it. Each worker also keeps its own freelist.

As such, a huge improvement would be thread-safe, global, possibly arena-based, alloc and free.

Better scheduler

Right now, the task scheduler will just partition threads evenly among the normal form of the result. So, for example, if the result is (Pair A B), and there are 8 threads, 4 will be assigned to reduce A and B. This is obviously very limited: if A reduces quickly, it won't re-assign the threads to help reducing B, so the program will just use half of the cores from that point. And so on. This is why algorithms that return lists are often slower, they aren't using threads at all. In the worst case, it will just fallback to being single threaded.

A huge improvement would be a proper scheduler, that adds potential redexes to a task pool. When I attempted to do that, the cost of synchronization added too much overhead, ruining the performance. Perhaps a heuristic to consider would be to limit the depth of the redexes for which global tasks are emitted; anything below, say, depth=8, would just be reduced by the thread that reaches it. Many improvements can be done, though.

I32, I64, U64, F32, F64

Right now, HVM only supports U32. The numeric types above should be supported too. I32 and F32 should be easy to add, since they are unboxed, like U32. I64, U64 and F64 require implementing boxed numbers, since they don't fit inside a 64-bit Lnk (due to the 4-bit pointer), but shouldn't be hard. Discussion on whether we should have unboxed 60-bit variants is valid.

On #81.

Improve the left-hand side flattener

On #54.

A nice API to use as a Rust library

Using HVM as a pure Rust lib inside other projects should be very easy, specially considering the interpreter has a fairly good performance (only about 50% slower than the single-thread compiled version). We must think of the right API to expose these features in a Rust-idiomatic, future-proof way. Input from the Rust community is appreciated.

A JIT compiler

Right now, HVM compiles to C. That means a C compiler like clang or gcc is necessary to produce executables, which means that it isn't portable when used as a lib. Of course, libs can just use the interpreter, but it is ~3x slower than the compiler, and is not parallel. Ideally, we'd instead JIT-compile HVM files using WASM or perhaps Cranelift, allowing lib users to enjoy the full speed of HVM in a portable way.

IO

Adding IO should be easy: just write an alternative version of normalize that, instead of just outputting the normal form, pattern-matches it against the expected IO type, and performs IO operations as demanded.

Windows support

The compiler doesn't support Windows yet. The use of -lpthreads may be an issue. The interpreter should work fine, though.

On #52.

GPU compilers

Compiling to the GPU would be amazing, and I see no reason this shouldn't be possible. The only complications I can think of are:

How do we alloc inside a GPU work unit?

Reduce() is highly branching, so this may destroy the performance. But it can be altered to group tasks by categories, and merge all branches in one go. The reduce function essentially does this: branch -> match -> alloc -> subst -> free. The last 4 steps are very similar independent of the branch it falls into. So, instead, the branch part can just prepare some data, and the match -> alloc -> subst -> free pass are all moved to the same branch, just with different indices and parameters. This will avoid thread divergence.

GPGPU is a laborious mess. I'm under OSX, so I can't use CUDA; OpenCL is terrible; WebGPU is too new (and from what I could see, doesn't have the ability to read/write from the same buffer in the same pass!?). Regardless, we'd probably need different versions, Metal for OSX, CUDA and OpenCL for Windows depending on the GPU, etc. So, that's a lot of work that I can't do myself, it would require expensive engineers.

Compile Kind to HVM

HVM is meant to be a "low-level target", which is kinda confusing because it is actually very high-level, but in the sense that users shouldn't be writing it directly. Ideally, Kind will compile to HVM, but it needs a lot of adjustments before that is possible.

Add a checker for the "derivatives of (λx(x x) λx(x x)) not allowed" rule

See this issue for more information, as well as the superposed duplication section on HOW.md.

On #61.

Tests

This project has basically no test. Write tests!

Will edit the thread as I think of more things. Feel free to add to this thread.
enhancement
opened by VictorTaelin 35
Windows support

Right now the generated C code can't be compiled on Windows due to the use of <pthreads.h> etc.

One way it could be done would be to compile it using a compatibility layer like Cygwin, but I never did this before.

Help from some Windows programmer would be appreciated. :3
enhancement

opened by steinerkelvin 19

Modified example program outputs expression instead of result

I modified the example program:

// Creates a tree with `2^n` elements
(Gen 0) = (Leaf 1)
(Gen n) = (Node (Gen(- n 1)) (Gen(- n 1)))

// Adds all elements of a tree
(Sum (Leaf x))   = x
(Sum (Node a b)) = (+ ((+ (Sum a) (Sum b))) ((- (Sum a) (Sum b))))

// Performs 2^n additions in parallel
(Main n) = (Sum (Gen n))

It often outputs 256, but sometimes outputs an expression. The expressions seem to be different:

$ time ./main 8
Rewrites: 304812 (2.60 MR/s).
Mem.Size: 198923 words.

((((64+((32+((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))+(32-((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))))+(64-((32+((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))))+(32-((16+((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))))+(16-((8+((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))))+(8-((4+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(4-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))))))))))+128)+0)

real	0m0.146s
user	0m0.052s
sys	0m0.111s

Rewrites: 305163 (2.14 MR/s).
Mem.Size: 200042 words.

(256+(((64+((((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))+32)+(((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))-32)))+(64-((((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))+32)+(((16+((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8)))+(16-((((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))+8)+(((((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))+((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2)))+(((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))-((((1+(Sum 1))+(1-(Sum 1)))+2)+(((1+(Sum 1))+(1-(Sum 1)))-2))))-8))))-32))))-128))

real	0m0.164s
user	0m0.045s
sys	0m0.130s

bug

opened by bjm5 18

`cargo install hvm` considered harmful

ToT (61a331ffdc7c04567245f2cae969ab4643e91f20) doesn't appear to compile with gcc version 11.2.0 (from Ubuntu 21.10)

$ hvm c bench/Fibonacci/main.hvm
Compiled to 'bench/Fibonacci/main.c'.
$ gcc -O2 bench/Fibonacci/main.c
bench/Fibonacci/main.c:932:5: error: variably modified ‘normal_seen_data’ at file scope
  932 | u64 normal_seen_data[NORMAL_SEEN_MCAP];
      |     ^~~~~~~~~~~~~~~~

Digging in I see

#define NORMAL_SEEN_MCAP (HEAP_SIZE/sizeof(u64)/(sizeof(u64)*8))
const u64 HEAP_SIZE = 8 * U64_PER_GB * sizeof(u64);

discussion

opened by tommythorn 16

Generated C code runs forever

@VictorTaelin

Just to be sure, all works fine there?

Hmm, I'm actually not sure. I tried one of the benchmarks which ended up compiling ok, with only the following warning:

bash-5.1$ clang -O2 compose_id.hovm.out.c -o main -lpthread
compose_id.hovm.out.c:974:23: warning: incompatible pointer to integer conversion assigning to 'Thd' (aka 'unsigned long') from 'void *'
      [-Wint-conversion]
    workers[t].thread = NULL;
                      ^ ~~~~
1 warning generated.

But when I try to run the resulting binary, it doesn't seem to finish running, instead seemingly getting stuck:

bash-5.1$ ./main
Reducing.

It hasn't printed anything after that new line for > than 30 minutes (I ended up killing it). Not sure if it's actually not working properly or if it's just my slow CPU though. When I looked at my process manager, the process didn't seem to be using any CPU cycles either.

What is your CPU

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               AuthenticAMD
  Model name:            HP Hexa-Core 2.0GHz
    CPU family:          22
    Model:               48
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    Frequency boost:     enabled
    CPU max MHz:         2000.0000
    CPU min MHz:         1000.0000
    BogoMIPS:            3992.48
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_op
                         t pdpe1gb rdtscp lm constant_tsc rep_good acc_power nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3
                         cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefet
                         ch osvw ibs skinit wdt topoext perfctr_nb bpext ptsc perfctr_llc cpb hw_pstate ssbd vmmcall bmi1 xsaveopt arat npt lbrv svm_lo
                         ck nrip_save tsc_scale flushbyasid decodeassists pausefilter pfthreshold overflow_recov
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Full AMD retpoline, STIBP disabled, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Originally posted by @nothingnesses in https://github.com/Kindelia/HOVM/issues/15#issuecomment-1021494829

opened by nothingnesses 15

Improve Nix usage instructions. Regenerate Cargo.nix.

Since Cargo.lock was changed with commit 6975488b0c253ee598981aaaf3260d7d9cfab74c. (Note: when Cargo.lock changes, crate2nix generate needs to be ran to update Cargo.nix)

opened by nothingnesses 14
C Runtime rationale

Hi there, this is a really interesting project. I was wondering why the target runtime is obtained transpiling to a C template, instead of directly using rust itself (eg. using Rayon); another option could be targeting something like LLVM.

Thank you
discussion

opened by csuriano23 13

Code crashes with segfault

First of all, thanks for this very interesting project!

I tried some of the examples, and the last one about lambda arithmetic:

// Increments a Bits by 1
(Inc xs) = λex λox λix
  let e = ex
  let o = ix
  let i = λp (ox (Inc p))
  (xs e o i)

// Adds two Bits
(Add xs ys) = (App xs λx(Inc x) ys)

// Multiplies two Bits
(Mul xs ys) = 
  let e = End
  let o = λp (B0 (Mul p ys))
  let i = λp (Add ys (B0 (Mul p ys)))
  (xs e o i)

// Squares (n * 100k)
(Main n) =
  let a = (FromU32 32 (* 100000 n))
  let b = (FromU32 32 (* 100000 n))
  (ToU32 (Mul a b))

just gives me a segfault:

$ hvm c main && clang -O2 main.c -o main -lpthread && ./main 25
Compiled to 'main.c'.
Segmentation fault (core dumped)

Tested with commit 1da28755.

Furthermore, the QuickSort example just outputs:

(Sum (QSort (Pivot) (Pivot) (Randoms 1 2500000)))

Probably because neither Randoms, Sum, nor Pivot are defined ...

Now back to studying HOW.md ... ;)

bug

opened by 01mf02 12

Checker for the derivatives of `(λx(x x) λx(x x))`
If a lambda that clones its argument is itself cloned, then its clones aren't allowed to clone each-other.

See https://github.com/Kindelia/HVM/issues/44 for more information, as well as the superposed duplication section on HOW.md.

Example of what is not allowed:

let f = λs λz (s (s z)) # a lambda that clones its argument let g = f # a clone of that lambda (g f) # the clone will clone the original
enhancement
opened by steinerkelvin 11

Pattern matching on a lambda?

Awesome project, got to it via HN I thought I'd try to port microKanren to HVM (why not), but it requires matching on a lambda, or a slight change in the semantics

Here's what I came up with until now, it's likely I may have missed a few things or got them wrong

(Car (Cons x y)) = x
(Cdr (Cons x y)) = y
(Vareq (Var v1) (Var v2)) = (== v1 v2)

(ExtS x v s) = (Cons (Cons x v) s)

(EgInner sc NIL) = MZERO
(EgInner sc s) = (Unit (Cons s (Cdr sc)))

(Eg u v) =
    λsc(EgInner sc (Unify u v (Car sc)))

(Unit sc) = (Cons sc MZERO)
MZERO = NIL

(UnifyInner u v NIL) = NIL
(UnifyInner (Var v) (Var v) s) = s
(UnifyInner (Var u) v s) = (ExtS (Var u) v s)
(UnifyInner u (Var v) s) = (ExtS (Var v) u s)
(UnifyInner (Cons hu tu) (Cons hv tv) s) =
    (let s = (Unify hu hv s);
     (Unify tu tv s))
(UnifyInner u u s) = s

(Unify u v s) =
    (UnifyInner (Walk u s) (Walk v s) s)

(Walk u NIL) = u
(Walk (Var u) (Cons (Cons (Var u) v) s)) = (Walk v s)
(Walk (Var u) .) = (Var u)

(CallFresh f) = @sc(let c = (Cdr sc);
                    ((f (Var c)) (Cons (Car sc) (+ c 1))))
(Disj g h) = @sc(Mplus (g sc) (h sc))
(Conj g h) = @sc(Bind (g sc) h)

(Mplus NIL y) = y
(Mplus (Cons h t) y) = (Cons h (Mplus t) y)
(Mplus fn x) = @(Mplus x (fn))

(Bind NIL g) = MZERO
(Bind (Cons h t) g) = (MPlus (g h) (Bind t g))
(Bind fn g) = @(Bind (fn) g)

EMPTY_STATE = (Cons NIL 0)

// uK ends here

(Zzz g) = @sc(@(g sc))

(ConjL (Cons g NIL)) = (Zzz g)
(ConjL (Cons g gs)) = (Conj (Zzz g) (ConjL gs))

(DisjL (Cons g NIL)) = (Zzz g)
(DisjL (Cons g gs)) = (Disj (Zzz g) (DisjL gs))

(Conde (Cons l alts)) = (DisjL (Cons (ConjL l) (Conde alts)))

(Fresh NIL goals) = (ConjL goals)
(Fresh (Cons v lvars) goals) = @v(Fresh lvars goals)

(Pull NIL) = NIL
(Pull (Cons x y)) = (Cons x y)
(Pull f) = (Pull (f))

(Take 0 f) = NIL
(Take n NIL) = NIL
(Take n (Cons x y)) = (Cons x (Take (- n 1) y))
(Take n f)  = (Take n (Pull f))

//

(Fives x) = (Disj (Eg x 5) (Fives x))

(Main) = (let f = (CallFresh Fives);
          (f EMPTY_STATE))

opened by bsless 11

Compiled HVM programs need a better CLI interface
Running a program compiled with hvm compile somefile.hvm and then cargo build --release, I am greeted with this message

$ ./target/release/somefile somefile 1.0.17-beta A massively parallel functional runtime. USAGE: somefile <SUBCOMMAND> OPTIONS: -h, --help Print help information -V, --version Print version information SUBCOMMANDS: compile Compile a file to Rust help Print this message or the help of the given subcommand(s) run Load a file and run an expression

That means that this program is not carrying only the interpreted and precompiled functions, but also the whole compiler.

That is unintuitive, what I expected was to just be able to call my program directly (which would correspond to calling it with somefile run "(Main arg)").

We should probably create a separate CLI for the compiled programs to make this simpler to use
opened by developedby 0

Speed up default_heap_size

System::new_all does a lot more than just measuring available mem, so let's cut those extra syscalls to make startup faster.

Bench (examples/hello):

Old:

$ hyperfine "target/release/main run '(Main 0)'"
Benchmark 1: target/release/main run '(Main 0)'
  Time (mean ± σ):     513.4 ms ±  36.6 ms    [User: 285.6 ms, System: 455.2 ms]
  Range (min … max):   469.5 ms … 587.2 ms    10 runs

New:

$ hyperfine "target/release/main run '(Main 0)'"
Benchmark 1: target/release/main run '(Main 0)'
  Time (mean ± σ):      77.5 ms ±   4.9 ms    [User: 18.0 ms, System: 59.9 ms]
  Range (min … max):    67.2 ms …  85.6 ms    33 runs

opened by funbringer 0

Multithreaded sequential programs spend most their active time waiting
For programs that are dominated by expressions that must be reduced in sequential order, there is a very large performance drop proportional to the number of threads used.

Take this example program:

(Func 0 b) = b (Func a b) = (Func (- a 1) (^ (<< b 1) b)) (Main) = (Func 100000 1)

Running with one thread, I get ~30MR/s, while with 2 threads I get ~20MR/s and with 16 threads ~10MR/s.

Analizing what the program spends its time on using perf, it looks like most of the time the threads are trying to steal a job from another task, failing (because this program is inherently sequential and there is not enough parallel work for all the threads), and the waiting to try again.

While it's true that we should try to write parallelizable programs for the HVM, for tasks that are mostly sequential we should have better behaviour.
opened by developedby 2
Programs with a lot of dups get stuck
(Func 0 b) = b (Func a b) = (Func (- a 1) (+ b b)) (Main) = (Func 100000000 2)

Running this code with a very large a, makes it seemingly never end (expected 10s, waited 10m). Other programs that create lots and lots of dups also seem to cause this.
bug
opened by developedby 5
Questions about Lambda operator and parallel evaluation.
Are the lambda operator and let syntax necessary? E.g. aren't these rewrite rules sufficient to emulate them:

(Replace (Var var) value (Var var)) = value (Replace (Var var) value (Lambda (Var arg) body) = (Lambda (Var arg) (Replace (Var var) value body) -- where var != arg for proper scoping (idk how to encode this as a rewrite rule) (Apply (Lambda (Var x) body) y) = (Replace (Var x) y body)

Example: (Apply (Lambda (Var X) (Var X)) (Var Y)) should evaluate to (Var Y). Is there a reason these exit as primitives (convenience? efficiency?) What about dup?

(Dup x) = (Dup' x x) -- ... extra rules for lambda/replace that propagate the dup appropriately -- can be used in code by pattern matching against Dup' e.g. (F (Dup' x y)) = Whatever

Basically, I am asking if rewrite rules are sufficient to encode all these computations (according to the interaction net paper I've read it seems so)? So what's the point of these primitives if this is intended as a bytecode/IR and not human-writable?

Secondly, I am still curious about how HVM handles parallel computation. I could probably dig around the code more but a succinct explanation would be helpful I think.

So suppose the following rules:

(And True x) = x (And False x) = False

And suppose I want to evaluate an expression in the form (And True (And False (And ...))). For each And, we can use the rewrite rules above to rewrite each nesting in parallel (assuming infinite cores). Is this done in practice and how? Like what struct in C do each of these evaluate to?
opened by vladov3000 1