evalPrimOp is a bottle-neck due to linear search #8

sgraf812 · 2022-04-12T16:47:14Z

While the implementation of, e.g., ...ByteArray.evalPrimOp is rather direct and elegant at the moment

newtype Name = BS8.ByteString

evalPrimOp :: PrimOpEval -> Name -> [Atom] -> Type -> Maybe TyCon -> M [Atom]
evalPrimOp fallback op args t tc = case (op, args) of

  -- newByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 False
    pure [MutableByteArray baIdx]


  -- newPinnedByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newPinnedByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 True
    pure [MutableByteArray baIdx]

It yields rather slow code. That is because GHC doesn't do the "obvious" trie-based match optimisation, which means we end up with a linear chain of comparisons

if op == "newByteArray#" then ,,,
else if op == "newPinnedByteArray#" then ...
...

in Core.

I took a profile (on nofib"s bernoulli, if that matters) and ...ByteArray.evalPrimOp takes about 10% of time and allocation. Here is an excerpt of the profile

COST CENTRE                          MODULE                            SRC                                                       %time %alloc

evalPrimOp                           Stg.Interpreter.PrimOp.ByteArray  lib/Stg/Interpreter/PrimOp/ByteArray.hs:(71,1)-(884,28)     9.4   10.9
evalPrimOp                           Stg.Interpreter.PrimOp.Addr       lib/Stg/Interpreter/PrimOp/Addr.hs:(23,1)-(357,28)          6.5    8.2
evalStackContinuation.\              Stg.Interpreter                   lib/Stg/Interpreter.hs:(355,74)-(394,35)                    4.4    5.8
lookupEnvSO                          Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(630,1)-(648,21)                4.2    2.2
builtinStgEval                       Stg.Interpreter                   lib/Stg/Interpreter.hs:(154,1)-(201,103)                    3.2    2.9
evalPrimOp                           Stg.Interpreter.PrimOp.Float      lib/Stg/Interpreter/PrimOp/Float.hs:(17,1)-(120,28)         2.8    2.9
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(497,45)-(502,23)                    2.8    3.4
evalPrimOp                           Stg.Interpreter.PrimOp.Double     lib/Stg/Interpreter/PrimOp/Double.hs:(17,1)-(127,28)        2.7    3.0
evalExpr                             Stg.Interpreter                   lib/Stg/Interpreter.hs:(423,1)-(533,93)                     2.5    0.6
compare                              Stg.Syntax                        lib/Stg/Syntax.hs:(30,3)-(32,12)                            2.4    0.0
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(504,37)-(510,27)                    2.4    1.2
setInsert                            Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(776,1)-(778,36)                2.1    0.0
evalPrimOp                           Stg.Interpreter.PrimOp.Int        lib/Stg/Interpreter/PrimOp/Int.hs:(25,1)-(149,28)           2.0    2.1
evalPrimOp                           Stg.Interpreter.PrimOp.SmallArray lib/Stg/Interpreter/PrimOp/SmallArray.hs:(27,1)-(157,28)    1.9    1.9

I think a bit of focus on optimising evalPrimOp may well speed up the interpreter by 50%. One way to do so would perhaps be to use a HashMap or Trie to do the lookup.

Why optimise anyway? Because at the moment a single run of NoFib's bernoulli benchmark takes about half an hour when the compiled program takes just 0.1s. That's quite a deal breaker for an exhaustive benchmark run of all 11* benchmarks.

The text was updated successfully, but these errors were encountered:

csabahruska · 2022-04-13T20:50:40Z

Hi,
I've refactored the primop handling to use hashmap lookup.
Try out the interpreter from this branch:
https://github.com/grin-compiler/ghc-whole-program-compiler-project/tree/primop-map/external-stg-interpreter

see:
f5e6806

sgraf812 · 2022-04-14T07:49:01Z

Wow, that was fast!

And it also resulted in a great speedup, now my reduced benchmark case takes 35s instead of 55s. Great work!

Another perhaps superior suggestion might be to intern all primop names and use an IntMap/Array, but the profile suggests that *.evalPrimOp is no longer a bottle-neck, so why bother.

sgraf812 · 2022-04-14T07:50:05Z

I think I closed prematurely, after all the changes aren't on master yet. But I think f5e6806 fixes the issue.

sgraf812 closed this as completed Apr 14, 2022

sgraf812 reopened this Apr 14, 2022

sgraf812 mentioned this issue Apr 14, 2022

StgState allocations dominate #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evalPrimOp is a bottle-neck due to linear search #8

evalPrimOp is a bottle-neck due to linear search #8

sgraf812 commented Apr 12, 2022 •

edited

Loading

csabahruska commented Apr 13, 2022 •

edited

Loading

sgraf812 commented Apr 14, 2022

sgraf812 commented Apr 14, 2022

evalPrimOp is a bottle-neck due to linear search #8

evalPrimOp is a bottle-neck due to linear search #8

Comments

sgraf812 commented Apr 12, 2022 • edited Loading

csabahruska commented Apr 13, 2022 • edited Loading

sgraf812 commented Apr 14, 2022

sgraf812 commented Apr 14, 2022

sgraf812 commented Apr 12, 2022 •

edited

Loading

csabahruska commented Apr 13, 2022 •

edited

Loading