Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evalPrimOp is a bottle-neck due to linear search #8

Open
sgraf812 opened this issue Apr 12, 2022 · 3 comments
Open

evalPrimOp is a bottle-neck due to linear search #8

sgraf812 opened this issue Apr 12, 2022 · 3 comments

Comments

@sgraf812
Copy link

sgraf812 commented Apr 12, 2022

While the implementation of, e.g., ...ByteArray.evalPrimOp is rather direct and elegant at the moment

newtype Name = BS8.ByteString

evalPrimOp :: PrimOpEval -> Name -> [Atom] -> Type -> Maybe TyCon -> M [Atom]
evalPrimOp fallback op args t tc = case (op, args) of

  -- newByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 False
    pure [MutableByteArray baIdx]


  -- newPinnedByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newPinnedByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 True
    pure [MutableByteArray baIdx]

It yields rather slow code. That is because GHC doesn't do the "obvious" trie-based match optimisation, which means we end up with a linear chain of comparisons

if op == "newByteArray#" then ,,,
else if op == "newPinnedByteArray#" then ...
...

in Core.

I took a profile (on nofib"s bernoulli, if that matters) and ...ByteArray.evalPrimOp takes about 10% of time and allocation. Here is an excerpt of the profile

COST CENTRE                          MODULE                            SRC                                                       %time %alloc

evalPrimOp                           Stg.Interpreter.PrimOp.ByteArray  lib/Stg/Interpreter/PrimOp/ByteArray.hs:(71,1)-(884,28)     9.4   10.9
evalPrimOp                           Stg.Interpreter.PrimOp.Addr       lib/Stg/Interpreter/PrimOp/Addr.hs:(23,1)-(357,28)          6.5    8.2
evalStackContinuation.\              Stg.Interpreter                   lib/Stg/Interpreter.hs:(355,74)-(394,35)                    4.4    5.8
lookupEnvSO                          Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(630,1)-(648,21)                4.2    2.2
builtinStgEval                       Stg.Interpreter                   lib/Stg/Interpreter.hs:(154,1)-(201,103)                    3.2    2.9
evalPrimOp                           Stg.Interpreter.PrimOp.Float      lib/Stg/Interpreter/PrimOp/Float.hs:(17,1)-(120,28)         2.8    2.9
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(497,45)-(502,23)                    2.8    3.4
evalPrimOp                           Stg.Interpreter.PrimOp.Double     lib/Stg/Interpreter/PrimOp/Double.hs:(17,1)-(127,28)        2.7    3.0
evalExpr                             Stg.Interpreter                   lib/Stg/Interpreter.hs:(423,1)-(533,93)                     2.5    0.6
compare                              Stg.Syntax                        lib/Stg/Syntax.hs:(30,3)-(32,12)                            2.4    0.0
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(504,37)-(510,27)                    2.4    1.2
setInsert                            Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(776,1)-(778,36)                2.1    0.0
evalPrimOp                           Stg.Interpreter.PrimOp.Int        lib/Stg/Interpreter/PrimOp/Int.hs:(25,1)-(149,28)           2.0    2.1
evalPrimOp                           Stg.Interpreter.PrimOp.SmallArray lib/Stg/Interpreter/PrimOp/SmallArray.hs:(27,1)-(157,28)    1.9    1.9

I think a bit of focus on optimising evalPrimOp may well speed up the interpreter by 50%. One way to do so would perhaps be to use a HashMap or Trie to do the lookup.

Why optimise anyway? Because at the moment a single run of NoFib's bernoulli benchmark takes about half an hour when the compiled program takes just 0.1s. That's quite a deal breaker for an exhaustive benchmark run of all 11* benchmarks.

@csabahruska
Copy link
Member

csabahruska commented Apr 13, 2022

Hi,
I've refactored the primop handling to use hashmap lookup.
Try out the interpreter from this branch:
https://github.com/grin-compiler/ghc-whole-program-compiler-project/tree/primop-map/external-stg-interpreter

see:
f5e6806

@sgraf812
Copy link
Author

Wow, that was fast!

And it also resulted in a great speedup, now my reduced benchmark case takes 35s instead of 55s. Great work!

Another perhaps superior suggestion might be to intern all primop names and use an IntMap/Array, but the profile suggests that *.evalPrimOp is no longer a bottle-neck, so why bother.

@sgraf812
Copy link
Author

I think I closed prematurely, after all the changes aren't on master yet. But I think f5e6806 fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants