Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the performance of parsing and evaluation speed #200

Open
jwiegley opened this issue Apr 26, 2018 · 35 comments
Open

Increase the performance of parsing and evaluation speed #200

jwiegley opened this issue Apr 26, 2018 · 35 comments
Labels
help wanted When someone explicitly requests for help

Comments

@jwiegley
Copy link
Member

Here's the bar:

[nix-shell:~/src/hnix]$ time nix-instantiate -I nix=$PWD/data/nix/corepkgs --eval --expr "import $HOME/src/nix/nixpkgs {}" > /dev/null

real	0m0.089s
user	0m0.077s
sys	0m0.010s

[nix-shell:~/src/hnix]$ time ./dist/build/hnix/hnix -I nix=$PWD/data/nix/corepkgs --eval --expr "import $HOME/src/nix/nixpkgs {}" > /dev/null

real	0m1.125s
user	0m1.047s
sys	0m0.070s
@domenkozar
Copy link
Collaborator

domenkozar commented Nov 17, 2018

Using branch vs-megaparsec-7:

$ \time ./result/bin/hnix -I nix=$PWD/data/nix/corepkgs --eval --expr "import $HOME/dev/nixpkgs {}" >/dev/null
1.26user 0.06system 0:01.33elapsed 99%CPU (0avgtext+0avgdata 210552maxresident)k
0inputs+0outputs (0major+49601minor)pagefaults 0swaps

$ \time nix-instantiate -I nix=$PWD/data/nix/corepkgs --eval --expr "import $HOME/dev/nixpkgs {}" >/dev/null
0.16user 0.01system 0:00.17elapsed 100%CPU (0avgtext+0avgdata 30660maxresident)k
0inputs+0outputs (0major+5412minor)pagefaults 0swaps

@domenkozar
Copy link
Collaborator

domenkozar commented Nov 17, 2018

Here is the flamegraph of the invocation: https://static.domenkozar.com/hnix.svg

To reproduce:

diff --git a/default.nix b/default.nix
index f135e1d..be7aa0c 100644
--- a/default.nix
+++ b/default.nix
@@ -70,7 +70,8 @@ drv = haskellPackages.developPackage {
       pkgs.haskell.packages.${compiler}.cabal-install
     ];
 
-    enableLibraryProfiling = false;
+    enableExecutableProfiling = true;
+    doCheck = false;
 
     testHaskellDepends = attrs.testHaskellDepends ++
       [ pkgs.nix

Then invoke:

nix-build
./result/bin/hnix -I nix=$PWD/data/nix/corepkgs --eval --expr "import $HOME/dev/nixpkgs {}" +RTS -pj
nix-shell -p flamegraph haskellPackages.ghc-prof-aeson-flamegraph --run "cat hnix.prof | ghc-prof-aeson-flamegraph | flamegraph.pl > hnix.svg"

@domenkozar
Copy link
Collaborator

maybe @mrkkrp could chip in here :)

@mrkkrp
Copy link

mrkkrp commented Nov 18, 2018

@domenkozar Can you give me TLDR? So it got slower, if so, how much slower? There is only one thing that can make it slower, that's getSourcePos. It shouldn't be called too often (for example, it's a bad idea to call it on every token). But in general if you call it not-that-often, the performance should stay the same. At least it was so for MMark when I checked this before releasing Megaparsec 7.

@domenkozar
Copy link
Collaborator

@mrkkrp Megaparsec improved performance by about 10% on this particular case. We are wondering how to bridge the gap between original Nix implementation (0.16s) and HNix (1.26s) and majority of time is spend in Megaparsec as observed in the flamegraph linked above.

If you have any tips for what could be the underlying causes we'd be very grateful :)

@mrkkrp
Copy link

mrkkrp commented Nov 18, 2018

Megaparsec improved performance by about 10% on this particular case.

Ah good! I was fearing that I screwed up with version 7 :-P

how to bridge the gap between original Nix implementation (0.16s) and HNix (1.26s)

I think my suggestions are already applied in the code... Would be good to make it as fast as C, but I don't think I know how to do it.

@domenkozar
Copy link
Collaborator

domenkozar commented Nov 23, 2018

hnix-hp

Since productivity is around 50%, running profiling with -hd shows us what is allocated that makes GC slow.

@mpickering
Copy link
Contributor

I did some profiling and it seems that the majority of allocated memory is never used. Digging deeper, it seemed that most of the wasted space is from things to do with parse errors where ParseError objects are allocated but then never used because the parse succeeds.

@domenkozar
Copy link
Collaborator

It seems currently the slowdown is about 10x even for bigger files (running on 2e4e5e0 with --arg doOptimize):

$ \time nix-instantiate --parse /home/ielectric/dev/nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix >/dev/null
0.41user 0.04system 0:00.45elapsed 100%CPU (0avgtext+0avgdata 115912maxresident)k
0inputs+0outputs (0major+30692minor)pagefaults 0swaps

$ \time ./result/bin/hnix /home/ielectric/dev/nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix +RTS -s >/dev/null
  19,764,470,152 bytes allocated in the heap
   2,079,365,984 bytes copied during GC
     273,960,304 bytes maximum residency (9 sample(s))
         916,584 bytes maximum slop
             755 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     19093 colls,     0 par    1.314s   1.314s     0.0001s    0.0005s
  Gen  1         9 colls,     0 par    0.678s   0.678s     0.0753s    0.2529s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    3.190s  (  3.191s elapsed)
  GC      time    1.991s  (  1.992s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    5.182s  (  5.182s elapsed)

  %GC     time      38.4%  (38.4% elapsed)

  Alloc rate    6,195,508,791 bytes per MUT second

  Productivity  61.6% of total user, 61.6% of total elapsed

4.96user 0.24system 0:05.21elapsed 99%CPU (0avgtext+0avgdata 787480maxresident)k
0inputs+0outputs (0major+193728minor)pagefaults 0swaps

@domenkozar
Copy link
Collaborator

domenkozar commented Nov 24, 2018

hnix

+RTS -hd for the previous comment (using additional flag --doProfiling true)

@domenkozar
Copy link
Collaborator

hnix

@mrkkrp
Copy link

mrkkrp commented Nov 25, 2018

I did some profiling and it seems that the majority of allocated memory is never used. Digging deeper, it seemed that most of the wasted space is from things to do with parse errors where ParseError objects are allocated but then never used because the parse succeeds.

@mpickering Can you tell me how to reproduce whatever you did to come to this conclusion?

@mpickering
Copy link
Contributor

I ran three profiles.

  1. -hb showed that most of the heap objects allocated were never used.
  2. -hc -hbvoid which shows the cost centre where memory is allocated but not used.
  3. -hd -hcnixPath which shows the description of the closure on the top cost centre from the previous profile.

I then looked at the code and found this closure by grepping the dumped STG. The closure was to do with error messages which led me to my hypothesis.

@mrkkrp
Copy link

mrkkrp commented Dec 4, 2018

I’ll get to playing with this but not in December. I think I have an idea how to improve things a bit.

@mrkkrp
Copy link

mrkkrp commented Jan 1, 2019

I tried to play with this but I think it doesn't really use my local version of megaparsec. I changed default.nix like so:

    megaparsec = import "/home/mark/projects/programs/haskell/megaparsec/default.nix"
      { mkDerivation=self.mkDerivation; inherit (super) base bytestring case-insensitive containers criterion deepseq hspec hspec-expectations mtl parser-combinators QuickCheck scientific stdenv text transformers weigh; };
    # super.megaparsec_7_0_4;

And generated this in my megaparsec directory:

{ mkDerivation, base, bytestring, case-insensitive, containers
, criterion, deepseq, hspec, hspec-expectations, mtl
, parser-combinators, QuickCheck, scientific, stdenv, text
, transformers, weigh
}:
mkDerivation {
  pname = "megaparsec";
  version = "7.0.4";
  src = ./.;
  libraryHaskellDepends = [
    base bytestring case-insensitive containers deepseq mtl
    parser-combinators scientific text transformers
  ];
  testHaskellDepends = [
    base bytestring case-insensitive containers hspec
    hspec-expectations mtl parser-combinators QuickCheck scientific
    text transformers
  ];
  benchmarkHaskellDepends = [
    base containers criterion deepseq text weigh
  ];
  enableLibraryProfiling = true;
  enableExecutableProfiling = true;
  doCheck = false;
  homepage = "https://github.com/mrkkrp/megaparsec";
  description = "Monadic parser combinators";
  license = stdenv.lib.licenses.bsd2;
}

This is how I tested it:

nix-shell --run "cabal configure --enable-profiling --flags=profiling && cabal build && ./dist/build/hnix/hnix --parse-only \"/home/mark/projects/tests/nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix\" +RTS -hd && hp2ps -s -c hnix.hp"

But no changes in local version of megaparsec seem to affect how hnix works, I even inserted tracing in the code. Nothing. And it takes a while to recompile too. Any hints about the workflow?

@domenkozar
Copy link
Collaborator

I'd suggest skipping Nix, as you'll always have to recompile the whole megaparsec and it will make your feedback loop really slow.

I've upstreamed most of hnix dependencies to stackage LTS 13, except hashing, monadlist and ref-tf for which you need mainland/ref-tf#2

@mrkkrp
Copy link

mrkkrp commented Jan 1, 2019

OK, I guess I’ll return to this later.

@mrkkrp
Copy link

mrkkrp commented Jan 1, 2019

Still it is not clear to me why my local version of Megaparsec was not used. I tried with both new cabal commands and the old ones.

@domenkozar
Copy link
Collaborator

I'd use something like

$ git diff
diff --git a/default.nix b/default.nix
index f5ded4b..e6984a3 100644
--- a/default.nix
+++ b/default.nix
@@ -31,7 +31,7 @@ drv = haskellPackages.developPackage {

   overrides = with pkgs.haskell.lib; self: super: {
     mono-traversable = dontCheck super.mono-traversable;
-    megaparsec = super.megaparsec_7_0_4;
+    megaparsec = self.callPackage /home/ielectric/dev/megaparsec {};
   };

   source-overrides = {};

where I bumped megaparsec version just to be sure I see 7.0.5 in:

$ nix-shell --run "ghc-pkg list | grep megaparsec"          
    megaparsec-7.0.5

@mrkkrp
Copy link

mrkkrp commented Jan 2, 2019

Ah OK, it looks the optimization I had in mind is just not an optimization :-D

@layus
Copy link
Collaborator

layus commented Dec 10, 2020

A short update on this one now that we can actually evaluate derivations. Hnix is about 100x slower than nix on evaluating nixpkgs.hello.

Here is a flamechart https://www.speedscope.app/#profileURL=https://gist.githubusercontent.com/layus/2dfb3f18b12c1daec65a69d34ee29639/raw/10d33407cbd78a4a2ea3ffe89ac5db3b91c31de7/hnix.eventlog.json

Parsing does not seem to be our bottleneck (3s) but it will need to be reduced if we want to compare in perf with nix.

@Anton-Latukha Anton-Latukha changed the title Close the performance gap for parsing and evaulating Increase the performance of parsing and evaluation speed Jan 21, 2021
@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Jan 21, 2021

Renamed topic to be calmer and more discoverable ⬆️, because people would keep showing up for it. And I would keep pointing to this discussion until it would be more effective to reopen the conversation anew.


Probably we need to make flame graphs formation an occurring procedure.
I never knew the HNix spends that much in string operations.


Current processes:

Other news:
Nikita Volkov recently created the: https://hackage.haskell.org/package/acc sequenced monoid data type for accumulation, traversals, and folds which promises x2 gains comparing to DList or Seq, well as it is written there.

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Jan 26, 2021

@layus

No matter how I try to replicate the case of #200 (comment) flamegraph - mine report is >20x faster at evaluating mentioned nixpkgs.hello. Mine:

cabal v2-run hnix --enable-profiling --flags=profiling -- --eval --expr '(import <nixpkgs> {}).hello' +RTS -V'0.005' -Pj

takes under 1s, while the one you posted shows 30s. And mine in the timeline shows the project essentially does a one-pass eval.

What I do wrong 🤣.

@layus
Copy link
Collaborator

layus commented Jan 26, 2021

Possibly I have a slow machine :-). Or I missed something else, who knows ?

Thanks for trying. Did you manage to produce your own speedscope ? https://mpickering.github.io/posts/2019-11-07-hs-speedscope.html

@layus
Copy link
Collaborator

layus commented Jan 26, 2021

I am not too confident about what I am doing, but it seems that I recompiled some dependencies with profiling enabled. Does that make any sense ?

@sorki
Copy link
Member

sorki commented Jan 26, 2021

@layus

No matter how I try to replicate the case of #200 (comment) flamegraph - mine report is >20x faster at evaluating mentioned nixpkgs.hello. Mine:

cabal v2-run hnix --enable-profiling --flags=profiling -- --eval --expr '(import <nixpkgs> {}).hello' +RTS -V'0.005' -Pj

takes under 1s, while the one you posted shows 30s. And mine in the timeline shows the project essentially does a one-pass eval.

What I do wrong rofl.

Try hello.outPath 😺

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Jan 26, 2021

I am not too confident about what I am doing, but it seems that I recompiled some dependencies with profiling enabled. Does that make any sense ?

Yep. You probably punting me that I need to recompile the deps. I know that. With Cabal I already recompiled most of that stack for profiling yesterday. But Cabal still keeps threatening me that it would rebuild a list of additional packages for profiling, but does not rebuild them however I try to force it into it. Since Cabal is a sort of reproducible tool, Cabal already built the stack and since I can not force rebuild - I decided that is some left-over detection message bug.

So I also used perf/dwarf debugger to look-up more/get more into about the state. And hotspot reported a bunch of debug symbols (2xxx of 3xxx) missing from running the executable.

So that is why I tried nix-build to force the closure into rebuild of debug symbols. And got the same (2xxx of 3xxx) missing from running the executable.

Well, Cabal I used was also 3.4 because of previous Cabal bug reasons. Now Cabal also on every configuration keeps backuping & regenerating the bit-perfect cabal.project.local files ... switched back to 3.2, nevertheless keeps threatening me that it would rebuild a list of packages for it. ... so... I would look up forward. Wanted to learn & automate time-based profiling...

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Jan 26, 2021

Well, the main difference between your & mine flamegraps seems that for me the GC, profile overhead, and system interrupts are grouped - that is probably why graphs on the time grid seem different.

So: current evaluation result

@layus
Copy link
Collaborator

layus commented Jan 26, 2021

Nice ! I am no more the only one to have produced such a graph ;-).

Are you working on something like https://arewefastyet.com ? :-D

Well, the main difference between your & mine flamegraps seems that for me the GC, profile overhead, and system interrupts are grouped - that is probably why graphs on the time grid seem different.

Well, a lot of things could explain the difference in the graphs. hs-speedscope update, ghc debugging updates, or even a slight difference in the profiling options.

PS: Thanks @sorki for spotting the missing outPath that would definitely explain the difference in the execution time.

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Jan 26, 2021

Are you working on something like https://arewefastyet.com ? :-D

No, just tried to debug the things. I also found a setup-specific bug and situation in the RTS GHC options, but refactoring the RTS options in the GHC is a bit too big yak shave.

Debug symbols report was probably due to Cabal by default probably includes only exported symbols in the module.

@Anton-Latukha
Copy link
Collaborator

Wanted also to generate the reversed flamegraph, to point to the function that are the most time-consuming.

But found a better way. Speedscope has "Sandwich" mode that allows ordering by "Self" - which shows the sum of the function implementation running time.
The Cabal has: --profiling-detail='all-functions'.

haskell-packages.nix has pregenerated closure, hello seems synthetic abstract test, something like firefox, thou it fails to evaluate fully and throws error - but that is the topic not really related to profiling.

So the summary resulting command is:

cabal v2-run --enable-profiling --flags=profiling --enable-library-profiling --enable-executable-profiling --profiling-detail='all-functions' hnix -- --eval --expr '(import <nixpkgs> {}).firefox.outPath' +RTS -V'0.005' -Pj

The .perf become ~500mb, JSON (even speedscope exported one) become ~100mb, which is too much to "just upload" to Gist.
'n seems pointless to throw flamegraphs around, especially currently. Everybody can choose and generate one for now interests & work interested in.

The list of the most time-consuming functions by themself:

Screenshot-2021-01-26-17:51:16

But this list is also not needed, people can do better then I did here.

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Jan 26, 2021

Overall I am interested in demand. Maybe I am very biased about it, since I see it superfluous and do not like it from the start:
Type.Infer:

instance Monad m => MonadValue (Judgment s) (InferT s m) where
  defer  = id
  demand = flip ($)
  inform j f = f (pure j)

Standard:

instance ( MonadAtomicRef m
         , MonadCatch m
         , Typeable m
         , MonadReader (Context m (StdValue m)) m
         , MonadThunkId m
         )
  => MonadValue (StdValue m) m where
  defer = fmap Pure . thunk

  demand (Pure v) f = force v (flip demand f)
  demand (Free v) f = f (Free v)

  inform (Pure t) f = Pure <$> further t f
  inform (Free v) f = Free <$> bindNValue' id (flip inform f) v

Lint:

instance (MonadThunkId m, MonadAtomicRef m, MonadCatch m)
  => MonadValue (Symbolic m) m where
  defer = fmap ST . thunk
  demand (ST v) f = force v (flip demand f)
  demand (SV v) f = f (SV v)

And the use of it in the source code is mostly superfluous. It looks just like an argument flipping function, and a lot of its use of the flip demand anyway. So code does flips the flip the flip of the flip flip flip...
demand also complicated code reading for me at the start. I wonder if it just should be removed. That demand on Free is costly - no wonder. I am curious would we ever would prune the Free by using Monad instead or fused-effects. Currently, I just lazily wait on what #804 redo of Standard would resolve.

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Feb 3, 2021

I did some profiling and it seems that the majority of allocated memory is never used. Digging deeper, it seemed that most of the wasted space is from things to do with parse errors where ParseError objects are allocated but then never used because the parse succeeds.

True:
Screenshot-2021-02-03-17:29:04
Screenshot-2021-02-03-17:40:20

⬆️ Look for *.catch.

The deal is roughly:

tl;dr: The thunk forcing strictifies the Context stack that gets stored in catches, contexts arrive there through throwError. And so strict data types not get deduplicated in memory.

The longer story:

  1. Every exception to be useful - stores in itself a whole Nix code evaluation context where the exception happened.
  2. Exceptions, use String a lot, which famously uses 40b*Char.
  3. We do a lot of thunk forcing on operations during recursions of evaluations. Which on forces makes the context and the Strings strict, == no deduplication and use of a lot of memory to store them.
    So that is probably the main memory footprint of "memory not ever used, because evaluation succeeds".
    (Frame (context gathering/messaging system) is not even fully integrated yet, there is a lot of direct error throws still around)

The proper errors get thrown in the project with throwError which uses the Frame system implementation, which preserves context. And so catch caught them and preserved the whole shebang of context of frames.

throwError
  :: forall s e m a . (Framed e m, Exception s, MonadThrow m) => s -> m a
throwError err = do
  context <- asks (view hasLens)
  traceM "Throwing error..."
  throwM $ NixException (NixFrame Error (toException err) : context)

There is 40 uses of throwError in src/Nix/Builtins.hs.

forceThunk
  :: forall m v a
   . (MonadVar m, MonadThrow m, MonadCatch m, Show (ThunkId m))
  => NThunkF m v
  -> (v -> m a)
  -> m a
forceThunk (Thunk n active ref) k = do
  eres <- readVar ref
  case eres of
    Computed v      -> k v
    Deferred action -> do
      nowActive <- atomicModifyVar active (True, )
      if nowActive
        then throwM $ ThunkLoop $ show n
        else do
          v <- catch action $ \(e :: SomeException) -> do
            _ <- atomicModifyVar active (False, )
            throwM e
          _ <- atomicModifyVar active (False, )
          writeVar ref (Computed v)
          k v
traceM :: Monad m => String -> m ()
traceM = const (pure ())

So all exception info got stored as a String.

The move from up-down seems:

    1. If force succeded - make so a) all exceptions get reduced/garbage collected, b) delete them.
  • Move to the use of Lazy/Short Text for the system.
  • Maybe think to keep the context lazy - it is pretty big and gets passed around literally everywhere during evaluation (Expressions are Nix strings, Nix strings go together with context), so TopLevelExpression and the tiniest Expr (which are not Atoms) - bear the context. Context gets a lot of tiny modifications everywhere, during every move. For the context text we are essentially cellular automata, laziness may be useful here to deduplicate it vertically just as COW system deduplicates between a lot of gradual tiny modifications in a data. Or maybe more experienced people would have a better suggestion to the optimization of a context.

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Feb 11, 2021

So, I looked into the demand and force (forceThunk) which (as mentioned in #200 (comment), are the most computationally demanding chain and also probably the one of memory creating causes) - seems it is true - there are huge improvements possible to it, the details are in the #850.

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Feb 11, 2021

#200 (comment)

In the end I talked about properly packing the contexts. If I understand what is context correctly (and I do not, because there are several different contexts in the project), the evaluation contexts form a 1 single tree, where the top node is TopLevelExpression context, and the context tree grows from it. HNix being cellular automata that grows the context tree during evaluation. idk is it sound to have 1 context tree for the whole evaluation and pass the tree position, to have ContextPos to the SourcePos. And when error gets reified - to go and gather the context from the tree. Maybe even the context tree is somewhat related to AST (and so transitionally to SourcePos).

Maybe context tree expands from AST, and so context tree position gives (the internal to it) AST tree position, abstract syntax tree expands from the source code, and so abstract syntax tree position gives (the internal to it) source code position (... and the HNix source code is the automata not on the source code, but on the context which allows attaching the abstract syntax objects to it and on them the program computations grow new context, and so on those abstract syntax appends context-abstract-source tree only grows through grafting, as a Free monad, which HNix already uses to express Nix runtime, and computations in the runtime are what grows on the contexts and grow the context, and because computations are very volatile things - they are fully strictified, to be properly garbage collected volatile on their end, so all they togather form a 4 layered "sphere"(multidimensional tree), but because computations can not be lazy (too much overhead) and need to be strict, maybe people tend to mix-up everything and strictify not only the 4-th layer, but - everything, so there is no longer single source-abstract-context tree to rely on, but anyway the chunks of it get passed around and get copied over in functions, because it is still needed). But as I said - it is just thinking and I mix together at least two-three types of contexts.


Do basic polymorphisation in the code, goes well, so at least would be easier to switch from where Strings are used to the Text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted When someone explicitly requests for help
Projects
None yet
Development

No branches or pull requests

7 participants