Rule::unify_source method execution takes about 30% of ForwardChainer execution time #54

vsbogd · 2019-12-27T18:25:21Z

I profiled ForwardChainer to find out how performance can be improved and found that it spends significant amount of time in the Rule::unify_source method. It depends on rules complexity. The more complex rules are used the bigger this amount of time is. The following benchmark demonstrates this: opencog/benchmark#27

The problem is that unify_source does almost nothing. First part converts rule's BindLink into another one with random variable naming (RewriteLink::alpha_convert). And second one constructs new BindLink with some variables grounded.

Proper fix probably should be raised to opencog/atomspace, but I am raising the issue here as main concern is URE performance.

Flame graph of the code benchmarked (better opening in web-browser):
perf.svg.gz

Following steps require getting perf and FlameGraph tool (https://github.com/brendangregg/FlameGraph). Intel CPU with support of Intel Processor Trace is required as well.

Steps to reproduce:
Run benchmark in the background:

./micro/benchmark --benchmark_filter=ForwardChainer_Basic --benchmark_min_time=600 &

Collect profile using Intel Processor Trace. You need only 1 second of profile it will be too large otherwise:

perf record -e intel_pt//u -p `pidof benchmark` sleep 1

Build flame graph of the do_step call (it can take a lot of time):

perf script --itrace=i100nsg | ./FlameGraph/stackcollapse-perf.pl | fgrep 'benchmark;BM_ForwardChainer_Basic;opencog::ForwardChainer::do_chain;opencog::ForwardChainer::do_step_rec;opencog::ForwardChainer::do_step;' | ./FlameGraph/flamegraph.pl > perf.svg

The text was updated successfully, but these errors were encountered:

linas · 2019-12-28T15:24:32Z

Not to hijack the conversation, but the flame-graph also points at opencog::SourceSet which looses huge amounts of time in boost which calls operator== which calls pthread stuff. Over the last decade, I've repeatedly discovered that boost is a performance nightmare -- rip it out of your code, everything will run 2x or 3x faster. I mean, it seems odd to me that pthread-anything should be more than 1% of the total performance; that indicates thundering herds in operator==

Also: what's __GI__exp and __ieee7? These seem to be eating lots of time...

Also: as a general rule, you want to run any/all atomspace things with HUGETLB_MORECORE=yes LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so.0 as this will in general give you a 50% performance boost.

linas · 2019-12-28T15:28:50Z

If you haven't configed hugepages, see below; its cut-n-paste from atomspace-postgres; you have to adjust these instructions for your app:

Tuning HugePages

Using 2MB-sized HugePages can also offer a large performance boost,
both for Postgres, and for the atomspace, especially when working with a
large atomspace. The procedure to set these up is a bit complicated.
First, add a hugepages user-group, and add postgres to it:

   sudo groupadd hugepages
   sudo gpasswd -a postgres hugepages

Then you need to find out what the group id (gid) was:

   id postgres

Suppose that this shows group id 1234 for hugepages. This needs to
be added to /etc/sysctl.conf as well. So edit, and add:

   vm.nr_hugepages = 16384       # 32GB of hugepages, 25% of RAM.
   vm.hugetlb_shm_group=1234

Don't forget to sudo sysctl -p /etc/sysctl.conf again.
Finally, the ability to use those pages. Add to /etc/security/limits.conf:

    @hugepages      soft    memlock         unlimited
    @hugepages      hard    memlock         unlimited

linas · 2019-12-28T15:31:35Z

actually, you may as well set hugepages to 80% of RAM, or whatever the typical working-set size for the atomspace is.

vsbogd · 2019-12-28T16:03:09Z

the flame-graph also points at opencog::SourceSet which looses huge amounts of time in boost which calls operator== which calls pthread stuff

Linas thank you for spotting this. I believe pthread stuff is a mutex lock at:

ure/opencog/ure/forwardchainer/SourceSet.cc

Lines 37 to 41 in 0a353dd

    
           bool Source::operator==(const Source& other) const 
        
           { 
        
           	std::lock_guard<std::mutex> lock(_whole_mutex); 
        
           	return body == other.body && vardecl == other.vardecl; 
        
           }

so root cause is not boost but intensive locking.

Also: what's __GI__exp and __ieee7? These seem to be eating lots of time...

I believe it is exponent calculation here:

ure/opencog/ure/forwardchainer/SourceSet.cc

Lines 175 to 188 in 0a353dd

    
           double SourceSet::get_weight(const Source& src) const 
        
           { 
        
           	// TODO: we could take into account some sort of fitness. For 
        
           	// instance if the fitness is maximize confidence, then we could 
        
           	// factor in the confidence of the source, as the higher the 
        
           	// confidence of the source, the higher the confidence of the 
        
           	// conclusion. 
        
           	return src.exhausted ? 0.0 : complexity_factor(src); 
        
           } 
        
           double SourceSet::complexity_factor(const Source& src) const 
        
           { 
        
           	return exp(-_config.get_complexity_penalty() * src.complexity); 
        
           }

linas · 2019-12-28T19:33:07Z

Hmm. Given that exp is fast (should be less than 50-100 clock cycles) the fact that its showing up a lot means it's being called a lot. Probably too much. Similar remarks for the mutex -- it should be no more than dozens of clocks, unless there's a huge amount of lock contention...

ngeiswei · 2020-01-07T08:12:31Z

Thanks. Great findings!

Regarding the locks, these are just crude half-way steps towards parallel URE. Once we move to C++17 I'll use more refined (unique/shared) mutexes. Maybe some thread pool, etc, not sure yet.

Regarding exp, it is used to calculate the probability distribution over sources, obviously this should be cached.

Calculate once at insertion rather than at each iteration. Address the exp part of opencog#54

vsbogd added the enhancement New feature or request label Dec 27, 2019

vsbogd mentioned this issue Dec 27, 2019

Add ForwardChainer benchmark opencog/benchmark#27

Merged

ngeiswei added a commit to ngeiswei/ure that referenced this issue Feb 14, 2020

Optimize complexity_factor calculation

906ed2d

Calculate once at insertion rather than at each iteration. Address the exp part of opencog#54

ngeiswei added a commit to ngeiswei/ure that referenced this issue Feb 21, 2020

Optimize complexity_factor calculation

ec3da1a

Calculate once at insertion rather than at each iteration. Address the exp part of opencog#54

ngeiswei mentioned this issue Feb 21, 2020

Improve URE API, unit tests + Optimize #64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule::unify_source method execution takes about 30% of ForwardChainer execution time #54

Rule::unify_source method execution takes about 30% of ForwardChainer execution time #54

vsbogd commented Dec 27, 2019 •

edited by ngeiswei

Loading

linas commented Dec 28, 2019 •

edited

Loading

linas commented Dec 28, 2019

linas commented Dec 28, 2019

vsbogd commented Dec 28, 2019

linas commented Dec 28, 2019

ngeiswei commented Jan 7, 2020

Rule::unify_source method execution takes about 30% of ForwardChainer execution time #54

Rule::unify_source method execution takes about 30% of ForwardChainer execution time #54

Comments

vsbogd commented Dec 27, 2019 • edited by ngeiswei Loading

linas commented Dec 28, 2019 • edited Loading

linas commented Dec 28, 2019

If you haven't configed hugepages, see below; its cut-n-paste from atomspace-postgres; you have to adjust these instructions for your app:

Tuning HugePages

linas commented Dec 28, 2019

vsbogd commented Dec 28, 2019

linas commented Dec 28, 2019

ngeiswei commented Jan 7, 2020

vsbogd commented Dec 27, 2019 •

edited by ngeiswei

Loading

linas commented Dec 28, 2019 •

edited

Loading