Skip to content

Stale wiki pages

tenchd edited this page Aug 5, 2021 · 6 revisions

Aspen

In all subsequent experiments Aspen was built according to the specs of its developers (compiled with g++-7 -fcilkplus with Cilk Plus runtime libraries) and given access to all 48 cores and full memory on the new lab machine.

Performance evaluation with 1000 update batches, unlimited memory

Input Stream Ingestion Speed (# 10^6 updates / sec) Total Runtime (sec)
kron_13_fourth 1.64828 10.597
kron_15_fourth 1.27607 218.918
kron_16_fourth 1.22838 909.619
kron_17_fourth 1.20785 3700.27

Performance evaluation with 1000 update batches, restricted to 16GB memory

Input Stream Ingestion Speed (# 10^6 updates / sec) Total Runtime (sec)
kron_13_fourth 1.47915 11.8092
kron_15_fourth 1.12393 248.551
kron_16_fourth 1.12532 992.933
kron_17_fourth 1.07073 4174.12

How does Aspen's performance respond to changes in batch size?

On the kron_13_fourth stream:

Batch Size (# updates) Ingestion Speed (# 10^6 updates / sec) Total Runtime (sec)
10^3 1.64084 10.645
10^4 2.68386 6.51
10^5 3.07802 5.676
10^6 3.4112 5.122
10^7 3.42795 5.099

Benchmarking Against Low Memory Conditions (No Write Optimization)

Ran against naive Kruskal's, NetworkX and our implementation of connected components on graphs with density n^2*0.95 Time is measured in wall-clock seconds.

No memory restrictions

n Naive NetworkX Streaming
10^3 1s 183 (100K)
5*10^3 61 (1.5G)
6*10^3 1m 31s (2.9G)
7*10^3 2m 6s (3.6G)
10^4 4m 58s (6.1G) 30000+ (2.3G)
1.1*10^4 6m 4s (7.1G)
1.2*10^4 7m 24s (11.8G)
1.3*10^4 8m 43s (13.1G)
1.4*10^4 10m 9s (14.7G)
1.5*10^4 11m 50s (16.2G)
1.6*10^4 13m 29s (17.8G)
2*10^4 (21.9G)

1GB restriction

n Naive NetworkX Streaming
10^3
5*10^3 26m 1s
6*10^3 51m 53s
7*10^3 12h+ (^C)
10^4 28h+ (^C)
10^5

8GB restriction

n Naive NetworkX Streaming
1.2*10^4 1h 15m 44s
1.3*10^4 1h 27m 22s
1.4*10^4 2h 10m 20s
1.5*10^4 5h 4m 12s
1.6*10^4 10h 46m 47s*
10^5

*: Dubious, since the program did not terminate on its own, but still gave output that signaled the end of execution.

Bipartiteness Reduction

Values in tables are numbers averaged over 10 trials (different graphs). In each experiment we profile ingestion and algorithm execution (this is necessary given the way the bipartiteness test reduction to CC must be implemented). For random graph generation: the probability of an edge appearing was set to 0.03, the geometric insertion/removal parameter was set to 0.5, and there is no limit to how many times an edge may appear in the stream. Experiments were conducted on the lab machine.

Time Profiling

n sketch-based CC bipartiteness reduction BGL is_bipartite()
500 50 264 1.2
750 173 779.8 2.9
1000 332.5 1391.7 5.5
1250 667.9 2584.7 8
1500 1041.6 3935.9 13.9
1750 1439.6 5341.5 17
2000 1895.9 6971.4 28.3
2250 2812.5 10074 34.4
2500 3492.5 12402.5 41.1
2750 4248.3 15011.1 49.1
3000 5392 18781.4 59.8
3250 6334.3 21999.4 71.8
3500 7357 25484.6 87.7
3750 8453.8 29244 104.9
4000 9622.4 33257.5 117.4
4250 12440.7 42065.6 143.4
4500 13948.8 47178.1 169.5
4750 15566.2 52632.4 195.4
5000 17235.3 58344 223.2

Time Graph

Memory Profiling

Each entry in table is average peak number of bytes. Data recorded using valgrind tool massif with the flags --depth=1 and --ignore-fn=main enabled.

Requested Memory

n sketch-based CC bipartiteness reduction BGL is_bipartite()
500 22269440.0 83413488.0 413521.0
750 45939440.0 168471488.0 811405.8
1000 61225440.0 224601488.0 1373709.0
1250 93391440.0 338251488.0 2198597.8
1500 122613440.0 441789488.0 2982961.8
1750 143035440.0 515407488.0 3832357.8
2000 163457440.0 589025488.0 5116566.6
2250 220455440.0 785907488.0 6926131.4
2500 244941440.0 873221488.0 8719897.8
2750 269427440.0 960535488.0 10233761.0
3000 319257440.0 1133097488.0 11682636.2
3250 345855440.0 1227515488.0 13144881.8
3500 372453440.0 1321933488.0 14737891.4
3750 399051440.0 1416351488.0 16751365.0
4000 425649440.0 1510769488.0 19643921.8
4250 534119440.0 1878547488.0 23607701.8
4500 565533440.0 1989045488.0 28013207.4
4750 596947440.0 2099543488.0 31976669.8
5000 628361440.0 2210041488.0 35195520.2

Requested Memory Graph

Total Memory

n sketch-based CC bipartiteness reduction BGL is_bipartite()
500 22381520.0 83781592.0 485070.4
750 46131520.0 169071608.0 964816.0
1000 61481536.0 225401640.0 1638487.2
1250 93731584.0 339351816.0 2604870.4
1500 123021552.0 443110040.0 3557333.6
1750 143511536.0 516948104.0 4610290.4
2000 164001536.0 590786152.0 6132981.6
2250 221139664.0 788031800.0 8215536.8
2500 245701680.0 875581832.0 10311092.0
2750 270263712.0 963131880.0 12146176.0
3000 320170000.0 1135931224.0 13953377.6
3250 346844016.0 1230585288.0 15796678.4
3500 373518000.0 1325239384.0 17802533.6
3750 400191984.0 1419893608.0 20267727.2
4000 426866064.0 1514547688.0 23637584.8
4250 535479584.0 1882900600.0 28105636.8
4500 566973616.0 1993654728.0 33043212.8
4750 598467600.0 2104408152.0 37566742.4
5000 629961616.0 2215162056.0 41363148.0

Total Memory Graph

Buffer Tree Performance

Performance of graph streaming with Buffer-Tree

05/08/2021 Performance numbers

Performance numbers using pthread for the graph workers and without any sketch level parallelism.
Runtime given is seconds (s) or minutes (m). Number within brackets is average updates per second (k is thousand, m is million).

Nodes 1000 5000 10000 15000
Edge Updates 10^6 ~2.4 x 10^7 ~9.5 x 10^7 ~2.1 x 10^8
Datastruct Size 63MB 590MB 1.5GB 2.5 GB
Runtime Standard 13.5s [74k] 11.6m [34k] 57.4m [28k] 138m [25k]
Runtime FBT +1 5.2s [191k] 3.5m [112k] 16.6m [95k] 40.1m [87k]
Runtime FBT +2 2.9s [342k] 1.9m [206k] 9.2m [172k] 22.6m [154k]
Runtime FBT +4 1.7s [578k] 1.1m [353k] 5.4m [292k] 13.8m [253k]
Runtime FBT +8 1.1s [903k] 41s [584k] 3.3m [467k] 9.1m [386k]
Runtime FBT +16 0.9s [1.1m] 31s [773k] 2.6m [614k] 7.4m [470k]

Buffer-Tree Ingestion Rates

These numbers are from raw ingestion into the buffer tree without any sort of graph streaming work being done

05/08/2021 Fully Synchronous

Nodes Branch Factor Buffer Size Number of Updates Runtime [rate /s]
512 4 1 MB 33.5 million 2.6s [12.9m]
512 16 1 MB 33.5 million 1.7s [19.5m]
1024 16 2 MB 268 million 15.6s [17.2m]

Cluster experiment 1

Testing the cluster on a simple word count Apache Spark program.
Input size: 600MB

4 worker nodes:
--- 166.97850894927979 seconds ---
--- 163.22055530548096 seconds ---
--- 164.37560272216797 seconds ---
--- 171.68829035758972 seconds ---

3 worker nodes:
--- 187.09018564224243 seconds ---
--- 184.9761152267456 seconds ---
--- 188.4610195159912 seconds ---

2 worker nodes:
--- 361.66504549980164 seconds ---
--- 362.92039012908936 seconds ---
--- 361.94735503196716 seconds ---

1 worker nodes:
--- 1067.6949362754822 seconds ---
--- 1059.5218110084534 seconds ---
--- 1053.6695115566254 seconds ---

0 node:
--- 2056.108319759369 seconds ---
--- 2004.2446208000183 seconds ---
--- 2048.037629365921 seconds ---

34 worker nodes: ---
--- 84.35914134979248 seconds --- ---
--- 82.6325478553772 seconds --- ---
--- 80.98088264465332 seconds --- ---

GraphX by default works for Scala only. There is a Python wrapper for GraphFrames which is what I'm using. I was able to run their connected components algorithm on the facebook friends graph sample that they have, but this was done locally (not on the cluster).

  1. Use spark-submit instead of pyspark so I can run it on the cluster. (Could use advice from Tyler but can do this myself if he's busy)
  2. Set up a Hadoop file system on the cluster. (Need some advice from Tyler)
  3. Modify the ansible playbook so the workers can downgrade scala/spark version themselves. Almost done with this.
  4. Figure out how to load our own input graphs into the algorithm. (Evan might have some suggestions on this).
  5. Get the final datasets from Ahmed onto the cluster and test GraphX on it. How big are the datasets? The master node of the cluster only has 350GB of storage and each worker has 30GB. I do not know if HDFS allows us to "spread" a large dataset across all the nodes.
  6. Make it so we can vary the number of nodes. There should be a way to do this efficiently/elegantly with spark-submit, but if not we can just use ansible.

Competitor Speed and Memory Experiments

16 GB RAM restriction experiments using Tyler's read buffering (buffer size 10^6) and batch size 10^6.

Aspen

Input Stream RES (GiB) SWAP (GiB) Overall Throughput (updates / second) Total Ingestion Time (seconds) Total Runtime (seconds)
kron13 2.4 0 3.41272e+06 6.43542 6.45255
kron15 2.8 0 3.19362e+06 92.6225 92.7011
kron16 4.1 0 2.99069e+06 372.126 372.44
kron17 7.3 0 2.91467e+06 1532.25 1533.1
kron18 16.9 13.5 303291 58916.7 59369.1

Terrace

Input Stream RES (GiB) SWAP (GiB) Overall Throughput (updates / second) Total Ingestion Time (seconds) Total Runtime (seconds)
kron13 0.577 0 141012 155.748 155.807
kron15 6.0 0 135892 2176.73 2177.15
kron16
kron17
kron18

Current performance estimates

Given the latest sketch update speed test results per Kenny, we can now estimate the input sizes at which our data structure becomes smaller than that of various alternatives (Boost, Networkx). We can also estimate how long our algorithm will take to run on such input sizes, giving us a sense of how close we are to outperforming the current state of the art for in-memory computation.

The above plot displays the space curves for 3 competitors (Networkx's edge list data structure, and Boost's adjacency list and adjacency matrix data structures). All 3 have O(n^2) space complexity but the constants differ. We also plot space curves for 3 variants of our sketching data structure: the currently implemented version which uses 16B per bucket, a trimmed version which reduces the size of error-checking variable c resulting in 12B per bucket, and a further trimmed version which uses half the typical number of buckets (reducing the number of repetitions used to avoid failure whp) for an effective 6B per bucket. We denote the input sizes at which each of our sketching data structures requires no more memory than the most compact alternative (Boost adjacency matrix), and note the data structure sizes of these break-even points.

We can now estimate how long it will take for our algorithm to run on the input sizes specified by these break-even points. We can see that the ~62% reduction in space from our current implementation to my proposed trimmed-bucket implementation makes an enormous difference in the length of time required to "beat" the Boost adjacency matrix algorithm - 30 hours as opposed to hundreds. This suggests that space savings are crucially valuable and that we are nearing the point at which several more iterations of runtime optimization (vectorizing sketch updates, distributing work via Spark, multithreading, buffering via WODS) are all that is required to make our algorithm feasible on inputs that alternative tools cannot match.

Further areas to consider:

  • What do we need to know to make similar comparisons for existing external-memory graph solutions?
  • How does our algorithm work on more "typical" inputs (larger node set, much sparser graphs) that are present in graph benchmarks such as the GAP benchmark suite? It's likely that on such graphs, we need a lot of space (which we hope can simply be on disk) but suffer little to no penalty in stream processing time per edge (and since the graphs are sparse, the edge stream should be much shorter). Ex: On GAP's Kron benchmark graph with 135M nodes and 2B edges, we only expect about a 5x slowdown in sketch update speed compared to a graph on 100K nodes (assuming that l_0 sketching on longer vectors doesn't matter much - Kenny will verify). However, our data structure size for this graph will be about 60 TB, forcing us to store this structure on disk and use WODS techniques to make sure that stream ingestion is I/O efficient. However, each sketch will only be about 0.5 MB, so they are still easy to distribute.

Distributed Framework notes

We need to decide which distributed framework to target since there may be issues with us using Spark. Add details about different options (including what we know about Spark) here.

Notes on integrating C++ code into PySpark https://www.slideshare.net/databricks/integrating-existing-c-libraries-into-pyspark-with-esther-kundin

  • People have done it before
  • This guide might be helpful
  • There are some performance considerations - they recommend not keeping state in the C++ code? Am I misinterpreting that?
  • seems like a fair bit of added complexity. Though I'm not sure how much.

Another tutorial on using C++ with Spark. At a quick glance this one seems simpler. https://medium.com/@thomaspt748/integrate-c-c-libraries-dll-so-into-apache-spark-scala-in-hadoop-cluster-56cdb26e15ea

Thrill - a C++ native distributed big data framework https://github.com/thrill/thrill

  • C++ native should make integration much easier
  • far less used than Spark, not an industry standard
  • EXPERIMENTAL

Apache Spark

  • Built on top of Hadoop, requires knowledge and understanding of Hadoop
  • Widely used in industry and lots of functionality/features
  • Only available in Java/Scala/Python

Open MPI

  • Available in C/C++
  • API forces user to handle inter-process communication

Map Reduce for C/C++ framework on Hadoop

External DB requirements for Batched Operations

Naive impl: insert, periodically query all updates for each sketch, batch all of them, delete from DS

  • Put
  • Get
  • Delete

Better impl: Insert and modify flushing algorithm to batch updates given some heurstic (e.g. all updates are for same sketch, depth of tree, etc)

  • Put
  • Low level access to flush (e.g. custom flush, flush hook)

Given a buffer tree with B^e = 16, buffer size 4096, n = 10^10 for naive, n = 10^5 for better, naive runs roughly 10-100x slower.

External memory performance comparison

Improvements in External Memory Connected Components Performance

graph

Each node in the above tree represents a fully external memory (EM) graph processing model (GPM) in which connected components (CC) is implemented. Each directed edge indicates that the destination node GPM demonstrates a CC algorithm which is faster than the CC algorithm implemented in the source node GPM. The requirement for such a demonstration is an experiment comparing the performance of the two algorithms across graphs varying in scale on identical hardware in the same environment. The sources to the studies supporting each edge is provided below in the references section.

Things to note

  • Although X-Stream is technically not a fully EM GPM, it is included because it is used as a performance milestone for many other fully EM GPMs.
  • NXGraph demonstrates it is faster than TurboGraph for real world graphs at most the scale of the Twitter graph (1.47 billion edges), however NXGraph runtime eventually overtakes TurboGraph for larger graphs. This performance superiority on different sized graphs is indicated by the two directed edges between NXGraph and TurboGraph.
  • NXGraph claims speedups over GridGraph and MMap, however these experiments only run a single iteration of PageRank for a single graph.
  • Mosaic and Graphene are the first fully EM GPMs which support computations on graphs with 1 trillion edges.
  • MOSAIC is the first major GPM to begin exploiting NVMe architecture. GraFBoost follows suit, while CLIP extends optional leveraging of NVMe.
  • Graphene excludes preprocessing time from its performance comparison, simply claiming that competitor preprocessing time is "similar or longer"
  • CLIP compares itself to MOSAIC on a commodity PC, without the massively-parallel Xeon Phi processors MOSAIC was intended to leverage for its graph computation. The MOSAIC paper contains CPU-only tests which still prove it is superior to GridGraph, however it should be understood that CLIP has only been shown to be superior this CPU-only version of MOSAIC, and that for massive graphs (>256 GB representation size) the full MOSAIC framework may or may not outperform CLIP.
  • Though LUMOS does not explicitly implement a CC algorithm, the original paper proves that it performs label propagation faster than GridGraph. Label Propagation is the basis for most CC algorithms implemented by GPMs, including GridGraph.
  • Similarly, V-Part compares BFS instead of CC against GraFBoost, however these similar "traversal" algorithms are essentially just different versions of the aforementioned label propagation and thus exhibit similar runtimes across GPMs.

Test Graphs

Below are the graph data sets used by some of the most popular GPMs. Most other GPMs evaluate themselves on some subset of these graphs as well.

GraphChi (2012)

Graph Name # Vertices # Edges
live-journal 4.8M 69M
netflix 0.5M 99M
domain 26M 0.37B
twitter-2010 42M 1.5B
uk-2007-05 106M 3.7B
uk-union 133M 5.4B
yahoo-web 1.4B 6.6B

X-Stream (2013)

Graph Name # Vertices # Edges
netflix 0.5M 0.1B
Twitter 41.7M 1.4B
Friendster 65.6M 1.8B
sk-2005 50.6M 1.9B
yahoo-web 1.4B 6.6B

The netflix graph is bipartite.

GridGraph (2015)

Graph Name # Vertices # Edges
LiveJournal 4.85M 69M
Twitter 61.6M 1.47B
UK 106M 3.74B
Yahoo 1.41B 6.64B

MOSAIC (2016)

Graph Name # Vertices # Edges
rmat24 16.8M 0.3B
twitter 41.6M 1.5B
rmat27 134.2M 2.1B
uk2007-05 105.8M 3.7B
hyperlink14 1.7246B 64.4B
rmat-trillion 4.2949B 1T

Note: rmat data sets are synthetic graphs.

References

The experimental evidence supporting (almost) each edge in the graph may be found in the original publication of its destination node GPM.

Exceptional edges

  • The edge from DS-GraphChi to X-Stream is supported by the DS-GraphChi paper.
  • The edge from LigraChi-g to GridGraph is supported by the Wonderland paper.

LaTeX Source Code

The above "imporvement tree" is incomplete. The LaTeX source code used to generate the above improvement tree using the tikz package is provided here for inclusion in other documents or expansion of the tree to include more GPMs and experiments.

Click to show
% Node style
\tikzstyle{textcontainer}=[fill=none, draw=black, shape=rectangle]
 
% Edge style
\tikzstyle{diredge}=[->]


\begin{tikzpicture}
	\begin{pgfonlayer}{nodelayer}
		\node [style=textcontainer] (0) at (0, 0) {GraphChi};
		\node [style=textcontainer] (1) at (24, -12) {TurboGraph};
		\node [style=textcontainer] (2) at (0, -12) {X-Stream};
		\node [style=textcontainer] (4) at (-12, -12) {LigraChi-g};
		\node [style=textcontainer] (5) at (24, -18) {MMap};
		\node [style=textcontainer] (6) at (17.25, -4.25) {BPP};
		\node [style=textcontainer] (7) at (-24, -12) {PrefEdge};
		\node [style=textcontainer] (8) at (0, -24) {GridGraph};
		\node [style=textcontainer] (9) at (-12, -24) {GPSA};
		\node [style=textcontainer] (10) at (-24, -24) {PathGraph};
		\node [style=textcontainer] (11) at (-3, -6) {DS-GraphChi};
		\node [style=textcontainer] (12) at (12, -24) {VENUS};
		\node [style=textcontainer] (13) at (12, -12) {NXGraph};
		\node [style=textcontainer] (14) at (12, -36) {AsyncStripe};
		\node [style=textcontainer] (16) at (0, -36) {MOSAIC};
		\node [style=textcontainer] (17) at (-24, -36) {Graphene};
		\node [style=textcontainer] (18) at (24, -24) {GraFBoost (Soft)};
		\node [style=textcontainer] (19) at (-12, -36) {Wonderland};
		\node [style=textcontainer] (20) at (0, -48) {CLIP};
		\node [style=textcontainer] (21) at (24, -30) {V-Part};
		\node [style=textcontainer] (22) at (24, -36) {LUMOS};
	\end{pgfonlayer}
	\begin{pgfonlayer}{edgelayer}
		\draw [style=diredge] (0) to (1);
		\draw [style=diredge] (0) to (2);
		\draw [style=diredge] (0) to (4);
		\draw [style=diredge] (1) to (5);
		\draw [style=diredge] (0) to (6);
		\draw [style=diredge] (0) to (7);
		\draw [style=diredge] (2) to (8);
		\draw [style=diredge] (2) to (9);
		\draw [style=diredge] (2) to (10);
		\draw [style=diredge] (0) to (11);
		\draw [style=diredge] (11) to (2);
		\draw [style=diredge] (2) to (12);
		\draw [style=diredge, bend right=15] (1) to (13);
		\draw [style=diredge, bend right=15] (13) to (1);
		\draw [style=diredge] (0) to (13);
		\draw [style=diredge] (8) to (14);
		\draw [style=diredge] (8) to (16);
		\draw [style=diredge] (8) to (17);
		\draw [style=diredge] (2) to (18);
		\draw [style=diredge] (4) to (8);
		\draw [style=diredge] (8) to (19);
		\draw [style=diredge] (16) to (20);
		\draw [style=diredge] (18) to (21);
		\draw [style=diredge] (8) to (22);
	\end{pgfonlayer}
\end{tikzpicture}

Flame graphs

Flame Graphs with Full Memory

open image in new tab to click around in the graph -- (not working right now for some reason... you can click locally though)

Standard Experiment

TokuDB

Flame Graphs Website

My Source

Specifically the commands I used to monitor the cpu activity on the system were:
perf record -F 99 -a -g -- [command (ie ./experiment or sleep)]
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-kernel.svg

This records activity across the entire system while the command is running. Therefore we can record for a time window. Ie 60 seconds using sleep.

Multithreading

Super-node level Parallelism

Description

Jobs are placed into a queue by TokuDB and taken out of the queue by GraphWorkers. These workers query Toku for the associated graph updates and then they call batch_update with these updates as input. The only locking is around the queue. We attempt to assert that a node is never within the queue more than once (This may not be successful, still investigating. There are discrepancies between the update counts).

Experimental Results

Data Size vs Threads

Following the 0.95, 0.5 scheme. Runtime given is seconds (s) or minutes (m). Number within brackets is average updates per second (k is thousand).

Nodes 1000 5000 10000 15000
Edge Updates 10^6 ~2.4 x 10^7 ~9.5 x 10^7 ~2.1 x 10^8
Datastruct Size 63MB 590MB 1.5GB 2.5 GB
Runtime Standard 14.2s [67k] 11.7m [33k] 57.7m [27k] 139m [25k]
Runtime TokuDB 18.2s [52k] 128m [27k]
Runtime 1 thread 18.9s [50k] 8.6m [46k] 74.2m [21k] 167m [21k]
Runtime 2 threads 26.6s [30k] 6.6m [60k] 48.0m [33k] 117m [30k]
Runtime 4 threads 114s [8k] 7.3m [54k] 21.3m [74k] 89.0m [39k]
Runtime 8 threads 134s [7k] 21.3m [74k] 72.5m [48k]
Runtime 16 threads 154s [6k] 85.7m [41k]

Threshold Size vs Threads

15000 nodes, 2.1x10^8 edge updates, with data structure size of 2.5GB

Threads tau=5000 tau=10000 tau=20000
1 160m [22k] 168m [21k] 183m [19k]
2 91m [38k] 114m [31k] 137m [25k]
4 45m [78k] 77m [45k] 123m [28k]
8 45m [78k] 71m [49k] 141m [25k]
16 49m [71k] 83m [42k] 113m [31k]

This is interesting. I expected it to perform better with a larger threshold not a smaller one. My reasoning was that a larger threshold meant each thread could spend longer working on batch_update and relatively less on querying from toku. However, perhaps this makes sense as a more aggressive threshold means that the tree is generally smaller and therefore we will suffer less from the deletes/queries and their locking issues. This is because each of these operations will have a lower amortized cost with a smaller tree size.

Issues

  • When inserting with multiple threads occasionally some updates are processed twice. Not yet sure if this is a toku issue or a GraphWorker issue.
  • In the last half or quarter of the process the insertion threads are very stagnant (less than 50% initialization) this gets worse as the number of threads increase. This shouldn't be happening unless my code is very wrong or there is something weird happening with Toku. Not sure yet.
  • This workload is bad for TokuDB because we are querying for everything in the tree by the time the process ends. This is not what toku is optimized for. Though this is a problem even without multithreading, but perhaps multi-thread exacerbates this.

Investigation Via Flamegraphs

FlameGraphs generated running upon 5000 nodes with 8 threads to determine what was going on with the performance.

Without Multithreading

Full Runtime Analysis for Multithreaded Setup

60 Seconds: GraphWorkers First Start Working

60 Seconds: Near the End, GraphWorkers Only 50% CPU Utilization

What this Tells Us

We can see that the issue with performance, especially with a large number of threads, is that there is a lot of waiting for locks or other events. This is represented by the swapper task in linux and is visible as an increasing portion of the runtime in the above flame-graphs.

Other Ideas

  • Mutexes vs spinlocks vs condition variables. Some online resources say mutex will handle high contention better and will be more performant even on small critical regions.
  • Two threads for insertion so we can move results of queries to a buffer (insertion to sketch is long part) (nah)
  • Sketch level parallelism

Sketch-level parallelism

OpenMP results

Data presented as openMP/single-threaded time used percentage. Tested on the lab machine with 20 available cores.

logn 10e3 10e4 10e5 10e6 10e7
2 2828 48.4 49.5 71.6 69.9
4 29.1 28.6 61.5 41.3 44.0
6 21.0 17.1 48.1 26.7 26.1
8 43.1 22.6 38.7 26.8 22.0
10 18.5 68.7 30.3 22.5 19.4
12 8.9 26.6 30.2 21.8 21.2
14 36.0 29.7 25.6 20.4 18.7
16 25.3 19.8 25.7 18.8 17.2
18 21.5 19.1 23.3 16.6 15.9
20 33.6 29.4 20.4 15.2 14.6
22 17.2 32.7 24.2 22.8 22.5
24 8.9 34.1 25.6 22.4 21.5
26 17.2 38.7 24.4 21.1 20.1
28 45.2 38.4 24.8 19.4 18.9
30 43.2 40.5 22.6 18.1 17.8

OpenMP experiment

logn 10e3 10e4 10e5 10e6 10e7
2 5065 48.9 60.3 75.3 56.6
4 138. 19.2 61.9 38.2 29.1
6 35.8 22.0 39.1 25.8 20.9
8 18.1 20.4 31.8 20.2 17.5
10 29.7 29.8 24.8 15.9 14.6
12 28.0 28.9 23.6 15.1 14.4
14 19.7 32.6 16.4 13.5 12.5
16 7.2 32.3 14.8 11.8 11.4
18 7.7 27.0 12.7 10.6 10.2
20 7.8 23.1 11.5 9.4 9.2
22 20.2 33.5 16.8 15.0 15.1
24 13.3 27.8 15.3 14.1 14.1
26 25.7 25.0 14.1 13.0 12.8
28 17.9 23.3 13.3 12.3 11.9
30 18.0 20.7 13.6 11.5 11.2

64-bit openMP experiment

spooky k skeletons and haunted spanning forests

Experimental conditions are same as for the bipartiteness profiling, with the exception that in memory profiling we only have a single trial for each algorithm for each number of vertices instead of 10. Additionally, we only profile total memory consumption and omit requested memory data.

Time Profiling

time graph

# vertices connected components spanning forest 2-skeleton 4-skeleton 8-skeleton
1000 1170 1168 2604.3 5606.9 12006.2
2000 3331.3 3331.3 7048.6 14737.6 31157.1
3000 6490.4 6488.1 13463.5 27876.3 58593
4000 8767.3 8773.9 18092.3 37301.8 78294.4
5000 12849.2 12862.6 26330.7 54198.7 113219
6000 16350.1 16354.1 33432.3 68659.8 143259
7000 19130.7 19138.7 39047.7 80098.2 166926
8000 21890.7 21908.1 44619.9 91580.7 190750
9000 28106.6 28108.1 57135.5 117060 244311
10000 31273 31301 63558.1 130038 271351
11000 34421.5 34437.1 69893.5 142946 298450
12000 39307.8 39332.4 79787.1 163276 340598
13000 42623.2 42627.8 86525.5 176931 369849
14000 45922.5 45943.9 93248.5 190565 398690
15000 49227.8 49237.4 99827.5 204221 427022

Memory Profiling

memory graph

# vertices connected components spanning forest 2-skeleton 4-skeleton 8-skeleton
1000.0 61481536.0 61505568.0 122929768.0 245729992.0 491330472.0
2000.0 164001536.0 164049568.0 328017768.0 655858040.0 1311538664.0
3000.0 320170000.0 320242032.0 640402632.0 1280579640.0 2560933784.0
4000.0 426866064.0 426962096.0 853842904.0 1707412360.0 3414550952.0
5000.0 629961616.0 630081648.0 1260081880.0 2519842296.0 5039363032.0
6000.0 815842656.0 815986688.0 1631892040.0 3263414600.0 6526459640.0
7000.0 951802896.0 951970928.0 1903860360.0 3807303288.0 7614189448.0
8000.0 1087763104.0 1087955136.0 2175828776.0 4351192152.0 8701918712.0
9000.0 1426762592.0 1426978624.0 2853875800.0 5707237304.0 11413961160.0
10000.0 1585282592.0 1585522624.0 3170964024.0 6341364152.0 12682171496.0
11000.0 1743801920.0 1744065952.0 3488052248.0 6975494328.0 13950380520.0
12000.0 2042098976.0 2042387008.0 4084693480.0 8168728600.0 16336800392.0
13000.0 2212267120.0 2212579152.0 4425077448.0 8849448792.0 17698192376.0
14000.0 2382435360.0 2382771392.0 4765461784.0 9530169816.0 19059585848.0
15000.0 2552604048.0 2552964080.0 5105845800.0 10210890888.0 20420981064.0

Unified Host and Network Dataset findings

So far I've looked at a sample of the whole dataset. The whole dataset is very big, I'm not sure exactly how big but seems like it could be on the order of terabytes. https://datasets.trovares.com/cyber/LANL/index.html

The sample I've looked at is about 10GB.
This is what I've gathered from this sample:
There are 115949436 total edges.
Found 115600796 duplicate edges.
Found 348640 unique edges.
There are 20027 nodes.

Results for the whole dataset:
There are 17882795024 total edges.
Found 17880111216 duplicate edges.
Found 2683808 unique edges.
There are 41584 nodes.

The original dataset is given as a directed graph where (u,v) means u sends some packets to v, but we can discard the direction of each edge to get a undirected graph where (u,v) means u/v send packets to each other.

Duplicate edge is an edge between (u,v) such that there is another edge that is also between (u,v). Unique edge is an edge that is not a duplicate edge.

Clone this wiki locally