-
Notifications
You must be signed in to change notification settings - Fork 5
Stale wiki pages
In all subsequent experiments Aspen was built according to the specs of its developers (compiled with g++-7 -fcilkplus with Cilk Plus runtime libraries) and given access to all 48 cores and full memory on the new lab machine.
Input Stream | Ingestion Speed (# 10^6 updates / sec) | Total Runtime (sec) |
---|---|---|
kron_13_fourth | 1.64828 | 10.597 |
kron_15_fourth | 1.27607 | 218.918 |
kron_16_fourth | 1.22838 | 909.619 |
kron_17_fourth | 1.20785 | 3700.27 |
Input Stream | Ingestion Speed (# 10^6 updates / sec) | Total Runtime (sec) |
---|---|---|
kron_13_fourth | 1.47915 | 11.8092 |
kron_15_fourth | 1.12393 | 248.551 |
kron_16_fourth | 1.12532 | 992.933 |
kron_17_fourth | 1.07073 | 4174.12 |
On the kron_13_fourth stream:
Batch Size (# updates) | Ingestion Speed (# 10^6 updates / sec) | Total Runtime (sec) |
---|---|---|
10^3 | 1.64084 | 10.645 |
10^4 | 2.68386 | 6.51 |
10^5 | 3.07802 | 5.676 |
10^6 | 3.4112 | 5.122 |
10^7 | 3.42795 | 5.099 |
Ran against naive Kruskal's, NetworkX and our implementation of connected components on graphs with density n^2*0.95 Time is measured in wall-clock seconds.
n | Naive | NetworkX | Streaming |
---|---|---|---|
10^3 | 1s | 183 (100K) | |
5*10^3 | 61 (1.5G) | ||
6*10^3 | 1m 31s (2.9G) | ||
7*10^3 | 2m 6s (3.6G) | ||
10^4 | 4m 58s (6.1G) | 30000+ (2.3G) | |
1.1*10^4 | 6m 4s (7.1G) | ||
1.2*10^4 | 7m 24s (11.8G) | ||
1.3*10^4 | 8m 43s (13.1G) | ||
1.4*10^4 | 10m 9s (14.7G) | ||
1.5*10^4 | 11m 50s (16.2G) | ||
1.6*10^4 | 13m 29s (17.8G) | ||
2*10^4 | (21.9G) |
n | Naive | NetworkX | Streaming |
---|---|---|---|
10^3 | |||
5*10^3 | 26m 1s | ||
6*10^3 | 51m 53s | ||
7*10^3 | 12h+ (^C) | ||
10^4 | 28h+ (^C) | ||
10^5 |
n | Naive | NetworkX | Streaming |
---|---|---|---|
1.2*10^4 | 1h 15m 44s | ||
1.3*10^4 | 1h 27m 22s | ||
1.4*10^4 | 2h 10m 20s | ||
1.5*10^4 | 5h 4m 12s | ||
1.6*10^4 | 10h 46m 47s* | ||
10^5 |
*: Dubious, since the program did not terminate on its own, but still gave output that signaled the end of execution.
Values in tables are numbers averaged over 10 trials (different graphs). In each experiment we profile ingestion and algorithm execution (this is necessary given the way the bipartiteness test reduction to CC must be implemented). For random graph generation: the probability of an edge appearing was set to 0.03, the geometric insertion/removal parameter was set to 0.5, and there is no limit to how many times an edge may appear in the stream. Experiments were conducted on the lab machine.
n | sketch-based CC | bipartiteness reduction | BGL is_bipartite()
|
---|---|---|---|
500 | 50 | 264 | 1.2 |
750 | 173 | 779.8 | 2.9 |
1000 | 332.5 | 1391.7 | 5.5 |
1250 | 667.9 | 2584.7 | 8 |
1500 | 1041.6 | 3935.9 | 13.9 |
1750 | 1439.6 | 5341.5 | 17 |
2000 | 1895.9 | 6971.4 | 28.3 |
2250 | 2812.5 | 10074 | 34.4 |
2500 | 3492.5 | 12402.5 | 41.1 |
2750 | 4248.3 | 15011.1 | 49.1 |
3000 | 5392 | 18781.4 | 59.8 |
3250 | 6334.3 | 21999.4 | 71.8 |
3500 | 7357 | 25484.6 | 87.7 |
3750 | 8453.8 | 29244 | 104.9 |
4000 | 9622.4 | 33257.5 | 117.4 |
4250 | 12440.7 | 42065.6 | 143.4 |
4500 | 13948.8 | 47178.1 | 169.5 |
4750 | 15566.2 | 52632.4 | 195.4 |
5000 | 17235.3 | 58344 | 223.2 |
Each entry in table is average peak number of bytes. Data recorded using valgrind
tool massif
with the flags --depth=1
and --ignore-fn=main
enabled.
n | sketch-based CC | bipartiteness reduction | BGL is_bipartite()
|
---|---|---|---|
500 | 22269440.0 | 83413488.0 | 413521.0 |
750 | 45939440.0 | 168471488.0 | 811405.8 |
1000 | 61225440.0 | 224601488.0 | 1373709.0 |
1250 | 93391440.0 | 338251488.0 | 2198597.8 |
1500 | 122613440.0 | 441789488.0 | 2982961.8 |
1750 | 143035440.0 | 515407488.0 | 3832357.8 |
2000 | 163457440.0 | 589025488.0 | 5116566.6 |
2250 | 220455440.0 | 785907488.0 | 6926131.4 |
2500 | 244941440.0 | 873221488.0 | 8719897.8 |
2750 | 269427440.0 | 960535488.0 | 10233761.0 |
3000 | 319257440.0 | 1133097488.0 | 11682636.2 |
3250 | 345855440.0 | 1227515488.0 | 13144881.8 |
3500 | 372453440.0 | 1321933488.0 | 14737891.4 |
3750 | 399051440.0 | 1416351488.0 | 16751365.0 |
4000 | 425649440.0 | 1510769488.0 | 19643921.8 |
4250 | 534119440.0 | 1878547488.0 | 23607701.8 |
4500 | 565533440.0 | 1989045488.0 | 28013207.4 |
4750 | 596947440.0 | 2099543488.0 | 31976669.8 |
5000 | 628361440.0 | 2210041488.0 | 35195520.2 |
n | sketch-based CC | bipartiteness reduction | BGL is_bipartite()
|
---|---|---|---|
500 | 22381520.0 | 83781592.0 | 485070.4 |
750 | 46131520.0 | 169071608.0 | 964816.0 |
1000 | 61481536.0 | 225401640.0 | 1638487.2 |
1250 | 93731584.0 | 339351816.0 | 2604870.4 |
1500 | 123021552.0 | 443110040.0 | 3557333.6 |
1750 | 143511536.0 | 516948104.0 | 4610290.4 |
2000 | 164001536.0 | 590786152.0 | 6132981.6 |
2250 | 221139664.0 | 788031800.0 | 8215536.8 |
2500 | 245701680.0 | 875581832.0 | 10311092.0 |
2750 | 270263712.0 | 963131880.0 | 12146176.0 |
3000 | 320170000.0 | 1135931224.0 | 13953377.6 |
3250 | 346844016.0 | 1230585288.0 | 15796678.4 |
3500 | 373518000.0 | 1325239384.0 | 17802533.6 |
3750 | 400191984.0 | 1419893608.0 | 20267727.2 |
4000 | 426866064.0 | 1514547688.0 | 23637584.8 |
4250 | 535479584.0 | 1882900600.0 | 28105636.8 |
4500 | 566973616.0 | 1993654728.0 | 33043212.8 |
4750 | 598467600.0 | 2104408152.0 | 37566742.4 |
5000 | 629961616.0 | 2215162056.0 | 41363148.0 |
Performance numbers using pthread for the graph workers and without any sketch level parallelism.
Runtime given is seconds (s) or minutes (m). Number within brackets is average updates per second (k is thousand, m is million).
Nodes | 1000 | 5000 | 10000 | 15000 |
---|---|---|---|---|
Edge Updates | 10^6 | ~2.4 x 10^7 | ~9.5 x 10^7 | ~2.1 x 10^8 |
Datastruct Size | 63MB | 590MB | 1.5GB | 2.5 GB |
Runtime Standard | 13.5s [74k] | 11.6m [34k] | 57.4m [28k] | 138m [25k] |
Runtime FBT +1 | 5.2s [191k] | 3.5m [112k] | 16.6m [95k] | 40.1m [87k] |
Runtime FBT +2 | 2.9s [342k] | 1.9m [206k] | 9.2m [172k] | 22.6m [154k] |
Runtime FBT +4 | 1.7s [578k] | 1.1m [353k] | 5.4m [292k] | 13.8m [253k] |
Runtime FBT +8 | 1.1s [903k] | 41s [584k] | 3.3m [467k] | 9.1m [386k] |
Runtime FBT +16 | 0.9s [1.1m] | 31s [773k] | 2.6m [614k] | 7.4m [470k] |
These numbers are from raw ingestion into the buffer tree without any sort of graph streaming work being done
Nodes | Branch Factor | Buffer Size | Number of Updates | Runtime [rate /s] |
---|---|---|---|---|
512 | 4 | 1 MB | 33.5 million | 2.6s [12.9m] |
512 | 16 | 1 MB | 33.5 million | 1.7s [19.5m] |
1024 | 16 | 2 MB | 268 million | 15.6s [17.2m] |
Testing the cluster on a simple word count Apache Spark program.
Input size: 600MB
4 worker nodes:
--- 166.97850894927979 seconds ---
--- 163.22055530548096 seconds ---
--- 164.37560272216797 seconds ---
--- 171.68829035758972 seconds ---
3 worker nodes:
--- 187.09018564224243 seconds ---
--- 184.9761152267456 seconds ---
--- 188.4610195159912 seconds ---
2 worker nodes:
--- 361.66504549980164 seconds ---
--- 362.92039012908936 seconds ---
--- 361.94735503196716 seconds ---
1 worker nodes:
--- 1067.6949362754822 seconds ---
--- 1059.5218110084534 seconds ---
--- 1053.6695115566254 seconds ---
0 node:
--- 2056.108319759369 seconds ---
--- 2004.2446208000183 seconds ---
--- 2048.037629365921 seconds ---
34 worker nodes: ---
--- 84.35914134979248 seconds --- ---
--- 82.6325478553772 seconds --- ---
--- 80.98088264465332 seconds --- ---
GraphX by default works for Scala only. There is a Python wrapper for GraphFrames which is what I'm using. I was able to run their connected components algorithm on the facebook friends graph sample that they have, but this was done locally (not on the cluster).
- Use spark-submit instead of pyspark so I can run it on the cluster. (Could use advice from Tyler but can do this myself if he's busy)
- Set up a Hadoop file system on the cluster. (Need some advice from Tyler)
- Modify the ansible playbook so the workers can downgrade scala/spark version themselves. Almost done with this.
- Figure out how to load our own input graphs into the algorithm. (Evan might have some suggestions on this).
- Get the final datasets from Ahmed onto the cluster and test GraphX on it. How big are the datasets? The master node of the cluster only has 350GB of storage and each worker has 30GB. I do not know if HDFS allows us to "spread" a large dataset across all the nodes.
- Make it so we can vary the number of nodes. There should be a way to do this efficiently/elegantly with spark-submit, but if not we can just use ansible.
16 GB RAM restriction experiments using Tyler's read buffering (buffer size 10^6) and batch size 10^6.
Input Stream | RES (GiB) | SWAP (GiB) | Overall Throughput (updates / second) | Total Ingestion Time (seconds) | Total Runtime (seconds) |
---|---|---|---|---|---|
kron13 | 2.4 | 0 | 3.41272e+06 | 6.43542 | 6.45255 |
kron15 | 2.8 | 0 | 3.19362e+06 | 92.6225 | 92.7011 |
kron16 | 4.1 | 0 | 2.99069e+06 | 372.126 | 372.44 |
kron17 | 7.3 | 0 | 2.91467e+06 | 1532.25 | 1533.1 |
kron18 | 16.9 | 13.5 | 303291 | 58916.7 | 59369.1 |
Input Stream | RES (GiB) | SWAP (GiB) | Overall Throughput (updates / second) | Total Ingestion Time (seconds) | Total Runtime (seconds) |
---|---|---|---|---|---|
kron13 | 0.577 | 0 | 141012 | 155.748 | 155.807 |
kron15 | 6.0 | 0 | 135892 | 2176.73 | 2177.15 |
kron16 | |||||
kron17 | |||||
kron18 |
Given the latest sketch update speed test results per Kenny, we can now estimate the input sizes at which our data structure becomes smaller than that of various alternatives (Boost, Networkx). We can also estimate how long our algorithm will take to run on such input sizes, giving us a sense of how close we are to outperforming the current state of the art for in-memory computation.
The above plot displays the space curves for 3 competitors (Networkx's edge list data structure, and Boost's adjacency list and adjacency matrix data structures). All 3 have O(n^2) space complexity but the constants differ. We also plot space curves for 3 variants of our sketching data structure: the currently implemented version which uses 16B per bucket, a trimmed version which reduces the size of error-checking variable c resulting in 12B per bucket, and a further trimmed version which uses half the typical number of buckets (reducing the number of repetitions used to avoid failure whp) for an effective 6B per bucket. We denote the input sizes at which each of our sketching data structures requires no more memory than the most compact alternative (Boost adjacency matrix), and note the data structure sizes of these break-even points.
We can now estimate how long it will take for our algorithm to run on the input sizes specified by these break-even points. We can see that the ~62% reduction in space from our current implementation to my proposed trimmed-bucket implementation makes an enormous difference in the length of time required to "beat" the Boost adjacency matrix algorithm - 30 hours as opposed to hundreds. This suggests that space savings are crucially valuable and that we are nearing the point at which several more iterations of runtime optimization (vectorizing sketch updates, distributing work via Spark, multithreading, buffering via WODS) are all that is required to make our algorithm feasible on inputs that alternative tools cannot match.
Further areas to consider:
- What do we need to know to make similar comparisons for existing external-memory graph solutions?
- How does our algorithm work on more "typical" inputs (larger node set, much sparser graphs) that are present in graph benchmarks such as the GAP benchmark suite? It's likely that on such graphs, we need a lot of space (which we hope can simply be on disk) but suffer little to no penalty in stream processing time per edge (and since the graphs are sparse, the edge stream should be much shorter). Ex: On GAP's Kron benchmark graph with 135M nodes and 2B edges, we only expect about a 5x slowdown in sketch update speed compared to a graph on 100K nodes (assuming that l_0 sketching on longer vectors doesn't matter much - Kenny will verify). However, our data structure size for this graph will be about 60 TB, forcing us to store this structure on disk and use WODS techniques to make sure that stream ingestion is I/O efficient. However, each sketch will only be about 0.5 MB, so they are still easy to distribute.
We need to decide which distributed framework to target since there may be issues with us using Spark. Add details about different options (including what we know about Spark) here.
Notes on integrating C++ code into PySpark https://www.slideshare.net/databricks/integrating-existing-c-libraries-into-pyspark-with-esther-kundin
- People have done it before
- This guide might be helpful
- There are some performance considerations - they recommend not keeping state in the C++ code? Am I misinterpreting that?
- seems like a fair bit of added complexity. Though I'm not sure how much.
Another tutorial on using C++ with Spark. At a quick glance this one seems simpler. https://medium.com/@thomaspt748/integrate-c-c-libraries-dll-so-into-apache-spark-scala-in-hadoop-cluster-56cdb26e15ea
Thrill - a C++ native distributed big data framework https://github.com/thrill/thrill
- C++ native should make integration much easier
- far less used than Spark, not an industry standard
- EXPERIMENTAL
Apache Spark
- Built on top of Hadoop, requires knowledge and understanding of Hadoop
- Widely used in industry and lots of functionality/features
- Only available in Java/Scala/Python
Open MPI
- Available in C/C++
- API forces user to handle inter-process communication
Map Reduce for C/C++ framework on Hadoop
- GitHub repo for the code: https://github.com/google/mr4c
- Article written about it: https://gigaom.com/2015/02/18/google-open-sources-a-mapreduce-framework-for-c/
- It hasn't been updated for awhile. Could be complete, could be deprecated.
- Spark is known to run faster than MapReduce?
Naive impl: insert, periodically query all updates for each sketch, batch all of them, delete from DS
- Put
- Get
- Delete
Better impl: Insert and modify flushing algorithm to batch updates given some heurstic (e.g. all updates are for same sketch, depth of tree, etc)
- Put
- Low level access to
flush
(e.g. custom flush, flush hook)
Given a buffer tree with B^e = 16
, buffer size 4096
, n = 10^10
for naive, n = 10^5
for better, naive runs roughly 10-100x slower.
Each node in the above tree represents a fully external memory (EM) graph processing model (GPM) in which connected components (CC) is implemented. Each directed edge indicates that the destination node GPM demonstrates a CC algorithm which is faster than the CC algorithm implemented in the source node GPM. The requirement for such a demonstration is an experiment comparing the performance of the two algorithms across graphs varying in scale on identical hardware in the same environment. The sources to the studies supporting each edge is provided below in the references section.
- Although X-Stream is technically not a fully EM GPM, it is included because it is used as a performance milestone for many other fully EM GPMs.
- NXGraph demonstrates it is faster than TurboGraph for real world graphs at most the scale of the Twitter graph (1.47 billion edges), however NXGraph runtime eventually overtakes TurboGraph for larger graphs. This performance superiority on different sized graphs is indicated by the two directed edges between NXGraph and TurboGraph.
- NXGraph claims speedups over GridGraph and MMap, however these experiments only run a single iteration of PageRank for a single graph.
- Mosaic and Graphene are the first fully EM GPMs which support computations on graphs with 1 trillion edges.
- MOSAIC is the first major GPM to begin exploiting NVMe architecture. GraFBoost follows suit, while CLIP extends optional leveraging of NVMe.
- Graphene excludes preprocessing time from its performance comparison, simply claiming that competitor preprocessing time is "similar or longer"
- CLIP compares itself to MOSAIC on a commodity PC, without the massively-parallel Xeon Phi processors MOSAIC was intended to leverage for its graph computation. The MOSAIC paper contains CPU-only tests which still prove it is superior to GridGraph, however it should be understood that CLIP has only been shown to be superior this CPU-only version of MOSAIC, and that for massive graphs (>256 GB representation size) the full MOSAIC framework may or may not outperform CLIP.
- Though LUMOS does not explicitly implement a CC algorithm, the original paper proves that it performs label propagation faster than GridGraph. Label Propagation is the basis for most CC algorithms implemented by GPMs, including GridGraph.
- Similarly, V-Part compares BFS instead of CC against GraFBoost, however these similar "traversal" algorithms are essentially just different versions of the aforementioned label propagation and thus exhibit similar runtimes across GPMs.
Below are the graph data sets used by some of the most popular GPMs. Most other GPMs evaluate themselves on some subset of these graphs as well.
Graph Name | # Vertices | # Edges |
---|---|---|
live-journal | 4.8M | 69M |
netflix | 0.5M | 99M |
domain | 26M | 0.37B |
twitter-2010 | 42M | 1.5B |
uk-2007-05 | 106M | 3.7B |
uk-union | 133M | 5.4B |
yahoo-web | 1.4B | 6.6B |
Graph Name | # Vertices | # Edges |
---|---|---|
netflix | 0.5M | 0.1B |
41.7M | 1.4B | |
Friendster | 65.6M | 1.8B |
sk-2005 | 50.6M | 1.9B |
yahoo-web | 1.4B | 6.6B |
The netflix graph is bipartite.
Graph Name | # Vertices | # Edges |
---|---|---|
LiveJournal | 4.85M | 69M |
61.6M | 1.47B | |
UK | 106M | 3.74B |
Yahoo | 1.41B | 6.64B |
Graph Name | # Vertices | # Edges |
---|---|---|
rmat24 | 16.8M | 0.3B |
41.6M | 1.5B | |
rmat27 | 134.2M | 2.1B |
uk2007-05 | 105.8M | 3.7B |
hyperlink14 | 1.7246B | 64.4B |
rmat-trillion | 4.2949B | 1T |
Note: rmat data sets are synthetic graphs.
The experimental evidence supporting (almost) each edge in the graph may be found in the original publication of its destination node GPM.
- GraphChi
- X-Stream
- TurboGraph
- LigraChi-g
- MMap
- PrefEdge
- BPP
- GridGraph
- GPSA
- PathGraph
- DS-GraphChi (GraphChi with dynamic shards)
- VENUS
- NXGraph
- AsyncStripe
- Graphene
- MOSAIC
- GraFBoost (Soft)
- Wonderland
- CLIP
- V-Part
- LUMOS
- The edge from DS-GraphChi to X-Stream is supported by the DS-GraphChi paper.
- The edge from LigraChi-g to GridGraph is supported by the Wonderland paper.
The above "imporvement tree" is incomplete. The LaTeX source code used to generate the above improvement tree using the tikz
package is provided here for inclusion in other documents or expansion of the tree to include more GPMs and experiments.
Click to show
% Node style
\tikzstyle{textcontainer}=[fill=none, draw=black, shape=rectangle]
% Edge style
\tikzstyle{diredge}=[->]
\begin{tikzpicture}
\begin{pgfonlayer}{nodelayer}
\node [style=textcontainer] (0) at (0, 0) {GraphChi};
\node [style=textcontainer] (1) at (24, -12) {TurboGraph};
\node [style=textcontainer] (2) at (0, -12) {X-Stream};
\node [style=textcontainer] (4) at (-12, -12) {LigraChi-g};
\node [style=textcontainer] (5) at (24, -18) {MMap};
\node [style=textcontainer] (6) at (17.25, -4.25) {BPP};
\node [style=textcontainer] (7) at (-24, -12) {PrefEdge};
\node [style=textcontainer] (8) at (0, -24) {GridGraph};
\node [style=textcontainer] (9) at (-12, -24) {GPSA};
\node [style=textcontainer] (10) at (-24, -24) {PathGraph};
\node [style=textcontainer] (11) at (-3, -6) {DS-GraphChi};
\node [style=textcontainer] (12) at (12, -24) {VENUS};
\node [style=textcontainer] (13) at (12, -12) {NXGraph};
\node [style=textcontainer] (14) at (12, -36) {AsyncStripe};
\node [style=textcontainer] (16) at (0, -36) {MOSAIC};
\node [style=textcontainer] (17) at (-24, -36) {Graphene};
\node [style=textcontainer] (18) at (24, -24) {GraFBoost (Soft)};
\node [style=textcontainer] (19) at (-12, -36) {Wonderland};
\node [style=textcontainer] (20) at (0, -48) {CLIP};
\node [style=textcontainer] (21) at (24, -30) {V-Part};
\node [style=textcontainer] (22) at (24, -36) {LUMOS};
\end{pgfonlayer}
\begin{pgfonlayer}{edgelayer}
\draw [style=diredge] (0) to (1);
\draw [style=diredge] (0) to (2);
\draw [style=diredge] (0) to (4);
\draw [style=diredge] (1) to (5);
\draw [style=diredge] (0) to (6);
\draw [style=diredge] (0) to (7);
\draw [style=diredge] (2) to (8);
\draw [style=diredge] (2) to (9);
\draw [style=diredge] (2) to (10);
\draw [style=diredge] (0) to (11);
\draw [style=diredge] (11) to (2);
\draw [style=diredge] (2) to (12);
\draw [style=diredge, bend right=15] (1) to (13);
\draw [style=diredge, bend right=15] (13) to (1);
\draw [style=diredge] (0) to (13);
\draw [style=diredge] (8) to (14);
\draw [style=diredge] (8) to (16);
\draw [style=diredge] (8) to (17);
\draw [style=diredge] (2) to (18);
\draw [style=diredge] (4) to (8);
\draw [style=diredge] (8) to (19);
\draw [style=diredge] (16) to (20);
\draw [style=diredge] (18) to (21);
\draw [style=diredge] (8) to (22);
\end{pgfonlayer}
\end{tikzpicture}
open image in new tab to click around in the graph -- (not working right now for some reason... you can click locally though)
Specifically the commands I used to monitor the cpu activity on the system were:
perf record -F 99 -a -g -- [command (ie ./experiment or sleep)]
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-kernel.svg
This records activity across the entire system while the command is running. Therefore we can record for a time window. Ie 60 seconds using sleep.
Jobs are placed into a queue by TokuDB and taken out of the queue by GraphWorkers. These workers query Toku for the associated graph updates and then they call batch_update with these updates as input. The only locking is around the queue. We attempt to assert that a node is never within the queue more than once (This may not be successful, still investigating. There are discrepancies between the update counts).
Following the 0.95, 0.5 scheme. Runtime given is seconds (s) or minutes (m). Number within brackets is average updates per second (k is thousand).
Nodes | 1000 | 5000 | 10000 | 15000 |
---|---|---|---|---|
Edge Updates | 10^6 | ~2.4 x 10^7 | ~9.5 x 10^7 | ~2.1 x 10^8 |
Datastruct Size | 63MB | 590MB | 1.5GB | 2.5 GB |
Runtime Standard | 14.2s [67k] | 11.7m [33k] | 57.7m [27k] | 139m [25k] |
Runtime TokuDB | 18.2s [52k] | 128m [27k] | ||
Runtime 1 thread | 18.9s [50k] | 8.6m [46k] | 74.2m [21k] | 167m [21k] |
Runtime 2 threads | 26.6s [30k] | 6.6m [60k] | 48.0m [33k] | 117m [30k] |
Runtime 4 threads | 114s [8k] | 7.3m [54k] | 21.3m [74k] | 89.0m [39k] |
Runtime 8 threads | 134s [7k] | 21.3m [74k] | 72.5m [48k] | |
Runtime 16 threads | 154s [6k] | 85.7m [41k] |
15000 nodes, 2.1x10^8 edge updates, with data structure size of 2.5GB
Threads | tau=5000 | tau=10000 | tau=20000 |
---|---|---|---|
1 | 160m [22k] | 168m [21k] | 183m [19k] |
2 | 91m [38k] | 114m [31k] | 137m [25k] |
4 | 45m [78k] | 77m [45k] | 123m [28k] |
8 | 45m [78k] | 71m [49k] | 141m [25k] |
16 | 49m [71k] | 83m [42k] | 113m [31k] |
This is interesting. I expected it to perform better with a larger threshold not a smaller one. My reasoning was that a larger threshold meant each thread could spend longer working on batch_update and relatively less on querying from toku. However, perhaps this makes sense as a more aggressive threshold means that the tree is generally smaller and therefore we will suffer less from the deletes/queries and their locking issues. This is because each of these operations will have a lower amortized cost with a smaller tree size.
- When inserting with multiple threads occasionally some updates are processed twice. Not yet sure if this is a toku issue or a GraphWorker issue.
- In the last half or quarter of the process the insertion threads are very stagnant (less than 50% initialization) this gets worse as the number of threads increase. This shouldn't be happening unless my code is very wrong or there is something weird happening with Toku. Not sure yet.
- This workload is bad for TokuDB because we are querying for everything in the tree by the time the process ends. This is not what toku is optimized for. Though this is a problem even without multithreading, but perhaps multi-thread exacerbates this.
FlameGraphs generated running upon 5000 nodes with 8 threads to determine what was going on with the performance.
We can see that the issue with performance, especially with a large number of threads, is that there is a lot of waiting for locks or other events. This is represented by the swapper
task in linux and is visible as an increasing portion of the runtime in the above flame-graphs.
- Mutexes vs spinlocks vs condition variables. Some online resources say mutex will handle high contention better and will be more performant even on small critical regions.
- Two threads for insertion so we can move results of queries to a buffer (insertion to sketch is long part) (nah)
- Sketch level parallelism
Data presented as openMP/single-threaded time used percentage. Tested on the lab machine with 20 available cores.
logn | 10e3 | 10e4 | 10e5 | 10e6 | 10e7 |
---|---|---|---|---|---|
2 | 2828 | 48.4 | 49.5 | 71.6 | 69.9 |
4 | 29.1 | 28.6 | 61.5 | 41.3 | 44.0 |
6 | 21.0 | 17.1 | 48.1 | 26.7 | 26.1 |
8 | 43.1 | 22.6 | 38.7 | 26.8 | 22.0 |
10 | 18.5 | 68.7 | 30.3 | 22.5 | 19.4 |
12 | 8.9 | 26.6 | 30.2 | 21.8 | 21.2 |
14 | 36.0 | 29.7 | 25.6 | 20.4 | 18.7 |
16 | 25.3 | 19.8 | 25.7 | 18.8 | 17.2 |
18 | 21.5 | 19.1 | 23.3 | 16.6 | 15.9 |
20 | 33.6 | 29.4 | 20.4 | 15.2 | 14.6 |
22 | 17.2 | 32.7 | 24.2 | 22.8 | 22.5 |
24 | 8.9 | 34.1 | 25.6 | 22.4 | 21.5 |
26 | 17.2 | 38.7 | 24.4 | 21.1 | 20.1 |
28 | 45.2 | 38.4 | 24.8 | 19.4 | 18.9 |
30 | 43.2 | 40.5 | 22.6 | 18.1 | 17.8 |
logn | 10e3 | 10e4 | 10e5 | 10e6 | 10e7 |
---|---|---|---|---|---|
2 | 5065 | 48.9 | 60.3 | 75.3 | 56.6 |
4 | 138. | 19.2 | 61.9 | 38.2 | 29.1 |
6 | 35.8 | 22.0 | 39.1 | 25.8 | 20.9 |
8 | 18.1 | 20.4 | 31.8 | 20.2 | 17.5 |
10 | 29.7 | 29.8 | 24.8 | 15.9 | 14.6 |
12 | 28.0 | 28.9 | 23.6 | 15.1 | 14.4 |
14 | 19.7 | 32.6 | 16.4 | 13.5 | 12.5 |
16 | 7.2 | 32.3 | 14.8 | 11.8 | 11.4 |
18 | 7.7 | 27.0 | 12.7 | 10.6 | 10.2 |
20 | 7.8 | 23.1 | 11.5 | 9.4 | 9.2 |
22 | 20.2 | 33.5 | 16.8 | 15.0 | 15.1 |
24 | 13.3 | 27.8 | 15.3 | 14.1 | 14.1 |
26 | 25.7 | 25.0 | 14.1 | 13.0 | 12.8 |
28 | 17.9 | 23.3 | 13.3 | 12.3 | 11.9 |
30 | 18.0 | 20.7 | 13.6 | 11.5 | 11.2 |
Experimental conditions are same as for the bipartiteness profiling, with the exception that in memory profiling we only have a single trial for each algorithm for each number of vertices instead of 10. Additionally, we only profile total memory consumption and omit requested memory data.
# vertices | connected components | spanning forest | 2-skeleton | 4-skeleton | 8-skeleton |
---|---|---|---|---|---|
1000 | 1170 | 1168 | 2604.3 | 5606.9 | 12006.2 |
2000 | 3331.3 | 3331.3 | 7048.6 | 14737.6 | 31157.1 |
3000 | 6490.4 | 6488.1 | 13463.5 | 27876.3 | 58593 |
4000 | 8767.3 | 8773.9 | 18092.3 | 37301.8 | 78294.4 |
5000 | 12849.2 | 12862.6 | 26330.7 | 54198.7 | 113219 |
6000 | 16350.1 | 16354.1 | 33432.3 | 68659.8 | 143259 |
7000 | 19130.7 | 19138.7 | 39047.7 | 80098.2 | 166926 |
8000 | 21890.7 | 21908.1 | 44619.9 | 91580.7 | 190750 |
9000 | 28106.6 | 28108.1 | 57135.5 | 117060 | 244311 |
10000 | 31273 | 31301 | 63558.1 | 130038 | 271351 |
11000 | 34421.5 | 34437.1 | 69893.5 | 142946 | 298450 |
12000 | 39307.8 | 39332.4 | 79787.1 | 163276 | 340598 |
13000 | 42623.2 | 42627.8 | 86525.5 | 176931 | 369849 |
14000 | 45922.5 | 45943.9 | 93248.5 | 190565 | 398690 |
15000 | 49227.8 | 49237.4 | 99827.5 | 204221 | 427022 |
# vertices | connected components | spanning forest | 2-skeleton | 4-skeleton | 8-skeleton |
---|---|---|---|---|---|
1000.0 | 61481536.0 | 61505568.0 | 122929768.0 | 245729992.0 | 491330472.0 |
2000.0 | 164001536.0 | 164049568.0 | 328017768.0 | 655858040.0 | 1311538664.0 |
3000.0 | 320170000.0 | 320242032.0 | 640402632.0 | 1280579640.0 | 2560933784.0 |
4000.0 | 426866064.0 | 426962096.0 | 853842904.0 | 1707412360.0 | 3414550952.0 |
5000.0 | 629961616.0 | 630081648.0 | 1260081880.0 | 2519842296.0 | 5039363032.0 |
6000.0 | 815842656.0 | 815986688.0 | 1631892040.0 | 3263414600.0 | 6526459640.0 |
7000.0 | 951802896.0 | 951970928.0 | 1903860360.0 | 3807303288.0 | 7614189448.0 |
8000.0 | 1087763104.0 | 1087955136.0 | 2175828776.0 | 4351192152.0 | 8701918712.0 |
9000.0 | 1426762592.0 | 1426978624.0 | 2853875800.0 | 5707237304.0 | 11413961160.0 |
10000.0 | 1585282592.0 | 1585522624.0 | 3170964024.0 | 6341364152.0 | 12682171496.0 |
11000.0 | 1743801920.0 | 1744065952.0 | 3488052248.0 | 6975494328.0 | 13950380520.0 |
12000.0 | 2042098976.0 | 2042387008.0 | 4084693480.0 | 8168728600.0 | 16336800392.0 |
13000.0 | 2212267120.0 | 2212579152.0 | 4425077448.0 | 8849448792.0 | 17698192376.0 |
14000.0 | 2382435360.0 | 2382771392.0 | 4765461784.0 | 9530169816.0 | 19059585848.0 |
15000.0 | 2552604048.0 | 2552964080.0 | 5105845800.0 | 10210890888.0 | 20420981064.0 |
So far I've looked at a sample of the whole dataset. The whole dataset is very big, I'm not sure exactly how big but seems like it could be on the order of terabytes. https://datasets.trovares.com/cyber/LANL/index.html
The sample I've looked at is about 10GB.
This is what I've gathered from this sample:
There are 115949436 total edges.
Found 115600796 duplicate edges.
Found 348640 unique edges.
There are 20027 nodes.
Results for the whole dataset:
There are 17882795024 total edges.
Found 17880111216 duplicate edges.
Found 2683808 unique edges.
There are 41584 nodes.
The original dataset is given as a directed graph where (u,v) means u sends some packets to v, but we can discard the direction of each edge to get a undirected graph where (u,v) means u/v send packets to each other.
Duplicate edge is an edge between (u,v) such that there is another edge that is also between (u,v). Unique edge is an edge that is not a duplicate edge.