Stale wiki pages

Aspen

In all subsequent experiments Aspen was built according to the specs of its developers (compiled with g++-7 -fcilkplus with Cilk Plus runtime libraries) and given access to all 48 cores and full memory on the new lab machine.

Performance evaluation with 1000 update batches, unlimited memory

Input Stream	Ingestion Speed (# 10^6 updates / sec)	Total Runtime (sec)
kron_13_fourth	1.64828	10.597
kron_15_fourth	1.27607	218.918
kron_16_fourth	1.22838	909.619
kron_17_fourth	1.20785	3700.27

Performance evaluation with 1000 update batches, restricted to 16GB memory

Input Stream	Ingestion Speed (# 10^6 updates / sec)	Total Runtime (sec)
kron_13_fourth	1.47915	11.8092
kron_15_fourth	1.12393	248.551
kron_16_fourth	1.12532	992.933
kron_17_fourth	1.07073	4174.12

How does Aspen's performance respond to changes in batch size?

On the kron_13_fourth stream:

Batch Size (# updates)	Ingestion Speed (# 10^6 updates / sec)	Total Runtime (sec)
10^3	1.64084	10.645
10^4	2.68386	6.51
10^5	3.07802	5.676
10^6	3.4112	5.122
10^7	3.42795	5.099

Benchmarking Against Low Memory Conditions (No Write Optimization)

Ran against naive Kruskal's, NetworkX and our implementation of connected components on graphs with density n^2*0.95 Time is measured in wall-clock seconds.

No memory restrictions

n	NetworkX	Streaming
10^3	1s	183 (100K)
5*10^3	61 (1.5G)
6*10^3	1m 31s (2.9G)
7*10^3	2m 6s (3.6G)
10^4	4m 58s (6.1G)	30000+ (2.3G)
1.1*10^4	6m 4s (7.1G)
1.2*10^4	7m 24s (11.8G)
1.3*10^4	8m 43s (13.1G)
1.4*10^4	10m 9s (14.7G)
1.5*10^4	11m 50s (16.2G)
1.6*10^4	13m 29s (17.8G)
2*10^4	(21.9G)

1GB restriction

n	Naive	NetworkX	Streaming
10^3
5*10^3		26m 1s
6*10^3		51m 53s
7*10^3		12h+ (^C)
10^4		28h+ (^C)
10^5

8GB restriction

n	Naive	NetworkX	Streaming
1.2*10^4		1h 15m 44s
1.3*10^4		1h 27m 22s
1.4*10^4		2h 10m 20s
1.5*10^4		5h 4m 12s
1.6*10^4		10h 46m 47s*
10^5

*: Dubious, since the program did not terminate on its own, but still gave output that signaled the end of execution.

Bipartiteness Reduction

Values in tables are numbers averaged over 10 trials (different graphs). In each experiment we profile ingestion and algorithm execution (this is necessary given the way the bipartiteness test reduction to CC must be implemented). For random graph generation: the probability of an edge appearing was set to 0.03, the geometric insertion/removal parameter was set to 0.5, and there is no limit to how many times an edge may appear in the stream. Experiments were conducted on the lab machine.

Time Profiling

n	sketch-based CC	bipartiteness reduction	BGL `is_bipartite()`
500	50	264	1.2
750	173	779.8	2.9
1000	332.5	1391.7	5.5
1250	667.9	2584.7	8
1500	1041.6	3935.9	13.9
1750	1439.6	5341.5	17
2000	1895.9	6971.4	28.3
2250	2812.5	10074	34.4
2500	3492.5	12402.5	41.1
2750	4248.3	15011.1	49.1
3000	5392	18781.4	59.8
3250	6334.3	21999.4	71.8
3500	7357	25484.6	87.7
3750	8453.8	29244	104.9
4000	9622.4	33257.5	117.4
4250	12440.7	42065.6	143.4
4500	13948.8	47178.1	169.5
4750	15566.2	52632.4	195.4
5000	17235.3	58344	223.2

Time Graph

Memory Profiling

Each entry in table is average peak number of bytes. Data recorded using valgrind tool massif with the flags --depth=1 and --ignore-fn=main enabled.

Requested Memory

n	sketch-based CC	bipartiteness reduction	BGL `is_bipartite()`
500	22269440.0	83413488.0	413521.0
750	45939440.0	168471488.0	811405.8
1000	61225440.0	224601488.0	1373709.0
1250	93391440.0	338251488.0	2198597.8
1500	122613440.0	441789488.0	2982961.8
1750	143035440.0	515407488.0	3832357.8
2000	163457440.0	589025488.0	5116566.6
2250	220455440.0	785907488.0	6926131.4
2500	244941440.0	873221488.0	8719897.8
2750	269427440.0	960535488.0	10233761.0
3000	319257440.0	1133097488.0	11682636.2
3250	345855440.0	1227515488.0	13144881.8
3500	372453440.0	1321933488.0	14737891.4
3750	399051440.0	1416351488.0	16751365.0
4000	425649440.0	1510769488.0	19643921.8
4250	534119440.0	1878547488.0	23607701.8
4500	565533440.0	1989045488.0	28013207.4
4750	596947440.0	2099543488.0	31976669.8
5000	628361440.0	2210041488.0	35195520.2

Requested Memory Graph

Total Memory

n	sketch-based CC	bipartiteness reduction	BGL `is_bipartite()`
500	22381520.0	83781592.0	485070.4
750	46131520.0	169071608.0	964816.0
1000	61481536.0	225401640.0	1638487.2
1250	93731584.0	339351816.0	2604870.4
1500	123021552.0	443110040.0	3557333.6
1750	143511536.0	516948104.0	4610290.4
2000	164001536.0	590786152.0	6132981.6
2250	221139664.0	788031800.0	8215536.8
2500	245701680.0	875581832.0	10311092.0
2750	270263712.0	963131880.0	12146176.0
3000	320170000.0	1135931224.0	13953377.6
3250	346844016.0	1230585288.0	15796678.4
3500	373518000.0	1325239384.0	17802533.6
3750	400191984.0	1419893608.0	20267727.2
4000	426866064.0	1514547688.0	23637584.8
4250	535479584.0	1882900600.0	28105636.8
4500	566973616.0	1993654728.0	33043212.8
4750	598467600.0	2104408152.0	37566742.4
5000	629961616.0	2215162056.0	41363148.0

Total Memory Graph

Buffer Tree Performance

Performance of graph streaming with Buffer-Tree

05/08/2021 Performance numbers

Performance numbers using pthread for the graph workers and without any sketch level parallelism.
Runtime given is seconds (s) or minutes (m). Number within brackets is average updates per second (k is thousand, m is million).

Nodes	1000	5000	10000	15000
Edge Updates	10^6	~2.4 x 10^7	~9.5 x 10^7	~2.1 x 10^8
Datastruct Size	63MB	590MB	1.5GB	2.5 GB
Runtime Standard	13.5s [74k]	11.6m [34k]	57.4m [28k]	138m [25k]
Runtime FBT +1	5.2s [191k]	3.5m [112k]	16.6m [95k]	40.1m [87k]
Runtime FBT +2	2.9s [342k]	1.9m [206k]	9.2m [172k]	22.6m [154k]
Runtime FBT +4	1.7s [578k]	1.1m [353k]	5.4m [292k]	13.8m [253k]
Runtime FBT +8	1.1s [903k]	41s [584k]	3.3m [467k]	9.1m [386k]
Runtime FBT +16	0.9s [1.1m]	31s [773k]	2.6m [614k]	7.4m [470k]

Buffer-Tree Ingestion Rates

These numbers are from raw ingestion into the buffer tree without any sort of graph streaming work being done

05/08/2021 Fully Synchronous

Nodes	Branch Factor	Buffer Size	Number of Updates	Runtime [rate /s]
512	4	1 MB	33.5 million	2.6s [12.9m]
512	16	1 MB	33.5 million	1.7s [19.5m]
1024	16	2 MB	268 million	15.6s [17.2m]

Cluster experiment 1

Testing the cluster on a simple word count Apache Spark program.
Input size: 600MB

4 worker nodes:
--- 166.97850894927979 seconds ---
--- 163.22055530548096 seconds ---
--- 164.37560272216797 seconds ---
--- 171.68829035758972 seconds ---

3 worker nodes:
--- 187.09018564224243 seconds ---
--- 184.9761152267456 seconds ---
--- 188.4610195159912 seconds ---

2 worker nodes:
--- 361.66504549980164 seconds ---
--- 362.92039012908936 seconds ---
--- 361.94735503196716 seconds ---

1 worker nodes:
--- 1067.6949362754822 seconds ---
--- 1059.5218110084534 seconds ---
--- 1053.6695115566254 seconds ---

0 node:
--- 2056.108319759369 seconds ---
--- 2004.2446208000183 seconds ---
--- 2048.037629365921 seconds ---

34 worker nodes: ---
--- 84.35914134979248 seconds --- ---
--- 82.6325478553772 seconds --- ---
--- 80.98088264465332 seconds --- ---

GraphX by default works for Scala only. There is a Python wrapper for GraphFrames which is what I'm using. I was able to run their connected components algorithm on the facebook friends graph sample that they have, but this was done locally (not on the cluster).

Use spark-submit instead of pyspark so I can run it on the cluster. (Could use advice from Tyler but can do this myself if he's busy)
Set up a Hadoop file system on the cluster. (Need some advice from Tyler)
Modify the ansible playbook so the workers can downgrade scala/spark version themselves. Almost done with this.
Figure out how to load our own input graphs into the algorithm. (Evan might have some suggestions on this).
Get the final datasets from Ahmed onto the cluster and test GraphX on it. How big are the datasets? The master node of the cluster only has 350GB of storage and each worker has 30GB. I do not know if HDFS allows us to "spread" a large dataset across all the nodes.
Make it so we can vary the number of nodes. There should be a way to do this efficiently/elegantly with spark-submit, but if not we can just use ansible.

Competitor Speed and Memory Experiments

16 GB RAM restriction experiments using Tyler's read buffering (buffer size 10^6) and batch size 10^6.

Aspen

Input Stream	RES (GiB)	SWAP (GiB)	Overall Throughput (updates / second)	Total Ingestion Time (seconds)	Total Runtime (seconds)
kron13	2.4	0	3.41272e+06	6.43542	6.45255
kron15	2.8	0	3.19362e+06	92.6225	92.7011
kron16	4.1	0	2.99069e+06	372.126	372.44
kron17	7.3	0	2.91467e+06	1532.25	1533.1
kron18	16.9	13.5	303291	58916.7	59369.1

Terrace

Input Stream	RES (GiB)	SWAP (GiB)	Overall Throughput (updates / second)	Total Ingestion Time (seconds)	Total Runtime (seconds)
kron13	0.577	0	141012	155.748	155.807
kron15	6.0	0	135892	2176.73	2177.15
kron16
kron17
kron18

Current performance estimates

Given the latest sketch update speed test results per Kenny, we can now estimate the input sizes at which our data structure becomes smaller than that of various alternatives (Boost, Networkx). We can also estimate how long our algorithm will take to run on such input sizes, giving us a sense of how close we are to outperforming the current state of the art for in-memory computation.

The above plot displays the space curves for 3 competitors (Networkx's edge list data structure, and Boost's adjacency list and adjacency matrix data structures). All 3 have O(n^2) space complexity but the constants differ. We also plot space curves for 3 variants of our sketching data structure: the currently implemented version which uses 16B per bucket, a trimmed version which reduces the size of error-checking variable c resulting in 12B per bucket, and a further trimmed version which uses half the typical number of buckets (reducing the number of repetitions used to avoid failure whp) for an effective 6B per bucket. We denote the input sizes at which each of our sketching data structures requires no more memory than the most compact alternative (Boost adjacency matrix), and note the data structure sizes of these break-even points.

We can now estimate how long it will take for our algorithm to run on the input sizes specified by these break-even points. We can see that the ~62% reduction in space from our current implementation to my proposed trimmed-bucket implementation makes an enormous difference in the length of time required to "beat" the Boost adjacency matrix algorithm - 30 hours as opposed to hundreds. This suggests that space savings are crucially valuable and that we are nearing the point at which several more iterations of runtime optimization (vectorizing sketch updates, distributing work via Spark, multithreading, buffering via WODS) are all that is required to make our algorithm feasible on inputs that alternative tools cannot match.

Further areas to consider:

What do we need to know to make similar comparisons for existing external-memory graph solutions?
How does our algorithm work on more "typical" inputs (larger node set, much sparser graphs) that are present in graph benchmarks such as the GAP benchmark suite? It's likely that on such graphs, we need a lot of space (which we hope can simply be on disk) but suffer little to no penalty in stream processing time per edge (and since the graphs are sparse, the edge stream should be much shorter). Ex: On GAP's Kron benchmark graph with 135M nodes and 2B edges, we only expect about a 5x slowdown in sketch update speed compared to a graph on 100K nodes (assuming that l_0 sketching on longer vectors doesn't matter much - Kenny will verify). However, our data structure size for this graph will be about 60 TB, forcing us to store this structure on disk and use WODS techniques to make sure that stream ingestion is I/O efficient. However, each sketch will only be about 0.5 MB, so they are still easy to distribute.

Distributed Framework notes

We need to decide which distributed framework to target since there may be issues with us using Spark. Add details about different options (including what we know about Spark) here.

Notes on integrating C++ code into PySpark https://www.slideshare.net/databricks/integrating-existing-c-libraries-into-pyspark-with-esther-kundin

People have done it before
This guide might be helpful
There are some performance considerations - they recommend not keeping state in the C++ code? Am I misinterpreting that?
seems like a fair bit of added complexity. Though I'm not sure how much.

Another tutorial on using C++ with Spark. At a quick glance this one seems simpler. https://medium.com/@thomaspt748/integrate-c-c-libraries-dll-so-into-apache-spark-scala-in-hadoop-cluster-56cdb26e15ea

Thrill - a C++ native distributed big data framework https://github.com/thrill/thrill

C++ native should make integration much easier
far less used than Spark, not an industry standard
EXPERIMENTAL

Apache Spark

Built on top of Hadoop, requires knowledge and understanding of Hadoop
Widely used in industry and lots of functionality/features
Only available in Java/Scala/Python

Open MPI

Available in C/C++
API forces user to handle inter-process communication

Map Reduce for C/C++ framework on Hadoop

GitHub repo for the code: https://github.com/google/mr4c
Article written about it: https://gigaom.com/2015/02/18/google-open-sources-a-mapreduce-framework-for-c/
It hasn't been updated for awhile. Could be complete, could be deprecated.
Spark is known to run faster than MapReduce?

External DB requirements for Batched Operations

Naive impl: insert, periodically query all updates for each sketch, batch all of them, delete from DS

Put
Get
Delete

Better impl: Insert and modify flushing algorithm to batch updates given some heurstic (e.g. all updates are for same sketch, depth of tree, etc)

Put
Low level access to flush (e.g. custom flush, flush hook)

Given a buffer tree with B^e = 16, buffer size 4096, n = 10^10 for naive, n = 10^5 for better, naive runs roughly 10-100x slower.

External memory performance comparison

Improvements in External Memory Connected Components Performance

graph

Each node in the above tree represents a fully external memory (EM) graph processing model (GPM) in which connected components (CC) is implemented. Each directed edge indicates that the destination node GPM demonstrates a CC algorithm which is faster than the CC algorithm implemented in the source node GPM. The requirement for such a demonstration is an experiment comparing the performance of the two algorithms across graphs varying in scale on identical hardware in the same environment. The sources to the studies supporting each edge is provided below in the references section.

Things to note

Although X-Stream is technically not a fully EM GPM, it is included because it is used as a performance milestone for many other fully EM GPMs.
NXGraph demonstrates it is faster than TurboGraph for real world graphs at most the scale of the Twitter graph (1.47 billion edges), however NXGraph runtime eventually overtakes TurboGraph for larger graphs. This performance superiority on different sized graphs is indicated by the two directed edges between NXGraph and TurboGraph.
NXGraph claims speedups over GridGraph and MMap, however these experiments only run a single iteration of PageRank for a single graph.
Mosaic and Graphene are the first fully EM GPMs which support computations on graphs with 1 trillion edges.
MOSAIC is the first major GPM to begin exploiting NVMe architecture. GraFBoost follows suit, while CLIP extends optional leveraging of NVMe.
Graphene excludes preprocessing time from its performance comparison, simply claiming that competitor preprocessing time is "similar or longer"
CLIP compares itself to MOSAIC on a commodity PC, without the massively-parallel Xeon Phi processors MOSAIC was intended to leverage for its graph computation. The MOSAIC paper contains CPU-only tests which still prove it is superior to GridGraph, however it should be understood that CLIP has only been shown to be superior this CPU-only version of MOSAIC, and that for massive graphs (>256 GB representation size) the full MOSAIC framework may or may not outperform CLIP.
Though LUMOS does not explicitly implement a CC algorithm, the original paper proves that it performs label propagation faster than GridGraph. Label Propagation is the basis for most CC algorithms implemented by GPMs, including GridGraph.
Similarly, V-Part compares BFS instead of CC against GraFBoost, however these similar "traversal" algorithms are essentially just different versions of the aforementioned label propagation and thus exhibit similar runtimes across GPMs.

Test Graphs

Below are the graph data sets used by some of the most popular GPMs. Most other GPMs evaluate themselves on some subset of these graphs as well.

GraphChi (2012)

Graph Name	# Vertices	# Edges
live-journal	4.8M	69M
netflix	0.5M	99M
domain	26M	0.37B
twitter-2010	42M	1.5B
uk-2007-05	106M	3.7B
uk-union	133M	5.4B
yahoo-web	1.4B	6.6B

X-Stream (2013)

Graph Name	# Vertices	# Edges
netflix	0.5M	0.1B
Twitter	41.7M	1.4B
Friendster	65.6M	1.8B
sk-2005	50.6M	1.9B
yahoo-web	1.4B	6.6B

The netflix graph is bipartite.

GridGraph (2015)

Graph Name	# Vertices	# Edges
LiveJournal	4.85M	69M
Twitter	61.6M	1.47B
UK	106M	3.74B
Yahoo	1.41B	6.64B

MOSAIC (2016)

Graph Name	# Vertices	# Edges
rmat24	16.8M	0.3B
twitter	41.6M	1.5B
rmat27	134.2M	2.1B
uk2007-05	105.8M	3.7B
hyperlink14	1.7246B	64.4B
rmat-trillion	4.2949B	1T

Note: rmat data sets are synthetic graphs.

References

The experimental evidence supporting (almost) each edge in the graph may be found in the original publication of its destination node GPM.

GraphChi
X-Stream
TurboGraph
LigraChi-g
MMap
PrefEdge
BPP
GridGraph
GPSA
PathGraph
DS-GraphChi (GraphChi with dynamic shards)
VENUS
NXGraph
AsyncStripe
Graphene
MOSAIC
GraFBoost (Soft)
Wonderland
CLIP
V-Part
LUMOS

Exceptional edges

The edge from DS-GraphChi to X-Stream is supported by the DS-GraphChi paper.
The edge from LigraChi-g to GridGraph is supported by the Wonderland paper.

LaTeX Source Code

The above "imporvement tree" is incomplete. The LaTeX source code used to generate the above improvement tree using the tikz package is provided here for inclusion in other documents or expansion of the tree to include more GPMs and experiments.

Click to show

% Node style
\tikzstyle{textcontainer}=[fill=none, draw=black, shape=rectangle]
 
% Edge style
\tikzstyle{diredge}=[->]


\begin{tikzpicture}
	\begin{pgfonlayer}{nodelayer}
		\node [style=textcontainer] (0) at (0, 0) {GraphChi};
		\node [style=textcontainer] (1) at (24, -12) {TurboGraph};
		\node [style=textcontainer] (2) at (0, -12) {X-Stream};
		\node [style=textcontainer] (4) at (-12, -12) {LigraChi-g};
		\node [style=textcontainer] (5) at (24, -18) {MMap};
		\node [style=textcontainer] (6) at (17.25, -4.25) {BPP};
		\node [style=textcontainer] (7) at (-24, -12) {PrefEdge};
		\node [style=textcontainer] (8) at (0, -24) {GridGraph};
		\node [style=textcontainer] (9) at (-12, -24) {GPSA};
		\node [style=textcontainer] (10) at (-24, -24) {PathGraph};
		\node [style=textcontainer] (11) at (-3, -6) {DS-GraphChi};
		\node [style=textcontainer] (12) at (12, -24) {VENUS};
		\node [style=textcontainer] (13) at (12, -12) {NXGraph};
		\node [style=textcontainer] (14) at (12, -36) {AsyncStripe};
		\node [style=textcontainer] (16) at (0, -36) {MOSAIC};
		\node [style=textcontainer] (17) at (-24, -36) {Graphene};
		\node [style=textcontainer] (18) at (24, -24) {GraFBoost (Soft)};
		\node [style=textcontainer] (19) at (-12, -36) {Wonderland};
		\node [style=textcontainer] (20) at (0, -48) {CLIP};
		\node [style=textcontainer] (21) at (24, -30) {V-Part};
		\node [style=textcontainer] (22) at (24, -36) {LUMOS};
	\end{pgfonlayer}
	\begin{pgfonlayer}{edgelayer}
		\draw [style=diredge] (0) to (1);
		\draw [style=diredge] (0) to (2);
		\draw [style=diredge] (0) to (4);
		\draw [style=diredge] (1) to (5);
		\draw [style=diredge] (0) to (6);
		\draw [style=diredge] (0) to (7);
		\draw [style=diredge] (2) to (8);
		\draw [style=diredge] (2) to (9);
		\draw [style=diredge] (2) to (10);
		\draw [style=diredge] (0) to (11);
		\draw [style=diredge] (11) to (2);
		\draw [style=diredge] (2) to (12);
		\draw [style=diredge, bend right=15] (1) to (13);
		\draw [style=diredge, bend right=15] (13) to (1);
		\draw [style=diredge] (0) to (13);
		\draw [style=diredge] (8) to (14);
		\draw [style=diredge] (8) to (16);
		\draw [style=diredge] (8) to (17);
		\draw [style=diredge] (2) to (18);
		\draw [style=diredge] (4) to (8);
		\draw [style=diredge] (8) to (19);
		\draw [style=diredge] (16) to (20);
		\draw [style=diredge] (18) to (21);
		\draw [style=diredge] (8) to (22);
	\end{pgfonlayer}
\end{tikzpicture}

Flame graphs

Flame Graphs with Full Memory

open image in new tab to click around in the graph -- (not working right now for some reason... you can click locally though)

Standard Experiment

TokuDB

Flame Graphs Website

My Source

Specifically the commands I used to monitor the cpu activity on the system were:
perf record -F 99 -a -g -- [command (ie ./experiment or sleep)]
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-kernel.svg

This records activity across the entire system while the command is running. Therefore we can record for a time window. Ie 60 seconds using sleep.

Multithreading

Super-node level Parallelism

Description

Jobs are placed into a queue by TokuDB and taken out of the queue by GraphWorkers. These workers query Toku for the associated graph updates and then they call batch_update with these updates as input. The only locking is around the queue. We attempt to assert that a node is never within the queue more than once (This may not be successful, still investigating. There are discrepancies between the update counts).

Experimental Results

Data Size vs Threads

Following the 0.95, 0.5 scheme. Runtime given is seconds (s) or minutes (m). Number within brackets is average updates per second (k is thousand).

Nodes	1000	5000	10000	15000
Edge Updates	10^6	~2.4 x 10^7	~9.5 x 10^7	~2.1 x 10^8
Datastruct Size	63MB	590MB	1.5GB	2.5 GB
Runtime Standard	14.2s [67k]	11.7m [33k]	57.7m [27k]	139m [25k]
Runtime TokuDB	18.2s [52k]			128m [27k]
Runtime 1 thread	18.9s [50k]	8.6m [46k]	74.2m [21k]	167m [21k]
Runtime 2 threads	26.6s [30k]	6.6m [60k]	48.0m [33k]	117m [30k]
Runtime 4 threads	114s [8k]	7.3m [54k]	21.3m [74k]	89.0m [39k]
Runtime 8 threads	134s [7k]		21.3m [74k]	72.5m [48k]
Runtime 16 threads	154s [6k]			85.7m [41k]

Threshold Size vs Threads

15000 nodes, 2.1x10^8 edge updates, with data structure size of 2.5GB

Threads	tau=5000	tau=10000	tau=20000
1	160m [22k]	168m [21k]	183m [19k]
2	91m [38k]	114m [31k]	137m [25k]
4	45m [78k]	77m [45k]	123m [28k]
8	45m [78k]	71m [49k]	141m [25k]
16	49m [71k]	83m [42k]	113m [31k]

This is interesting. I expected it to perform better with a larger threshold not a smaller one. My reasoning was that a larger threshold meant each thread could spend longer working on batch_update and relatively less on querying from toku. However, perhaps this makes sense as a more aggressive threshold means that the tree is generally smaller and therefore we will suffer less from the deletes/queries and their locking issues. This is because each of these operations will have a lower amortized cost with a smaller tree size.

Issues

When inserting with multiple threads occasionally some updates are processed twice. Not yet sure if this is a toku issue or a GraphWorker issue.
In the last half or quarter of the process the insertion threads are very stagnant (less than 50% initialization) this gets worse as the number of threads increase. This shouldn't be happening unless my code is very wrong or there is something weird happening with Toku. Not sure yet.
This workload is bad for TokuDB because we are querying for everything in the tree by the time the process ends. This is not what toku is optimized for. Though this is a problem even without multithreading, but perhaps multi-thread exacerbates this.

Investigation Via Flamegraphs

FlameGraphs generated running upon 5000 nodes with 8 threads to determine what was going on with the performance.

Without Multithreading

Full Runtime Analysis for Multithreaded Setup

60 Seconds: GraphWorkers First Start Working

60 Seconds: Near the End, GraphWorkers Only 50% CPU Utilization

What this Tells Us

We can see that the issue with performance, especially with a large number of threads, is that there is a lot of waiting for locks or other events. This is represented by the swapper task in linux and is visible as an increasing portion of the runtime in the above flame-graphs.

Other Ideas

Mutexes vs spinlocks vs condition variables. Some online resources say mutex will handle high contention better and will be more performant even on small critical regions.
Two threads for insertion so we can move results of queries to a buffer (insertion to sketch is long part) (nah)
Sketch level parallelism

Sketch-level parallelism

OpenMP results

Data presented as openMP/single-threaded time used percentage. Tested on the lab machine with 20 available cores.

logn	10e3	10e4	10e5	10e6	10e7
2	2828	48.4	49.5	71.6	69.9
4	29.1	28.6	61.5	41.3	44.0
6	21.0	17.1	48.1	26.7	26.1
8	43.1	22.6	38.7	26.8	22.0
10	18.5	68.7	30.3	22.5	19.4
12	8.9	26.6	30.2	21.8	21.2
14	36.0	29.7	25.6	20.4	18.7
16	25.3	19.8	25.7	18.8	17.2
18	21.5	19.1	23.3	16.6	15.9
20	33.6	29.4	20.4	15.2	14.6
22	17.2	32.7	24.2	22.8	22.5
24	8.9	34.1	25.6	22.4	21.5
26	17.2	38.7	24.4	21.1	20.1
28	45.2	38.4	24.8	19.4	18.9
30	43.2	40.5	22.6	18.1	17.8

OpenMP experiment

logn	10e3	10e4	10e5	10e6	10e7
2	5065	48.9	60.3	75.3	56.6
4	138.	19.2	61.9	38.2	29.1
6	35.8	22.0	39.1	25.8	20.9
8	18.1	20.4	31.8	20.2	17.5
10	29.7	29.8	24.8	15.9	14.6
12	28.0	28.9	23.6	15.1	14.4
14	19.7	32.6	16.4	13.5	12.5
16	7.2	32.3	14.8	11.8	11.4
18	7.7	27.0	12.7	10.6	10.2
20	7.8	23.1	11.5	9.4	9.2
22	20.2	33.5	16.8	15.0	15.1
24	13.3	27.8	15.3	14.1	14.1
26	25.7	25.0	14.1	13.0	12.8
28	17.9	23.3	13.3	12.3	11.9
30	18.0	20.7	13.6	11.5	11.2

64-bit openMP experiment

spooky k skeletons and haunted spanning forests

Experimental conditions are same as for the bipartiteness profiling, with the exception that in memory profiling we only have a single trial for each algorithm for each number of vertices instead of 10. Additionally, we only profile total memory consumption and omit requested memory data.

Time Profiling

time graph

# vertices	connected components	spanning forest	2-skeleton	4-skeleton	8-skeleton
1000	1170	1168	2604.3	5606.9	12006.2
2000	3331.3	3331.3	7048.6	14737.6	31157.1
3000	6490.4	6488.1	13463.5	27876.3	58593
4000	8767.3	8773.9	18092.3	37301.8	78294.4
5000	12849.2	12862.6	26330.7	54198.7	113219
6000	16350.1	16354.1	33432.3	68659.8	143259
7000	19130.7	19138.7	39047.7	80098.2	166926
8000	21890.7	21908.1	44619.9	91580.7	190750
9000	28106.6	28108.1	57135.5	117060	244311
10000	31273	31301	63558.1	130038	271351
11000	34421.5	34437.1	69893.5	142946	298450
12000	39307.8	39332.4	79787.1	163276	340598
13000	42623.2	42627.8	86525.5	176931	369849
14000	45922.5	45943.9	93248.5	190565	398690
15000	49227.8	49237.4	99827.5	204221	427022

Memory Profiling

memory graph

# vertices	connected components	spanning forest	2-skeleton	4-skeleton	8-skeleton
1000.0	61481536.0	61505568.0	122929768.0	245729992.0	491330472.0
2000.0	164001536.0	164049568.0	328017768.0	655858040.0	1311538664.0
3000.0	320170000.0	320242032.0	640402632.0	1280579640.0	2560933784.0
4000.0	426866064.0	426962096.0	853842904.0	1707412360.0	3414550952.0
5000.0	629961616.0	630081648.0	1260081880.0	2519842296.0	5039363032.0
6000.0	815842656.0	815986688.0	1631892040.0	3263414600.0	6526459640.0
7000.0	951802896.0	951970928.0	1903860360.0	3807303288.0	7614189448.0
8000.0	1087763104.0	1087955136.0	2175828776.0	4351192152.0	8701918712.0
9000.0	1426762592.0	1426978624.0	2853875800.0	5707237304.0	11413961160.0
10000.0	1585282592.0	1585522624.0	3170964024.0	6341364152.0	12682171496.0
11000.0	1743801920.0	1744065952.0	3488052248.0	6975494328.0	13950380520.0
12000.0	2042098976.0	2042387008.0	4084693480.0	8168728600.0	16336800392.0
13000.0	2212267120.0	2212579152.0	4425077448.0	8849448792.0	17698192376.0
14000.0	2382435360.0	2382771392.0	4765461784.0	9530169816.0	19059585848.0
15000.0	2552604048.0	2552964080.0	5105845800.0	10210890888.0	20420981064.0

Unified Host and Network Dataset findings

So far I've looked at a sample of the whole dataset. The whole dataset is very big, I'm not sure exactly how big but seems like it could be on the order of terabytes. https://datasets.trovares.com/cyber/LANL/index.html

The sample I've looked at is about 10GB.
This is what I've gathered from this sample:
There are 115949436 total edges.
Found 115600796 duplicate edges.
Found 348640 unique edges.
There are 20027 nodes.

Results for the whole dataset:
There are 17882795024 total edges.
Found 17880111216 duplicate edges.
Found 2683808 unique edges.
There are 41584 nodes.

The original dataset is given as a directed graph where (u,v) means u sends some packets to v, but we can discard the direction of each edge to get a undirected graph where (u,v) means u/v send packets to each other.

Duplicate edge is an edge between (u,v) such that there is another edge that is also between (u,v). Unique edge is an edge that is not a duplicate edge.

Stale wiki pages

Aspen

Performance evaluation with 1000 update batches, unlimited memory

Performance evaluation with 1000 update batches, restricted to 16GB memory

How does Aspen's performance respond to changes in batch size?

Benchmarking Against Low Memory Conditions (No Write Optimization)

No memory restrictions

1GB restriction

8GB restriction

Bipartiteness Reduction

Time Profiling

Memory Profiling

Requested Memory

Total Memory

Buffer Tree Performance

Performance of graph streaming with Buffer-Tree

05/08/2021 Performance numbers

Buffer-Tree Ingestion Rates

05/08/2021 Fully Synchronous

Cluster experiment 1

Competitor Speed and Memory Experiments

Aspen

Terrace

Current performance estimates

Distributed Framework notes

External DB requirements for Batched Operations

External memory performance comparison

Improvements in External Memory Connected Components Performance

Things to note

Test Graphs

GraphChi (2012)

X-Stream (2013)

GridGraph (2015)

MOSAIC (2016)

References

Exceptional edges

LaTeX Source Code

Flame graphs

Flame Graphs with Full Memory

Standard Experiment

TokuDB

Flame Graphs Website

Multithreading

Super-node level Parallelism

Description

Experimental Results

Data Size vs Threads

Threshold Size vs Threads

Issues

Investigation Via Flamegraphs

Without Multithreading

Full Runtime Analysis for Multithreaded Setup

60 Seconds: GraphWorkers First Start Working

60 Seconds: Near the End, GraphWorkers Only 50% CPU Utilization

What this Tells Us

Other Ideas

Sketch-level parallelism

OpenMP results

spooky k skeletons and haunted spanning forests

Time Profiling

Memory Profiling

Unified Host and Network Dataset findings

Clone this wiki locally