Fig. 1: Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth (from Chat-GPT).
Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth
Authors: Xinyu Yuan, Yan Qiao*, Meng Li et al.
Paper: http://arxiv.org/abs/2412.03611
This work addresses the challenge of frequency estimation in unending data streams, which is vital across domains such as networking, social media, and finance. Traditional sketch-based approaches, using compact counters and hash functions, are computationally efficient but often sacrifice accuracy. Recent efforts to enhance sketches with deep learning have shown promise in improving performance but face limitations: reliance on labeled data, frequent retraining requirements, and high time and space costs in streaming scenarios.
To overcomethese limitations, the study introduces UCL-sketch (Unsupervised Compressive Learning Sketch), a novel framework combining the strengths of equation-based and learned sketches. Unlike prior approaches, UCL-sketch is ground-truth-free, relying on self-supervised strategy called equivalent learning using only sketch counters for online training. This enables real-time adaptation to distribution shifts in streaming data. The framework also incorporates logical buckets, allowing scalable and efficient handling of large-scale streams by splitting and learning multiple mappings with shared parameters.
Fig. 2: Overview of Our Stream Data Sketching Framework.
This repository contains:
- 🪐 A simple PyTorch implementation of UCL-sketch.
- ⚡️ Pre-processed 13-byte long 5-tuple network packet data slices.
- 💥 A self-contained jupyter notebook for running and evaluating all sketching algorithms: CM-sketch, C-sketch, Ideally Learned CM-sketch, Ideally Learned C-sketch, Univmon, Elastic Sketch, NitroSketch, SeqSketch and our UCL-sketch for sure.
- 🛸 Other useful functions and documents, such as metrics like Weighted Mean Relative Difference (WMRD).
First, download and set up the repo:
git clone https://github.com/Y-debug-sys/UCL-sketch.git
cd UCL-sketch
We provide an environment.yml
file that can be used to create a Conda environment.
conda env create -f environment.yml
conda activate UCL-sketch
We provide a running script for UCL-sketch in main.py
. This script can be used to train UCL-sketch on the provided IP traces, but it can be easily modified to support other streaming datasets: For both Kosarak
and Retail
, they can be downloaded from http://fmi.uantwerpen.be/data. After downloading, extract the .dat format file into data
directory. Then the usage is given by:
python main.py --config_path ./configs/{your_config_name}.yaml --data_path ./data/{your_dataset_name}.dat --ckpt ./checkpoints --data network
Moreover, you can also conduct experiments on synthetic zipfian datasets by running the following code:
python main.py --config_path ./configs/{your_config_name}.yaml --skewness {your_skew_value} --ckpt ./checkpoints --data synthetic
Regarding comparisons with baselines and evaluation (AAE, ARE, WMRD etc.), see our Jupyter demo
for details.
The implementation of baselines in this codebase mainly borrows from a C++ repo called BitSense. We thank the authors for their helpful open-source contributions.
If you use this codebase, or otherwise find our work valuable, please cite UCL-sketch:
@article{yuan2024learning,
title={Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth},
author={Yuan, Xinyu and Qiao, Yan and Li, Meng and Wei, Zhenchun and Feng, Cuiying},
journal={arXiv preprint arXiv:2412.03611},
year={2024}
}