Skip to content

Official Implementation of "Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth"

License

Notifications You must be signed in to change notification settings

Y-debug-sys/UCL-sketch

Repository files navigation

Unsupervised Compressive Learning Sketch (UCL-sketch)
Official PyTorch Implementation


Fig. 1: Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth (from Chat-GPT).

Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth
Authors: Xinyu Yuan, Yan Qiao*, Meng Li et al.
Paper: http://arxiv.org/abs/2412.03611

About

This work addresses the challenge of frequency estimation in unending data streams, which is vital across domains such as networking, social media, and finance. Traditional sketch-based approaches, using compact counters and hash functions, are computationally efficient but often sacrifice accuracy. Recent efforts to enhance sketches with deep learning have shown promise in improving performance but face limitations: reliance on labeled data, frequent retraining requirements, and high time and space costs in streaming scenarios.

To overcomethese limitations, the study introduces UCL-sketch (Unsupervised Compressive Learning Sketch), a novel framework combining the strengths of equation-based and learned sketches. Unlike prior approaches, UCL-sketch is ground-truth-free, relying on self-supervised strategy called equivalent learning using only sketch counters for online training. This enables real-time adaptation to distribution shifts in streaming data. The framework also incorporates logical buckets, allowing scalable and efficient handling of large-scale streams by splitting and learning multiple mappings with shared parameters.


Fig. 2: Overview of Our Stream Data Sketching Framework.

This repository contains:

  • 🪐 A simple PyTorch implementation of UCL-sketch.
  • ⚡️ Pre-processed 13-byte long 5-tuple network packet data slices.
  • 💥 A self-contained jupyter notebook for running and evaluating all sketching algorithms: CM-sketch, C-sketch, Ideally Learned CM-sketch, Ideally Learned C-sketch, Univmon, Elastic Sketch, NitroSketch, SeqSketch and our UCL-sketch for sure.
  • 🛸 Other useful functions and documents, such as metrics like Weighted Mean Relative Difference (WMRD).

Setup

First, download and set up the repo:

git clone https://github.com/Y-debug-sys/UCL-sketch.git
cd UCL-sketch

We provide an environment.yml file that can be used to create a Conda environment.

conda env create -f environment.yml
conda activate UCL-sketch

Running

We provide a running script for UCL-sketch in main.py. This script can be used to train UCL-sketch on the provided IP traces, but it can be easily modified to support other streaming datasets: For both Kosarak and Retail, they can be downloaded from http://fmi.uantwerpen.be/data. After downloading, extract the .dat format file into data directory. Then the usage is given by:

python main.py --config_path ./configs/{your_config_name}.yaml --data_path ./data/{your_dataset_name}.dat --ckpt ./checkpoints --data network

Moreover, you can also conduct experiments on synthetic zipfian datasets by running the following code:

python main.py --config_path ./configs/{your_config_name}.yaml --skewness {your_skew_value} --ckpt ./checkpoints --data synthetic

Regarding comparisons with baselines and evaluation (AAE, ARE, WMRD etc.), see our Jupyter demo for details.

Acknowledgments

The implementation of baselines in this codebase mainly borrows from a C++ repo called BitSense. We thank the authors for their helpful open-source contributions.

Citation

If you use this codebase, or otherwise find our work valuable, please cite UCL-sketch:

@article{yuan2024learning,
  title={Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth},
  author={Yuan, Xinyu and Qiao, Yan and Li, Meng and Wei, Zhenchun and Feng, Cuiying},
  journal={arXiv preprint arXiv:2412.03611},
  year={2024}
}

About

Official Implementation of "Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published