The repository contains the replication package for the paper "Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection".
The workflows of our tool (Auto-RIOLU and Guided-RIOLU) are shown in the following graph:
Column Sampling: Sample a subset of data from the column to generate the patterns.
Coverage Rate (
Constrained Template Generation: Constrained Template Generation: Generate raw templates for each record with an exact matching rate
Pattern Generation: Generate pattern constraints for each template according to the coverage rate (
Pattern Selection: Select patterns based on some heuristics (e.g., their generalizability).
-
Python >= 3.8
-
Pandas == 1.5.3
-
Numpy == 1.24.3
-
Scikit-learn == 1.3.2
-
Matplotlib == 3.7.5
We use the DOMAINS dataset documented in FlashProfile for data profiling evaluation. The original data is shared through https://github.com/SaswatPadhi/FlashProfileDemo/tree/master/tests. According to the instructions, we removed the extreme-size data (containing <40 or >100k records) and the duplicates. The files for the DOMAINS dataset can be accessed through the test_data_profiling
folder.
We use the modified version of the Hospital, Flights, and Movies dataset (https://github.com/visenger/clean-and-dirty-data, https://github.com/BigDaMa/raha/tree/master/datasets) for data anomaly detection task. The modified versions are available in test_anomaly_detection
folder, and the ground truths for these data are in the ground_truth_anomaly_detection
folder (-1 for null values, 1 for anomalies, 0 for non-anomalies).
The procedure for replicating our experiments is as follows:
Run code test_profiling.py
to get the data profiling result provided by RIOLU's pattern generation module. The patterns for each file, overall precisions, and overall recalls will be printed in the output, and a graphic result will be automatically stored in the folder.
An example of the printed graph:
The two anomaly detection tools read the dataset in the test_anomaly_detection
folder. Change the variable of dataset
in the code to specify the desired dataset (valid data in our folder: hosp_1k, hosp_10k, hosp_100k, movies), the code will automatically read the ground truth to fetch the columns to be tested (not all the columns contain pattern anomalies).
Run code Auto-RIOLU.py
to get the result of the unsupervised version of RIOLU; the predicted CSV file will be stored.
Run code Guided-RIOLU.py
to get the result of the supervised version of RIOLU; the predicted CSV file will be stored. We use the cleaned version of each dataset as a support to estimate the coverage rate (
Our pattern generator (pattern_generator.py
) can generate a pattern pool with a water flow structure. Run the following code to create patterns and get the actual coverage of each pattern (i.e., the frequency of matching of each pattern on the whole dataset).
generator = PatternGenerator(sampled_data, coverage_rate)
generator.pattern_coverage_statictics()
pattern_statistics = generator.pattern_coverage
Workflow example with DateTime column:
Our pattern selector (pattern_selector.py
) can select patterns dynamically using the 2-class K-Means clustering technique. To select patterns, you should input a dictionary containing patterns and their frequencies (e.g., {"[0-9]+":0.80980980980981, "[0-9]+[a-z][0-9]+$": 0.135, ...}). The output will be a list of selected patterns.
selector = PatternSelector(pattern_coverage)
selector.select_patterns()
pattern_pool = selector.pattern_pool
To validate the effectiveness of RIOLU, we also prompted ChatGPT to generate regexes and flag anomalies as a comparison. We used GPT-3.5 Turbo API with all the default settings. The experiments with ChatGPT were done in one trial because extracting regexes from each response requires manual work and careful validation, which, in other words, sadly, requires loads of time.
Hi ChatGPT, given the following data (in a column), please provide me with regular expressions for data profiling that represent the column:
*** Your Column Goes Here ***
The responses and regexes for pattern-based data profiling can be found in the test_data_profiling
folder.
Hi ChatGPT, given the following data (in a column), please accomplish the following two tasks:
1. Provide me with regular expressions that represent the column.
2. Provide me with anomalous cases in the column.
*** Your Column Goes Here ***
The regexes for pattern violation detection can be found in the test_anomaly_detection
folder.
├── ablation_study
├── 20_subsampling.py
├── no_selection.py
├── static_rEM_1.py
├── static_rcov_95.py
├── static_selection_threshold.py
└── readme.md
├── ground_truth_anomaly_detection (Data ground truths)
├── images
├── java_repo_exploration
├── java_names
├── java_naming_anomalies
└── readme.md
├── sensitivity_analysis
├── Auto_RIOLU_alt_inircov.py
├── Auto_RIOLU_alt_nsubset.py
└── readme.md
├── test_anomaly_detection
├── chatgpt_sampled (Data sampled for ChatGPT & the extracted regexes)
├── flights
├── hosp_1k
├── hosp_10k
├── hosp_100k
├── movies
└── readme.md
├── test_data_profiling
├── hetero
├── homo.simple
├── homo
├── GPT_responses.csv (ChatGPT profiling responses & the extracted regexes)
└── readme.md
├── Auto-RIOLU.py (Auto-RIOLU for anomaly detection)
├── Guided-RIOLU.py (Guided-RIOLU for anomaly detection)
├── pattern_generator.py
├── pattern_selector.py
├── pattern_summarizer.py
├── test_profiling.py (RIOLU for data profiling)
├── utils.py
├── LICENSE
└── readme.md