KMeans Anomaly Detection on KDD Cup Data

This project implements an anomaly detection system using KMeans clustering on the KDD Cup 1999 Network Traffic Dataset. Originally implemented in Jupyter notebooks, the code has been refactored into a modular Python project with separate classes for data processing, clustering, and visualization. The project is designed to detect network intrusions and anomalies in network traffic data by clustering similar data points together and identifying outliers.

Project Overview

The primary goal of this project is to use unsupervised learning to detect anomalies in network traffic data, which could represent potential intrusions. KMeans clustering was chosen due to its ability to group similar connections, enabling the detection of outliers (anomalies) without labeled data. This project uses Apache Spark with PySpark to handle large datasets efficiently and leverages Google Cloud Dataproc for performance scaling.

Dataset

We use the KDD Cup 1999 Data Set:

kddcup.data: Full dataset with approximately 5 million rows and 41 features.
kddcup.data_10_percent: A reduced version (10%) for local development and testing.

The dataset includes both normal connections and network attacks (e.g., smurf, neptune, teardrop).

Installation

Clone the repository:

git clone https://github.com/yourusername/kmeans_anomaly_detection.git
cd kmeans_anomaly_detection

Install dependencies:
```
pip install -r requirements.txt
```
Ensure Google Cloud SDK is installed if using Google Cloud Dataproc.

Project Structure

kmeans_anomaly_detection/
├── data/
│   ├── kddcup.data
│   ├── kddcup.data_10_percent
├── original_notebooks/
│   ├── kmeans_cloud.ipynb
│   ├── kmeans_local.ipynb
│   └── kmeans_real.ipynb
├── src/
│   ├── clustering.py         # KMeans clustering implementation
│   ├── data_processing.py    # Data loading, preprocessing, and scaling
│   ├── main.py               # Main script to run end-to-end clustering
│   ├── plots/                # Directory to save generated plots
│   └── visualization.py      # Visualization of clustering results
├── .gitignore
├── LICENSE
└── README.md

Key Components

data_processing.py: Handles data loading, normalization, scaling, and one-hot encoding for categorical features.
clustering.py: Implements KMeans clustering and evaluation with silhouette scores.
visualization.py: Provides visualizations for clustering results, including an elbow plot for silhouette scores across multiple k values
main.py: Coordinates data processing, clustering, and visualization, running experiments across a range of k values to identify the optimal cluster count.

Configuration

In main.py, you can adjust parameters such as:

k_range: Range of cluster numbers to test for finding the optimal k using the silhouette score.
save_plots: Set to True to save generated plots in the plots/ directory, or False to display them interactively.

Google Cloud Dataproc Setup

For large-scale data processing, this project supports running on Google Cloud Dataproc.

Upload Data to Google Cloud Storage (GCS): Use gsutil to upload the dataset to a GCS bucket.

gsutil cp data/kddcup.data gs://your-bucket-name/kddcup.data

Submit a Dataproc Job: Use the following gcloud command to submit the job to a Dataproc cluster:

gcloud dataproc jobs submit pyspark src/main.py --cluster=your-cluster-name --region=your-region

Results

Using the elbow method and silhouette scores, we found that an optimal value for k (number of clusters) for the reduced dataset was around k=56. This configuration allowed the model to differentiate between normal and anomalous connections effectively, achieving a notable improvement in accuracy and reducing computation time. Example Plots

The plots directory (src/plots/) contains:

Silhouette Score vs. Number of Clusters (k): Helps identify the optimal k value.
Cluster Visualizations: Shows the distribution of clustered network connections.

References

This project builds on concepts discussed in the Scientific Data Management course at the University of Vienna, utilizing KMeans++ clustering and Apache Spark. Thanks to the KDD Cup for providing the dataset, and Google Cloud for facilitating large-scale processing with Dataproc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KMeans Anomaly Detection on KDD Cup Data

Table of Contents

Project Overview

Dataset

Installation

Project Structure

Key Components

Configuration

Google Cloud Dataproc Setup

Results

The plots directory (src/plots/) contains:

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
data		data
original_notebooks		original_notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

QuirkyCroissant/KMeansAnomalyDetection

Folders and files

Latest commit

History

Repository files navigation

KMeans Anomaly Detection on KDD Cup Data

Table of Contents

Project Overview

Dataset

Installation

Project Structure

Key Components

Configuration

Google Cloud Dataproc Setup

Results

The plots directory (src/plots/) contains:

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages