Malicious URL Detection using Machine Learning

Project Overview

Objective:
To formulate malicious URL detection as a binary classification task for two-class prediction: malicious or benign

Detection Technique:
Extract lexical features (static analysis) from malicious and benign URLs to train a machine learning algorithm so that it learns to accurately identify unseen malicious URLs

Data Preprocessing

Total number of data: 594,806 URLs

Malicious URLs = 158,081
Benign URLs = 436,725

Steps:

Malicious & benign URLs are combined into one dataset (.csv) using Pandas and then randomly shuffled
Tokenize URLs using custom tokenizer function Feature extraction using TF-IDF Vectorizer (Scikit-learn Feature Extraction module)
Convert URL dataset to a matrix of TF-IDF features
- Number of features extracted = 1,418,106
Split dataset into 2 subsets:
- Training data: 80%
- Test data: 20%

Model Training

Model: Logistic Regression

Create logistic regression classifier
Train model using training data
Test model using test data

Performance Evaluation

Accuracy

Training: 0.9686
Test: 0.9567

Precision: 0.9666

Prediction

Additional data

Source: GitHub
Total number of URLs = 420,464
- Malicious URLs = 75,643
- Benign URLs = 344,821

Model Performance

Dataset is preprocessed and are fed into trained model as input.
- Model predicts labels of URLs (malicious/benign)
Prediction results:
- Accuracy: 0.8327
- Precision: 0.9037

High model accuracy and precision recorded in classifying URLs of an unseen dataset.

Conclusion

Project Outcome:

Designed a practical malicious URL detection algorithm with fast detection speed to effectively distinguish genuine and malicious URLs
Performed static analysis by extracting useful lexical features of URLs which are used to train the machine learning model
Achieved a satisfying end product which is a logistic regression classifier of high prediction accuracy & precision

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
data.csv		data.csv
labeled_gen_url.csv		labeled_gen_url.csv
labeled_mal_url.csv		labeled_mal_url.csv
malicious url detection.ipynb		malicious url detection.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malicious URL Detection using Machine Learning

Project Overview

Data Preprocessing

Model Training

Performance Evaluation

Prediction

Conclusion

About

Languages

ganshuyi/mal-url-detection

Folders and files

Latest commit

History

Repository files navigation

Malicious URL Detection using Machine Learning

Project Overview

Data Preprocessing

Model Training

Performance Evaluation

Prediction

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Languages