Internet Security: Using ML to Identify Phishing

Lead Researcher (Machine Learning)

Published research conducted at the Faculty of Engineering, University of Western Ontario. Evaluated the efficacy of Logistic Regression and Multinomial Naive Bayes in identifying malicious URLs, achieving 96.6% accuracy on a dataset of 500k+ entries.

96.6%

Accuracy

507k

URLs Analyzed

NLTK/Regex

Feature Extraction

The Challenge

Phishing attacks use social engineering within emails and links to obtain sensitive passwords or financial information. Recognizing these sites requires a keen eye for subtle URL patterns. The objective was to build a machine learning program that identifies common variables in phishing sites to filter them with industrial-grade accuracy.

My Approach

Developed a comparative framework using RegexpTokenizer and Snowball Stemmer for linguistic feature extraction. I designed a custom preprocessing pipeline locally to vectorize 507,195 URLs from Kaggle and compared high-reasoning Logistic Regression against Multinomial Naive Bayes (MNB) to find the most robust classifier.

Core Technical Accomplishments

Linguistic Preprocessing (NLTK)

Implemented RegexpTokenizer using 'r'[A-Za-z]+'' to isolate lexical components from URLs and applied the Snowball Stemmer for English to break down strings into root words, creating a rich feature set for the classifiers.

Comparative Model Analysis

Benchmarked Logistic Regression against Multinomial Naive Bayes. Logistic Regression outperformed MNB by correctly identifying 36,065 phishing sites and 97,148 safe sites, effectively capturing more nuanced patterns in the URL structure.

Production Pipelining

Migrated the winning Logistic Regression model into a Scikit-Learn pipeline. This allowed for seamless cross-validation across different parameters and streamlined the transition from raw URL input to final 'good/bad' classification.

Statistical Validation & Confusion Matrices

Visualized model performance using heatmapped confusion matrices. Analyzed the 3.8% error rate (falsely identified sites) to understand the limits of ML in high-volume security environments (e.g., 10M+ sites).

Results & Impact

  • Achieved a peak testing accuracy of 96.6% using the pipelined Logistic Regression model
  • Demonstrated that phishing websites follow a common, identifiable lexical pattern across a 500,000+ entry dataset
  • Published research findings through the University of Western Ontario Faculty of Engineering
  • Successfully identified 35,988 phishing sites with high precision, outperforming standard MNB baselines
  • Provided a blueprint for a browser-integrated filtration system to protect users from high-volume phishing attempts

Tech Stack

PythonScikit-LearnNLTKLogistic RegressionNaive BayesLinguistic AnalysisPandas