Evaluation Feature Selection in Machine Learning Models for Malicious URL Detection

Hanung Febrianto – Poster

Phishing attacks that exploit malicious URLs are rapidly increasing and often bypass traditional blacklist-based filtering methods. To address this challenge, this research introduces a machine learning model enhanced with Pearson correlation-based feature selection to improve both accuracy and computational efficiency.

The study used the PhiUSIIL dataset consisting of 235,795 URLs with 51 engineered features. After applying feature selection, the feature set was reduced to 22 at the optimal correlation threshold of 0.3. This reduction successfully minimized redundancy without compromising predictive performance. In fact, certain models such as KNN and Naïve Bayes even showed improved accuracy after feature selection.

Furthermore, the model achieved significant computational efficiency gains. For example, the Random Forest classifier reduced its training time by about 40% (from ~14.5s to ~8.6s) and lowered RAM consumption by approximately 30% (from ~1162 MB to ~806 MB). These improvements make the model not only accurate but also practical for real-time implementation in environments with limited resources.

Overall, this research demonstrates that Pearson correlation-based feature selection provides a scalable, efficient, and reliable solution for real-time phishing URL detection. Future work will focus on integrating this model into operational security systems, such as institutional firewalls, and exploring hybrid approaches with deep learning to handle more complex phishing threats