CYBER THREAT DETECTION IN OPEN-SOURCE DATA USING SELECTED MACHINE LEARNING ALGORITHMS

AJAYI, MARY ODUNAYO (2022) CYBER THREAT DETECTION IN OPEN-SOURCE DATA USING SELECTED MACHINE LEARNING ALGORITHMS. Masters thesis, Landmark University, Omu Aran, Kwara State.

[img] Text
Ajayi Mary Odunayo final report 1.docx

Download (1MB)

Abstract

Threat actors are developing and evolving their tools to quickly sight loopholes and vulnerabilities in devices and the security of organizations. Open sources are frequently used by these malicious threat actors to exchange their Tactics, Techniques, and Procedures (TTP) to attack devices. There is a huge amount of threat data available on these open sources making it difficult for cybersecurity professionals to utilize and share. Humans can easily differentiate the useful and relevant information, but it is daunting when the data is large with limited time hence the need to automate the process. This thesis presents a comparative analysis on the performance of four machine learning algorithms (Decision Tree, Logistic Regression, Random Forest and Naïve Bayes) to help cybersecurity professionals in making decision on the most suitable algorithm to analyze cyber threat intelligence dataset. The dataset used in this study is a Cyber Threat data generated by (Kim & Kim, 2019) and was automatically obtained from reports on freely available platforms and malware repository databases. The dataset is in an extensible markup language (XML) format which entails roughly 640,000 records gathered from various security reports produced between January 2008 and June 2019. 70% of the total of the dataset was used for training, with the rest 30% used as the testing dataset and to construct the machine learning model. Experimental results show that Random Forest algorithm has the best performance with an accuracy score of 97.16% followed by Decision Tree with an accuracy of 97.08%, Naïve Bayes also has an accuracy of 93.92% while Logistic Regression classifier has the least score of all the four algorithms with the accuracy of 80.15%. The Other evaluation metrics used for the comparative analysis in this study are F1 score, recall and precision of the algorithms. Precision for Logistic Regression is 72.51%, Naïve Bayes is 75.17%, Decision Tree is 95.49% and Random Forest is 95.6%. Also, for the recall, Logistic regression is 71.11%, Naïve Bayes is 83.13%, Decision tree is 95.06% and Random Forest is 95.07%. And lastly, the F1 score for Logistic Regression is 67.27%, Naïve Bayes is 78.04%, Decision Tree is 95.11% and Random Forest is 95.13%. Logistic regression which had the least scores in all the four metrics compared to the other three algorithms. These means that the algorithm is not best suited for the dataset used in this thesis. Future work can investigate how to improve the performance of the algorithm. Prospective researchers can learn from the findings of this work in order to come up with newer and enhanced algorithms, which can be useful in decision making for cyber security experts.

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Engineering, Science and Mathematics > School of Electronics and Computer Science
Depositing User: Mr DIGITAL CONTENT CREATOR LMU
Date Deposited: 26 Mar 2025 15:45
Last Modified: 26 Mar 2025 15:45
URI: https://eprints.lmu.edu.ng/id/eprint/5616

Actions (login required)

View Item View Item