This is an attempt at using classification algorithms for PE file malware detection targeting windows. While the majority of the code used for this project could be applied to any PE based malware dataset it is intended to be trained using the Practical Security Analytics - PE Malware Machine Learning Dataset be sure to give them some love as the benign PE samples were a huge help. Should this link die someday get in touch and I'll gladly provide the dataset.
Aside from including a full classification algorithm implementation for dataset analytics I will be including the ability train on any labeled PE sample sets and classify any given unlabeled test samples using a wide range of features extracted (this is NOT a malware sandbox all features are statically extracted).
The location of samples needs to be passed as an argument to the various scripts we will use for data pre-processing. This currently is hard-coded but will be changed as I refine the project. Before we create the training set we need to account for bad samples (aka samples with heavy obfuscation or packed preventing them from being recognizable as PE files. These files can be run still but need manual intervention to untangle them enough for libraries like PeFile() to work properly. For files that aren't fixed will be eliminated from the set. For training note that samples MUST be located in pe-dataset/black/ or pe-dataset/white/ where black is malware and white is safe this is needed for the database to propagate correctly.
Well very simple - I like a challenge! This project came from my Graduate AIT664 class at George Mason University where I took a simple data science & machine learning assignment and turned it into an attempt at recreating some of the published works on PE-malware detection and classification through various static analysis methods. I'll also be including our presentation later this year should we be permitted to record it. So far this project has been very educational and tons of fun. I'll include steps to easily recreate what we have done along with places to get malware samples of your own!