Data

Data sets of Botnet Traffic (Updated 12 July, 2018)

We collected real botnet traffic and have formed our own dataset. We collected five types of botnet samples as shown in Table 1: Mirai, Zeus, Ares, Athena and BlackEnergy, as these botnets are very representative. Taking advantage of docker technology that makes it easy to simulate a large number of machines on a single physical Machine, we built these botnets and simulated a large number of bots for five botnets respectively in our laboratory. Then we captured network traffic of each botnet for a couple of days as shown in Table 1. In addition, we captured background traffic from the ISP’s gateways for 10 hours.

As there was no other process running on the simulated hosts except the botnet clients, each virtual machine had only malicious traffic without normal traffic. However, when a typical host is infected by a botnet client, it has both normal and malicious traffic, and the size of its normal traffic is much larger. To simulate the infected host, we mixed the malicious traffic and the background traffic. We randomly selected the same number of IPs from 200,000 IPs of the background traffic. Finally, we got a big dataset that contains a lot of background traffic and relatively little botnet traffic. The data is described in Table 1.

We use the package of python named scrapy to extract the information of every packets in pcap file and save it into mysql database. There is a table named packets in this database.
The structure is as follows:

CREATE TABLE `packets` (

`ID` int(11) NOT NULL AUTO_INCREMENT,

`TIMESTAMP` double(16,6) NOT NULL, // the timestamp of packet

`LENGTH` int(11) NOT NULL, //the length of packet(byte)

`IP_SRC` char(15) NOT NULL, //the ip of source

`IP_DST` char(15) NOT NULL, // the ip of destination

`PORT_SRC` int(11) NOT NULL, // the source port

`PORT_DST` int(11) NOT NULL, //the destination port

`FLAG` char(5) DEFAULT NULL, //ack syn or something

`ISBOT` int(11) DEFAULT ‘0’, //0 is normal. 1,2,3,4,5 is bot

PRIMARY KEY (`ID`)

) ENGINE=InnoDB;

You can download this dataset (Zipped 7GB)

The file is a mysql’s backup file. You can backup this database use this command:

mysql -u root -p botdata < botdata.sql

The data was collected by our research group (main contributors: Yang Wang and Yue Xu).

Data sets of Android apps’ features extracted from APK files (Updated 10 Nov, 2015)

In order to discover the discriminatory and persistent features for automated Android malicious app (malapp) detection at a large scale, we collect very large app sets and extract static features from APK files. The app sets consist of four parts:

benign_2014
166,365 benign apps (labeled with VirusTotal) downloaded from six app markets (i.e., AnZhi, AppChina, LenovoMM, MyApp, GFan, NDuoa) from November 2013 through January 2014.
malapp_2014
In this data set, 1,260 samples are from the Android Malware Genome Project (AMGP), 3,417 samples are downloaded from VirusShare.com, 5,560 samples are from the Drebin data set, 401 samples are provided by two antivirus companies. After removing the duplicate samples, there are 8,701 malicious apps in this data set.
benign_2015
46,891 samples downloaded from AnZhi market from January to March, 2015.
malapp_2015
9,662 malicious apps from VirusShare.

We extract 11 feature sets from our app sets:

  • FS1 Component Names
  • FS2 Requested Permissions
  • FS3 Hardware Features
  • FS4 Filtered Intents
  • FS5 Restricted API calls
  • FS6 Used Permissions
  • FS7 Certification Information
  • FS8 Strings (URL, HTTP address, file path, numbers)
  • FS9 Payload Information
  • FS10 Code Patterns
  • FS11 Suspicious API calls

These features are categorized into platform-defined as well as app-specific features.

The data contains feature names of all the features we used and the vectors for benign_2014, benign_2015, malapp_2014, and malapp_2015 samples:

  • Names of all the features
  • Names of features in each feature set
  • Names of platform-defined features and app-specific features
  • benign_2014_matrix
  • benign_2015_matrix
  • malapp_2014_matrix
  • malapp_2015_matrix

The data set can be downloaded here. (http://infosec.bjtu.edu.cn/wangxing/android/dataset.tar.gz)

We treat the malapp detection as a binary classification problem. Each app is represented by a feature vector. We propose to employ four classifiers, namely, Logistic Regression (LR), linear Support Vector Machine (SVM), Decision Tree (DT) and Random Forest (RF), to compare the discriminative power of different feature sets and the performance of different classifiers. The source code of our methods can also be downloaded here. The descriptions of each source file can be found in the README file contained in the compressed file.

The source code can be downloaded here. (http://infosec.bjtu.edu.cn/wangxing/android/source.tar.gz)

Permission data sets of Android Applications (updated June 17, 2014)

The permission vectors were constructed from 310,926 benign apps as well as 4868 malapps. Benign apps are got from Google’s play and have been labeled. Although a great number of malicious app samples have been reported, the collection of malapps is still a challenging task for research. We have been provided with two malicious app sets (named Mal_Com1 and Mal_Com2) from two different antivirus companies. We got the malicious apps discovered by Zhou et al. and named them as Mal_Zhou. In addition, we downloaded a total number of 3,417 malicious apps from the website of VirusShare that is a repository of malware samples. All the malapps in the Mal_VS were approved by VirusTotal. After removing the duplicate samples, we have a total number of 3,207 malapps Mal_VS.

We only consider the permissions provided by Android system, although an app can also define its own
permissions. To analyze the permission usage of apps, we mainly extract the Android permission list from the Manifest file of each app. The total number of distinct permissions requested by all the apps (including benign and malicious) in our data sets is 135. However, the permissions requested by an app may be over-privileged, since 47 out of the 135 permission (e.g., permission INSTALL PACKAGES) are not for use by third-party applications. We then remove these 47 permissions and the total number of distinct permissions is thus 88. Therefore, each app can be represented by a 88-dimensional Boolean vector, where 1 denotes that the app requests the permission and 0 otherwise.

The data contains the mapping between the permission name and the vector, the vectors for benign apps, malicious apps (Zhou) ,  malicious apps (Com1) ,  malicious apps (Com2) and malicious apps (VS). 

permission_mapping    
permmission_matrix_benign_google_apps  
permission_matrix_malicious_zhou
permmission_matrix_malicious_com1
permmission_matrix_malicious_com2
permmission_matrix_malicious_VS

Decision_Rules
More information can be found in our paper (please cite):
Wei Wang, Xing Wang, Dawei Feng, Jiqiang Liu, Zhen Han, Xiangliang Zhang: Exploring 
Permission-Induced Risk in Android Applications for Malicious Application Detection. 
IEEE Transactions on Information Forensics and Security 9(11): 1869-1882 (2014)

JavaScript Attack data (Updated April, 23, 2015)

We collected a large number of JavaScript attack data for the detection of Obfuscated Malicious JavaScript Code.

JavaScript_Attack_Data

Autonomic IDS code and data

We proposed an autonomic intrusion detection framework and employed Affinity Propagation clustering algorithm on the framework to dynamically detect web attacks based on HTTP traffic.

Autonomic_IDS (Matlab code and data)