FloCon 2017 has ended
Back To Schedule
Thursday, January 12 • 1:00pm - 1:30pm
Detecting Threats, Not Sandboxes: Characterizing Network Environments to Improve Malware Classification

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Applying supervised machine learning to network data features is increasingly common; it is well suited for tasks such as the detection of malicious flows and application identification. In these applications, it is essential to avoid biases that can arise due to the fact that different training datasets are obtained in different network environments. Unfortunately, it is not straightforward to understand how these environments can introduce biases; many previous studies have not even attempted to do so. In this work, we focus on the important case of training data obtained from malware sandboxes, and its use in detecting malware communications on enterprise networks. We present techniques to identify data features derived from the TCP/IP, TLS, DNS, and HTTP protocols that are artifacts of network environments, and show data features that are invariant across those environments.

HTTP headers provide a good example; the user-agent is often but not always invariant. The via header, on the other hand, indicates that a flow has passed through a proxy, and thus it is not representative of the application's type or intention, but rather a feature of the network environment. In our datasets, nearly 100% of the enterprise HTTP flows contained the "via" header, but this was uncommon in the malware sandbox dataset. A naïve application of machine learning would use this fact to achieve low error in cross-validation tests, but it would also fail at capturing the concept of maliciousness, and its efficacy on real network traffic would suffer. A similar situation holds for TLS, which contains a complex set of data features. Most Windows sandboxes use the XP version to maximize the probability that the submitted malware sample executes. TLS flows that take advantage of the underlying operating system's TLS library would use an outdated version of SChannel. In the cases where the malware samples use SChannel, offering obsolete TLS ciphersuites is not an inherent feature of the malware, but rather a feature of the sandbox environment. Understanding and accounting for these biases is necessary to create machine learning models that can accurately discern malicious traffic versus that of enterprise traffic, and not simply learn to classify different network environments. In addition to highlighting these pitfalls, we offer solutions to the problems and demonstrate their results. By understanding the target network environment and creating training datasets composed of synthetic samples, we can systematically avoid a sandbox bias. For example, when monitoring a network with a web proxy enabled and where Windows 10 is the most prevalent operating system, we create synthetic HTTP flows by modifying the existing malware HTTP flows to include the appropriate "via" header. Similarly, we modify the TLS ciphersuite offer vector and extensions to resemble the appropriate version of SChannel. Finally, we use the synthetic malware dataset and baseline benign data collected from the enterprise network to create robust machine learning classifiers that can be deployed on the enterprise network.

avatar for Blake Anderson

Blake Anderson

Cisco Systems Inc.
Blake received his PhD from the University of New Mexico. In his dissertation, he developed novel machine learning techniques and applied these techniques to classify, cluster, and find phylogenetic relationships on malware data. Blake spent time performing security research at Los... Read More →
avatar for David McGrew

David McGrew

Cisco Systems, Inc.
David McGrew is a Fellow in the Advanced Security Research Group at Cisco, where he works to improve network and system security through applied research, standards, and product engineering.  His current interests are the detection of threats using network technologies and the development... Read More →

Thursday January 12, 2017 1:00pm - 1:30pm PST
Great Room V-VIII 7450 Hazard Center Dr.