OWASP Machine Learning Resources
This page is intended to provide machine learning security resources for security researchers who are new to the field. It includes a few introductory resources for the basics of machine learning as well as examples of machine learning applied to security problems on different platforms. While there are many resources and conference talks available on the web, this page focuses on open-source projects and information that would be of interest to security researchers as a place to start learning. Rather than reinvent existing information, the approach of this page is to maintain context and accuracy by linking to original source materials.
Security Aspects of Machine Learning
Machine Learning as it relates to security can be broken down into three broad categories.
- Adversarial Machine Learning - This technique attempts to confuse ML algorithms into returning the attacker's selected data instead of the expected response.
- Machine learning for improved security analysis - Machine learning can be used in security log analysis in order to better detect or prevent attacks. Typical applications include detecting anomalies, predicting values, or classifying data into two or more categories.
- Attacks against machine learning software - These attacks target the software that is being used for machine learning with the hopes of causing a denial-of-service attack or arbitrary code execution. These types of attacks typically result in a software patch for the impacted library or service.
Getting Started with the Basics of Machine Learning
There are many free videos and interactive demos for machine learning available on the web. These are just a few options:
- Machine Learning Recipes with Josh Gordon - This Google Developer YouTube series focuses on TensorFlow and SciKit learn. The target audience is people who are new to machine learning and are able to read Python. - Josh Gordon, Google Developer YouTube channel, 2017.
Hands On Learning
- Tensorflow Playground - This web page allows you to play with neural networks from your web browser. The goal of the exercises are to design neural networks to match different data sets.
- Azure Binary Classification: Network Intrusion Detection - This lab demonstrates intrusion detection security analysis using Azure's Machine Learning environment. The lab exercise can be performed using a free account. In addition, the Cortana Intelligence Gallery has other types of anomaly detection examples from Microsoft and other contributors available from this site.
- Microsoft Azure Machine Learning: Algorithm Cheat Sheet - "This cheat sheet helps you choose the best Azure Machine Learning Studio algorithm for your predictive analytics solution. Your decision is driven by both the nature of your data and the question you’re trying to answer." Microsoft provides this documentation on how to interpret the cheat sheet: How to choose algorithms for Microsoft Azure Machine Learning While this reference is specific to Microsoft, many of the concepts are universal.
- Python for data science cheat sheet - A cheat sheet for the different Scikit machine learning Python modules.
- Scikit Learn: Choosing the right estimator - "The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data."
ML Security Resources
Consolidated Resource Sites
- MLSec Project - "MLSec Project is a select community of like-minded individuals that want to work together on using machine learning and data science in information security. It features open source projects, blog posts and community content to help further the understanding of how this technology can help defenders handle the growing complexity and verbosity of their environments and tools."
- Awesome Adversarial Machine Learning - A github project which is "A curated list of awesome adversarial machine learning resources" including blogs, papers, and talks.
- Advances in Cloud-Scale Machine Learning for Cyberdefense - "Picking an attacker's signals out of billions of log events in near real time from petabyte scale storage is a daunting task, but Microsoft has been using security data science at cloud scale to successfully disrupt attacker. This session presents the latest frameworks, techniques and the unconventional machine learning algorithms that Microsoft uses to protect its infrastructure and customers." - Mark Russinovich, BlueHat 2017
- Machine Learning and the Cloud: Disrupting Threat Detection and Prevention - "Machine learning with large data sets gives unprecedented insights and anomaly detection capability. Learn how Microsoft uses the agility and scale of the cloud to protect its infrastructure and customers by applying data mining and machine learning algorithms and security domain learnings to the vast amounts of data and telemetry gathered by its many different systems and services." - Mark Russinovich, RSA USA 2016.
- Predictive Security: Using Big Data to Fortify Your Defenses - "This session shows you how to build a predictive analytics stack on AWS, which harnesses the power of Amazon Machine Learning in conjunction with Amazon Elasticsearch Service, AWS CloudTrail, and VPC Flow Logs to perform tasks such as anomaly detection and log analysis. We also demonstrate how you can use AWS Lambda to act on this information in an automated fashion, such as performing updates to AWS WAF and security groups, leading to an improved security posture and alleviating operational burden on your security teams." - Michael Capicotto & Matt Nowina, AWS re:Invent 2016
- Attacking Machine Learning with Adversarial Examples - "In this post we’ll show how adversarial examples work across different mediums, and will discuss why securing systems against them can be difficult." - February 24, 2017
- Weka - "Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes."
Languages and Libraries
- Scikit-Learn - Data mining and data analysis in Python that is built on NumPy, SciPy, and matplotlib.
- Tensor Flow - TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
- R Project - The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
- Anaconda - "The open source version of Anaconda is a high performance distribution of Python and R and includes over 100 of the most popular Python, R and Scala packages for data science. ... If you don't have time or disk space for the entire distribution, try Miniconda which contains only conda and Python. Then install just the individual packages you want through the conda command."
- Keras - "Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow, CNTK or Theano. It was developed with a focus on enabling fast experimentation."
- UCI ML Data Sets University of California, Irvine has a collection of machine learning data sets across many fields. Within their collection, their are security relevant data sets if you search for topics such as spam, phishing, etc.
- HTTP DATASET CSIC 2010 "The HTTP dataset CSIC 2010 contains thousands of web requests automatically generated. It can be used for the testing of web attack protection systems. It was developed at the “Information Security Institute” of CSIC (Spanish Research National Council)."
- eXpose Deep Nueral Network This is an open-source deep neural network project that attempts to detect malicious URLs, file paths and registry keys with proper training. Data sets can be found in the data/models directory the in the sample_scores.json files.
- KDD Cup 1999: Computer Network Intrusion Detection The goal of the KDD Cup competition in 1999 was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network. This is a link to the large data set used for that competition. The other tabs on the page provide additional context on the data.