Characterizing and Understanding Software Security Vulnerabilities in Machine Learning libraries
The application of machine learning (ML) libraries has tremendously increased in many domains, including autonomous driving systems, medical, and critical industries. Vulnerabilities of such libraries could result in irreparable consequences. However, the characteristics of software security vulnerabilities have not been well studied. In this paper, to bridge this gap, we take the first step towards characterizing and understanding the security vulnerabilities of seven well-known ML libraries, including TensorFlow, PyTorch, Scikit-learn, Mlpack, Pandas, Numpy, and Scipy. To do so, we collected 683 security vulnerabilities to explore four major factors: 1) vulnerability types, 2) root causes, 3) symptoms, and 4) fixing patterns of security vulnerabilities in ML libraries. The findings of this study can help developers and researchers understand the characteristics of security vulnerabilities across different ML libraries.
Empirical Study
In this project, we provide the first comprehensive empirical study on characteristics of software security vulnerabilities of seven ML libraries, including TensorFlow, PyTorch, Sciki-Learn, Pandas,Scipy, Mlpack, and, Numpy. In total, we collected 683 vulnerable related commits for our analysis, and based on that; we provide multiple findings and guidelines to developers and contributors of these libraries to have a better understanding of software security vulnerabilities specific to this context and increase the reliability them. We address the following research questions:
- What types of vulnerabilities exist in ML libraries?
- What are the root causes for vulnerabilities in ML libraries?
- What are the symptoms of vulnerabilities in ML libraries?
- What are the fixing patterns for vulnerabilities in ML libraries?
The benchmark data we used for our manual analysis is available here.
In order to extract commits, please use this script.
The steps to reproduce the commits for each ML library is as follows:
- Put
fetch_commits.py
under any directory in your os, and the commits will be generated at the same directory. Please note that you need Python3.x to be able to run the script. - Once you run the script, you will see multiple parameters. The main parameters are the target ML library username and repository name, as well as a personal access token.
Or you can access all 5609 commits for the libraries here:
*TensorFlow *PyTorch *scikit-learn *Pandas *Mlpack *Numpy *Scipy
In order to run data for images and tables in the paper, please use our colab notebook:
In order to generate Figure 6 (the parallel plot), we used a Python library called plotly. You can use this to generate Figure 6. You just need to put the scripts anywhere in your home directory and run it (please note that you need Python3.x). The figures will be generated as web pages in your localhost address.