Skip to content

tuhh-softsec/LLMSecEval

Repository files navigation

LLMSecEval: Dataset of NL Prompts for Code Generation

This repo consists of a dataset of NL prompts that can be used to generate code using LLMs (Large Language Models) that cover different security scenarios. The prompts cover 18 out of Top 25 CWEs scenarios of 2021. The primary usage of this dataset is to evaluate the code generated by different LLMs in terms of security. Additionally it can also be used to explore the field of prompt engineering to generate secure code using LLMs. The dataset is largely language-agnostic such that it can be used to generate code in any programming language.

Some of the information included in the promtps file are shown below:

  • Prompt ID: The unique ID of the prompt
  • CWE: The full name of the CWE.
  • LLM-generated NL Prompt: The NL prompt created from the code of specified CWE scenario by the LLM.
  • Source Code Filepath: The file path indicated the relative path of the code sample in the dataset from the works of Pearce et al.. These are the code from which the NL prompt is generated.
  • Vulnerable: Indicates if the NL prompt was generated from vulnerable code or not. Note: The prompts are cleaned to remove any direct mentions of vulnerabilities.
  • Language: Indicates the programming language of the code from which the prompt was generated.
  • Language-related Metrics: This denotes the scores assigned to the NL prompts in terms of language fluency. The score ranges from 1 to 5. The meaning of the scores can be found in the works of Hu et al..
    • Naturalness: measures grammatical fluency of the NL prompts.
    • Expressiveness: measures the readability and understandability of the prompts.
  • Content-related Metrics: This denotes the scores assigned to the NL prompts in terms of how well it captures the information in the code. The score ranges from 1 to 5. The meaning of the scores can be found in the works of Hu et al..
    • Content Adequacy: measures how well the prompts represent the code from which they are generated.
    • Conciseness: measures if the prompts contain unnecessary and irrelevant information.
  • Manually-fixed Prompts: The manually fixed versions of LLM-generated NL prompts with low scores.

More details of the dataset and its usage instructions can be found in the folder: Dataset.

In addition to the dataset, we have created a web application that can be used to generated code from NL prompts using GPT-3 and Codex. It also includes the functionality for generating NL descriptions from code snippets. This is to enable the extension of the NL prompts dataset. The code and usage instructions of this application can be found in the folder: Code Generation.

And finally, an interface to detect vulnerabilities associated with 18 out of the top 25 CWEs using CodeQL queries is present in the folder: Security Analysis - CodeQL.

This repository is part of our publication in the 20th International Conference on Mining Software Repositories (MSR 2023). If you use LLMSecEval in academic context, please cite it as follows:

@inproceedings{llmseceval2023,
  title     = {LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations},  
  author    = {Tony, Catherine and Mutas, Markus and Díaz Ferreyra, Nicolas and Scandariato, Riccardo},  
  booktitle = {2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)},   
  year      = {2023},
  doi       = {10.5281/zenodo.7565965}
}