TY - GEN
T1 - Detecting Code Vulnerabilities using LLMs
AU - Huynh, Larry
AU - Zhang, Yinghao
AU - Jayasundera, Djimon
AU - Jeon, Woojin
AU - Kim, Hyoungshick
AU - Bi, Tingting
AU - Hong, Jin B.
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Large language models (LLMs) have emerged as a promising tool for detecting code vulnerabilities, potentially offering advantages over traditional rule-based methods. This paper proposes an enhanced framework for vulnerability detection using LLMs, incorporating various prompt engineering strategies to improve performance. We evaluate several techniques, including role-based prompting, zero-shot chain-of-Thought, and structured prompting approaches, on the DiverseVul dataset of C/C++ vulnerabilities. Our experiments assess the framework's performance across different code structures, contextual information levels, and LLM capabilities. Our results show that using our dynamic prompt engineering technique, you can improve the F1 score by up to 100% with GPT-3.5, a widely used LLM model. We also observe that GPT-4o, Gemini 2.0 Flash, and Meta Llama 3.1 generally outperform GPT-3.5, and all models are very poor when it comes to correctly identifying the type of vulnerability in the code, with the best F1 score of 0.16 observed. However, our follow-up experiments on LLM-based vulnerability correction (i.e., patching) show a 45.77% success rate using GPT-4o, demonstrating promising results in leveraging LLMs for enhancing software security and providing insights into optimizing prompt engineering for vulnerability detection tasks.
AB - Large language models (LLMs) have emerged as a promising tool for detecting code vulnerabilities, potentially offering advantages over traditional rule-based methods. This paper proposes an enhanced framework for vulnerability detection using LLMs, incorporating various prompt engineering strategies to improve performance. We evaluate several techniques, including role-based prompting, zero-shot chain-of-Thought, and structured prompting approaches, on the DiverseVul dataset of C/C++ vulnerabilities. Our experiments assess the framework's performance across different code structures, contextual information levels, and LLM capabilities. Our results show that using our dynamic prompt engineering technique, you can improve the F1 score by up to 100% with GPT-3.5, a widely used LLM model. We also observe that GPT-4o, Gemini 2.0 Flash, and Meta Llama 3.1 generally outperform GPT-3.5, and all models are very poor when it comes to correctly identifying the type of vulnerability in the code, with the best F1 score of 0.16 observed. However, our follow-up experiments on LLM-based vulnerability correction (i.e., patching) show a 45.77% success rate using GPT-4o, demonstrating promising results in leveraging LLMs for enhancing software security and providing insights into optimizing prompt engineering for vulnerability detection tasks.
KW - Code Vulnerability
KW - CWE
KW - Large Language Models
KW - Prompt Engineering
UR - https://www.scopus.com/pages/publications/105011816351
U2 - 10.1109/DSN64029.2025.00047
DO - 10.1109/DSN64029.2025.00047
M3 - Conference paper
AN - SCOPUS:105011816351
T3 - Proceedings - 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2025
SP - 401
EP - 414
BT - Proceedings - 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2025
A2 - Cinque, Marcello
A2 - Cotroneo, Domenico
A2 - De Simone, Luigi
A2 - Eckhart, Matthias
A2 - Lee, Patrick P. C.
A2 - Zonouz, Saman
PB - IEEE, Institute of Electrical and Electronics Engineers
CY - USA
T2 - 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2025
Y2 - 23 June 2025 through 26 June 2025
ER -