A Study of Large Language Models in Detecting Python Code Violations

Authors

DOI:

https://doi.org/10.14500/aro.12395

Keywords:

Large language models, Software metrics, Software quality, Static code analysis

Abstract

Adhering to good coding practices is critical for enhancing a software’s readability, maintainability, and reliability. Common static code analysis tools for Python, such as Pylint and Flake8, are widely used to enforce code quality by detecting coding violations without executing the code. Yet, they often fail to handle deeper semantic understanding and contextual reasoning. This study investigates the effectiveness of large language models (LLMs) compared to traditional static code analysis tools in detecting Python coding violations. Six state-of-the-art LLMs: ChatGPT, Gemini, Claude Sonnet, DeepSeek, Kimi, and Qwen are evaluated against Pylint and Flake8 tools. To do so, a curated dataset of 75 Python code snippets, annotated with 27 common code violations, is used. In addition, three common prompting strategies: Structural, chain-of-thought, and role-based, are used to instruct the selected LLMs. The experimental results reveal that Claude Sonnet achieved the highest F1-Score (0.81), outperforming Flake8 (0.79) and demonstrating strong precision (0.99) and recall (0.69). However, LLMs show differences in performance, with Qwen and DeepSeek underperforming relative to others. Moreover, LLMs that identified documentation and design violations (such as type hints and nested method structures) perform better than stylistic consistency and complex semantic reasoning. The results are heavily influenced by the prompting approach, with structural prompts yielding the most balanced performance in the majority of cases. This research contributes to the empirical work on employing LLMs for code quality assurance while also demonstrating their potential role as complementary static code analysis tools for Python, with methodologies that may extend to other languages.

Downloads

Download data is not yet available.

Author Biographies

Hekar A. Mohammed Salih, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Hekar A. Mohammed Salih is an M.Sc. student at the Department of Computer Science, College of Science, Duhok University. He got the B.Sc. degree in Computer Science and Information Technology. His research interests are in software engineering, LLMs, and AI/ML.

Qusay I. Sarhan, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Qusay I. Sarhan is an Assistant Professor at the Department of Computer Science, College of Science, Duhok University. He got the B.Sc. degree in Software Engineering, the M.Tech. degree in Software Engineering and the Ph.D. degree in Software Engineering. His research interests are in software engineering, internet of things, and AI/ML.

References

AlOmar, E.A., and Mkaouer, M.W., 2024. Cultivating software quality improvement in the classroom: An experience with chatGPT. In: 2024 36th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, United States, pp.1-10.

Carandang, K.A.M., Arana, J.M., Casin, E.R., Monterola, C., Tan, D.S., Valenzuela, J.F.B., and Alis, C., 2025. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). Association for Computational Linguistics, Stroudsburg, PA, USA, pp.1413-1422.

Flake8-Dev., 2025. Flake8 : Is a Python Tool That Check the Style and Quality of Some Python Code. Available from: https://github.com/pycqa/flake8 [Last accessed on 2025 Feb 17].

Guo, Q., Cao, J., Xie, X., Liu, S., Li, X., Chen, B., and Peng, X., 2023a. Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp.1-13.

Guo, Z., Tan, T., Liu, S., Liu, X., Lai, W., Yang, Y., Li, Y., Chen, L., Dong, W., and Zhou, Y., 2023b. Mitigating false positive static analysis warnings: Progress, challenges, and opportunities. IEEE Transactions on Software Engineering, 49(12), pp.5154-5188.

Haindl, P., and Weinberger, G., 2024. Does chatGPT Help novice programmers write better code? results from static code analysis. IEEE Access, 12, pp.114146114156.

Hajipour, H., Hassler, K., Holz, T., Schönherr, L., and Fritz, M., 2024. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models. In: 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, United States, pp.684-709.

Ignatyev, V.N., Shimchik, N.V., Panov, D.D., and Mitrofanov, A.A., 2024. Large language models in source code static analysis. In: 2024 Ivannikov Memorial Workshop (IVMEM). IEEE, United States, pp.28-35.

Jesse, K., Ahmed, T., Devanbu, P.T., and Morgan, E., 2023. Large language models and simple, stupid bugs. In: 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, United States, pp.563-575.

Kedia, N.K., Kumari, H., and Mundra, S., 2023. A review paper on python for data science and machine learning. Journal of Analysis and Computations, 17(2), pp.97-103.

Li, H., Hao, Y., Zhai, Y., and Qian, Z., 2023a. Assisting static analysis with large language models: A ChatGPT experiment. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA: pp.2107-2111.

Li, H., Hao, Y., Zhai, Y., and Qian, Z., 2023b. The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models. Cornell University, United States.

Li, Z., Dutta, S., and Naik, M., 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. Cornell University, United States, pp.1-24.

Liu, Z., Yang, Z., and Liao, Q., 2024. Exploration On Prompting LLM With Code-Specific Information For Vulnerability Detection. In: Proceedings - 2024 IEEE International Conference on Software Services Engineering, SSE 2024, pp.273-281.

Ma, W., Liu, S., Lin, Z., Wang, W., Hu, Q., Liu, Y., Zhang, C., Nie, L., Li, L., and Liu, Y., 2024. LLMs: Understanding Code Syntax and Semantics for Code Analysis. Cornell University, United States.

Mohajer, M.M., Aleithan, R., Harzevili, N.S., Wei, M., Belle, A.B., Pham, H.V., and Wang, S., 2023. SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models. Cornell University, United States.

Moratis, K., Diamantopoulos, T., Nastos, D.N., and Symeonidis, A., 2024. Write me This Code: An Analysis of ChatGPT Quality for Producing Source Code. Proceedings - 2024 IEEE/ACM 21st International Conference on Mining Software Repositories, MSR 2024, pp.147-151.

Mousavi, S.M., Alghisi, S., and Riccardi, G., 2025. LLMs as Repositories of Factual Knowledge: Limitations and Solutions. Cornell University, United States, pp.1-13.

Noever, D., 2023. Can Large Language Models Find and Fix Vulnerable Software? Cornell University, United States.

Omar, M., and Shiaeles, S., 2023. VulDetect: A novel technique for detecting software vulnerabilities using language models. In: 2023 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE, United States, pp.105-110.

Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B., 2023. Examining zero-shot vulnerability repair with large language models. In: 2023 IEEE Symposium on Security and Privacy (SP). IEEE, United States, pp.2339-2356.

Purba, M.D., Ghosh, A., Radford, B.J., and Chu, B., 2023. Software Vulnerability Detection using Large Language Models. In: 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, United States, pp.112-119.

Pylint-Dev., 2024. Pylint: It’s Not Just a Linter That Annoys You. Available from: https://github.com/pylint-dev/pylint [Last accessed on 2025 Feb 17].

Raschka, S., Patterson, J., and Nolet, C., 2020. Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information (Switzerland), 11(4), p.193.

Ságodi, Z., Siket, I., and Ferenc, R., 2024. Methodology for code synthesis evaluation of LLMs presented by a case study of ChatGPT and copilot. IEEE Access, 12, pp.72303-72316.

Souma, N., Ito, W., Obara, M., Kawaguchi, T., Akinobu, Y., Kurabayashi, T., Tanno, H., and Kuramitsu, K., 2023. Can chatGPT correct code based on logical steps. Proceedings Asia-Pacific Software Engineering Conference, APSEC, 2, pp.653-654.

Venkatesh, A.P.S., Sabu, S., Mir, A.M., Reis, S., and Bodden, E., 2024. The emergence of large language models in static analysis: A first look through micro-benchmarks. In: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering. ACM, New York, NY, USA, pp.35-39.

Wadhwa, N., Pradhan, J., Sonwane, A., Sahu, S.P., Natarajan, N., Kanade, A., Parthasarathy, S., and Rajamani, S., 2024. CORE: Resolving code quality issues using LLMs. Proceedings of the ACM on Software Engineering, 1(FSE), pp.789-811.

Yin, X., Ni, C., and Wang, S., 2024. Multitask-based evaluation of open-source LLM on software vulnerability. IEEE Transactions on Software Engineering, 50(11), pp.3071-3087.

Zhang, Q., Fang, C., Xie, Y., Zhang, Y., Yang, Y., Sun, W., Yu, S., and Chen, Z., 2024. A Survey on Large Language Models for Software Engineering. Cornell University, United States.

Published

2025-10-01

How to Cite

Mohammed Salih, H. A. and Sarhan, Q. I. (2025) “A Study of Large Language Models in Detecting Python Code Violations”, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 13(2), pp. 215–225. doi: 10.14500/aro.12395.
Received 2025-07-01
Accepted 2025-09-14
Published 2025-10-01

Similar Articles

<< < 1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.