A Study of Large Language Models in Detecting  Python Code Violations

Hekar A. Mohammed Salih; Qusay I. Sarhan

doi:10.14500/aro.12395

Authors

Hekar A. Mohammed Salih Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq https://orcid.org/0009-0004-2423-2583
Qusay I. Sarhan Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq https://orcid.org/0000-0001-8708-0063

DOI:

https://doi.org/10.14500/aro.12395

Keywords:

Large language models, Software metrics, Software quality, Static code analysis

Abstract

Adhering to good coding practices is critical for enhancing a software’s readability, maintainability, and reliability. Common static code analysis tools for Python, such as Pylint and Flake8, are widely used to enforce code quality by detecting coding violations without executing the code. Yet, they often fail to handle deeper semantic understanding and contextual reasoning. This study investigates the effectiveness of large language models (LLMs) compared to traditional static code analysis tools in detecting Python coding violations. Six state-of-the-art LLMs: ChatGPT, Gemini, Claude Sonnet, DeepSeek, Kimi, and Qwen are evaluated against Pylint and Flake8 tools. To do so, a curated dataset of 75 Python code snippets, annotated with 27 common code violations, is used. In addition, three common prompting strategies: Structural, chain-of-thought, and role-based, are used to instruct the selected LLMs. The experimental results reveal that Claude Sonnet achieved the highest F1-Score (0.81), outperforming Flake8 (0.79) and demonstrating strong precision (0.99) and recall (0.69). However, LLMs show differences in performance, with Qwen and DeepSeek underperforming relative to others. Moreover, LLMs that identified documentation and design violations (such as type hints and nested method structures) perform better than stylistic consistency and complex semantic reasoning. The results are heavily influenced by the prompting approach, with structural prompts yielding the most balanced performance in the majority of cases. This research contributes to the empirical work on employing LLMs for code quality assurance while also demonstrating their potential role as complementary static code analysis tools for Python, with methodologies that may extend to other languages.

Downloads

Download data is not yet available.

Author Biographies

Hekar A. Mohammed Salih, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Hekar A. Mohammed Salih is an M.Sc. student at the Department of Computer Science, College of Science, Duhok University. He got the B.Sc. degree in Computer Science and Information Technology. His research interests are in software engineering, LLMs, and AI/ML.

Qusay I. Sarhan, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Qusay I. Sarhan is an Assistant Professor at the Department of Computer Science, College of Science, Duhok University. He got the B.Sc. degree in Software Engineering, the M.Tech. degree in Software Engineering and the Ph.D. degree in Software Engineering. His research interests are in software engineering, internet of things, and AI/ML.

References

AlOmar, E.A., and Mkaouer, M.W., 2024. Cultivating software quality improvement in the classroom: An experience with chatGPT. In: 2024 36th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, United States, pp.1-10. DOI: https://doi.org/10.1109/CSEET62301.2024.10663028

Carandang, K.A.M., Arana, J.M., Casin, E.R., Monterola, C., Tan, D.S., Valenzuela, J.F.B., and Alis, C., 2025. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). Association for Computational Linguistics, Stroudsburg, PA, USA, pp.1413-1422. DOI: https://doi.org/10.18653/v1/2025.acl-industry.99

Flake8-Dev., 2025. Flake8 : Is a Python Tool That Check the Style and Quality of Some Python Code. Available from: https://github.com/pycqa/flake8 [Last accessed on 2025 Feb 17].

Guo, Q., Cao, J., Xie, X., Liu, S., Li, X., Chen, B., and Peng, X., 2023a. Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp.1-13. DOI: https://doi.org/10.1145/3597503.3623306

Guo, Z., Tan, T., Liu, S., Liu, X., Lai, W., Yang, Y., Li, Y., Chen, L., Dong, W., and Zhou, Y., 2023b. Mitigating false positive static analysis warnings: Progress, challenges, and opportunities. IEEE Transactions on Software Engineering, 49(12), pp.5154-5188. DOI: https://doi.org/10.1109/TSE.2023.3329667

Haindl, P., and Weinberger, G., 2024. Does chatGPT Help novice programmers write better code? results from static code analysis. IEEE Access, 12, pp.114146114156. DOI: https://doi.org/10.20944/preprints202406.1151.v1

Hajipour, H., Hassler, K., Holz, T., Schönherr, L., and Fritz, M., 2024. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models. In: 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, United States, pp.684-709. DOI: https://doi.org/10.1109/SaTML59370.2024.00040

Ignatyev, V.N., Shimchik, N.V., Panov, D.D., and Mitrofanov, A.A., 2024. Large language models in source code static analysis. In: 2024 Ivannikov Memorial Workshop (IVMEM). IEEE, United States, pp.28-35. DOI: https://doi.org/10.1109/IVMEM63006.2024.10659715

Jesse, K., Ahmed, T., Devanbu, P.T., and Morgan, E., 2023. Large language models and simple, stupid bugs. In: 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, United States, pp.563-575. DOI: https://doi.org/10.1109/MSR59073.2023.00082

Kedia, N.K., Kumari, H., and Mundra, S., 2023. A review paper on python for data science and machine learning. Journal of Analysis and Computations, 17(2), pp.97-103. DOI: https://doi.org/10.30696/JAC.XVII.2.2023.97-103

Li, H., Hao, Y., Zhai, Y., and Qian, Z., 2023a. Assisting static analysis with large language models: A ChatGPT experiment. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA: pp.2107-2111. DOI: https://doi.org/10.1145/3611643.3613078

Li, H., Hao, Y., Zhai, Y., and Qian, Z., 2023b. The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models. Cornell University, United States.

Li, Z., Dutta, S., and Naik, M., 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. Cornell University, United States, pp.1-24.

Liu, Z., Yang, Z., and Liao, Q., 2024. Exploration On Prompting LLM With Code-Specific Information For Vulnerability Detection. In: Proceedings - 2024 IEEE International Conference on Software Services Engineering, SSE 2024, pp.273-281. DOI: https://doi.org/10.1109/SSE62657.2024.00049

Ma, W., Liu, S., Lin, Z., Wang, W., Hu, Q., Liu, Y., Zhang, C., Nie, L., Li, L., and Liu, Y., 2024. LLMs: Understanding Code Syntax and Semantics for Code Analysis. Cornell University, United States.

Mohajer, M.M., Aleithan, R., Harzevili, N.S., Wei, M., Belle, A.B., Pham, H.V., and Wang, S., 2023. SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models. Cornell University, United States.

Moratis, K., Diamantopoulos, T., Nastos, D.N., and Symeonidis, A., 2024. Write me This Code: An Analysis of ChatGPT Quality for Producing Source Code. Proceedings - 2024 IEEE/ACM 21st International Conference on Mining Software Repositories, MSR 2024, pp.147-151. DOI: https://doi.org/10.1145/3643991.3645070

Mousavi, S.M., Alghisi, S., and Riccardi, G., 2025. LLMs as Repositories of Factual Knowledge: Limitations and Solutions. Cornell University, United States, pp.1-13.

Noever, D., 2023. Can Large Language Models Find and Fix Vulnerable Software? Cornell University, United States.

Omar, M., and Shiaeles, S., 2023. VulDetect: A novel technique for detecting software vulnerabilities using language models. In: 2023 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE, United States, pp.105-110. DOI: https://doi.org/10.1109/CSR57506.2023.10224924

Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B., 2023. Examining zero-shot vulnerability repair with large language models. In: 2023 IEEE Symposium on Security and Privacy (SP). IEEE, United States, pp.2339-2356. DOI: https://doi.org/10.1109/SP46215.2023.10179324

Purba, M.D., Ghosh, A., Radford, B.J., and Chu, B., 2023. Software Vulnerability Detection using Large Language Models. In: 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, United States, pp.112-119. DOI: https://doi.org/10.1109/ISSREW60843.2023.00058

Pylint-Dev., 2024. Pylint: It’s Not Just a Linter That Annoys You. Available from: https://github.com/pylint-dev/pylint [Last accessed on 2025 Feb 17].

Raschka, S., Patterson, J., and Nolet, C., 2020. Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information (Switzerland), 11(4), p.193. DOI: https://doi.org/10.3390/info11040193

Ságodi, Z., Siket, I., and Ferenc, R., 2024. Methodology for code synthesis evaluation of LLMs presented by a case study of ChatGPT and copilot. IEEE Access, 12, pp.72303-72316. DOI: https://doi.org/10.1109/ACCESS.2024.3403858

Souma, N., Ito, W., Obara, M., Kawaguchi, T., Akinobu, Y., Kurabayashi, T., Tanno, H., and Kuramitsu, K., 2023. Can chatGPT correct code based on logical steps. Proceedings Asia-Pacific Software Engineering Conference, APSEC, 2, pp.653-654. DOI: https://doi.org/10.1109/APSEC60848.2023.00094

Venkatesh, A.P.S., Sabu, S., Mir, A.M., Reis, S., and Bodden, E., 2024. The emergence of large language models in static analysis: A first look through micro-benchmarks. In: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering. ACM, New York, NY, USA, pp.35-39. DOI: https://doi.org/10.1145/3650105.3652288

Wadhwa, N., Pradhan, J., Sonwane, A., Sahu, S.P., Natarajan, N., Kanade, A., Parthasarathy, S., and Rajamani, S., 2024. CORE: Resolving code quality issues using LLMs. Proceedings of the ACM on Software Engineering, 1(FSE), pp.789-811. DOI: https://doi.org/10.1145/3643762

Yin, X., Ni, C., and Wang, S., 2024. Multitask-based evaluation of open-source LLM on software vulnerability. IEEE Transactions on Software Engineering, 50(11), pp.3071-3087. DOI: https://doi.org/10.1109/TSE.2024.3470333

Zhang, Q., Fang, C., Xie, Y., Zhang, Y., Yang, Y., Sun, W., Yu, S., and Chen, Z., 2024. A Survey on Large Language Models for Software Engineering. Cornell University, United States.

A Study of Large Language Models in Detecting Python Code Violations

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Hekar A. Mohammed Salih, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Qusay I. Sarhan, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

aro

calrivate

scopus

scopus citescore

kou

cr

if

doaj

Make a Submission

issn

Information

Current Issue

Developed By

Keywords

Browse