A Study of Large Language Models in Detecting Python Code Violations
DOI:
https://doi.org/10.14500/aro.12395Keywords:
Large language models, Software metrics, Software quality, Static code analysisAbstract
Adhering to good coding practices is critical for enhancing a software’s readability, maintainability, and reliability. Common static code analysis tools for Python, such as Pylint and Flake8, are widely used to enforce code quality by detecting coding violations without executing the code. Yet, they often fail to handle deeper semantic understanding and contextual reasoning. This study investigates the effectiveness of large language models (LLMs) compared to traditional static code analysis tools in detecting Python coding violations. Six state-of-the-art LLMs: ChatGPT, Gemini, Claude Sonnet, DeepSeek, Kimi, and Qwen are evaluated against Pylint and Flake8 tools. To do so, a curated dataset of 75 Python code snippets, annotated with 27 common code violations, is used. In addition, three common prompting strategies: Structural, chain-of-thought, and role-based, are used to instruct the selected LLMs. The experimental results reveal that Claude Sonnet achieved the highest F1-Score (0.81), outperforming Flake8 (0.79) and demonstrating strong precision (0.99) and recall (0.69). However, LLMs show differences in performance, with Qwen and DeepSeek underperforming relative to others. Moreover, LLMs that identified documentation and design violations (such as type hints and nested method structures) perform better than stylistic consistency and complex semantic reasoning. The results are heavily influenced by the prompting approach, with structural prompts yielding the most balanced performance in the majority of cases. This research contributes to the empirical work on employing LLMs for code quality assurance while also demonstrating their potential role as complementary static code analysis tools for Python, with methodologies that may extend to other languages.
Downloads
References
AlOmar, E.A., and Mkaouer, M.W., 2024. Cultivating software quality improvement in the classroom: An experience with chatGPT. In: 2024 36th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, United States, pp.1-10.
Carandang, K.A.M., Arana, J.M., Casin, E.R., Monterola, C., Tan, D.S., Valenzuela, J.F.B., and Alis, C., 2025. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). Association for Computational Linguistics, Stroudsburg, PA, USA, pp.1413-1422.
Flake8-Dev., 2025. Flake8 : Is a Python Tool That Check the Style and Quality of Some Python Code. Available from: https://github.com/pycqa/flake8 [Last accessed on 2025 Feb 17].
Guo, Q., Cao, J., Xie, X., Liu, S., Li, X., Chen, B., and Peng, X., 2023a. Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp.1-13.
Guo, Z., Tan, T., Liu, S., Liu, X., Lai, W., Yang, Y., Li, Y., Chen, L., Dong, W., and Zhou, Y., 2023b. Mitigating false positive static analysis warnings: Progress, challenges, and opportunities. IEEE Transactions on Software Engineering, 49(12), pp.5154-5188.
Haindl, P., and Weinberger, G., 2024. Does chatGPT Help novice programmers write better code? results from static code analysis. IEEE Access, 12, pp.114146114156.
Hajipour, H., Hassler, K., Holz, T., Schönherr, L., and Fritz, M., 2024. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models. In: 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, United States, pp.684-709.
Ignatyev, V.N., Shimchik, N.V., Panov, D.D., and Mitrofanov, A.A., 2024. Large language models in source code static analysis. In: 2024 Ivannikov Memorial Workshop (IVMEM). IEEE, United States, pp.28-35.
Jesse, K., Ahmed, T., Devanbu, P.T., and Morgan, E., 2023. Large language models and simple, stupid bugs. In: 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, United States, pp.563-575.
Kedia, N.K., Kumari, H., and Mundra, S., 2023. A review paper on python for data science and machine learning. Journal of Analysis and Computations, 17(2), pp.97-103.
Li, H., Hao, Y., Zhai, Y., and Qian, Z., 2023a. Assisting static analysis with large language models: A ChatGPT experiment. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA: pp.2107-2111.
Li, H., Hao, Y., Zhai, Y., and Qian, Z., 2023b. The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models. Cornell University, United States.
Li, Z., Dutta, S., and Naik, M., 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. Cornell University, United States, pp.1-24.
Liu, Z., Yang, Z., and Liao, Q., 2024. Exploration On Prompting LLM With Code-Specific Information For Vulnerability Detection. In: Proceedings - 2024 IEEE International Conference on Software Services Engineering, SSE 2024, pp.273-281.
Ma, W., Liu, S., Lin, Z., Wang, W., Hu, Q., Liu, Y., Zhang, C., Nie, L., Li, L., and Liu, Y., 2024. LLMs: Understanding Code Syntax and Semantics for Code Analysis. Cornell University, United States.
Mohajer, M.M., Aleithan, R., Harzevili, N.S., Wei, M., Belle, A.B., Pham, H.V., and Wang, S., 2023. SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models. Cornell University, United States.
Moratis, K., Diamantopoulos, T., Nastos, D.N., and Symeonidis, A., 2024. Write me This Code: An Analysis of ChatGPT Quality for Producing Source Code. Proceedings - 2024 IEEE/ACM 21st International Conference on Mining Software Repositories, MSR 2024, pp.147-151.
Mousavi, S.M., Alghisi, S., and Riccardi, G., 2025. LLMs as Repositories of Factual Knowledge: Limitations and Solutions. Cornell University, United States, pp.1-13.
Noever, D., 2023. Can Large Language Models Find and Fix Vulnerable Software? Cornell University, United States.
Omar, M., and Shiaeles, S., 2023. VulDetect: A novel technique for detecting software vulnerabilities using language models. In: 2023 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE, United States, pp.105-110.
Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B., 2023. Examining zero-shot vulnerability repair with large language models. In: 2023 IEEE Symposium on Security and Privacy (SP). IEEE, United States, pp.2339-2356.
Purba, M.D., Ghosh, A., Radford, B.J., and Chu, B., 2023. Software Vulnerability Detection using Large Language Models. In: 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, United States, pp.112-119.
Pylint-Dev., 2024. Pylint: It’s Not Just a Linter That Annoys You. Available from: https://github.com/pylint-dev/pylint [Last accessed on 2025 Feb 17].
Raschka, S., Patterson, J., and Nolet, C., 2020. Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information (Switzerland), 11(4), p.193.
Ságodi, Z., Siket, I., and Ferenc, R., 2024. Methodology for code synthesis evaluation of LLMs presented by a case study of ChatGPT and copilot. IEEE Access, 12, pp.72303-72316.
Souma, N., Ito, W., Obara, M., Kawaguchi, T., Akinobu, Y., Kurabayashi, T., Tanno, H., and Kuramitsu, K., 2023. Can chatGPT correct code based on logical steps. Proceedings Asia-Pacific Software Engineering Conference, APSEC, 2, pp.653-654.
Venkatesh, A.P.S., Sabu, S., Mir, A.M., Reis, S., and Bodden, E., 2024. The emergence of large language models in static analysis: A first look through micro-benchmarks. In: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering. ACM, New York, NY, USA, pp.35-39.
Wadhwa, N., Pradhan, J., Sonwane, A., Sahu, S.P., Natarajan, N., Kanade, A., Parthasarathy, S., and Rajamani, S., 2024. CORE: Resolving code quality issues using LLMs. Proceedings of the ACM on Software Engineering, 1(FSE), pp.789-811.
Yin, X., Ni, C., and Wang, S., 2024. Multitask-based evaluation of open-source LLM on software vulnerability. IEEE Transactions on Software Engineering, 50(11), pp.3071-3087.
Zhang, Q., Fang, C., Xie, Y., Zhang, Y., Yang, Y., Sun, W., Yu, S., and Chen, Z., 2024. A Survey on Large Language Models for Software Engineering. Cornell University, United States.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Hekar A. Mohammed Salih, Qusay I. Sarhan

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors who choose to publish their work with Aro agree to the following terms:
-
Authors retain the copyright to their work and grant the journal the right of first publication. The work is simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0]. This license allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors have the freedom to enter into separate agreements for the non-exclusive distribution of the journal's published version of the work. This includes options such as posting it to an institutional repository or publishing it in a book, as long as proper acknowledgement is given to its initial publication in this journal.
-
Authors are encouraged to share and post their work online, including in institutional repositories or on their personal websites, both prior to and during the submission process. This practice can lead to productive exchanges and increase the visibility and citation of the published work.
By agreeing to these terms, authors acknowledge the importance of open access and the benefits it brings to the scholarly community.
Accepted 2025-09-14
Published 2025-10-01