Evaluating Large Language Models for Arduino Code Generation

Sardar K. Jabrw; Qusay I. Sarhan

doi:10.14500/aro.12344

Authors

Sardar K. Jabrw Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq https://orcid.org/0009-0006-0411-789X
Qusay I. Sarhan Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq https://orcid.org/0000-0001-8708-0063

DOI:

https://doi.org/10.14500/aro.12344

Keywords:

Large Language Models, Arduino, Code Generation, Internet of Things, Code Performance

Abstract

Large language models (LLMs), also known as generative AI, have transformed code generation by translating natural language prompts into executable code. Yet, their capabilities in generating code for resource-constrained devices such as Arduino, which are used in the Internet of Things and embedded systems, remained underexplored. This study evaluates six state-of-the-art LLMs for generating correct, efficient, and high-quality Arduino code. The evaluation was performed across five dimensions, namely functional correctness, runtime efficiency, memory usage, code quality, similarity to human-written code, and multi-round error correction. The results reveal that ChatGPT-4o achieves the highest zero-shot functional correctness and aligns closely with human code in readability and similarity. On the other hand, Gemini 2.0 Flash generates faster-executing code but at the cost of higher code complexity and lower similarity. DeepSeek-V3 balances correctness with superior flash memory optimization, whereas Claude 3.5 Sonnet struggles with prompt adherence. Finally, multi-round error correction improves correctness across all six models. Overall, the f indings underscore that none of the evaluated LLMs consistently outperforms all evaluation criteria. Hence, model choice must align with project priorities; as shown, ChatGPT-4o excels in functional correctness, whereas Gemini 2.0 excels in execution time, and DeepSeek-V3 in memory efficiency. This study provides a systematic evaluation of code generated with LLMs for Arduino, which, to the best of our knowledge, has not been previously studied across multiple models and performance metrics, thereby establishing a foundation for future research and contributing to enhancing the trustworthiness and effectiveness of LLM-generated code.

Downloads

Download data is not yet available.

Author Biographies

Sardar K. Jabrw, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Sardar K. Jabrw is a M.Sc. student at the Department of Computer Science, College of Science, Duhok University. He got the B.Sc. degree in Computer Science. His research interests are in Software Engineering, LLMs, and AI/ML.

Qusay I. Sarhan, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Qusay I. Sarhan is an Assistant Professor at the Department of Computer Science, College of Science, Duhok University. He got the B.Sc. degree in Software Engineering, the M.Tech. degree in Software Engineering and the Ph.D. degree in Software Engineering. His research interests are in Software Engineering, Internet of Things, and AI/ML.

References

Abdullah, A.A., Mohammed, N.S., Khanzadi, M., Asaad, S.M., Abdul, Z.K., and Maghdid, H.S., 2025. In-depth analysis on machine learning approaches: Techniques, applications, and trends. The Scientific Journal of Koya University, 13(1), pp.190-202.

Beurer-Kellner, L., Vechev, M., and Fischer, M., 2023. Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages, 7, pp.1946-1969.

Bucaioni, A., Ekedahl, H., Helander, V., and Nguyen, P.T., 2024. Programming with ChatGPT: How far can we go? Machine Learning with Applications, 15, p.100526.

Clark, A., Igbokwe, D., Ross, S., and Zibran, M.F., 2024. A Quantitative Analysis of Quality and Consistency in AI-Generated Code. In: Proceedings - 2024 7th International Conference on Software and System Engineering, ICoSSE 2024. Institute of Electrical and Electronics Engineers Inc., pp.37-41.

Coello, C.E.A., Alimam, M.N., and Kouatly, R., 2024. Effectiveness of ChatGPT in coding: A comparative analysis of popular large language models. Digital, 4(1), pp.114-125.

DeLorenzo, M., Gohil, V., and Rajendran, J., 2024. CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation. In: Conference: 2024 IEEE LLM Aided Design Workshop (LAD). pp.1-5.

Ebert, C., Cain, J., Antoniol, G., Counsell, S., and Laplante, P., 2016. Cyclomatic complexity. IEEE Software, 33(6), pp.27-29.

Evtikhiev, M., Bogomolov, E., Sokolov, Y., and Bryksin, T., 2023. Out of the BLEU: How should we assess quality of the Code Generation models? Journal of Systems and Software, 203, p.111741.

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., and Wang, H., 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 33(8), pp.1-79.

Jiang, J., Wang, F., Shen, J., Kim, S. and Kim, S., 2024. A survey on large language models for code generation. arXiv, arXiv:2406.00515. [Last accessed on 2025 Apr 25].

Kim, S.M., Choi, Y., and Suh, J., 2020. Applications of the open-source hardware arduino platform in the mining industry: A review. Applied Sciences, 10, 5018.

Kok, I., Demirci, O., and Ozdemir, S., 2024. When IoT Meet LLMs: Applications and Challenges. In: 2024 IEEE International Conference on Big Data (BigData). Los Alamitos, CA, USA: IEEE Computer Society. pp.7075-7084.

Koubaa, A., Qureshi, B., Ammar, A., Khan, Z., Boulila, W., and Ghouti, L., 2023. Humans are still better than ChatGPT: Case of the IEEEXtreme competition. Heliyon, 9(11), p.e21624.

Li, J., Li, G., Li, Y., and Jin, Z., 2024. Structured Chain-of-Thought Prompting for Code Generation. ACM Transactions on Software Engineering and Methodology, 34, pp.1-23.

Liu, C., Bao, X., Zhang, H., Zhang, N., Hu, H., Zhang, X., and Yan, M., 2024. Guiding ChatGPT for Better Code Generation: An Empirical Study. In: Proceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024. Institute of Electrical and Electronics Engineers Inc. pp.102-113.

Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, D., Jiang, D., Tang, D. and Li, G., 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv, arXiv:2102.04664. [Last accessed on 2025 Apr 25].

Miah, T., and Zhu, H., 2024. User Centric Evaluation of Code Generation Tools (Invited Paper). In: 2024 IEEE International Conference on Artificial Intelligence Testing (AITest). Los Alamitos, CA, USA: IEEE Computer Society. pp.109-119.

Mirjalili, S., Abdulla, A.A., Hassan, B.A., and Rashid, T.A., 2025. LLaMAAdapter + MRP: Integrating Meta-Reasoning Prompting with LLaMA-Adapter for Efficient Multi-Modal and Task-Adaptive Reasoning. TechRxiv, June 18.

Moradi Dakhel, A., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M.C., and Jiang, Z.M. (Jack), 2023. GitHub Copilot AI pair programmer: Asset or Liability? The Journal of Systems and Software, 203(C), p.111734.

Nayyar, A., and Puri, V., 2016. A review of Arduino board’s, Lilypad’s & Arduino shields. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom). pp.1485-1492.

Nazir, A., and Wang, Z., 2023. A comprehensive survey of ChatGPT: Advancements, applications, prospects, and challenges. Meta-Radiology, 1, p.100022.

Niu, C., Zhang, T., Li, C., Luo, B., and Ng, V., 2024. On Evaluating the Efficiency of Source Code Generated by LLMs. In: Proceedings - 2024 IEEE/ACM 1st International Conference on AI Foundation Models and Software Engineering, FORGE 2024. Association for Computing Machinery, Inc. pp.103-107.

Nuñez-Varela, A.S., Pérez-Gonzalez, H.G., Martínez-Perez, F.E., and Soubervielle-Montalvo, C., 2017. Source code metrics: A systematic mapping study. Journal of Systems and Software, 128, pp.164-197.

Palla, D., and Slaby, A., 2025. Evaluation of generative AI models in python code generation: A comparative study. IEEE Access, 13, pp.65334-65347.

Paul, D.G., Zhu, H., and Bayley, I., 2024. ScenEval: A Benchmark for ScenarioBased Evaluation of Code Generation. In: 2024 IEEE International Conference on Artificial Intelligence Testing (AITest). IEEE. pp.55-63.

Petrovic, N., Konicanin, S., and Suljovic, S., 2023. ChatGPT in IoT Systems: Arduino Case Studies. In: 2023 IEEE 33rd International Conference on Microelectronics, MIEL 2023. Institute of Electrical and Electronics Engineers Inc., pp.1-4.

Rai, L., Khatiwada, S., Deng, C., and Liu, F., 2024. Cross-Language Code Development with Generative AI: A Source-to-Source Translation Perspective. In: 2024 IEEE 7th International Conference on Electronic Information and Communication Technology, ICEICT 2024. Institute of Electrical and Electronics Engineers Inc., pp.562-565.

Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., Sundaresan, N., Zhou, M., Blanco, A. and Ma, S., 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv, arXiv:2009.10297. [Last accessed on 2025 Apr 25].

Sharma, T., 2024. LLMs for Code: The Potential, Prospects, and Problems. In: Proceedings - IEEE 21st International Conference on Software Architecture Companion, ICSA-C 2024. Institute of Electrical and Electronics Engineers Inc. pp.373-374.

Shuvo, U.A., Dip, S.A., Vaskar, N.R., and Al Islam, A.B.M.A., 2025. Assessing ChatGPT’s Code Generation Capabilities with Short vs Long Context Programming Problems. In: Proceedings of the 2024 11th International Conference on Networking, Systems and Security, NSysS 2024. Association for Computing Machinery, Inc. pp.32-40.

Su, H., Ai, J., Yu, D., and Zhang, H., 2023. An Evaluation Method for Large Language Models’ Code Generation Capability. In: Proceedings - 2023 10th International Conference on Dependable Systems and Their Applications, DSA 2023. Institute of Electrical and Electronics Engineers Inc. pp.831-838.

Tashtoush, Y., Abu-El-Rub, N., Darwish, O., Al-Eidi, S., Darweesh, D., and Karajeh, O., 2023. A notional understanding of the relationship between code readability and software complexity. Information (Switzerland), 14(2), 81.

Yin, T., 2024. Lizard: A Simple Code Complexity Analyser without Caring about the c/c++ Header Files or Java Imports, Supports Most of the Popular Languages. pp. 21-27. Available from: https://github.com/terryyin/lizard [Last accessed on 2025 Apr 25].

Yusro, M., Guntoro, N., and Rikawarastuti, R., 2021. Utilization of microcontroller technology using Arduino board for Internet of Things (a systematic review). AIP Conference Proceedings, 2331, p.060004.

Evaluating Large Language Models for Arduino Code Generation

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Sardar K. Jabrw, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Qusay I. Sarhan, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

aro

calrivate

scopus

scopus citescore

kou

cr

if

doaj

Make a Submission

issn

Information

Current Issue

Developed By

Keywords

Browse