AraFashion

A Novel Fashion Captioning Dataset Leveraging Attention-Based EfficientNet and xLSTM

Authors

DOI:

https://doi.org/10.14500/aro.12335

Keywords:

Arabic Image Captioning, AraFashion, Dataset, EfficientNetB4, xLSTM

Abstract

The significance of creating models that can produce precise textual descriptions of photographs has become apparent, particularly in specialized domains such as fashion. Arabic suffers from a severe shortage of publicly available resources, particularly fashion picture databases, in contrast to the wealth of databases and studies about the English language. This restricts the creation of Arabic language models and impedes scholarly research in this area. By creating a hybrid model for automatically producing Arabic descriptions of fashion photos, our study seeks to close this gap. Based on the EfficientNet-B4 architecture, this model incorporates an attention mechanism to extract visual features and, for the first time in this field, links it to an xLSTM module for text creation. This study produced a new dataset with Arabic captions called AraFashion; the Arabic descriptions were translated into English through Google Translate. Using real Arabic data improves the model’s accuracy and realism, as seen by the model’s top BLEU-1 score of 0.7335 for Arabic descriptions. This study suggests growing Arabic databases in the fashion industry and highlights the need to support the Arabic language in AI technology.

Downloads

Download data is not yet available.

References

Al-Malki, R.S., and Al-Aama, A.Y., 2023, Arabic captioning for images of clothing using deep learning. Sensors, 23(8), pp.3783.

Al-Malla, M.A., Jafar, A., and Ghneim, N., 2022, Pre-trained CNNs as feature extraction modules for image Captioning: An experimental study. ELCVIA Electronic Letters on Computer Vision and Image Analysis, 21(1), pp.1–16.

Anderson, P., Fernando, B., Johnson, M., and Gould, S., 2016. Spice: Semantic Propositional Image Caption Evaluation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, Berlin.

Banerjee, S., and Lavie, A., 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.

Beck, M., Pöppel, K., Spanring, M., Auer, A., Prudnikova, O., Kopp, M., Klambauer, G., Brandstetter, J., and Hochreiter, S., 2024 XLSTM: Extended Long Short-Term Memory. [arXiv Preprint]

Cai, C., Yap, K.H., and Wang, S., 2025 Toward attribute-controlled fashion image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, 20, 280.

Ibrahim, H.S., Shati, N.M., and Alsewari, A.A., 2024. A transfer learning approach for arabic image captions. Al-Mustansiriyah Journal of Science, 35, pp.81-90.

Lasheen, M.T., and Barakat, N.H., 2022. Arabic image captioning: The effect of text pre-processing on the attention weights and the BLEU-N scores. International Journal of Advanced Computer Science and Applications, 13, pp.413‑423.

Lin, C.Y., 2004. Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics, Pennsylvania.

Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X., 2016. Deepfashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., and Cucchiara, R. 2023. Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors (Basel), 23(3), pp.1286.

Pan, Y., Yao, T., Li, Y., and Mei, T., 2020. X-Linear Attention Networks for Image Captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Papineni, K., Roukos, S., Ward, T., and Zhu Bleu, W.J., 2002. A Method for Automatic Evaluation of machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.

Rawate, S.,Vayadande, K.,Chaudhary, S., Manmode, S., Suryavanshi, R., and Chanda, K., 2022 Fashion Classification model. Techno-Societal 2016. In: International Conference on Advanced Technologies for Societal Applications, Springer.

Rostamzadeh, N., Hosseini, S., Boquet, T., Stokowiec, W., Zhang, Y., Jauvin, C., and Pal, C., 2018. Fashion-Gen: The Generative Fashion Dataset and Challenge [arXiv Preprint].

Ruan, T., and Zhang, S., 2024. Towards Understanding How Attention Mechanism Works in Deep Learning [arXiv Preprint].

Sabri, S.M., 2021. Arabic Image Captioning Using Deep Learning with Attention. University of Georgia, Georgia.

Sameer, M., Talib, A., Hussein, A., and Husni, H., 2023. Arabic speech recognition based on encoder-decoder architecture of transformer. Journal of Techniques, 5, pp.176-183.

Shams, 2025. AraFashion: A New Dataset for Fashion Caption. Kaggle, San Francisco.

Tan, M., and Le, Q., 2019. Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks. International Conference on Machine Learning, PMLR.

Vedantam, R., Lawrence Zitnick, C., and Parikh, D., 2015. Cider: Consensus Based Image Description Evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Xiao, H., Rasul, K., and Vollgraf, R., 2017. Fashion-Mnist: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. [arXiv Preprint].

Yang, X., Zhang, H., Jin, D., Liu, Y., Wu, C.H., Tan, J., Xie, D., Wang, J., and Wang, X., 2020. Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIII 16, Springer.

Published

2026-03-15

How to Cite

Ahmed, S. A. and Abdulameer, A. T. (2026) “AraFashion: A Novel Fashion Captioning Dataset Leveraging Attention-Based EfficientNet and xLSTM”, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 14(1), pp. 100–106. doi: 10.14500/aro.12335.
Received 2025-06-05
Accepted 2025-12-11
Published 2026-03-15