Time Series-Based Spoof Speech Detection Using Long Short-Term Memory and Bidirectional Long Short-Term Memory
Abstract
Detecting fake speech in voice-based authentication systems is crucial for reliability. Traditional methods often struggle because they can't handle the complex patterns over time. Our study introduces an advanced approach using deep learning, specifically Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) models, tailored for identifying fake speech based on its temporal characteristics. We use speech signals with cepstral features like Mel-frequency cepstral coefficients (MFCC), Constant Q cepstral coefficients (CQCC), and open-source Speech and Music Interpretation by Large-space Extraction (OpenSMILE) to directly learn these patterns. Testing on the ASVspoof 2019 Logical Access dataset, we focus on metrics such as min-tDCF, Equal Error Rate (EER), Recall, Precision, and F1-score. Our results show that LSTM and BiLSTM models significantly enhance the reliability of spoof speech detection systems.
Downloads
References
Abdul, Z.K., and Al-Talabani, A.K., 2022. Mel frequency cepstral coefficient and its applications: A review. IEEE Access, 10, pp. 122136-122158. DOI: https://doi.org/10.1109/ACCESS.2022.3223444
Adiban, M., Sameti, H., and Shehnepoor, S., 2020. Replay spoofing countermeasure using autoencoder and siamese networks on ASVspoof 2019 challenge. Computer Speech and Language, 64, pp. 1-10. DOI: https://doi.org/10.1016/j.csl.2020.101105
Ahmed, N., Khan, J., Sheta, N., Tarek, R., Zualkernan, I., and Aloul, F., 2022. Detecting Replay Attack on Voice-Controlled Systems using Small Neural Networks. In: 2022 IEEE 7th Forum on Research and Technologies for Society and Industry Innovation, RTSI 2022, pp.50-54. DOI: https://doi.org/10.1109/RTSI55261.2022.9905158
Bai, Z., and Zhang, X.L., 2021. Speaker recognition based on deep learning: An overview. Neural Networks, 140, pp. 65-99. DOI: https://doi.org/10.1016/j.neunet.2021.03.004
Chakravarty, N., and Dua, M., 2023. Data augmentation and hybrid feature amalgamation to detect audio deep fake attacks. Physica Scripta, 98(9), p. 096001. DOI: https://doi.org/10.1088/1402-4896/acea05
Dave, N., 2013. Feature extraction methods LPC, PLP and MFCC in speech recognition. International Journal for Advance Research in Engineering and Technology, 1(6), pp. 1-5.
Devesh, K., Pavan, K.V., Ayush, A., and Mahadeva Prasanna, S.R., 2022. Fake Speech Detection Using OpenSMILE Features. Springer International Publishing, Berlin.
Eyben, F., Scherer, K.R., Schuller, B.W., Sundberg, J., Andre, E., Busso, C., Devillers, L.Y., Epps, J., Laukka, P., Narayanan, S.S., and Truong, K.P., 2016. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), pp. 190-202. DOI: https://doi.org/10.1109/TAFFC.2015.2457417
Eyben, F., Wöllmer, M., and Schuller, B., 2010. OpenSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor. In: MM’10-Proceedings of the ACM Multimedia 2010 International Conference, pp.1459-1462. DOI: https://doi.org/10.1145/1873951.1874246
Hassan, F., and Javed, A., 2021. Voice Spoofing Countermeasure for Synthetic Speech Detection. In: 2021 International Conference on Artificial Intelligence, ICAI 2021, pp. 209-212. DOI: https://doi.org/10.1109/ICAI52203.2021.9445238
Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation, 9(8), pp. 1735-1780. Jiang, Z., Huang, H., Yang, S., Lu, S., and Hao, Z., 2009. Acoustic Feature Comparison of MFCC and CZT-Based Cepstrum for Speech Recognition. In: 5th International Conference on Natural Computation, ICNC 2009, 1(200808003), pp.55-59. DOI: https://doi.org/10.1109/ICNC.2009.587
Kamble, M.R., Sailor, H.B., Patil, H.A., and Li, H., 2020. Advances in anti-spoofing: From the perspective of ASVspoof challenges. APSIPA Transactions on Signal and Information Processing, 9, e2. DOI: https://doi.org/10.1017/ATSIP.2019.21
Karo, M., Yeredor, A., and Lapidot, I., 2024. Compact time-domain representation for logical access spoofed audio. IEEE/ACM Transactions on Audio Speech and Language Processing, 32, pp.946-958. DOI: https://doi.org/10.1109/TASLP.2023.3341000
Kinnunen, T., Delgado, H., Evans, N., Lee, K.A., Vestman, V., Nautsch, A., Todisco, M., Wang, X., Sahidullah, M., Yamagishi, J., and Reynolds, D.A., 2020. Tandem assessment of spoofing countermeasures and automatic speaker verification: Fundamentals. IEEE/ACM Transactions on Audio Speech and Language Processing, 28, pp. 2195-2210. DOI: https://doi.org/10.1109/TASLP.2020.3009494
Kinnunen, T., Sahidullah, M., Delgado, H., Todisco, M., Evans, N., Yamagishi, J., and Lee, K.A., 2017. The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection. In: Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 2017-August, pp.2-6. DOI: https://doi.org/10.21437/Interspeech.2017-1111
Kumari, T.R.J., and Jayanna, H.S., 2015. Comparison of LPCC and MFCC Features and GMM and GMM-UBM Modeling for Limited Data Speaker Verification. In: 2014 IEEE International Conference on Computational Intelligence and Computing Research, IEEE ICCIC 2014, pp. 95-103. DOI: https://doi.org/10.1109/ICCIC.2014.7238329
McFee, B., Raffel, C., Liang, D., Ellis, D.P.W., McVicar, M., Battenberg, E., and Nietok, O., 2015. Librosa: Audio and Music Signal Analysis in Python. In: Proceedings of the 14th Python in Science Conference, (Scipy), pp.18-24. DOI: https://doi.org/10.25080/Majora-7b98e3ed-003
Nautsch, A., Wang, X., Evans, N., Kinnunen, T., Vestman, V., Todisco, M., Delgado, H., Sahidullah, M., Yamagishi, J., and Lee, K.A., 2021. ASVspoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(2), pp. 252-265. DOI: https://doi.org/10.1109/TBIOM.2021.3059479
Novoselov, S., Kozlov, A., Lavrentyeva, G., Simonchik, K., and Shchemelinin, V., 2016. STC Anti-Spoofing Systems for the ASVspoof 2015 Challenge. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp.5475-5479. DOI: https://doi.org/10.1109/ICASSP.2016.7472724
Patel, T.B., and Patil, H.A., 2015. Combining Evidences from Mel Cepstral, Cochlear Filter Cepstral and Instantaneous Frequency Features for Detection of Natural vs. Spoofed Speech. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp.2062-2066. DOI: https://doi.org/10.21437/Interspeech.2015-467
Rahmeni, R., Aicha, A.B., and Ayed, Y.B., 2020. Acoustic features exploration and examination for voice spoofing counter measures with boosting machine learning techniques. Procedia Computer Science, 176, pp. 1073-1082. DOI: https://doi.org/10.1016/j.procs.2020.09.103
Siami-Namini, S., Tavakoli, N., and Namin, A.S., 2019. The Performance of LSTM and BiLSTM in Forecasting Time Series. In: Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019, pp.3285-3292. DOI: https://doi.org/10.1109/BigData47090.2019.9005997
Tian, X., Xiao, X., Chng, E.S., and Li, H., 2017. Spoofing Speech Detection using Temporal Convolutional Neural Network. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. DOI: https://doi.org/10.1109/APSIPA.2016.7820738
Todisco, M., Delgado, H., and Evans, N., 2016. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients. In: Odyssey 2016: Speaker and Language Recognition Workshop, pp.283-290. DOI: https://doi.org/10.21437/Odyssey.2016-41
Todisco, M., Delgado, H., and Evans, N., 2017. Constant Q cepstral coefficients: Aspoofing countermeasure for automatic speaker verification. Computer Speech and Language, 45, pp. 516-535. DOI: https://doi.org/10.1016/j.csl.2017.01.001
Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., Yamagishi, J., Evans, N., Kinnunen, T., and Aik Lee, K., 2019. ASVSpoof 2019: Future Horizons in Spoofed and Fake Audio Detection. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019, pp.1008-1012. DOI: https://doi.org/10.21437/Interspeech.2019-2249
Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., Sahidullah, M., Vestman, V., Kinnunen, T., Lee, K.A., Juvela, L., Alku, P., Peng, Y.H., Hwang, H.T., &... Ling, Z.H., 2020. ASVspoof 2019: Alarge-scale public database of synthetized, converted and replayed speech. Computer Speech and Language, 64, 101114. DOI: https://doi.org/10.1016/j.csl.2020.101114
Wei, C., Pang, R., and Kuo, C.C.J., 2024. AGreen Learning Approach to Spoofed Speech Detection. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.12956-12960. DOI: https://doi.org/10.1109/ICASSP48485.2024.10448336
Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanilci, C., Sahidullah, M., and Sizov, A., 2015. ASVspoof 2015: The First Automatic Speaker Verification Spoofing and Countermeasures Challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp.2037-2041. DOI: https://doi.org/10.21437/Interspeech.2015-462
Wu, Z., Yamagishi, J., Kinnunen, T., Hanilçi, C., Sahidullah, M., Sizov, A., Evans, N., Todisco, M., and Delgado, H., 2017. ASVspoof: The automatic speaker verification spoofing and countermeasures challenge. IEEE Journal on Selected Topics in Signal Processing, 11(4), pp. 588-604. DOI: https://doi.org/10.1109/JSTSP.2017.2671435
Yang, J., Das, R.K., and Li, H., 2020. Significance of subband features for synthetic speech detection. IEEE Transactions on Information Forensics and Security, 15(c), pp. 2160-2170. DOI: https://doi.org/10.1109/TIFS.2019.2956589
Zhou, J., Hai, T., Jawawi, D.N.A., Wang, D., Ibeke, E., and Biamba, C., 2022. Voice spoofing countermeasure for voice replay attacks using deep learning. Journal of Cloud Computing, 11(1), 51. DOI: https://doi.org/10.1186/s13677-022-00306-5
Copyright (c) 2024 Arsalan R. Mirza, Abdulbasit K. Al-Talabani
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors who choose to publish their work with Aro agree to the following terms:
-
Authors retain the copyright to their work and grant the journal the right of first publication. The work is simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0]. This license allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors have the freedom to enter into separate agreements for the non-exclusive distribution of the journal's published version of the work. This includes options such as posting it to an institutional repository or publishing it in a book, as long as proper acknowledgement is given to its initial publication in this journal.
-
Authors are encouraged to share and post their work online, including in institutional repositories or on their personal websites, both prior to and during the submission process. This practice can lead to productive exchanges and increase the visibility and citation of the published work.
By agreeing to these terms, authors acknowledge the importance of open access and the benefits it brings to the scholarly community.