Evaluation of Word2Vec and FastText models for text similarity measurement assessment

Evaluation of Word2Vec and FastText models for text similarity measurement assessment

Authors

  • Tukino Tukino Buana Perjuangan Karawang University, Karawang, Indonesia
  • Eko Sediyono Satya Wacana Christian University, Salatiga, Indonesia
  • Hendry Hendry Satya Wacana Christian University, Salatiga, Indonesia
  • Agustia Hananto Buana Perjuangan Karawang University, Karawang, Indonesia
  • Elfina Novalia Buana Perjuangan Karawang University, Karawang, Indonesia
  • Fitria Nurapriani Buana Perjuangan Karawang University, Karawang, Indonesia

Keywords:

Word2Vec, FastText, Cosine similarity, Text similarity, TF-IDF, RMSE, MAE

Abstract

Measuring text similarity assessment is crucial in the field of Education in the digital age, such as automated question evaluation, content alignment, and mapping learning outcomes, but estimating semantic similarity accurately for short and specific texts is challenging. Existing approaches often lack systematic comparisons across embedding models and weighting schemes. We evaluated Word2Vec and FastText embeddings (CBOW and Skip-gram) combined with TF-IDF, POS weighting, and BM25, to calculate cosine similarity using 112 sentence pairs and evaluated the models using Pearson and Spearman correlations as well as RMSE and MAE to compare the scores from the models with those from experts. The best performing configurations were FT+CBOW+TFIDF (highest Pearson’s α = 0.7493) for semantic agreement with experts and W2V+CBOW+TF-IDF (lowest mean error, MAE = 0.91, RMSE = 1.10, overall error = 1.01) for prediction accuracy; The BM25-based variant produced significantly higher errors. These findings indicate that CBOW with TF-IDF provides the most stable similarity estimates for short educational texts, which supports automated evaluation tools in learning environments.

Author Biography

Tukino Tukino, Buana Perjuangan Karawang University, Karawang, Indonesia

he is also affiliated with Satya Wacana Christian University, Salatiga, Indonesia

References

1. Z. Li, Y. Tomar, and R. J. Passonneau, “A Semantic Feature-Wise Transformation Relation Network for Automatic Short Answer Grading,” Proc. 2021 Conf. Empir. Methods Nat. Lang. Process. Punta Cana, Dominic. Republic, 7–11 Novemb. 2021;, pp. 6030–6040, 2021, doi: 10.18653/v1/2021.emnlp-main.487.

2. K. Babić, F. Guerra, S. Martinčić-Ipšić, and A. Meštrović, “A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings,” J. Inf. Organ. Sci., vol. 44, no. 2, pp. 231–246, 2020, doi: 10.31341/jios.44.2.2.

3. M. Thapa, P. Kapoor, S. Kaushal, and I. Sharma, “A Review of Contextualized Word Embeddings and Pre-Trained Language Models, with a Focus on GPT and BERT,” Proc. 1st Int. Conf. Cogn. Cloud Comput. Jaipur, India,1–2 August 2024, no. IC3Com 2024, pp. 205–214, doi: 10.5220/0013305900004646.

4. S. Subba, B.; Kumari, “A heterogeneous stacking ensemble based sentiment analysis framework using multiple word embeddings,” Comput. Intell., vol. 38, no. 2, pp. 530–559, 2022, doi: https://doi.org/10.1111/coin.12478.

5. A. Allahim and A. Cherif, “Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation,” Appl. Sci., vol. 14, no. 23, p. 11104, Nov. 2024, doi: 10.3390/app142311104.

6. J. Wang and Y. Dong, “Measurement of text similarity: A survey,” Inf., vol. 11, 421., no. 9, pp. 1–17, 2020, doi: 10.3390/info11090421.

7. S. Das, A. Dutta, T. Lindheimer, M. Jalayer, and Z. Elgart, “YouTube as a Source of Information in Understanding Autonomous Vehicle Consumers: Natural Language Processing Study,” Transp. Res. Rec., vol. 2673, no. 8, pp. 242–253, 2019, doi: 10.1177/0361198119842110.

8. C. Deng, G. Lai, and H. Deng, “Improving word vector model with part-of-speech and dependency grammar information,” CAAI Trans. Intell. Technol., vol. 5, no. 4, pp. 260–267, 2020, doi: 10.1049/trit.2020.0055.

9. X. Li, A. Henriksson, M. Duneld, J. Nouri, and Y. Wu, “Evaluating Embeddings from Pre-Trained Language Models and Knowledge Graphs for Educational Content Recommendation,” Futur. Internet, vol. 16, no. 1, 2024, doi: 10.3390/fi16010012.

10. J. Yang, S.; Huang, G.; Ofoghi, B.; Yearwood, “Short text similarity measurement using context-aware weighted biterms,” Concurr. Comput. Pr. Exp., p. 34.e5765., 2020, doi: https://doi.org/10.1002/cpe.5765.

11. L. Xiao, Q. Li, Q. Ma, J. Shen, Y. Yang, and D. Li, Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec. PLoS ONE 2024, vol. 19,. e0305095. doi: 10.1371/journal.pone.0305095.

12. S. Ramadhani, M. A. Hariyadi, and C. Crysdian, “The Evaluation of Computer Science Curriculum for High School Education Based on Similarity Analysis,” Int. J. Adv. Data Inf. Syst., vol. 4, no. 2, pp. 201–213, 2023, doi: 10.25008/ijadis.v4i2.1307.

13. H. Hendry, T. Tukino, E. Sediyono, A. Fauzi, and B. Huda, “HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text,” Inf., vol. 16, no. 11, pp. 1–28, 2025, doi: 10.3390/info16110995.

14. D. Iskandar and A. Kurniawati, “Analisis Perbandingan Teknik Word2vec dan Doc2vec dalam Mengukur Kemiripan Dokumen Menggunakan Cosine Similarity,” J. Teknol. Inf. dan Ilmu Komput., vol. 12, no. 1, pp. 133–144, 2025, doi: 10.25126/jtiik.2025129143.

15. T. Gao, X. Yao, and D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings,” Proc. 2021 Conf. Empir. Methods Nat. Lang. Process. Punta Cana, Dominic. Republic, 7–11 Novemb. 2021;, pp. 6894–6910, doi: 10.18653/v1/2021.emnlp-main.552.

16. R. Rani et al., “A Survey of Numerous Text Similarity Approach,” Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., vol. 10, no. November, pp. 132777–132785, 2021, doi: 10.1109/ACCESS.2022.3230592.

17. N. H. Hameed, A. M. Alimi, and A. T. Sadiq, “Short Text Semantic Similarity Measurement Approach Based on Semantic Network,” Baghdad Sci. J., vol. 19, no. 6, pp. 1581–1591, 2022, doi: 10.21123/bsj.2022.7255.

18. K. Zhang, Y. Liu, F. Mei, G. Sun, and J. Jin, “IBGJO: Improved Binary Golden Jackal Optimization with Chaotic Tent Map and Cosine Similarity for Feature Selection,” Entropy, vol. 25, 1128., no. 8, pp. 1–23, 2023, doi: 10.3390/e25081128.

19. C. Sánchez-Antonio et al., “A Short-Text Similarity Model Combining Semantic and Syntactic Information,” Mathematics, vol. 12, no. 22, p. 3126, Nov. 2024, doi: 10.3390/electronics12143126.

20. D. Chandrasekaran and V. Mago, “Evolution of Semantic similarity a survey,” ACM Comput. Surv., vol. 54, no. 2, pp. 1–35, 2021, doi: 10.1145/3440755.

21. M. R. A. H., M. Ilham, D. F. Surianto, and A. M. Mappalotteng, “Semantic Similarity Measurement Evaluation of KBBI Synonyms Using a Word Embedding Approac,” J. Nas. Tek. Elektro dan Teknol. Inf., vol. 14, no. 2 SE-Articles, pp. 112–120, 2025, [Online]. Available: https://jurnal.ugm.ac.id/v3/JNTETI/article/view/17117

22. A. Pertiwi, A. Azhari, and S. Mulyana, “Fast2Vec, a modified model of FastText that enhances semantic analysis in topic evolution,” PeerJ Comput. Sci., vol. 11, pp. 1–36, 2025, doi: 10.7717/peerj-cs.2862.

23. D. C. Kendaraan, “Analisis Sistem Pendeteksi Posisi Plat Kendaraan Dari Citra Kendaraan,” J. Ilm. SPEKTRUM, no. July 2016, 2015, [Online]. Available: https://ojs.unud.ac.id/index.php/spektrum/article/view/20008

24. J. L. Xianming Li, “AoE - Angle-optimized Embeddings for Semantic Textual Similarity.,” Proc. ofthe 62nd Annu. Meet. ofthe Assoc. Comput. Linguist. August 11-16, 2024, vol. 1, pp. 1825–1839, 2024, doi: 10.18653/v1/2024.acl-long.101.

25. A. Jalilifard, V. F. Caridá, A. F. Mansano, R. S. Cristo, and F. P. C. da Fonseca, “Semantic Sensitive TF-IDF to Determine Word Relevance in Documents,” in In Advances in Computing and Network Communications; Lecture Notes in Electrical Engineering; Springer: Singapore, 2021, pp. 327-337. doi: 10.1007/978-981-33-6987-0_27.

26. S. Chawla, R. Kaur, and P. Aggarwal, “Text classification framework for short text based on TFIDF-FastText,” Multimed. Tools Appl., vol. 82, no. 26, pp. 40167–40180, 2023, doi: 10.1007/s11042-023-15211-5.

27. M. Umer et al., “Impact of convolutional neural network and FastText embedding on text classification,” Multimed. Tools Appl., vol. 82, no. 4, pp. 5569–5585, 2023, doi: 10.1007/s11042-022-13459-x.

28. Y. Wang, B. Zhang, W. Liu, J. Cai, and H. Zhang, “STMAP: A novel semantic text matching model augmented with embedding perturbations,” Inf. Process. Manag., vol. 61, no. 1, p. 103576, 2024, doi: 10.1016/j.ipm.2023.103576.

29. D. Tiwari, B. Nagpal, B. S. Bhati, A. Mishra, and M. Kumar, “A systematic review of social network sentiment analysis with comparative study of ensemble-based techniques,” Artif. Intell. Rev., vol. 56, no. 11, pp. 13407–13461, 2023, doi: 10.1007/s10462-023-10472-w.

30. A. M. Priyatno, M. R. A. Prasetya, P. Cholidhazia, and R. K. Sari, “Comparison of Similarity Methods on New Student Admission Chatbots Using Retrieval-Based Concepts,” J. Eng. Sci. Appl., vol. 1, no. 1, pp. 32–40, 2024, doi: 10.69693/jesa.v1i1.2.

31. P. Gong, J. Liu, Y. Xie, M. Liu, and X. Zhang, “Enhancing context representations with part-of-speech information and neighboring signals for question classification,” Complex Intell. Syst., vol. 9, no. 6, pp. 6191–6209, 2023, doi: 10.1007/s40747-023-01067-7.

32. T. Paryono, E. Sediyono, Hendry, B. Huda, A. Lia Hananto, and A. Yuniar Rahman, “Intelligent classification and performance prediction of multi-text assessment with recurrent neural networks-long short-term memory,” IAES Int. J. Artif. Intell., vol. 13, no. 3, pp. 3350–3363, 2024, doi: 10.11591/ijai.v13.i3.pp3350-3363.

33. Y. Ma, X. Liu, L. Zhao, Y. Liang, P. Zhang, and B. Jin, “Hybrid embedding-based text representation for hierarchical multi-label text classification,” Expert Syst. Appl., vol. 187, p. 115905, 2022, doi: 10.1016/j.eswa.2021.115905.

34. M. R. Islam, A. Ahmad, and M. S. Rahman, “Bangla text normalization for text-to-speech synthesizer using machine learning algorithms,” J. King Saud Univ. - Comput. Inf. Sci., vol. 36, no. 1, p. 101807, 2024, doi: 10.1016/j.jksuci.2023.101807.

35. M. O. Gani, R. K. Ayyasamy, S. M. Alhashmi, A. Sangodiah, and Y. T. Fui, “ETFPOS-IDF: A Novel Term Weighting Scheme for Examination Question Classification Based on Bloom’s Taxonomy,” IEEE Access, vol. 10, no. November, pp. 132777–132785, 2022, doi: 10.1109/ACCESS.2022.3230592.

Downloads

Published

2026-05-04

How to Cite

Evaluation of Word2Vec and FastText models for text similarity measurement assessment. (2026). BIS Information Technology and Computer Science, 3, V326007. https://doi.org/10.31603/bistycs.485