TỪ TF-IDF ĐẾN CÁC MÔ HÌNH DỰA TRÊN TRANSFORMER: PHƯƠNG PHÁP PHÂN CỤM CHO TỔ CHỨC SẢN PHẨM THƯƠNG MẠI ĐIỆN TỬ

DOI: https://doi.org/10.58902/nckhpt.e-v2i1.366

Lê Nhật Tùng

Trường Đại học Công nghệ Đồng Nai

Bùi Minh Huy

Trường Đại học Công nghệ Thành phố Hồ Chí Minh

Trần Lê Vân

Trường Đại học Công nghệ Thành phố Hồ Chí Minh

Nguyễn Thị Thanh Tâm

Trường Đại học Công nghệ Thành phố Hồ Chí Minh

Nguyễn Thị Liệu

Trường Đại học Công nghệ Đồng Nai

PDF

Quản lý danh mục sản phẩm là yếu tố thiết yếu giúp các nền tảng thương mại điện tử tối ưu hóa vận hành, đặc biệt trong môi trường đa ngôn ngữ tại Việt Nam, nơi mô tả sản phẩm thường trộn lẫn nội dung tiếng Việt và thuật ngữ tiếng Anh. Nghiên cứu này đánh giá có hệ thống các phương pháp phân cụm tự động bằng cách so sánh 9 tổ hợp giữa 3 phương pháp biểu diễn văn bản (TF-IDF, PhoBERT, E5-multilingual) và 3 thuật toán phân cụm (K-Means, DBSCAN, Gaussian Mixture Models - GMM) trên tập dữ liệu 36.644 sản phẩm từ Lazada Việt Nam. Kết quả thực nghiệm cho thấy mô hình E5-multilingual kết hợp với GMM đạt hiệu quả cao nhất với chỉ số thông tin tương hỗ chuẩn hóa (NMI) là 91,28% và độ khiết (Purity) đạt 79,22%, vượt trội so với các phương pháp truyền thống và mô hình đơn ngữ. Đáng chú ý, nghiên cứu phát hiện nghịch lý khi một số phương pháp có điểm kỹ thuật cao (Silhouette 0,986) nhưng không mang lại giá trị phân loại kinh doanh (ARI 0,003), nhấn mạnh vai trò của việc xác thực dựa trên danh mục thực tế. Những phát hiện này cung cấp cơ sở thực nghiệm để các doanh nghiệp thương mại điện tử phát triển các giải pháp tự động hóa quy trình phân loại, giảm thiểu nguồn lực thủ công và nâng cao trải nghiệm khách hàng.

Từ khóa: TF-IDF; Thuật toán phân cụm; Thương mại điện tử; Transformer.

Tài liệu tham khảo

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1–27. https://doi.org/10.1080/03610927408827101

Chang, W.-C., Yu, F. X., Chang, Y.-W., Yang, Y., & Kumar, S. (2020). Pre-training Tasks for Embedding-based Large-scale Retrieval (arXiv:2002.03932). arXiv. https://doi.org/10.48550/arXiv.2002.03932

Davies, D. L., & Bouldin, D. W. (1979). A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909

Dat, N. Q., & Anh, N. T. (2020). PhoBERT: Pre-trained language models for Vietnamese. In T. Cohn & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037–1042). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.92

Dung, T. T., Tung, L. N., Dung, B. N., & Huan, V. (2024). Emotion recognition in learners with emoji sentiment accompaniment using the PhoBERT model. Journal of Science Natural Science, 46–56. https://doi.org/10.18 173/2354-1059.2024-0034

Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, 226–231.

Gaussian Mixture Model. (2025). GeeksforGeeks. https://www.geeksforgeeks.org/machine-learning/gaussian-mixture-model/

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075

Linden, G., Smith, B., & York, J. (2003). Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7(1), 76–80. https://doi.org/10.1109/MIC.2003.1167344

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics: 5.1 (pp. 281–298). University of California Press. https://digicoll.lib.berkeley.edu/record/113015/files/math_s5_v1_article-17.pdf

Manning, C. D., Raghavan, P., & Schütze, H. (2008, July 7). Introduction to Information Retrieval. Cambridge University Press. Cambridge Aspire Website. https://doi.org/10.1017/CBO9780511809071

McAuley, J., Targett, C., Shi, Q., & van den Hengel, A. (2015). Image-Based Recommendations on Styles and Substitutes. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, 43–52. https://doi.org/10.1145/2766462.2767755

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. https://doi.org/10.1016/0306-4573(88)90021-0

Vinh, N. X., Epps, J., Epps, J., & Bailey, J. (2010). Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance.

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., & Wei, F. (2024). Text Embeddings by Weakly-Supervised Contrastive Pre-training (arXiv:2212.03533). arXiv. https://doi.org/10.48550/arXiv.2212.03533

Yulianton, H., & Santi, R. (2024). Product Matching using Sentence-BERT: A Deep Learning Approach to E-Commerce Product Deduplication. Engineering and Technology Journal, 09. https://doi.org/10.47191/etj/v9i12.14