Author: Ayu Pertiwi, Azhari Azhari, Sri Mulyana
Combines semantic embedding and subword modeling for dynamic topic evolution (Branded name as hybrid AI-enhanced topic model)
Problem and Challenge
Semantic topic modeling faces challenges in capturing nuanced word meanings, particularly in cases with negation, rare words, or synonymy. Standard models like LDA and DTM struggle with semantic coherence, while embedding models like Word2Vec and FastText have limitations in handling out-of-vocabulary (OOV) words or context insensitivity like shown in Figure 1.
Goal of Experimentation
To develop Fast2Vec, a hybrid word embedding model that integrates Word2Vec and FastText, aiming to enhance semantic accuracy in dynamic topic modeling. The objective is to track topic trends and evolution patterns using improved word representations.
Methods
Fast2Vec combines Word2Vec and FastText embeddings through weighted summation (==0.5). DTM is used to model topics over time, while UMAP and Affinity Propagation support semantic clustering. Semantic similarity is evaluated using cosine similarity, Spearman, and Pearson correlation.
Architecture System
The system workflow like shown in Figure 2 includes data preprocessing, Fast2Vec embedding generation, DTM-based topic extraction, dimensionality reduction (UMAP), semantic clustering (AP), and evolution tracking via entropy analysis. This pipeline enables interpretable and adaptive topic modeling.
Results and Discussion
Fast2Vec improves similarity by 39.64% over Word2Vec in OOV settings and outperforms FastText by 6.18%. It performs best in 7 out of 12 benchmark datasets. The model (Fig. 3) also successfully categorizes topic evolution patternsdiffusion, stability, shift, and moderate fluctuation validated through entropy-based trend analysis.
Value Proposition
Fast2Vec offers robust word representations that support fine-grained topic evolution tracking. Its integration of context and subword modeling makes it ideal for applications in NLP research, scientometrics, and semantic analysis over time. It bridges the gap between statistical modeling and semantic precision.