Author: Jasson Prestiliano, Azhari Azhari, and Arif Nurwidyantoro (2025)
Problem and Challenge
Online games such as Roblox and Minecraft are considered child-friendly because the developers do not include elements harmful to children in their games. The feature that makes these two games very popular is the user-generated content feature offered by both games, which allows each player to create content as they wish. Some parents banned Roblox from their homes because of an incident where many players sexually assaulted their seven-year-old daughter’s Roblox Avatar. An avatar represents a game player symbolized by the game character they play. Even though there is a maximum security feature by filtering Roblox chats by game developers in English, irresponsible users can still display something inappropriate or violent that can significantly impact the psychology of children who play.
Goal of Experimentation
Propose and develop a multimodal approach of visual and verbal modality of deep learning based violence detection.
Methods
The visual violence detection that becomes the visual modality of this study combines 3D Convolutional Neural Network (3DCNN), BiLSTM, and Attention mechanism in the model architecture. 3DCNN is used to extract the spatiotemporal characteristics of the preprocessed video frames. Following the extraction of spatiotemporal traits, they are systematically processed by BiLSTM in both forward and backward orientations to capture long-term temporal dependencies. BiLSTM will generate several hidden states that will subsequently function as inputs for the attention mechanism. The attention mechanism enhances the importance of key components within the feature sequence, allowing the model to focus on the most relevant information for a particular task.
The proposed violence chat detection model uses BERT combined with BiLSTM. Transformers are the foundation of BERT, a deep learning model that dynamically calculates the weightings between each input and output element based on their connections. BERT is particularly adept at several functions that enable this, including sequence-to-sequence-based language generation tasks, including abstract summarization, sentence prediction, question answering, and conversational response generation. BERT embeddings are contextualized to capture the meaning of a word about its context within a sentence. Consequently, BERT is an exceptional option for integration into automated essay assessment systems. One of the most effective NLP methods for enabling the machine to understand the context of a sentence is BiLSTM, which can generate more meaningful outputs by integrating LSTM layers from both directions.
System Architecture
The multimodal approach is required because many modalities can determine violence within a video capture. This study uses video and chat as the modalities. The video will show the violence visually, while the chat could show violence verbally. Audio is not included because the sound in child-friendly online games usually uses funny or ordinary sounds to disguise the violence. The video and the chat will be processed using an unimodal approach, and then the result of each modality will be combined using a late fusion technique. Late fusion is the most straightforward and often used fusion technique. It integrates data after distinct comprehensive processing in several unimodal streams. The different modalities may be addressed using robust, targeted methods customized to the unique characteristics of each modality. Following a thorough sequence of unimodal processing, usually after label prediction in a recognition task, the outcomes are consolidated, often by summing or averaging. Late fusion has a significant limitation due to its restricted capacity to leverage cross-correlations across various unimodal data.

Figure 1. Multimodal Architecture System
Result and Discussion
On the visual side, the proposed model, combining the 3DCNN architecture, BiLSTM, and attention mechanisms, demonstrated excellent performance in detecting violence and non-violence from video captures of online gaming sessions. The model’s success was validated through testing on several datasets: the Hockey Dataset and the Violent Movies Dataset, including the specific dataset developed for this study, the Online Game Violence Dataset. The average accuracy of the training and validation results for these three datasets surpassed that of several other models compared. These results demonstrate that this visual-based approach is capable of accurately capturing spatio-temporal patterns of violence in the context of child-friendly online games.
On the verbal side, the combination of BERT and BiLSTM, when trained on the Indonesian Chat Dataset, successfully detected and classified each chat message into neutral, violent, racist, or harassing categories. The accuracy obtained was superior to several other recent models compared when both were trained on the Indonesian Chat Dataset. This indicates that the proposed verbal violence detection model successfully detected and correctly classified most chat messages.
The multimodal approach used was hybrid late fusion, where each modality was processed separately and then combined using a combination of rule-based and softmax probability techniques. This combination resulted in a more comprehensive detection system, improving classification accuracy compared to relying solely on a single modality. This model can also be applied to online games with a child-friendly rating to detect violence from both modalities, and can serve as an early warning system for parents.

Figure 2. Visual Violence Detection Result
Value Proposition
The implementation of this violence detection model is expected to assist in monitoring and alleviate parental anxiety regarding the online games played by their children. Furthermore, this violence detection model can also be a tool to help parents take necessary action if violence occurs during gameplay, and it can also be applied to develop more child-friendly online gaming communities.