Enhancing Arabic Offensive Tweet Classification: An Ensemble Approach Integrating AraBERT, Neural Networks, and LSTM Models

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
The British University in Dubai (BUiD)
This thesis addresses the crucial research problem of accurate detection and moderation of offensive language in Arabic text, considering the intricacies posed by the language's complex morphology, dialectal variations, orthographic ambiguity, orthographic noise, limited linguistic resources, and the necessity for comprehensive coverage of offensive language expressions. The research objectives are delineated through four key research questions. Firstly, the study aims to identify the existing research gaps in Arabic Text Classification (ATC) through an extensive and rigorous systematic literature review. The study adopts a scholarly and formal approach, aiming to identify the specific areas within ATC research that lack comprehensive exploration or exhibit inadequacies in existing knowledge. This endeavor is grounded in the rigorous analysis and synthesis of relevant academic literature, ensuring a meticulous examination of the current state of research in ATC. Secondly, it investigates the effects of employing novel pre-processing methods on the performance of Arabic Text Classification. Thirdly, the research endeavors to determine the most effective model for enhancing the accuracy of Arabic offensive text classification by introducing a novel approach using pre-trained models; AraBERT model in conjunction with fully connected neural networks (NN) and long short-term memory (LSTM) networks. Finally, the study evaluates the proposed model's ability to classify Arabic offensive text effectively. The research methodology consists of two integral parts, comprising dataset description, the proposed framework. The dataset description provides insights into the two datasets utilized, namely OSACT and SEMEval. The framework elucidates the proposed model, which leverages a combination of pretrained models and neural networks, thereby achieving a high level of effectiveness in classifying Arabic offensive text. The model's performance is meticulously assessed using various evaluation metrics, including accuracy and F1-macro score, and is compared against other classifier models. The research findings demonstrate the superiority of the proposed model over the baseline AraBERT model, with the proposed model achieving an accuracy of 0.870 compared to the baseline accuracy of 0.820, along with an F1-score of 0.853 compared to the baseline's 0.800. This emphasizes the model's exceptional capacity to accurately identify offensive content in Arabic text. The implications of this research extend to diverse domains and stakeholders, encompassing decision makers, developers, and policy makers. The insights garnered from the study can be instrumental in making informed decisions pertaining to the integration of Arabic text classification systems in various operational settings. By comprehending the proposed model's performance and efficacy, decision makers can assess its potential impact on optimizing processes such as information retrieval, content filtering, and sentiment analysis in Arabic text. In conclusion, this thesis contributes significantly to the existing literature by addressing the complexities associated with offensive language identification in Arabic text and introducing an innovative approach that integrates pretrained models with deep learning techniques and neural networks. The demonstrated effectiveness and superior performance of the proposed model underscore its potential for practical implementation in real-world scenarios, thereby bolstering the field of Arabic offensive text classification.
Arabic offensive classification, preprocessing techniques, AraBERT, AraBERT preprocessing, ensemble methodology, deep learning, neural networks, LSTM, natural language processing, emoji interpretation