Integrating rule-based approach and machine learning approach for arabic named entity recognition

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
The British University in Dubai (BUiD)
Named Entity Recognition is considered one of the crucial Information Extraction tasks in which many of Natural Language Processing applications rely on as an important pre-processing step. Named Entity Recognition has been successfully applied on different natural languages such as English, French, German, Chinese and Arabic. Natural Language Processing for Arabic has started receiving attention in the past few years as a challenge especially when it comes to information extraction due to the complex nature of Arabic language which rises from the Arabic complicated syntax and rich morphology. However, Named Entity Recognition for Arabic is in its early stages where opportunities for improvement in the performance still available. Most of Arabic NER systems have been developed using mainly two types of approaches including Rule-based approach and Machine Learning based approach. In this thesis, the problem of Named Entity Recognition for Arabic is tackled through integrating the Machine Learning based approach with the Rule-based approach to form a hybrid approach in attempt to enhance the overall performance of Arabic Named Entity Recognition. The proposed hybrid system is capable of recognizing 11 different types of named entities including Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. The proposed Arabic named entity recognition system is composed of two main components including a Rule-based component and a Machine Learning based component. The Rule-based component is a reproduction from the acquired linguistic knowledge of the NERA system which has gone through enhancements. The Machine Learning based component utilizes the following techniques: decision trees, support vector machines and logistic regression in order to generate a model for Arabic NER upon an annotated dataset produced by the rule-based component. An annotated dataset is presented to the Machine Learning based component through a set of features. The feature set is carefully and reasonably selected to optimize the performance of the Machine Learning component as much as possible. Two types of relevant linguistic resources are collected and acquired: gazetteers and corpora (i.e. datasets). A number of extensive experiments are conducted on three different dimensions including the named entity types, the feature set and the machine learning technique to evaluate the performance of our hybrid Arabic Named Entity Recognition system when applied on different datasets. The experimental results show that the hybrid approach outperforms the Rule-based approach and the Machine Learning based approach separately when it comes to Named Entity Recognition for Arabic. According to the experimental analysis, the best performance of our proposed system is achieved when all the features of different types are considered in the feature set. Decision trees approach has proved its efficiency as a classifier in the proposed hybrid system for Arabic Named Entity Recognition in which the highest overall improvement in the performance is achieved when decision trees approach is used as the classifier. Our hybrid NER system for Arabic outperforms the state-of-the-art of the Arabic Named Entity Recognition in terms of precision, recall and f-measure when applied to ANERcorp dataset with precision of 94.7%, recall of 94.1% and f-measure of 94.4% for Person named entity, precision of 91.7%, recall of 88.6% and f-measure of 90.1% for Location named entities, and precision of 89.4%, recall of 87% and f-measure of 88.2% for Organization named entities.
machine learning, named entity recognition, natural language processing