Arabic Question Answering from diverse data sources

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
The British University in Dubai (BUiD)
Currently, Arabic users are still forced to extract manually the accurate answers of their questions, which is a difficult task with a vast amount of information available on the Internet. Actually, the existing Arabic Question Answering (QA) systems do not meet the users’ needs in terms of performance and scope that cover all types of questions. The motivation behind this research is the need for new approaches to handle all types of questions and answer them beyond the factoid questions. Therefore, we present in this paper a new design of the linguistic approach to develop a reliable Arabic QA system and data source with the ability to address the following challenges: (i) handle both factoid and complex questions in Arabic language, (ii) extract the precise answer from available resources, (iii) evaluate the proposed QA system based on a gold standard data set, and (iv) provide an Arabic Corpus of Occupations (ACO) corpus that has been made freely and publicly available for research purposes. Our QA system is a web application that helps us to get an answer to the question posed from different data sources. Accordingly, we conducted experiments on a set of 230 question from the previously published resources, TREC, CLEF, and Arabic Corpus of Occupations (ACO) corpus. The system performance shows an average precision of 36%, by answering 72 questions, as well as the Recall was 78% and F-Measure was 51%. Besides, the aim that attracted us to build the Arabic Corpus of Occupations (ACO) corpus was the lack of free, annotated and large-scale Arabic resources that can be used in training and testing Arabic QA systems. In this paper, we provide ACO corpus of one million words written in Modern Standard Arabic (MSA). The corpus contains 700 occupations which are analyzed carefully and manually annotated. We use Cohen's Kappa coefficient method to evaluate the reliability of the tagged content. The corpus content has been tagged and assessed by two different groups of taggers. Accordingly, the inter-annotator agreement indicates that the reliability of ACO corpus is almost perfect agreement. As well as, the content of the corpus is highly confidence and reliable according to the result achieved by 90%.
Arabic Question Answering (QA) systems, data sources, Arabic users, Arabic language