Positive Unlabelled Learning to Recognize Dishes as Named Entity
The British University in Dubai (BUiD)
With the growth of social media, there is a need to analyse the user-generated content; especially the text reviews. Online text reviews have a lot of potential and opportunities for both users and business owners. Many researches target analysing text reviews extracting useful info especially Named Entity Recognition. In this research, I focus on extracting food and dish names as a named entity. With the lack of labelled data, I try to overcome the cold start and avoid manual labelling by building a lookup table from a dictionary. I work with Yelp dataset, going through each text review, using each noun as a candidate, label the positive samples using the aforementioned lookup table, then using Positive Unlabelled learning techniques to recognise more entities within the unlabelled data, by predicting the probability for each candidate. I considered the surrounding words; preceding and following in building the model, as well as Part of Speech tag for each. To eliminate duplicates due to repeated candidates from different reviews or sentences, I calculate the average and represent each candidate entity only once. The results show how we can automate entity recognition process, using dictionaries and machine learning techniques and achieve an acceptable accuracy of 67% and boost the newly discovered entities by around 15% using Positive Unlabelled learning over automatically build lookup table. This research has the potential to be extended to other topics other than food and dish names, also it acts as a framework and algorithm independent.
social media, user-generated content, named entity recognition