I am always fascinated to extract information from a database especially from a group of text after I involve myself working with dbpedia in the semantic web forum. Today I write about the Entity Linking and how it facilitates NLP.
I am sure most of you would have come across Named Entity Recognition (NER). NER is a fundamental Natural Language Processing (NLP) task and has a wide range of use cases. This article is not about NER but about an NLP task that is closely related to NER.
Do you know what is Named Entity Linking (NEL)? How does it help in Information Extraction, Semantic Web, and many other tasks? If not, then don’t worry. This article will answer those questions along with a basic implementation of NEL.
Before looking into NEL, we will first understand information extraction. According to Wikipedia,
“ Information extraction is a task of automatically extracting structured information from unstructured and/or semi-structured documents. In most of the cases, this activity concerns processing human language texts by means of NLP.”
In the below information extraction example, unstructured text data is converted into a structured semantic graph. A broad goal of information extraction is to extract knowledge from unstructured data and use that obtained knowledge for various other tasks.
What is Named Entity Linking?
Information extraction comprises of multiple sub-tasks. In most cases, we will have the following sub-tasks. And they are performed in order, to extract the information from unstructured data.
· Named Entity Recognition (NER)
· Named Entity Linking (NEL)
· Relation Extraction
A named entity is a real-world object, such as persons, locations, organizations, etc. NER identifies and classify named entity occurrences in text into pre-defined categories. NER is modeled as a task of assigning a tag to each word in a sentence. Below is an example result from a NER system.
NER will tell us what words are entities and what are their types. In the above example, NER will locate “Sebastian Thrun” as a person. But we still don’t know exactly which “Sebastian Thrun” the text is speaking about in the above example. NEL is the next sub-task that will answer this question.
NEL will assign a unique identity to entities mentioned in the text. In other words, NEL is the task to link entity mentions in text with their corresponding entities in a knowledge base [1]. The target knowledge base depends on the application, but we can use knowledge bases derived from Wikipedia for open-domain text. In our above example, we can find exactly which “Sebastian Thrun” by linking the entities to DBpedia. DBpedia is a structured knowledge base extracted from Wikipedia. This process of linking entities to Wikipedia is also called as Wikification.
NEL is also referred to as Entity Linking, Named Entity Disambiguation (NED), Named Entity Recognition and Disambiguation (NERD) or Named Entity Normalization (NEN). NEL has a wide range of applications other than Information Extraction. NEL is used in Information Retrieval, Content Analysis, Intelligent Tagging, Question Answering System, Recommender Systems, etc.
NEL also plays a significant role in the Semantic Web. The Semantic Web is a term coined by Tim Berners-Lee for a web of data that can be processed by machines [5]. A vital issue in Semantic Web is to automatically populate and enrich existing knowledge bases with newly extracted facts. NEL is inherently considered as an essential subtask for knowledge base population [1].
General Approach
NEL is not a trivial task due to the name variation and ambiguity problem. Name variation means an entity can be mentioned in different ways. For example, the entity Michael Jeffrey Jordan can be referred to using numerous names, such as Michael Jordan, MJ, and Jordan. Whereas the ambiguity problem is related to the fact that a name may refer to different entities depending on the context. Here is an example for ambiguity problem, the name Bulls can apply to more than one entity in Wikipedia, such as the NBA team Chicago Bulls, the football team Belfast Bulls, etc. [4]
In general, a typical entity linking system consists of three modules, namely Candidate Entity Generation, Candidate Entity Ranking, and Unlinkable Mention Prediction [1]. A brief description of each module is given below.
· Candidate Entity Generation — In this module, the NEL system aims to retrieve a set of candidate entities by filtering out the irrelevant entities in the knowledge base. The retrieved set contains possible entities that may refer to an entity mention.
· Candidate Entity Ranking — Here, different kinds of evidence are leveraged to rank the candidate entities to find the most likely entity for the mention.
· Unlinkable Mention Prediction — This module will validate whether the top-ranked entity identified in the previous module is the target entity for the given mention. If not, then it will return NIL for the mention. Basically, this module is to deal with unlinkable mentions.
To know more about each module in detail, please read [1].
Coming back to Spotlight. DBPedia spotlight uses Apache OpenNLP to identify the entity mentions. Disambiguation in Spotlight is performed using the generative probabilistic model from [4]. Please read [2], [3] to know more about DBpedia Spotlight’s implementation.
NEL is an essential NLP task that should be given more importance. Recently people started using deep learning techniques to improve the performance of NEL systems on standard datasets [6][7]. I believe massive Linked Open Data present today provides an incredible opportunity for tomorrow’s Artificial Intelligence. Given NEL’s role in Information Extraction and Semantic Web, we need to work more on topics like these.
References
[1] Wei Shen, Jianyong Wang, and Jiawei Han, Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions (2014), IEEE Transactions on Knowledge and Data Engineering.
[2] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes, Improving Efficiency and Accuracy in Multilingual Entity Extraction (2013), 9th International Conference on Semantic Systems.
[3] Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer, DBpedia spotlight: shedding light on the web of documents (2011), 7th International Conference on Semantic Systems.
[4] Xianpei Han, and Le Sun, A Generative Entity-Mention Model for Linking Entities with Knowledge Base (2011), 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
[6] Nikolaos Kolitsas, Octavian-Eugen Ganea, and Thomas Hofmann, End-to-End Neural Entity Linking (2018), CoNLL.
[7] Jonathan Raiman, and Olivier Raiman, DeepType: Multilingual Entity Linking by Neural Type System Evolution (2018), AAAI.
Commentaires