Named Entity Recognition and Disambiguation
Entity extraction and disambiguation is the task of determining the identity of entities mentioned in a text against a knowledge base. The identification and resolution of named-entities such as person-name or location, provides many practical applications. For instance, users can extract lists of people, map different texts, generate timelines, and provide an enhanced search. This is of great importance not only for research but also for the publishing process.
OAPEN is testing the integration of the NERD API in the workflow of publishing platforms to enhance discoverability and usage of open access monographs. The data of several thousand monographs and chapters are currently available as a CSV (comma separated) text file here:
Entity-fishing, the NERD implementation developed by INRIA, is a service available within the DARIAH-EU infrastructure and used by the partners of the OPERAS HIRMEOS-project to enrich open access digital monographs published on five digital platforms.
Description of the data
The data is divided into the following columns:
|OAPEN_ID||Unique ID of the publication in the OAPEN Library|
|rawName||The entity as it appears in the text|
|nerd_score||Disambiguation confidence score|
|nerd_selection_score||Selection confidence score, indicates how certain the disambiguated entity is actually valid for the text mention|
|wikipediaExternalRef||ID of the Wikipedia page|
|wiki_URL||URL of the Wikipedia page|
|type||NER class of the entity|
|domains||Description of subject domain|
Each book may contain more than one occurrence of the same entity. The nerd_score and the nerd_selection_score may vary. This allows researchers to count the number of occurrences and use this as an additional method to assess the contents of the book. The OAPEN_ID refers to the identifier of the title in the OAPEN Library. A full description of all books in the OAPEN Library is available here.
For more information about the entity-fishing query processing service see https://nerd.readthedocs.io/en/latest/restAPI.html#response.