SpanCat: Named Entity Recognition in spaCy and beyond
Named entity recognition (NER) is the problem of finding spans of interest for downstream applications such as names of people, companies, drugs and diseases.
The extracted spans are often used as inputs for further processing, such as mapping the entities to entries in a knowledge-base, or identifying relevant relations between different entities.
The EntityRecognizer component in spaCy is efficient and tends to perform well in real-world datasets by exploiting typical characteristics of NER data sets. One such characteristic is that the spans are non-overlapping. However, in many applications various other span-configurations are useful such as entities naturally appearing within other entities in a nested structure.
We’ll introduce spaCy’s new component the SpanCategorizer along with our tooling to support categorizing and annotating nested and other kinds of irregular overlapping spans. We’ll also explain the machine learning approach we’ve taken so far and our current research efforts in improving SpanCat
ML Engineer, spaCy Core Developer, Explosion.AI
Ákos Kádár is a Machine Learning Engineer at Explosion and a core developer of the Natural Language Processing library spaCy and thinc the Machine Learning library powering it. He has a research background and obtained a PhD from Tilburg University, where he worked on interpretability and learning multi-modal and multi-lingual representations. During his research career he also worked at Microsoft Research Montreal on reinforcement learning, at Samsung AI Toronto on semantic and dependency parsing and at Borealis on SQL parsing and causal inference.