Get a head start on your coding projects with our Python Code Generator. Perfect for those times when you need a quick solution. Don't wait, try it today!
Named Entity Recognition (NER) is a typical natural language processing (NLP) task that automatically identifies and recognizes predefined entities in a given text. Entities like person names, organizations, dates and times, and locations are valuable information to extract from unstructured and unlabeled raw text.
At the end of this tutorial, you will be able to perform named entity recognition on any given English text with HuggingFace Transformers and SpaCy in Python; here's an example of the resulting NER:
SpaCy is an open-source library in Python for advanced NLP. It is built on the latest research and designed to be used in real-world products. We'll be using two NER models on SpaCy, namely the regular en_core_web_sm
and the transformer en_core_web_trf
. We'll also use spaCy's NER amazing visualizer.
To get started, let's install the required libraries for this tutorial. First, installing transformers:
Next, we need to install spacy
and spacy-transformers
. To do that, I've grabbed the latest .whl
file from the spacy-models
releases for installation:
Of course, if you're reading this tutorial in the future, make sure to get the latest release from this page if you encounter any problems regarding the above command.
Next, we have to download the spaCy's en_core_web_sm
regular model:
en_core_web_sm
is an English model pipeline optimized for CPU. It is small, with only 13MB in size, and under the MIT license. For larger models, you can use en_core_web_md
for medium-sized and en_core_web_lg
for the large one.
Once done with the installation, let's get started with the code:
For this tutorial, we'll be performing NER on this text that I've grabbed from Wikipedia:
We'll be using the HuggingFace Transformers' pipeline
API for loading the models:
We're using a BERT model (bert-base-encased
) that was fine-tuned on the CoNLL-2003 Named Entity Recognition dataset. You can use dslim/bert-large-NER
for a larger version of this one.
Let's extract the entities for our text using this model:
Output:
As you can see, the output is a list of dictionaries that has the start and end positions of the entity in the text, the prediction score, the word itself, the index, and the entity name.
The named entities of this datasets are:
O
: Outside of a named entity.B-MIS
: Beginning of a miscellaneous entity right after another miscellaneous entity.I-MIS
: Miscellaneous entity.B-PER
: Beginning of a person’s name right after another person’s name.I-PER
: Person’s name.B-ORG
: The beginning of an organization right after another organization.I-ORG
: Organization.B-LOC
: Beginning of a location right after another location.I-LOC
: Location.Next, let's make a function that uses spaCy to visualize this Python dictionary:
The above function uses the spacy.displacy.render()
function to render a named entity extracted text. We use manual=True
indicating that's a manual visualization and not a spaCy document. We also set jupyter
to True
as we're currently on a Juypter notebook or Colab.
The whole purpose of the for
loop is to construct a list of dictionaries with the start
and end
positions, and the entity's label. We also check to see if there are some same entities nearby, so we combine them.
Let's call it:
Next, let's load another relatively larger and better model that is based on roberta-large
:
Performing inference:
Visualizing:
As you can see, now it's improved, naming Albert Einstein as a single entity and also the Kingdom of Wurttemberg.
There are a lot of other models that were fine-tuned on the same dataset. Here's yet another one:
This model, however, only has PER
, MISC
, LOC
, and ORG
entities. SpaCy automatically colors the familiar entities.
To perform NER using SpaCy, we must first load the model using spacy.load()
function:
We're loading the model we've downloaded. Make sure you download the model you want to use before loading it here. Next, let's generate our document:
And then visualizing it:
This one looks much better, and there are a lot more entities (18) than the previous ones, namely CARDINAL, DATE
, EVENT
, FAC
, GPE
, LANGUAGE
, LAW
, LOC
, MONEY
, NORP
, ORDINAL
, ORG
, PERCENT
, PERSON
, PRODUCT
, QUANTITY
, TIME
, WORK_OF_ART
.
However, quantum mechanics was mistakenly labeled as an organization, so let's use the Transformer model that spaCy is offering:
Let's perform inference and visualize the text:
This time Swiss Federal was labeled as an organization, even though it wasn't complete (it should be Swiss Federal polytechnic school), and quantum mechanics is no longer an organization.
The en_core_web_trf
model performs much better than the previous ones. Check this table that shows each English model offered by spaCy with their size and metrics evaluation of each:
Model Name | Model Size | Precision | Recall | F-Score |
en_core_web_sm |
13MB | 0.85 | 0.84 | 0.84 |
en_core_web_md |
43MB | 0.85 | 0.84 | 0.85 |
en_core_web_lg |
741MB | 0.86 | 0.85 | 0.85 |
en_core_web_trf |
438MB | 0.90 | 0.90 | 0.90 |
Make sure you try other types of texts and see for yourself if your text confirms the above table! You can check this page on spaCy to see the details of each model.
For other languages, spaCy strives to make these models available for every language globally. You can check this page to see the available models for each language.
Here are some related NLP tutorials that you may find useful:
Happy learning ♥
Just finished the article? Now, boost your next project with our Python Code Generator. Discover a faster, smarter way to code.
View Full Code Fix My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!