ML Beginner's Guide To Build NER Model For News Articles

ML Beginner's Guide To Build NER Model For News Articles
Named Entity Recognition in Articles and News


Introduction


Imagine reading all the news and having all the important information like names, places, events etc. automatically highlighted.

That’s the magic of Named Entity Recognition(NER). In simple terms NER helps us to easily identify and classify specific data from a given text, it could be people’s names, locations, dates and many more.

NER Definition with Example

Named Entity recognition(NER) is a process used in Natural language processing(NLP) to automatically identify and classify entities in text content.

NER works with particular features of text which are known as named entities. These entities include:

People: Names of people, like "Will Smith" or "Emma Watson".
Locations: Place names, like "New York City" or "Mount Everest".
Organizations: Names of companies, institutions like "Google" or "United Nations".
Dates and Times: Different times or dates, like "January 1, 2024" or "the 21st century".
Quantities: Measurements(Numeric) , like "10 kilograms" or "100 dollars".

How  Named Entity Recognition Actually Works ?

There are three main steps in Named Entity Recognition(NER):

  1. Tokenization: In tokenization the input text is broken down into individual tokens, which can be words, punctuation marks.
  2. Part-of-speech(POS) tagging: Each token is assigned a part-of-speech tag that indicates its category. This step helps to identify patterns and context for identifying named entities.
  3. Named Entity Classification: In this step, the tokens are analyzed to determine to which named entity they belong to(eg. person, organization, location, date, etc)

Let’s understand NER with the help of an example:

Sentence:
“Smith went to Paris last summer to attend a conference by Google. There, he met John, a software engineer from Microsoft, during the event.”

In the above example, NER would classify the entities as follows:
Smith: PERSON entity
Paris: LOCATION entity
last summer: DATE entity, telling about a specific time period
Google: ORGANIZATION entity
John: PERSON entity
Microsoft: ORGANIZATION entity
NER would analyze each word in the paragraph and further label them with their entity types based on different patterns learned during the training phase.

NER in News and Articles

NER Example

NER plays an important role in understanding the who, what, when, where, and why of news and other articles. In news and other articles, NER plays an important role in extracting information and further understanding the context of the text.

Here are some common use cases of NER in news and articles:

1. Content Discovery: Suppose you want to search about an  athlete or about a specific event. NER  will categorize the articles by the entities they mention.

Example: You want to stay updated about  Elon Musk . An NER-powered news platform can recommend articles mentioning "Elon Musk" or "Tesla" , even if those words aren't in the headlines.

2. Content Summarization: NER can extract key facts from text, which would help us to get insightful summaries from a given text.

Example: A news summary of a natural disaster like earthquake might highlight key details like "magnitude 7.8 " and "killing 1000 people".

3. Sentiment Analysis: By understanding who and what is there in the context, NER analyzes those sentiments and further tracks those trends.

Example: NER can analyze customer reviews and tell us whether those reviews are positive, negative or neutral.

4. Personalized Recommendations: Based on entities you previously interacted with, personalized news recommendations or responses can also be generated using NER.

Example: If you read articles about "Climate Change", you can see recommendations for other articles related to climate change.

NER Model Building

Let’s build a simple NER model using spaCy:

In the following tutorial we would be using spaCy pre-trained model for training a NER model. There are many other libraries which also be used.

  1. Install spaCy using the following command:
pip install spacy

2. Download spaCy's pre-trained model: SpaCy library provides pre-trained models that include NER capabilities.For this example we are using the English model `en_core_web_sm`.

python -m spacy download en_core_web_sm

3. Import spaCy and load the pre-trained model:

import spacy
nlp = spacy.load("en_core_web_sm")

4. Now let’s see the pre-trained entities present in the spaCy library:

nlp.pipe_labels['ner']

Pre-Trained Entities in spaCy

5. Now let's test the model:


text=""" U.S. President Joe Biden on Monday launched a task force aimed at addressing the "systemic" problem of mishandling classified information during presidential transitions, days after a Justice Department
special counsel's sharply critical report said he had done just that.The Presidential Records Transition Task Force will study past transitions
to determine best practices for safeguarding classified information from an outgoing administration, the White House said.
It will also assess the need for changes to existing policies and procedures to prevent the removal of sensitive information that by law should be kept with the National Archives and Records Administration."""

doc = nlp(text)

for ent in doc.ents:
    print(ent.text, "|",ent.label_, "|", spacy.explain(ent.label_))

Entities Present in the text

6. Let’s see the output for the given text.

from spacy import displacy
displacy.render(doc,style="ent" )

NER output in Text

Challenges in NER:

NER is a powerful tool in natural language processing, but it comes with its own set of challenges:

  1. Ambiguity: Entities can be ambiguous, especially when the context is unclear. For example Apple could refer to the fruit or the tech company.
  2. Named Entity Variations: Entities can be written in different ways including nicknames, abbreviations etc. For example, “Barack Obama” can also be called “Obama”, ”Mr. President”. NER models need to be trained and tested on large datasets to handle these variations effectively.
  3. Limited Training Data: Training NER models requires high quality ,annotated data specific to the domain. It is very time consuming and resources might be limited for some domains. Hence, the model accuracy might decrease while predicting data.

Conclusion

Named entity Recognition plays an important role in the analysis of news and articles, which allows easier extraction of information from huge chunks of data.

By correctly identifying and classifying named entities such as persons, organizations, locations, and dates, NER allows people to gain valuable insights into the context of different articles.

As NER technology enhances its application in news aggregation, financial news analysis ,and many other domains, it is expected to enhance our ability to understand and interpret various news and articles.

Frequently Asked Questions

Q1. What is the use of NER in NLP?

Named entity recognition is a subdomain of natural language processing which focuses on classifying and identifying entities from huge chunks of data into categories such as persons, organizations, locations etc.

Q2. Can models like BERT be used for NER?

Yes, BERT can be used for named entity recognition. By fine-tuning BERT on labeled NER datasets, it can identify and classify named entities such as person names, organization names, locations, dates, in a given text.

Q3. What is an example of a NER?

Statement-"President Biden met with Prime Minister Macron in Paris yesterday."(NER identifies "Biden" as PERSON, "Macron" as PERSON, and "Paris" as LOCATION)

Looking for high quality training data to train your named entity recognition model? Talk to our team to get a tool demo.




Train Your Vision/NLP/LLM Models 10X Faster

Book our demo with one of our product specialist

Book a Demo