Named Entity Recognition To Enrich Text
Named Entity Recognition (NER) to Enrich Text
Named Entity Recognition (NER) is a Natural Language Processing task that identifies and classifies named entities (NE) into predefined semantic categories (such as persons, organizations, locations, events, time expressions, and quantities). By converting raw text into structured information, NER makes data more actionable, facilitating tasks like information extraction, data aggregation, analytics, and social media monitoring.
This notebook demonstrates how to carry out NER with chat completion and functions-calling to enrich a text with links to a knowledge base such as Wikipedia:
Text:
In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press. His work led to an information revolution and the unprecedented mass-spread of literature throughout Europe. Modelled on the design of the existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday.
Text enriched with Wikipedia links:
In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press. His work led to an information revolution and the unprecedented mass-spread of literature throughout Europe. Modelled on the design of the existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday.
Inference Costs: The notebook also illustrates how to estimate OpenAI API costs.
1. Setup
1.1 Install/Upgrade Python packages
Note: you may need to restart the kernel to use updated packages. Note: you may need to restart the kernel to use updated packages. Note: you may need to restart the kernel to use updated packages.
1.2 Load packages and OPENAI_API_KEY
You can generate an API key in the OpenAI web interface. See https://platform.openai.com/account/api-keys for details.
This notebook works with the latest OpeanAI models gpt-3.5-turbo-0613 and gpt-4-0613.
2. Define the NER labels to be Identified
We define a standard set of NER labels to showcase a wide range of use cases. However, for our specific task of enriching text with knowledge base links, only a subset is practically required.
3. Prepare messages
The chat completions API takes a list of messages as input and delivers a model-generated message as an output. While the chat format is primarily designed for facilitating multi-turn conversations, it is equally efficient for single-turn tasks without any preceding conversation. For our purposes, we will specify a message for the system, assistant, and user roles.
3.1 System Message
The system message (prompt) sets the assistant's behavior by defining its desired persona and task. We also delineate the specific set of entity labels we aim to identify.
Although one can instruct the model to format its response, it has to be noted that both gpt-3.5-turbo-0613 and gpt-4-0613 have been fine-tuned to discern when a function should be invoked, and to reply with JSON formatted according to the function's signature. This capability streamlines our prompt and enables us to receive structured data directly from the model.
3.2 Assistant Message
Assistant messages usually store previous assistant responses. However, as in our scenario, they can also be crafted to provide examples of the desired behavior. While OpenAI is able to execute zero-shot Named Entity Recognition, we have found that a one-shot approach produces more precise results.
3.3 User Message
The user message provides the specific text for the assistant task:
4. OpenAI Functions (and Utils)
In an OpenAI API call, we can describe functions to gpt-3.5-turbo-0613 and gpt-4-0613 and have the model intelligently choose to output a JSON object containing arguments to call those functions. It's important to note that the chat completions API doesn't actually execute the function. Instead, it provides the JSON output, which can then be used to call the function in our code. For more details, refer to the OpenAI Function Calling Guide.
Our function, enrich_entities(text, label_entities) gets a block of text and a dictionary containing identified labels and entities as parameters. It then associates the recognized entities with their corresponding links to the Wikipedia articles.
4. ChatCompletion
As previously highlighted, gpt-3.5-turbo-0613 and gpt-4-0613 have been fine-tuned to detect when a function should to be called. Moreover, they can produce a JSON response that conforms to the function signature. Here's the sequence we follow:
- Define our
functionand its associatedJSONSchema. - Invoke the model using the
messages,toolsandtool_choiceparameters. - Convert the output into a
JSONobject, and then call thefunctionwith theargumentsprovided by the model.
In practice, one might want to re-invoke the model again by appending the function response as a new message, and let the model summarize the results back to the user. Nevertheless, for our purposes, this step is not needed.
Note that in a real-case scenario it is strongly recommended to build in user confirmation flows before taking actions.
4.1 Define our Function and JSON schema
Since we want the model to output a dictionary of labels and recognized entities:
{
"gpe": ["Germany", "Europe"],
"date": ["1440"],
"person": ["Johannes Gutenberg"],
"product": ["movable-type printing press"],
"event": ["Renaissance"],
"quantity": ["3,600 pages"],
"time": ["workday"]
}
we need to define the corresponding JSON schema to be passed to the tools parameter:
4.2 Chat Completion
Now, we invoke the model. It's important to note that we direct the API to use a specific function by setting the tool_choice parameter to {"type": "function", "function" : {"name": "enrich_entities"}}.
5. Let's Enrich a Text with Wikipedia links
5.1 Run OpenAI Task
2023-10-20 18:05:51,729 - INFO - function_to_call: <function enrich_entities at 0x0000021D30C462A0>
2023-10-20 18:05:51,730 - INFO - function_args: {'person': ['John Lennon', 'Paul McCartney', 'George Harrison', 'Ringo Starr'], 'org': ['The Beatles'], 'gpe': ['Liverpool'], 'date': ['1960']}
2023-10-20 18:06:09,858 - INFO - entity_link_dict: {'John Lennon': 'https://en.wikipedia.org/wiki/John_Lennon', 'Paul McCartney': 'https://en.wikipedia.org/wiki/Paul_McCartney', 'George Harrison': 'https://en.wikipedia.org/wiki/George_Harrison', 'Ringo Starr': 'https://en.wikipedia.org/wiki/Ringo_Starr', 'The Beatles': 'https://en.wikipedia.org/wiki/The_Beatles', 'Liverpool': 'https://en.wikipedia.org/wiki/Liverpool'}
5.2 Function Response
5.3 Token Usage
To estimate the inference costs, we can parse the response's "usage" field. Detailed token costs per model are available in the OpenAI Pricing Guide:
Token Usage
Prompt: 331 tokens
Completion: 47 tokens
Cost estimation: $0.00059