Scientific Research, LLM
An Empirical Study on Extracting Named Entities from Repetitive Texts
An overview of the paper I presented at the WEBIST 2024 conference.
Large Language Models (LLMs) like GPT-3.5 Turbo and GPT-4 are transforming how we process and analyze text. Their capabilities expand into various fields, including healthcare, finance, and historical research, from summarizing data to extracting structured information.
In this post, I explore a study that tested the effectiveness of LLMs in extracting named entities from repetitive texts — specifically, historical birth registries. I’ll dive into the results, challenges, ethical considerations, and broader applications.
Whether you’re a researcher, developer, or business professional, this post offers practical takeaways on leveraging LLMs for real-world tasks.
The Challenge of Named Entity Recognition (NER)
Named Entity Recognition (NER) involves identifying and categorizing specific information (e.g., names, dates, and locations) from text. While modern methods excel with structured or annotated datasets, repetitive texts present unique challenges due to inconsistent formatting and context-specific terms.
Case Study Example: I used historical birth records from Pisa’s Jewish community (1749–1809), which included structured but complex entries like:
The goal was to extract:
- Child Name: Ribqa
- Father Name: Salamon Sezzi
- Sex: Female (F)
- Date of Birth: October 12, 1749
Once the named entities were extracted, the objective was to organize them into a CSV file:
Why Use Large Language Models?
Traditional approaches to NER rely on rule-based systems or annotated datasets:
All the methods require extensive manual effort and struggle with new text formats. LLMs, however, offer:
- Adaptability: They handle diverse text formats with minimal customization.
- Efficiency: Processing large datasets quickly without the need for extensive pre-training.
- Flexibility: They generalize patterns from training data, making them ideal for historical or niche applications.
The study evaluated whether GPT-3.5 Turbo and GPT-4 could replace rule-based systems for extracting named entities from repetitive texts, using only carefully crafted prompts.
Experiment Design
I built the following system:
I defined three types of instruction templates:
I tested six configurations:
- Models: GPT-3.5 Turbo and GPT-4.
- Instruction Levels: Simple, Medium, Details.
I defined the following Evaluation Metrics:
- Father Ratio: Correct father name extraction.
- Child Ratio: Correct child name extraction.
- Sex Ratio: Accurate gender identification.
- Date Ratio: Accurate birth date extraction.
- Total Ratio: Aggregate performance across all metrics.
The following table shows the results:
Key Findings
The results highlighted the importance of balancing cost, accuracy, and clarity:
- Accuracy vs. Cost:
- GPT-4 delivered the highest accuracy, especially with detailed prompts, but at a significantly higher cost.
- GPT-3.5 Turbo offered strong performance with medium or detailed prompts, providing a cost-effective alternative.
- Instruction Quality:
- Detailed prompts significantly improved accuracy, even with the less expensive GPT-3.5 Turbo.
- Medium instructions performed well in most cases, offering a balance of simplicity and accuracy.
- Common Errors:
- Naming conventions like “del fu [Name]” (denoting ancestry) confused the models.
- Simpler instructions occasionally led to formatting errors, such as inconsistent CSV outputs.
Lessons Learned and Broader Applications
Lessons Learned:
- Clarity in Instructions is Key: Adding examples and addressing edge cases reduced errors significantly.
2. Choose the Right Model for the Task:
- GPT-3.5 Turbo is ideal for cost-sensitive tasks.
- GPT-4 is best for tasks requiring high accuracy and complex reasoning.
Applications Beyond Historical Texts:
This methodology has wide-ranging potential:
- Healthcare: Extracting patient data from clinical notes.
- Finance: Parsing information from invoices or contracts.
- Customer Support: Categorizing ticket details for faster resolutions.
Addressing Ethical Considerations:
LLMs are trained on vast datasets, which may include biases. To mitigate these risks:
- Use outputs cautiously, particularly in sensitive tasks.
- Implement human oversight for high-stakes decisions.
- Transparently document instructions and outputs to ensure accountability.
Challenges and Future Directions
Key Challenges:
- Complex Naming Structures: Handling ancestry references (“del fu [Name]”) requires further prompt optimization.
- Formatting Errors: Simpler instructions occasionally led to inconsistencies in structured outputs.
- Scalability: Performance on unstructured datasets remains untested.
Future Directions:
- Explore Alternative Models: Test LLMs from other providers like Google or Meta.
- Instruction Optimization: Refine prompts to handle incomplete or ambiguous data effectively.
- Broader Testing: Adapt the approach for unstructured text, such as freeform notes or email conversations.
Additional information about the paper is available here: