How to choose an Entity Extraction product

Entity Extraction Is Essential for Organizations to Maximize the Value of Their Unstructured Data

Entity Extraction (also known as Named Entity Recognition) is advanced AI technology that recognizes key concepts in unstructured text and converts them into semantically structured data. It was pioneered in the nineties through funding primarily from the US Government in order to develop a technology that could make sense of the very large amounts of unstructured data that the Government increasingly had to deal with. As one of the Government managers behind the technology once said:

“We sure knew how to collect all sorts of unstructured data, but we didn’t have any tools for finding the critical things in it and converting them into useful intelligence.”

Entity Extraction Has Gone Mainstream

Fast forward to today and Entity Extraction has escaped the R&D lab and is the focus of a large community of developers. There are a large number of companies offering products. Entity Extraction has matured to the point where Government and business organizations increasingly see a compelling need to include it as part of their digital strategy for realizing maximum value from unstructured data.

Organizations that are using Entity Extraction are seeing business impact in many areas, including:

    • Enterprise search
    • Business intelligence
    • Intelligence analysis
    • E-discovery
    • Risk management
    • Health care
    • Legal research
    • Redaction

All Entity Extraction Products Are Not Created Equal

There are certain things to keep in mind when mulling over which vendor’s extraction product to choose.

As part of the business case to acquire entity extraction software, an organization needs to consider many factors:

1. Ontology Coverage

The word “ontology” refers to the set of concepts that an entity extraction product can identify in text. Most extraction products on the market cover basic “named entity” types, such as:

    • Person
    • Organization
    • Place
    • Numeric (e.g., money amounts and percentages)
    • Dates and times

One thing to look for is how fine-grained an ontology your application requires. For example, most entity extraction products will only identify that something is an organization name and not provide any further characterization. They do not distinguish among organizations that might be companies, governmental, educational, non-profits, etc. Similarly, some products extract place names but do not distinguish countries, states, cities, etc. If such distinctions are critical to your planned application, then you want to look elsewhere.

2. Accuracy

Another critical factor is an entity extraction product’s accuracy. Extraction has an inevitable error rate, and it’s important to find one that minimizes both false positives and false negatives.

A false positive is to extract something as, e.g., a person name that isn’t one. For example, “Joshua Tree” could be taken to be a person name because of the apparent first name. It’s not, obviously. A false negative is simply to miss a name that should have been extracted.

There needs to be a good balance between recall and precision. After all, if you extract just one name correctly from a document that contains hundreds of names, you have perfect precision (not one wrong name was extracted!), but almost zero recall. Conversely, you can extract pretty much everything from a document and have excellent recall but dreadful precision. Typically, a metric called F-Measure is used to measure the accuracy of entity extraction, taking into account both false positives and false negatives.

3. Noise Tolerance

Ideally, you’d like an entity extraction product to be able to process a wide variety of unstructured text data with high accuracy (i.e., high F-Measure) — from well-edited news reports to all-lowercase email to wildly ungrammatical text messages or social media posts. If your input texts include not-so-well-edited documents, but your candidate entity extraction product is hampered by grammatical irregularities or is dependent on correct capitalization of names, then it will likely not meet your requirements.

Accuracy metrics provided by vendors deserve careful scrutiny. Vendors sometimes report precision, recall, and F-measure scores, but these metrics must be evaluated in context. What dataset was used? Does it resemble your data in language, structure, and noise level? A product that performs well on news articles or Wikipedia data may struggle with email, scanned texts, or informal chat logs.

Requesting a proof of concept (PoC) using your own sample data is one of the most reliable ways to assess real-world performance. Have your team identify documents that are representative of what you need to extract from and then identify the items to be extracted. Then align the two lists of named entities, one list with the entity extraction product output and the other with the human output, and examine the differences. This will give you a good picture of the accuracy of the entity extraction product.

A PoC is also useful for revealing any gaps in the ontology coverage between what the entity extraction product is designed to extract and what you need extracted. That leads to the next consideration in choosing an entity extraction product: customizability.

4. Customizability

For named entities, the range of desirable extraction targets can range from the ordinary such as person and organization names to the very arcane or technical such as those found in scientific or technical domains (e.g., widget names in a specialized area). No entity extraction product on the market covers all the bases. An organization needs to determine the following:

    • Can the entity extraction product extract what is needed out-of-the-box?
    • If not, can the extraction product be customized to handle what is needed?
    • How easy and quick is it to customize?
5. Foreign Language Coverage

Do you need extraction from texts in languages other than English? This is important to know since entity extraction products differ widely in their language coverage.

6. Speed and scalability

If the amount of your text input is very large, you want to ask if an entity extraction product can handle the volume of unstructured data that needs to be analyzed within your time frame. In addition, given the constant increase of unstructured data on social media, the web, as well as other sources, you may want an entity extraction product to be highly scalable. In other words, could it easily handle many more input documents for any future increase as well as sudden time-critical needs?

7. Advanced Extraction Capabilities

Few entity extraction products have the ability to go beyond extracting basic named entities and extract more advanced semantic concepts such as relationships and events. Relationship extraction extracts a semantic association between two entities, such as an affiliation relationship between a person and an organization or a familial relationship between two people.

Event extraction is more complex, generally involving an activity in which entities participate along with a date and location of the event. For example, for a sentence “Salesforce acquired Informatica in 2025,” an entity extraction product should identify the acquiring company (Salesforce), the acquired company (Informatica), and the purchase date (2025). The output will be structured data resembling a database record:

               EVENT_TYPE: Acquisition

               ACQUIRING_ENTITY: Salesforce

               ACQUIRED_COMPANY: Informatica

               DATE: 2025

Event extraction should extract the same information expressed in very different ways linguistically: “Salesforce’s acquisition of Informatica in 2025 was the result of many years of discussion between the two companies.”  Notice that what was expressed by a verb in the first example (“acquired”) is expressed by a noun (“acquisition”) in the second.

Event extraction for this sentence will produce structured output identical to the structured output above. This represents the true power of event extraction: it removes the variability inherent in natural language (here using a noun or a verb to express the same concept). It captures the meaning of the sentence in a uniform structured format that is suitable for further processing by downstream applications such as analytic or visualization tools that require structured data.

To be sure, some use cases may only require named entities. For example, let’s say you are planning to develop a faceted search application where named entities are used as search filters. Faceted search is a natural use case for named entity extraction and is now widespread.

However, other applications might need advanced extraction such as relationship and event extraction.  For instance, in performing due diligence on a potential customer, employee, or contractor, you need to monitor a large amount of news and other text data in order to discover any adverse events (e.g., arrest, lawsuit, indictment) that they may have been involved in.

An organization must think carefully about whether it needs basic or more advanced extraction.

8. Ease of Integration

Entity extraction rarely lives in isolation. It is integrated with other systems that feed it text data or which receive the structured data that the entity extraction product outputs. You need a product that integrates easily with your existing stack including:

    • Data pipelines
    • Search and analytics platforms
    • Workflow tools
    • Databases
    • Etc.

Straightforward APIs, good documentation, and responsive technical support can significantly reduce implementation time.

9. Vendor Stability

Finally, vendor reputation should not be overlooked. Established providers often have a larger customer base, have incorporated many “must haves,” and have addressed many “gotchas” in their products based on a variety of customer feedback. Reviewing case studies and independent benchmarks can provide valuable insight.

“Future proofing” is also essential. Entity extraction evolves rapidly. A product that is state-of-the-art today may become outdated if the vendor does not actively invest in research and development. Assessing things like release frequency and commitment to innovation can help ensure that your entity extraction capabilities remain competitive.

Summary

In sum, we have listed a variety of factors that a potential user of entity extraction may want to consider before making a decision. The technology is no longer just a niche product. It is being used in many large-scale production environments.

As we have emphasized, organizations must evaluate the following for a candidate entity extraction product:

    • Alignment with the intended use case
    • Ontology coverage
    • Customization capabilities
    • Accuracy
    • Noise tolerance
    • Speed & scalability
    • Foreign-language coverage
    • Ease of integration
    • Vendor stability.

By conducting thorough due diligence—including pilot testing with real data—decision-makers can select a solution that transforms unstructured text into valuable structured data at a scale that can keep pace with the tsunami of unstructured data that organizations are faced with.