Tackling Entity Resolution Across Resource Directory Data Sets: Lessons From the ServiceNet Team
Contributors:
David Botos, Head of Engineering – Connect 211
Greg Bloom, Director – Open Referral
The potential benefits of merging and collaborating on resource directory datasets are numerous, including better service coverage, improved analytics, reduced maintenance costs, crowdsourced quality, and more.
In order to make that possible, we need low effort methods for two things:
- Identifying and verifying duplicates
- Reconciling groups of duplicates into an improved data product
This article reviews our progress solving the first part: identifying real duplicates. This problem is also known as entity resolution.
Since 2022, we have been prototyping solutions that help resource data managers handle duplicates. In 2025 alone, we executed three major iterations of work, rethinking fundamental assumptions about how resource data from multiple enterprises overlaps in the real world and can be reconciled for specific data applications.
Terminology For Our Data Model
This document uses vocabulary from the Human Services Data Schema (HSDS) to describe resource data. HSDS defines organizations (AKA “agencies”), the services (AKA “programs”) they provide, and the locations where services are delivered.
There is also an important layer of indexing, or taxonomization, primarily for services. All enterprise resource data sets rely heavily on indexing to improve navigation and reporting. Indexes are very helpful for AI solutions in resource data.
The Challenge: Assessing “sameness” is subjective.
We’ve discovered that the question “when are entities the same?” is very much a matter of opinion. Take organizations, for example. It’s easy to think of them as recognizable brands with tax identification numbers that make them unique and easy to distinguish:
- Data about organizations frequently lacks identifiers.
- Programs or departments from governments, municipalities, and other large organizations overlap unevenly.
- When they span large regions, organizations may be treated as separate entities by local data stewards, where much of the data originates.
These are just a few reasons there may not be a single right answer to the question of “what’s the same?”.
We predict that effective entity-resolution and data-collaboration tools will facilitate the gradual convergence of definitions of “what is the same”, but they will never completely converge. Legitimate differences of opinion about the boundaries around entities will remain. Our goal is not to remove opinions, but to improve our capacity to manage them cooperatively.
The role of real-world facts in entity resolution.
Real-world data points such as phone numbers, URLs, email addresses, physical addresses, and external organization identifiers form the basis for identifying “same” entities. Formatting inconsistencies might make them difficult to work with, but they are largely resolvable with AI.
We also use AI to determine semantic similarity. In other words, we make an educated guess as to whether they describe the same things. In this way, we can assess whether two organizations (for example) are the same, even if their names are different.
Apart from this short list of facts, we haven’t found many other data points that are consistent enough across sources to be useful. However, thanks to recent advances in AI, this is enough.
Resolving organizations.
Earlier, we listed a few legitimate reasons why organizations may be defined differently depending on the data steward’s context; we don’t want to mess up their internal organization data structures. However, deduplicating organizations is important for resolving services, our ultimate goal.
One way to address this dilemma is to establish “parent” or “global” organization records that require legal identifiers such as FEINs (where possible), and then associate source organizations as subsidiaries beneath parent records, like an umbrella. This approach preserves the fidelity of individual data sources while also aligning them globally.
It can also create super-clusters of services for large or complex providers, which presents its own challenges – but we think this method is the best way to balance the benefits of aggregation while preventing the collapse of localized distinctions.
We have more to learn here, but this approach appears adequate as we begin to address a higher priority problem: resolving services.
Resolving services.
There are many opinions about how to define “service”, as well as “which services are the same.”
For example, some directories define services in a very granular way, while others bundle related concepts together in longer descriptions. This results in the same cluster of activities defined as a single record in one system and five records in another.
To resolve services, we look for three categories of things they have in common:
- A parent organization
- Real-world data like phone numbers, addresses, etc
- Taxonomies, also known as Service Terms for many 211s
Services have at least one important advantage over organizations: they are usually indexed by one or more taxonomy systems. Taxonomies provide semantically rich information that is less subjective than free-text descriptions. They are also somewhat, if not perfectly, standardized. When two services from different sources share related taxonomy terms, then there is a stronger basis for comparison.
Given these signals, we can reliably create high-likelihood duplicate groups at scale using AI.
Resolving taxonomies.
Unlike other entities studied, taxonomies (also known as “indexing systems”) do not have internal duplication. Within a single taxonomy, each term is (at least in principle) distinct. This makes resolving related taxonomies across different sources more of a mapping or crosswalk process.
Consensus on how terms should be mapped varies, and we must design for multiple concurrent answers. That said, broad adoption of standard taxonomic mappings across major indexing systems is one of the best ways to improve entity resolution.
We Need Human Oversight
Humans remain in the loop for critical asteps:
- Verify and correct AI’s results – AI does get it wrong sometimes. In pilots, this requires way less effort than doing the comparison manually
- Prioritize data in the final, reconciled result based on the subjective requirements of a target audience
Humans are also irreplaceable (it turns out) at the first step of gathering resource information in the first place, but that’s for another article.
What Comes Next?
After identifying which entities are the same (and verifying the results), the next step is to reconcile groups of the same entities so that the resulting outputs, such as a resource directory database, don’t display duplicates to users.
Reconciliation is a complicated topic with many potential solutions that are outside the scope of this article. If you would like to learn more, we describe a solution that we are currently developing here: A Minimal Approach to Deduplicating Services in Resource Directories
The State Of This Project
In pilots for Washington State and Illinois, we have built notebooks, iterated on interfaces, launched API tools, trained our own AI models, and generally progressed at a breakneck pace since 2022.
All of this coalesced into a major milestone by the start of 2026: A much-improved user interface for deduplication that is integrated into the enterprise-grade software powering Connect 211’s data orchestration pipelines. Any data in our expansive data system can be quickly earmarked for comparison and deduplication.
Although work is ongoing, this transition marks a shift from experimentation toward large-scale operational impact. Our years of pilots have informed a clearer understanding of how to turn multiple overlapping data sources into a scalable system.
As work continues, we are actively looking for:
- Pilot opportunities
- Funders
- Collaborators
We are unlocking the ability to create supply chains for resource data and reducing the number of overlapping data silos that currently dot our resource directory landscape. The ripple effect is huge.
Please reach out to sky@connect211.com to participate.