Guides

How Identity Graphs Work (And Why They Matter)

Learn how identity graphs power visitor identification, why match rates vary from 5% to 40%, and how to evaluate tools based on the data behind them.

George Gogidze George Gogidze · · 16 min read
How Identity Graphs Work (And Why They Matter)

Every visitor identification tool promises to reveal who’s on your website. Match rates of “up to 30%!” or “identify 20% of traffic!” get thrown around in sales decks like they’re interchangeable numbers.

But they’re not. Match rates across tools range from 5% to 40% — a massive spread that directly impacts how many leads you get from the same traffic. The difference between a 10% match rate and a 35% match rate on 10,000 monthly visitors is 2,500 additional identified leads per month.

So why the gap? The answer isn’t in the pixel, the dashboard, or the sales pitch. It’s in the identity graph.


What Is an Identity Graph?

An identity graph is a database that maps digital signals to real people. Think of it as a massive lookup table that connects the dots between fragmented data points — IP addresses, browser cookies, device fingerprints, email hashes, mobile ad IDs — and resolves them into verified person records.

Here’s a simplified view of how the resolution works:

DIGITAL SIGNALS              IDENTITY RESOLUTION         VERIFIED RECORD
─────────────────           ───────────────────         ─────────────────
IP address: 74.125.x.x  ─┐                             Name: Sarah Chen
Cookie ID: abc123def     ─┤  Cross-reference +          Email: sarah@acme.com
Device ID: D-8827134     ─┼─ validate against  ───────► Company: Acme Corp
Browser fingerprint      ─┤  identity graph              Title: VP Marketing
Email hash: f7c3b...     ─┘                             LinkedIn: /in/sarachen
                                                        Phone: (415) 555-0142

The identity graph doesn’t just store one signal-to-person mapping. It maintains a web of connections across signals. The same person might browse from their work laptop, their phone, and their home desktop. A good identity graph links all three sessions to one person, even though the IP addresses, cookies, and device IDs are all different.

The quality of this graph — how many signals it contains, how accurately they’re linked, and how recently they were verified — determines everything downstream. A visitor identification tool is only as good as the identity graph behind it.


How Visitor Identification Uses Identity Graphs

When someone visits your website, here’s what happens behind the scenes:

Step 1: Signal Collection. The identification pixel fires and captures available signals from the browser session. This includes the visitor’s IP address, user agent string, screen resolution, timezone, language settings, installed fonts, and any existing cookies or device identifiers.

Step 2: Signal Processing. The raw signals are cleaned and normalized. IP addresses are checked against known VPN and proxy ranges. Bot traffic is filtered out. The remaining signals are packaged into a query.

Step 3: Identity Graph Lookup. The query is matched against the identity graph. This is where the magic happens — and where tools diverge dramatically in quality. The graph searches for matching signal patterns and returns any associated person records.

Step 4: Confidence Scoring. Matches are scored based on how many signals aligned, how recently those signals were verified, and whether the match is deterministic (exact) or probabilistic (inferred). High-confidence matches get returned. Low-confidence matches get filtered out — or, in tools with lower standards, passed through anyway.

Step 5: Record Delivery. The identified visitor record — name, email, company, title, and behavioral data — is delivered to your dashboard, CRM, or webhook in real time.

The entire process takes milliseconds. But the years of data collection, verification, and graph maintenance behind that millisecond lookup are what separate accurate tools from noisy ones.


Build vs Buy: The Reason Match Rates Differ

Here’s the insight most vendors don’t want you to know: the majority of visitor identification tools don’t build their own identity graph. They license access to third-party data providers.

The most common third-party identity graphs come from companies like LiveRamp, Tapad, Oracle Data Cloud, and a handful of smaller data cooperatives. These are legitimate, well-maintained graphs — but they’re shared infrastructure. Any tool can license them.

This creates an obvious problem. If five different visitor ID tools are all querying the same underlying graph, they’re going to return roughly the same matches. The pixel implementation might look different. The dashboard might be prettier. The pricing might vary. But the core identification capability is nearly identical.

This is why so many tools cluster around 10-20% match rates. They’re not bad tools — they’re just constrained by the same data source.

What happens when a tool builds its own graph

Building a proprietary identity graph is a fundamentally different approach. Instead of licensing someone else’s data, you invest in:

  • Direct data partnerships with publishers, apps, and platforms that generate first-party identity data
  • Proprietary data collection across consented touchpoints
  • Custom matching algorithms optimized for your specific use case (in this case, website visitor identification)
  • Continuous verification to keep records fresh and accurate

This is harder, slower, and more expensive than licensing. It requires years of relationship-building with data partners and constant engineering investment to maintain data quality. But the payoff is significant: match rates of 30-40% compared to 10-20% on licensed graphs.

At Leadpipe, we build and maintain our own identity graph. It’s the single biggest investment we make as a company, and it’s the reason our match rates consistently outperform tools that rely on third-party data. We’re not smarter at writing pixels — we have a better graph.


Deterministic vs Probabilistic Matching

Not all identity resolution is created equal. The two primary approaches — deterministic and probabilistic — produce very different results.

Deterministic Matching

Deterministic matching requires a verified, exact link between a digital signal and a person. For example:

  • A user logged into a site with their email, and that email is linked to their name and company
  • A hashed email from an ad platform matches a hashed email in the identity graph
  • A device ID was registered during an app signup that included verified contact information

Deterministic matches are high-confidence. When a tool tells you “this visitor is Sarah Chen from Acme Corp” based on deterministic data, there’s strong evidence backing that claim. The trade-off is that deterministic graphs are harder to build. Every record needs a verified connection — you can’t just infer it.

Probabilistic Matching

Probabilistic matching uses statistical inference to predict identity based on behavioral patterns. Instead of a verified link, the system looks at signals like:

  • This IP address is associated with a company office in San Francisco
  • The browsing behavior matches patterns from a known user segment
  • The device fingerprint is statistically similar to a previously identified device
  • The combination of timezone, language, and screen resolution narrows the candidate pool

Probabilistic matching can cover more traffic because it doesn’t need exact matches. But it introduces false positives. The system might say “this is probably Sarah Chen” when it’s actually her coworker using a similar device on the same network.

Why This Distinction Matters

Some visitor identification tools heavily rely on probabilistic matching to inflate their match rates. A tool claiming a 25% match rate sounds impressive — until you realize half those matches are probabilistic guesses with a 60% confidence score.

When evaluating tools, always ask: what percentage of your matches are deterministic vs probabilistic? A tool with a 15% match rate based on 90% deterministic data will give you more actionable leads than a tool with a 25% match rate based on 50% probabilistic data.

Leadpipe prioritizes deterministic matching. We’d rather give you 100 verified leads than 200 “maybe” leads. Our identity graph is built on verified linkages, not statistical guesses.


Why “Match Rate” Numbers Can Be Misleading

Match rate is the most-cited metric in visitor identification, but it’s also the most manipulated. Here’s why you should be skeptical of any vendor’s self-reported number.

Different tools measure differently

There’s no industry standard for how match rate is calculated. Some tools measure:

  • Unique visitors matched — the percentage of distinct people identified out of total unique visitors
  • Total sessions matched — the percentage of total page views or sessions where a match was found (this inflates the number because returning visitors get counted multiple times)
  • Company-level matches — identifying the company, not the person (much easier, much less useful)
  • Person-level matches — identifying the actual individual (harder, more valuable)

A tool reporting a “30% match rate” on company-level identification is not comparable to a tool reporting a “15% match rate” on person-level identification. The 15% tool may actually be delivering more value because you’re getting actionable contact records, not just company names.

Traffic composition changes everything

Match rates are heavily influenced by your traffic mix:

FactorHigher Match RateLower Match Rate
GeographyUS trafficInternational traffic
Audience typeB2B (office IPs)B2C (residential/mobile)
Device typeDesktopMobile
Network typeCorporate networksVPNs, residential ISPs
Traffic sourceDirect, organicSocial, paid display

A tool might legitimately match 35% of your US desktop traffic while only matching 5% of your international mobile traffic. If your traffic is 80% US desktop, your overall match rate looks great. If it’s 50% international mobile, the same tool looks terrible — even though nothing changed on the tool’s end.

The RB2B problem

Some tools have structural limitations that are rarely disclosed upfront. RB2B, for example, only matches visitors against LinkedIn profiles. If a visitor doesn’t have a LinkedIn account — or their LinkedIn data doesn’t appear in RB2B’s dataset — they’re invisible. This means entire segments of your traffic (people without LinkedIn, users with privacy settings enabled, international visitors with lower LinkedIn penetration) simply can’t be matched, regardless of how good your traffic quality is.


How to Actually Compare Visitor ID Tools

Skip the vendor claims. Here’s a framework for evaluating identity graphs and match rates based on your actual data.

1. Run a Side-by-Side Test on Your Traffic

Install two or three tools simultaneously on your site. Run them for at least 2-4 weeks to get a statistically meaningful sample. Compare:

  • Total unique visitors identified (person-level, not company-level)
  • Match rate as a percentage of unique visitors
  • Data completeness per match (do you get email, phone, title, or just a name?)
  • Accuracy of returned records (spot-check a sample against LinkedIn, company websites)

2. Ask the Right Questions

Before you commit to a tool, ask:

  • Do you build your own identity graph, or do you license third-party data? If they license, ask which providers. If they won’t answer, that tells you something.
  • What percentage of matches are deterministic vs probabilistic? If they can’t answer this question clearly, their graph likely leans probabilistic.
  • How often is your identity graph refreshed? Data decays fast. People change jobs, companies, and email addresses. A graph that’s refreshed monthly is significantly better than one refreshed quarterly.
  • What’s your coverage outside the US? If you have international traffic, this matters enormously. Most identity graphs are heavily US-weighted.

3. Calculate Cost Per Verified Lead

Match rate alone doesn’t tell you the full story. Calculate the cost per verified, actionable lead:

Cost per verified lead = Monthly tool cost / Number of person-level matches with valid email

A cheaper tool with a lower match rate might cost more per lead than an expensive tool with a higher match rate. A $500/month tool that identifies 200 people costs $2.50 per lead. A $200/month tool that identifies 50 people costs $4.00 per lead. The “expensive” tool is actually cheaper per lead.

4. Check Data Freshness

Ask for a sample export and cross-reference the job titles, companies, and emails against current LinkedIn profiles. If 30% of the records show outdated information (people who’ve changed jobs, defunct email addresses), the identity graph isn’t being maintained. Freshness is a proxy for data quality.


The Bottom Line

The identity graph is the engine behind every visitor identification tool. Everything else — the pixel, the dashboard, the integrations — is just the car body. A beautiful car with a weak engine won’t get you far.

When evaluating visitor identification tools, look past the surface features and ask about the data. Who built the graph? How is it maintained? Is the matching deterministic or probabilistic? How does it perform on traffic that looks like yours?

The tools that invest in building and maintaining their own identity graphs will consistently outperform the ones that license shared data. It’s not glamorous work, but it’s the work that determines whether you identify 10% or 35% of your traffic — and that difference compounds into thousands of additional leads every month.


Frequently Asked Questions

What is an identity graph in visitor identification?
An identity graph is a database that maps digital signals — such as IP addresses, browser cookies, device fingerprints, and email hashes — to verified person records. It serves as the lookup table that visitor identification tools query to determine who is visiting your website. The quality and size of the identity graph directly determines the tool's match rate and accuracy.
Why do different visitor ID tools have such different match rates?
Most visitor identification tools license the same third-party identity graphs from providers like LiveRamp or Tapad, which limits them to 10-20% match rates. Tools that build their own proprietary identity graphs — investing in direct data partnerships and custom matching algorithms — can achieve 30-40% match rates because they have access to more signals and can optimize specifically for visitor identification.
What is the difference between deterministic and probabilistic matching?
Deterministic matching relies on verified, exact linkages between digital signals and real people — such as a hashed email matched to a known contact record. It produces high-confidence results but covers less traffic. Probabilistic matching uses statistical inference to predict identity based on behavioral patterns and signal combinations. It covers more traffic but introduces false positives. The best tools prioritize deterministic matching for accuracy.
How can I tell if a visitor ID tool builds its own identity graph?
Ask the vendor directly: do you build your own identity graph, or do you license third-party data? If they license, ask which providers they use. You can also ask what percentage of their matches are deterministic vs probabilistic, and how frequently their graph is refreshed. Tools that build their own graphs can usually answer these questions in detail, while resellers tend to be vague.
What is a good match rate for visitor identification software?
Match rate depends heavily on your traffic composition — US desktop traffic matches at higher rates than international mobile traffic. For person-level identification (not just company-level), tools using third-party identity graphs typically achieve 10-20%. Tools with proprietary graphs can reach 30-40% on US traffic. Always compare match rates by running side-by-side tests on your own traffic rather than relying on vendor-reported numbers.

Looking to go deeper on visitor identification? These posts cover related topics: