Guides

Identity Graph APIs: How They Work (Technical Guide)

How identity graphs store, match, and resolve person data. Graph structure, matching algorithms, confidence scoring, and API patterns explained for developers.

Nicolas Canal Nicolas Canal · · 13 min read
Identity Graph APIs: How They Work (Technical Guide)

Every visitor identification tool, every data enrichment API, every AI sales agent that knows who’s on your website - they all run on the same underlying data structure: an identity graph.

But what actually IS an identity graph? How does it store data? How does matching work? And why do some graphs produce 82% accuracy while others hit 40%?

If you’re evaluating identity APIs, building on top of one, or just want to understand why your enrichment results sometimes disagree with each other, this is the guide. We’re going deep on the data structures, matching algorithms, confidence scoring, and API patterns that make identity resolution work.


What Is an Identity Graph (Technical Definition)

An identity graph is a data structure that maps fragmented identifiers - email addresses, phone numbers, device IDs, IP addresses, cookies, behavioral signals - to unified person profiles.

That’s the textbook definition. Here’s what it means in practice.

Every person leaves a trail of identifiers across the internet. They use a work email on LinkedIn, a personal email on Amazon, a phone number at checkout, and a different browser on their tablet than their laptop. These are all separate data points, and without something connecting them, they look like different people.

An identity graph connects them. It says: “This email, this phone number, this device fingerprint, and this cookie all belong to the same human being.”

The graph itself is a network data structure (technically a labeled multigraph in most implementations). Each identifier is a node. Each verified or inferred connection between identifiers is an edge. At the center, collapsed from all connected nodes, sits a unified person profile.

This is the engine behind identity resolution. It’s what transforms an anonymous website visit into a name, email, company, and job title.


Graph Structure: Nodes, Edges, and Profiles

Let’s get into the actual architecture. Identity graphs have three layers.

Layer 1: Nodes (Identifiers)

Nodes represent individual data points. Every graph stores some combination of these:

  • Email addresses - hashed and unhashed, personal and professional
  • Phone numbers - mobile, work, home
  • Device IDs - mobile advertising IDs (MAID), IDFA, GAID
  • IP addresses - with ISP and geolocation metadata
  • Cookies - first-party and third-party (where still available)
  • Browser fingerprints - composite identifiers from user agent, screen resolution, installed fonts, timezone
  • Authentication tokens - hashed login credentials from SSO, OAuth
  • Social handles - LinkedIn URLs, Twitter handles
  • Physical addresses - mailing addresses linked to name/phone from public records

Each node has metadata attached: when it was first seen, when it was last verified, what source contributed it, and a quality score.

Layer 2: Edges (Connections)

Edges are the connections between nodes. This is where the magic - and the complexity - lives.

An edge says: “We have evidence that Node A and Node B belong to the same person.” The strength of that evidence determines the edge weight (more on this in the confidence scoring section).

Strong edges come from verified, deterministic data:

  • Email + phone from the same user registration
  • Email + device from an authenticated login
  • Name + address from public records filings

Weak edges come from statistical inference:

  • IP address + browser fingerprint (could be a shared office)
  • Cookie + device (could be a shared computer)
  • Behavioral patterns (similar browsing, same times, same locations)

The critical design decision in any identity graph is how many weak edges you’re willing to trust. Too few, and your match rate drops. Too many, and you start merging separate people into the same profile. That’s where deterministic vs probabilistic matching comes in.

Layer 3: Profiles (Unified Records)

A profile is the output - the unified person record created by traversing the graph from a starting node and collecting all connected data.

When you query an identity graph API with a device fingerprint, the system finds that node, walks the edges, and assembles the profile. A typical resolved profile might include:

  • Full name
  • Work email and personal email
  • Direct phone number
  • Job title and department
  • Company name, size, industry, revenue
  • LinkedIn URL
  • Physical address
  • Last seen timestamp
  • Confidence score

Leadpipe returns 100+ data points per person - the most comprehensive profile available from any self-serve identity API. This includes personal demographics (age, gender, income, net worth), social profiles (LinkedIn, Facebook), professional history, digital identifiers (HEMs, cookies, device IDs), and intent data. That combination of “who they are,” “how to reach them across channels,” and “what they care about” is what makes the data actionable for sales teams.


How Data Enters the Graph

An identity graph is only as good as its data sources. There are five primary ways data gets into a graph, and each has trade-offs.

1. First-Party Pixels and Tags

A JavaScript pixel installed on websites captures visitor signals in real time. When a user interacts with a site - browses pages, fills a form, clicks a link - the pixel collects device data, behavioral signals, and any authentication events.

This is the foundation of most visitor identification tools. The pixel provides the “ask” (who is this visitor?), and the graph provides the “answer.”

2. Data Partnerships

Identity graph providers build partnerships with data aggregators, publishers, and platforms that contribute identity data. These partnerships are typically structured as licensing agreements where the provider pays for access to verified identity linkages.

For example, a provider might partner with a consumer data aggregator that has verified name-email-phone connections from loyalty programs, warranty registrations, or account signups.

3. Public Records

Public records filings - voter registration, property records, business filings, professional licensing - provide verified connections between names, addresses, and phone numbers. This data is particularly valuable because it’s independently verified by government agencies.

Enterprise providers like Experian and TransUnion have deep access to public records data, which is one reason their graphs tend to be the largest (and most expensive).

4. User-Contributed Data

Some systems collect data directly from user interactions. When users register, log in, or authenticate across partner sites, those events create verified linkages in the graph.

5. Co-op Models

Companies like 5x5 Data use a cooperative model where members contribute their own first-party data to a shared graph and, in return, get access to the full dataset. The advantage is network effects - more members means more data. The downside is that data quality depends on what members contribute, and you’re required to share your own data to participate.


Matching Algorithms Explained

This is the core of how identity graphs work. When a query comes in - “who is this visitor?” - the graph runs a matching algorithm to find the best answer. There are four main approaches.

Matching Method Comparison

Criteria Deterministic Probabilistic Hybrid ML-Powered
Accuracy 95%+ 70-80% 85-90% Varies
Coverage 25-40% 50-70% 50-70% Varies
False Positives <5% 10-30% 5-15% 5-15%
Safe for Automation Yes No Partial Partial
Used By Leadpipe, LiveRamp Lotame, 5x5 Experian, TransUnion Amperity

Deterministic Matching

Deterministic matching requires an exact match against a verified identifier. No guessing. No inference. The system either has a verified connection or it returns nothing.

How it works:

  1. Visitor hits your website. The pixel captures signals (IP, device fingerprint, cookies).
  2. These signals are compared against the graph’s verified linkages.
  3. If the visitor’s device fingerprint matches a device that’s been authenticated (e.g., logged into a partner site with a verified email), the system returns the identity.
  4. If no verified match exists, no result is returned.

Pros:

  • High accuracy - false positive rates below 5%
  • Reliable contact data - the email/phone returned are verified
  • Safe for automation - you can feed results directly to an AI SDR or email sequence

Cons:

  • Lower coverage - typically resolves 25-40% of traffic
  • Requires a large graph of verified connections, which is expensive to build and maintain

Who uses it: Leadpipe uses deterministic matching with a 4.44B profile graph, which delivers high accuracy with broader coverage than most deterministic-only systems.

Probabilistic Matching

Probabilistic matching uses statistical inference from weak signals to estimate identity. Instead of requiring a verified connection, it calculates the probability that a combination of signals belongs to a specific person.

How it works:

  1. The system collects available signals: IP address, device fingerprint, browser configuration, behavioral patterns, time-of-day data.
  2. Each signal is weighted based on its uniqueness and reliability.
  3. A statistical model calculates the probability that this combination of signals belongs to each candidate identity in the graph.
  4. If the probability exceeds a threshold (typically 70-85%), the system returns the match.

Pros:

  • Higher coverage - can resolve 50-70% of traffic
  • Works even without authenticated events
  • Catches visitors that deterministic methods miss

Cons:

  • Higher false positive rates (10-30%)
  • Contact data may be stale or incorrect
  • Risky to automate against - sending personalized email to the wrong person is worse than sending no email

Who uses it: Many mid-market tools rely heavily on probabilistic matching to inflate their match rate numbers. This is why independent accuracy testing is so important - a 60% match rate with 40% accuracy means only 24% of your identified visitors are actually correct.

Hybrid Matching

Hybrid approaches run deterministic first, then fall back to probabilistic for unmatched visitors. The idea is to get the best of both worlds: high-confidence matches where possible, statistical inference for the rest.

How it works:

  1. First pass: deterministic matching against verified linkages.
  2. For unmatched visitors, second pass: probabilistic scoring.
  3. Results are tagged with the matching method used, so downstream systems can handle each differently (e.g., auto-outreach for deterministic, manual review for probabilistic).

Who uses it: LiveRamp’s AbiliTec system and Experian’s identity solutions both use hybrid approaches. The challenge is that the probabilistic layer still produces false positives, and most buyers don’t segment their actions based on match confidence.

ML-Powered Matching

Some providers use machine learning models trained on labeled identity data to improve matching accuracy over time.

How it works:

  1. A supervised model is trained on millions of known identity linkages (ground truth data).
  2. The model learns which combinations of weak signals are most predictive of identity.
  3. Over time, the model improves as it processes more matches and receives feedback.

Amperity claims to use 45 different algorithms in their matching pipeline. The ML approach can outperform simple probabilistic scoring, but it requires massive amounts of training data and constant model maintenance.


Confidence Scoring

Every identity graph assigns a confidence score to its matches. Understanding how scoring works is critical for building reliable systems on top of identity APIs.

How Scores Are Calculated

Confidence scores typically range from 0 to 100 (or 0.0 to 1.0). They’re calculated based on:

  • Signal strength - authenticated login > cookie match > IP + fingerprint > IP alone
  • Number of corroborating signals - multiple independent signals pointing to the same identity increase confidence
  • Recency - a match verified last week is more reliable than one verified two years ago
  • Source quality - data from authenticated events outweighs data from inference

A typical scoring model might look like this:

Signal CombinationConfidence
Authenticated email match95-100
Cookie + device fingerprint match85-95
IP + device + behavioral pattern70-85
IP + device only50-70
IP only20-40

Threshold Tuning

The confidence threshold determines the minimum score required to return a match. This is the most important configuration parameter in any identity resolution system.

High threshold (90+): Very few false positives, but you’ll miss a lot of real visitors. Good for automated outreach where a wrong match has consequences.

Medium threshold (70-85): Balanced. Most production systems operate here. Some false positives, but coverage is reasonable.

Low threshold (50-70): High coverage, but expect significant false positives. Only appropriate if you’re routing to manual review rather than automation.

The right threshold depends on your use case. If you’re feeding data to an AI sales agent that sends personalized emails, you want high confidence. If you’re building aggregate analytics dashboards, you can tolerate more noise.

The False Positive Tradeoff

Here’s the math that most vendors don’t show you:

A provider claims a 65% match rate. Sounds great compared to a competitor’s 35%. But if the 65% provider has a 25% false positive rate, only 49% of your traffic is correctly identified. The 35% provider with a 3% false positive rate correctly identifies 34% of your traffic.

The difference: 49% vs 34% correct matches. But the 65% provider also gave you 16% wrong matches - people who’ll receive emails meant for someone else.

This is why accuracy testing matters more than match rate.


Scale Matters: Graph Size Comparison

The size of an identity graph directly impacts its coverage. A larger graph means more potential matches for any given set of signals.

Here’s how the major players compare:

ProviderGraph SizeTypeAccess Model
Experian / TapadUndisclosed (believed largest)Deterministic + probabilisticEnterprise ($50K+/yr)
LiveRampUndisclosed (massive)Hybrid (AbiliTec)Enterprise ($50K+/yr)
Leadpipe4.44B profilesDeterministicSelf-serve ($147-$8K/mo)
FullContactDeterministic-heavySales-assisted ($99/mo advertised)
5x5 Data~250M contactsCo-op / probabilisticMembership
LotameUndisclosedProbabilisticPartnership-based (contact for access)

Identity Graph Scale Comparison

Leadpipe 4.44B profiles
FullContact 1B profiles
5x5 Data 250M+ contacts
Experian / LiveRamp Undisclosed (believed largest)

Bars scaled to 4.44B. Striped bar indicates undisclosed graph size. Leadpipe from $147/mo (Starter) to ~$8K/mo (1M+ IDs); Experian/LiveRamp at $50K+/yr.

Why graph size isn’t everything. A graph with 4B profiles and strong verified connections will outperform a graph with 10B profiles built mostly from probabilistic inference. The quality of the edges - not just the number of nodes - determines accuracy.

That said, below a certain size, coverage suffers. If your graph only has 250M contacts and a visitor isn’t in it, no amount of algorithmic sophistication will produce a match. Scale is necessary but not sufficient.


Freshness and Decay

Identity data decays faster than most people realize. People change jobs (average tenure: 4.2 years, but turnover is accelerating), switch phone numbers, move to new addresses, and cycle through devices.

Decay Rates by Data Type

Data TypeAverage Decay RateRefresh Frequency Needed
Work email20-30% per yearMonthly
Phone number15-20% per yearQuarterly
Job title25-35% per yearMonthly
Physical address10-15% per yearQuarterly
Device fingerprint50-70% per yearWeekly
IP address60-80% per yearDaily

IP addresses and device fingerprints change so frequently that any graph not refreshing these signals daily is working with stale data. Work emails and job titles require at least monthly updates to maintain accuracy.

Why Daily Refresh Matters

Consider a scenario: a prospect visits your pricing page on Monday. Your identity graph last updated its IP-to-person mappings two weeks ago. The prospect’s company rotated its IP pool last Friday. The graph returns a match - but it’s the person who HAD that IP two weeks ago, not the person who has it now.

Stale data produces confident wrong answers. That’s worse than no answer at all.

This is one reason enterprise graphs like LiveRamp maintain their edge - they have the resources for continuous data refresh across billions of records. It’s also why smaller providers that rely on monthly data dumps from third parties tend to underperform on accuracy benchmarks.


Build vs Buy Analysis

If you’re a platform or SaaS company considering adding identity resolution as a feature, you’ve got three paths: build your own graph, buy enterprise access, or integrate an existing API.

Build Your Own Identity Graph

Cost: $500K - $2M+ initial investment, $200K+/yr ongoing Timeline: 12-18 months to MVP, 2-3 years to competitive parity Team required: 3-5 data engineers, 1-2 ML engineers, 1 data partnerships lead, 1 compliance/legal

You’ll need to:

  • Source and license identity data from multiple providers
  • Build the graph data structure (typically Neo4j, JanusGraph, or a custom graph DB)
  • Implement matching algorithms
  • Build confidence scoring
  • Handle data refresh (daily for IP/device, monthly for PII)
  • Manage privacy compliance (CCPA, GDPR, state privacy laws)
  • Build and maintain the API layer

When this makes sense: You’re a data company and identity resolution IS your product. You have the budget, the timeline, and the regulatory expertise. Companies like LiveRamp, Experian, and Leadpipe chose this path because identity is their core competency.

Buy Enterprise Access

Cost: $50K - $200K+/yr Timeline: 3-6 months including procurement, legal review, implementation Options: LiveRamp, Experian, TransUnion, Oracle Data Cloud

You get access to a proven graph with massive scale. The downsides: enterprise pricing, long sales cycles, minimum commitments, and limited API flexibility.

When this makes sense: You’re a large company with budget, you need the largest possible graph, and you can tolerate a 6-month implementation timeline. See the detailed breakdown in build vs buy vs embed.

Integrate an Existing API

Cost: $147 - $500/mo Timeline: Minutes to first API call, days to production integration Options: Leadpipe, FullContact, Lotame

You get an API key, documentation, and start building immediately. No procurement cycle, no minimum commitment, no implementation consultants.

When this makes sense: You want identity resolution as a feature in your product, not your core product. You need to ship fast. Your budget doesn’t support enterprise licensing. This is the path most SaaS platforms take.

If you’re leaning toward the API integration path, you can sign up for Leadpipe’s free trial and make your first API call in under 5 minutes. No commitment, no credit card - just 500 free identified leads to test the data quality against your own traffic.


How Leadpipe’s Identity Graph Works

Leadpipe builds and maintains its own proprietary identity graph. Here’s what that means in practice.

4.44B
Person Profiles
20,735
Intent Topics Tracked
100+
Data Points Per Person
Daily
IP/Device Refresh

Graph size: 4.44 billion profiles - the largest self-serve identity graph available. That’s roughly 4x FullContact’s graph and significantly more accessible than LiveRamp’s or Experian’s enterprise-gated offerings.

Matching approach: Deterministic. Leadpipe requires verified connections before returning a match. No probabilistic fallback inflating match rates with questionable data.

Data freshness: Daily refresh on IP and device mappings. Regular updates on PII fields (email, phone, job title).

Fields returned: 100+ data points per resolved person, including:

  • Contact data (name, emails, phone numbers, photo, address)
  • Personal demographics (age, gender, income, net worth, marital status, children, homeowner status)
  • Professional data (title, seniority, department, headline, full work history, LinkedIn)
  • Social profiles (LinkedIn, Facebook, other social platforms)
  • Digital identifiers (HEMs - SHA256, SHA1, MD5 hashed emails; cookies; device IDs)
  • Firmographics (company name, domain, size, industry, revenue, location, LinkedIn URL, NAICS/SIC codes)
  • Intent data (20,735 topics tracked, with topic labels, confidence scores, and daily refresh)

That last point is what differentiates Leadpipe from pure identity graph providers. Most graphs tell you WHO someone is. Leadpipe also tells you WHAT they care about. When you combine identity with intent data, you can route leads not just based on who they are, but what they’re actively researching.

API access: 23 endpoints, self-serve starting at $147/mo. Scales to ~$8,000/mo for high-volume (1M+ identifications at ~$0.008/ID). No sales call required. Multiple delivery methods: REST API, real-time webhooks, flat file/CSV exports, TypeScript SDK (npm install @leadpipe/client), and MCP server (npx -y @leadpipe/mcp). Full API documentation available publicly.

Webhook support: Real-time delivery with First Match and Every Update triggers. Your system gets identity data within seconds of a visitor being resolved - no polling required. Full details in the webhook payload reference.

If you want to see how this compares to other options, the identity resolution API comparison breaks down every major provider side by side.

Try it free. Leadpipe includes 500 identified leads on the free trial - no credit card required. Sign up here and make your first API call in under 5 minutes.


FAQ

What’s the difference between an identity graph and a customer data platform (CDP)?

An identity graph is a data structure that maps identifiers to people. A CDP is a software platform that collects, unifies, and activates customer data. CDPs typically use an identity graph as one component, but also include data storage, segmentation, audience building, and activation features. Think of the identity graph as the engine and the CDP as the car.

How big does an identity graph need to be?

It depends on your market. For US B2B, you need at least 200-300M person records to achieve reasonable coverage. Below that, you’ll miss too many visitors. Leadpipe’s 4.44B profiles provide broad coverage across geographies and both B2B and B2C use cases. Enterprise providers like LiveRamp and Experian maintain even larger graphs but charge 100-300x more for access.

Can I build my own identity graph?

Technically yes. Practically, it costs $500K-2M upfront and takes 12-18 months - and that gets you to MVP, not competitive parity. You also need ongoing data partnerships, a compliance team, and continuous engineering to maintain freshness. Unless identity is your core product, integrating an existing API is almost always the better path.

Why do different identity tools return different results for the same visitor?

Different tools use different graphs, different matching algorithms, and different confidence thresholds. A probabilistic tool might return a match with 60% confidence that a deterministic tool would reject. Neither is “wrong” - they just have different accuracy/coverage tradeoffs. This is why running your own accuracy tests matters.

What’s the difference between deterministic and probabilistic matching?

Deterministic matching requires an exact verified connection (like a hashed email match). Probabilistic matching uses statistical inference from weak signals (like IP + device fingerprint). Deterministic is more accurate but has lower coverage. Probabilistic has higher coverage but more false positives. Full breakdown in our deterministic vs probabilistic guide.

How often should identity graph data be refreshed?

IP addresses and device fingerprints should be refreshed daily. Work emails and job titles need monthly updates. Phone numbers and physical addresses can be refreshed quarterly. Any provider not refreshing IP data daily is giving you stale matches.

What does an identity graph API call look like?

Most identity graph APIs accept a query with one or more identifiers (email, phone, device ID, IP) and return a resolved profile. For example, Leadpipe’s API accepts visitor session data and returns 100+ data points including contact info, demographics, social profiles, HEMs, firmographics, and intent data. Response times are typically under 200ms. See the complete developer guide for request/response examples.

How is identity data different from enrichment data?

Identity data answers “who is this person?” - it resolves an anonymous signal to a known identity. Enrichment data answers “what else do we know about them?” - it adds firmographics, technographics, social profiles, and other attributes to a known record. Many providers (including Leadpipe) combine both: identity resolution + enrichment in a single API call.