Week 4: EHR + Claims Data Integration: The Technical Reality
What 18 months of hell at Vanderbilt taught me about data integration (so you can do it in 3)
Welcome back to Week 4.
If you’ve been following along, you now understand:
Week 1: Why healthcare analytics is broken
Week 2: The modern data pipeline architecture
Week 3: Building a healthcare data warehouse from scratch
This week, we tackle the hardest problem in healthcare analytics:
Integrating EHR (Electronic Health Record) data with Claims data.
This is where most projects die. I know because I’ve killed several of them.
Let me save you 18 months of pain.
The Problem Nobody Talks About
Scene: Vanderbilt University Medical Center, March 2023.
Executive mandate: “We need to integrate EHR and Medicaid claims for 1.5 million Tennessee members to support value-based care reporting. CMS deadline is 6 months.”
Sounds simple, right? We have:
✅ Epic EHR system (clinical data)
✅ Tennessee Medicaid claims (billing data)
✅ Both contain patient identifiers
✅ Just join them together
Reality check: It took us 18 months to get it working.
Not because we were incompetent. Because EHR and Claims data fundamentally don’t want to be together.
Here’s why:
The Seven Circles of Integration Hell
Circle 1: They Don’t Speak the Same Language
EHR data:
ICD-10 codes: “E11.9 - Type 2 diabetes mellitus without complications”
Detailed clinical notes
Lab results with units: “HbA1c: 7.2%”
Medications with dosages: “Metformin 500mg BID”
Structured around clinical encounters
Claims data:
ICD-10 codes: “E119” (no description)
No clinical context
No lab results
Medications as NDC codes: “00093-7303-01”
Structured around billing events
Example of the chaos:
Same patient, same diabetes diagnosis:
EHR (Epic):
{
“diagnosis”: “Type 2 diabetes mellitus without complications”,
“icd10”: “E11.9”,
“date_diagnosed”: “2020-03-15”,
“hba1c”: 7.2,
“provider”: “Dr. Jane Smith, Endocrinology”
}
Claims (TennCare):
{
“diagnosis_code”: “E119”,
“service_date”: “2020-03-15”,
“allowed_amount”: 245.00,
“provider_npi”: “1234567890”
}
They’re talking about the same thing, but you’d never know it from the data structure.
Circle 2: The Identifier Nightmare
The theory: Use patient identifiers to link records.
The reality: There are no universal patient identifiers in the US.
Here’s what we had to work with at VUMC:
Epic EHR identifiers:
MRN (Medical Record Number): Internal to VUMC
Social Security Number: Sometimes present, often missing
Name + DOB: Subject to typos and variations
TennCare Claims identifiers:
Medicaid ID: Changes when patients lose/regain eligibility
SSN: Present in ~60% of records
Name + DOB: Different formatting conventions
Real example of matching hell:
EHR Record:
- Name: “SMITH, JOHN ANDREW”
- DOB: 1985-03-15
- SSN: XXX-XX-1234
- MRN: V12345678
Claims Record A:
- Name: “SMITH,JOHN A”
- DOB: 03/15/1985
- SSN: [BLANK]
- Medicaid ID: TN0098765
Claims Record B:
- Name: “SMITH, JOHN”
- DOB: 1985-03-15
- SSN: XXX-XX-1234
- Medicaid ID: TN0012345
Claims Record C:
- Name: “SMITH, J A”
- DOB: 1985-03-15
- SSN: XXX-XX-1234
- Medicaid ID: TN0098765
Question: Are A, B, and C the same person? How sure are you?
Our solution (which took 4 months to build):
def probabilistic_match(ehr_record, claims_record):
“”“
Multi-stage matching algorithm with confidence scores
“”“
score = 0
# Stage 1: Exact SSN match (if both present)
if ehr_record.ssn and claims_record.ssn:
if ehr_record.ssn == claims_record.ssn:
score += 50 # Strong match
else:
return 0 # Definite non-match
# Stage 2: DOB match
if normalize_date(ehr_record.dob) == normalize_date(claims_record.dob):
score += 30
else:
score -= 20 # Different DOB is red flag
# Stage 3: Name matching (fuzzy)
name_similarity = jaro_winkler_similarity(
normalize_name(ehr_record.name),
normalize_name(claims_record.name)
)
score += name_similarity * 20
# Stage 4: Address matching (if available)
if ehr_record.zip and claims_record.zip:
if ehr_record.zip == claims_record.zip:
score += 10
# Confidence thresholds
if score >= 80:
return “definite_match”
elif score >= 60:
return “probable_match”
elif score >= 40:
return “possible_match”
else:
return “no_match”
def normalize_name(name):
“”“Remove common variations”“”
name = name.upper().strip()
name = re.sub(r’[,.]’, ‘’, name)
name = re.sub(r’\s+’, ‘ ‘, name)
# Handle Jr, Sr, II, III
name = re.sub(r’\b(JR|SR|II|III|IV)\b’, ‘’, name)
return name
def normalize_date(date_str):
“”“Handle different date formats”“”
# Try multiple formats
formats = [’%Y-%m-%d’, ‘%m/%d/%Y’, ‘%Y%m%d’, ‘%m-%d-%Y’]
for fmt in formats:
try:
return datetime.strptime(date_str, fmt).date()
except:
continue
return None
Results:
Definite matches: 87% of records
Probable matches: 9% (manual review required)
Possible matches: 3% (flagged for data quality team)
No matches: 1% (orphaned records)
Time to build: 4 months Manual review hours: 200+ hours for the 9% probable matches
Circle 3: The Timeline Paradox
The problem: EHR and Claims record events at different times.
Example scenario:
Patient visits ER on April 15, 2024 with chest pain.
EHR records:
- April 15, 2024 11:30 AM: Patient check-in
- April 15, 2024 11:45 AM: Triage note
- April 15, 2024 12:00 PM: Physician assessment
- April 15, 2024 2:30 PM: Discharge summary
- Diagnosis: “Acute chest pain, rule out MI”
- Final: “Costochondritis” (chest wall inflammation)
Claims records:
- April 30, 2024: Claim submitted
- May 15, 2024: Claim processed
- June 1, 2024: Claim adjusted (coding error)
- June 15, 2024: Final claim paid
- Diagnosis code: “I20.9” (Angina, unspecified) [wrong!]
Questions this creates:
What date do you use for analysis? Service date (Apr 15) or paid date (Jun 15)?
Which diagnosis is correct? EHR (costochondritis) or Claims (angina)?
How do you handle the 2-month lag in claims data?
What about retroactive adjustments?
Our solution:
-- Create a unified timeline view
CREATE VIEW patient_timeline AS
SELECT
patient_id,
‘ehr’ as source,
encounter_date as event_date,
encounter_date as recorded_date,
diagnosis_code,
‘clinical’ as data_quality
FROM ehr_encounters
UNION ALL
SELECT
patient_id,
‘claims’ as source,
service_date as event_date,
paid_date as recorded_date,
diagnosis_code,
CASE
WHEN DATEDIFF(day, service_date, paid_date) <= 30 THEN ‘good’
WHEN DATEDIFF(day, service_date, paid_date) <= 60 THEN ‘delayed’
ELSE ‘very_delayed’
END as data_quality
FROM claims
ORDER BY patient_id, event_date;
Design decision: We track both service date (when it happened) and recorded date (when we learned about it).
This is critical for:
Predictive modeling (only use data available at prediction time)
Reporting (explain data lags to stakeholders)
Data quality monitoring (flag suspiciously delayed claims)
Circle 4: The HIPAA Minefield
The scene: Week 3 of the VUMC project.
Me: “We need to send EHR data to Azure for integration with claims.”
Legal: “What’s your HIPAA compliance plan?”
Me: “Azure is HIPAA compliant with a BAA.”
Legal: “Show me the data flow diagram. What PHI is moving where?”
Three months later...
Here’s what we had to build:
┌─────────────────────────────────────────────────────────┐
│ HIPAA Compliance Layers │
└─────────────────────────────────────────────────────────┘
Layer 1: Data Extraction
├─ EHR (Epic) → Export to SFTP (encrypted)
├─ Claims → Export to SFTP (encrypted)
└─ Both use AES-256 encryption at rest
Layer 2: De-identification (HIPAA Safe Harbor)
├─ Remove 18 identifiers:
│ ├─ Names
│ ├─ Addresses (except state)
│ ├─ Dates (except year)
│ ├─ Phone/Fax
│ ├─ Email
│ ├─ SSN
│ ├─ MRN
│ ├─ Device IDs
│ └─ ... (see full list in code below)
│
└─ Generate surrogate keys:
├─ patient_id → hash(patient_id + salt)
├─ provider_npi → hash(provider_npi + salt)
└─ Keep salt in separate key management system
Layer 3: Data Transfer
├─ SFTP → Azure Data Lake (encrypted in transit)
├─ TLS 1.3 required
└─ Audit logging enabled
Layer 4: Processing
├─ Azure Databricks (HIPAA-compliant workspace)
├─ Network isolation (private endpoints)
├─ No data egress to internet
└─ All access logged
Layer 5: Access Control
├─ Role-based access (RBAC)
├─ Multi-factor authentication
├─ VPN required for remote access
└─ Annual security training required
The de-identification code:
import hashlib
import re
from datetime import datetime
class HIPAADeIdentifier:
“”“
HIPAA Safe Harbor de-identification
Based on 45 CFR § 164.514(b)(2)
“”“
def __init__(self, salt_key):
self.salt = salt_key
def deidentify_record(self, record):
“”“Remove 18 HIPAA identifiers”“”
safe_record = {}
# 1. Names - remove entirely
# safe_record[’name’] = None # Don’t include at all
# 2. Addresses - keep only state
if ‘address’ in record:
safe_record[’state’] = record[’address’].get(’state’)
# 3. Dates - keep only year (except DOB - shift dates)
if ‘dob’ in record:
# Shift DOB by random offset (preserve age)
safe_record[’birth_year’] = self._extract_year(record[’dob’])
if ‘service_dates’ in record:
safe_record[’service_years’] = [
self._extract_year(d) for d in record[’service_dates’]
]
# 4-8. Contact info - remove
# phone, fax, email, ssn, mrn - don’t include
# 9. Generate surrogate key
if ‘patient_id’ in record:
safe_record[’patient_key’] = self._generate_surrogate(
record[’patient_id’]
)
# 10. Device IDs - remove
# 11. URLs - remove
# 12. IP addresses - remove
# ... etc
# Keep clinical data (diagnoses, procedures, etc)
safe_record[’diagnoses’] = record.get(’diagnoses’, [])
safe_record[’procedures’] = record.get(’procedures’, [])
safe_record[’medications’] = record.get(’medications’, [])
# Age buckets (not exact age)
if ‘age’ in record:
safe_record[’age_bucket’] = self._age_bucket(record[’age’])
# Zip code - only first 3 digits
if ‘zip’ in record:
safe_record[’zip3’] = record[’zip’][:3]
return safe_record
def _generate_surrogate(self, original_id):
“”“Create irreversible hash”“”
return hashlib.sha256(
f”{original_id}{self.salt}”.encode()
).hexdigest()
def _extract_year(self, date_str):
“”“Extract year from date”“”
try:
return datetime.strptime(date_str, ‘%Y-%m-%d’).year
except:
return None
def _age_bucket(self, age):
“”“Group ages into buckets”“”
if age < 18:
return ‘<18’
elif age < 30:
return ‘18-29’
elif age < 40:
return ‘30-39’
elif age < 50:
return ‘40-49’
elif age < 60:
return ‘50-59’
elif age < 70:
return ‘60-69’
elif age < 80:
return ‘70-79’
else:
return ‘80+’
# Usage
deidentifier = HIPAADeIdentifier(salt_key=os.environ[’SALT_KEY’])
# Before
ehr_record = {
‘patient_id’: ‘V12345678’,
‘name’: ‘John Smith’,
‘dob’: ‘1985-03-15’,
‘ssn’: ‘123-45-6789’,
‘phone’: ‘615-555-1234’,
‘address’: {
‘street’: ‘123 Main St’,
‘city’: ‘Nashville’,
‘state’: ‘TN’,
‘zip’: ‘37203’
},
‘diagnoses’: [’E11.9’, ‘I10’],
‘age’: 39
}
# After
safe_record = deidentifier.deidentify_record(ehr_record)
# {
# ‘patient_key’: ‘a3f5b8c9...’, # irreversible hash
# ‘state’: ‘TN’,
# ‘birth_year’: 1985,
# ‘diagnoses’: [’E11.9’, ‘I10’],
# ‘age_bucket’: ‘30-39’,
# ‘zip3’: ‘372’
# }
Legal review time: 3 months Documentation pages: 47 Meetings with compliance: 12 Revisions to data flow: 8
But here’s the important part: Once approved, we could reuse this framework for every future project. The 3 months paid off.
Circle 5: The Data Volume Problem
At VUMC, we’re integrating:
EHR: 1.5M patients, 50M+ encounters, 500M+ clinical events
Claims: Same 1.5M patients, 200M+ claims over 5 years
Naive approach:
-- This query will run until the heat death of the universe
SELECT
e.patient_id,
e.encounter_date,
e.diagnosis_code as ehr_diagnosis,
c.service_date,
c.diagnosis_code as claims_diagnosis
FROM ehr_encounters e
LEFT JOIN claims c
ON e.patient_id = c.patient_id
AND ABS(DATEDIFF(day, e.encounter_date, c.service_date)) <= 7
WHERE e.encounter_date >= ‘2019-01-01’
Problem: This join creates 50M × 200M = 10 trillion comparisons.
Our optimized approach:
-- Step 1: Create patient-level spine (partition key)
CREATE TABLE patient_spine AS
SELECT DISTINCT
patient_key,
birth_year,
state,
first_encounter_date,
last_encounter_date
FROM ehr_encounters
WHERE encounter_date >= ‘2019-01-01’;
-- Step 2: Partition EHR data by patient + year
CREATE TABLE ehr_encounters_partitioned
PARTITIONED BY (patient_key, encounter_year)
AS
SELECT
*,
YEAR(encounter_date) as encounter_year
FROM ehr_encounters;
-- Step 3: Partition Claims data by patient + year
CREATE TABLE claims_partitioned
PARTITIONED BY (patient_key, service_year)
AS
SELECT
*,
YEAR(service_date) as service_year
FROM claims;
-- Step 4: Join within partitions (MUCH faster)
SELECT
e.patient_key,
e.encounter_date,
e.diagnosis_code as ehr_diagnosis,
c.service_date,
c.diagnosis_code as claims_diagnosis
FROM ehr_encounters_partitioned e
LEFT JOIN claims_partitioned c
ON e.patient_key = c.patient_key
AND e.encounter_year = c.service_year
AND ABS(DATEDIFF(day, e.encounter_date, c.service_date)) <= 7
WHERE e.encounter_year >= 2019;
Performance:
Naive query: Never completed (>24 hours)
Optimized query: 12 minutes on Databricks cluster (64 cores)
Cost:
Naive: $∞ (never finishes)
Optimized: $38 per run
Circle 6: The Data Quality Quagmire
Truth bomb: Both EHR and Claims data are garbage. Just in different ways.
EHR data problems:
Coding inconsistency (10 doctors, 10 ways to document diabetes)
Missing data (labs not entered, notes incomplete)
Copy-paste errors (residents copy previous notes)
Temporal issues (medication list not updated)
Claims data problems:
Upcoding (billing for more expensive procedure)
Bundling/unbundling games
Delayed submission (6+ months for some claims)
Denials/adjustments create duplicates
Real example from TennCare:
Patient X has diabetes. Here’s what we found:
EHR (Epic):
- Encounter 1: E11.9 (T2DM without complications)
- Encounter 2: E11.65 (T2DM with hyperglycemia)
- Encounter 3: E11.9
- Lab: HbA1c 7.2% (confirms diabetes)
- Medication: Metformin 500mg BID (confirms treatment)
Claims (TennCare):
- Claim 1: E11.9 (T2DM)
- Claim 2: E10.9 (T1DM) ← WRONG TYPE!
- Claim 3: E119 (missing decimal)
- Claim 4: 250.00 (old ICD-9 code) ← OUTDATED!
- Claim 5: [No diagnosis] (denied claim)
What’s the truth?
Our data quality framework:
class DataQualityScorer:
“”“
Assign quality scores to each data element
Use for weighted analysis
“”“
def score_diagnosis(self, record):
“”“
Score diagnosis reliability (0-100)
“”“
score = 50 # baseline
# Source credibility
if record[’source’] == ‘ehr’:
if record[’record_type’] == ‘physician_note’:
score += 30 # High trust
elif record[’record_type’] == ‘automated_problem_list’:
score += 10 # Medium trust
elif record[’source’] == ‘claims’:
if record[’claim_status’] == ‘paid’:
score += 20 # Verified by payer
elif record[’claim_status’] == ‘denied’:
score -= 30 # Suspect coding
# Temporal recency
days_old = (datetime.now() - record[’recorded_date’]).days
if days_old <= 90:
score += 10
elif days_old > 365:
score -= 10
# Consistency across sources
if record.get(’confirmed_by_multiple_sources’):
score += 20
# Specificity (more specific = better quality)
if ‘.’ in record[’icd_code’]: # has decimal
score += 5
# Clinical coherence (does it make sense?)
if self.check_clinical_coherence(record):
score += 10
return max(0, min(100, score)) # clamp to 0-100
def check_clinical_coherence(self, record):
“”“
Check if diagnosis makes sense given context
“”“
# Example: Type 1 diabetes new diagnosis at age 65 is suspect
if record[’icd_code’].startswith(’E10’): # T1DM
if record.get(’age’) and record[’age’] > 50:
if not record.get(’previous_t1dm_diagnosis’):
return False # Likely coding error
# Example: Pregnancy diagnosis for male patient
if record[’icd_code’].startswith(’O’): # Pregnancy codes
if record.get(’gender’) == ‘M’:
return False # Definite error
return True
# Usage: Weight analyses by quality scores
quality_scorer = DataQualityScorer()
for record in patient_records:
score = quality_scorer.score_diagnosis(record)
record[’quality_score’] = score
# Then in analysis:
# Don’t just count diagnoses - weight by quality
weighted_prevalence = sum(
r[’quality_score’] / 100.0
for r in records
if r[’icd_code’] == ‘E11.9’
) / len(records)
Key insight: Don’t try to make data perfect. Instead:
Score data quality
Use scores as weights in analysis
Flag low-quality data for manual review
Report confidence intervals based on quality
Circle 7: The “Now What?” Problem
The scene: Month 18 of the VUMC project.
We finally had integrated data. Clean, validated, HIPAA-compliant.
Executive: “Great! Now show me... something.”
Me: “What do you want to see?”
Executive: “I don’t know. You’re the data person. Show me insights!”
The brutal truth: Integration is only 20% of the value. The other 80% is knowing what questions to ask.
What we built (the actual product):
# Population Health Dashboard
# Azure Function that runs daily
def generate_population_insights(patient_cohort):
“”“
Combine EHR + Claims to find actionable insights
“”“
insights = []
# Insight 1: High-risk patients (EHR + Claims)
high_risk = find_high_risk_patients(
ehr_data=get_recent_encounters(patient_cohort),
claims_data=get_recent_utilization(patient_cohort)
)
# Insight 2: Care gaps (EHR shows diagnosis, Claims show no treatment)
care_gaps = find_care_gaps(
ehr_diagnoses=get_chronic_conditions(patient_cohort),
claims_prescriptions=get_filled_medications(patient_cohort)
)
# Insight 3: Unnecessary utilization (Claims show pattern, EHR shows alternative)
unnecessary_er = find_unnecessary_er_visits(
claims_er_visits=get_er_visits(patient_cohort),
ehr_pcp_availability=get_pcp_schedule(patient_cohort)
)
# Use GPT-4 to generate plain-English summary
summary = generate_executive_summary(
high_risk_count=len(high_risk),
care_gap_details=care_gaps,
er_cost=sum(v[’cost’] for v in unnecessary_er)
)
return {
‘insights’: insights,
‘summary’: summary,
‘actionable_patients’: high_risk + care_gaps,
‘estimated_savings’: calculate_roi(insights)
}
def find_care_gaps(ehr_diagnoses, claims_prescriptions):
“”“
Example: Patient has diabetes diagnosis but not taking meds
“”“
gaps = []
for patient in ehr_diagnoses:
if ‘E11’ in patient[’icd_codes’]: # Has diabetes
# Check if taking diabetes meds
meds = [m for m in claims_prescriptions
if m[’patient_key’] == patient[’patient_key’]
and m[’drug_class’] in [’metformin’, ‘insulin’, ‘sglt2’]]
if not meds:
gaps.append({
‘patient_key’: patient[’patient_key’],
‘condition’: ‘Type 2 Diabetes’,
‘gap’: ‘No diabetes medication filled in 90 days’,
‘risk’: ‘high’,
‘action’: ‘Contact patient to discuss medication adherence’,
‘estimated_cost’: 5000 # Avg cost of diabetes complication
})
return gaps
This is what executives wanted:
✅ “234 patients with diabetes not taking meds” (actionable list)
✅ “Estimated $1.2M in preventable complications” (ROI)
✅ “Auto-generated outreach list” (next steps)
Not:
❌ “Here’s a dashboard of integrated data” (so what?)
❌ “95% of records successfully linked” (who cares?)
The Honest Timeline (What Actually Happened)
Month 1-3: Legal & Compliance
BAA with Azure
Data governance policies
HIPAA compliance framework
Security architecture review
Month 4-7: Data Extraction
Epic data export (took 6 weeks to get IT approval)
Claims data export (relatively easy)
Building de-identification pipeline
Testing with synthetic data
Month 8-11: Matching Algorithm
Probabilistic matching logic
Manual review workflow
Validation with known matches
False positive/negative analysis
Month 12-15: Data Quality
Cleaning and standardization
Quality scoring framework
Reconciliation rules
Exception handling
Month 16-18: Product Development
Population health dashboard
Care gap identification
Risk stratification
Executive reporting
Total time: 18 months Total cost: ~$800K (4 FTE + infrastructure)
The 3-Month Playbook (How to Do It Faster)
After doing this twice (TennCare and VUMC), here’s the optimized path:
Month 1: Foundation
Week 1:
Get legal approval (use my BAA template - see appendix)
Set up Azure environment with HIPAA controls
Deploy de-identification pipeline (use my code)
Week 2:
Extract EHR data to SFTP
Extract Claims data to SFTP
Validate data formats
Week 3:
Load to Azure Data Lake
Run de-identification
Create partitioned tables
Week 4:
Build patient spine
Implement matching algorithm (use my probabilistic matcher)
Validate on sample (1000 patients)
Month 2: Integration
Week 5-6:
Scale matching to full population
Manual review of probable matches
Build quality scoring framework
Week 7-8:
Create unified data model
Build integration pipelines
Automated testing
Month 3: Product
Week 9-10:
Build population health dashboard
Care gap analysis
Risk stratification
Week 11-12:
Executive reporting
User acceptance testing
Production deployment
Keys to success:
Don’t reinvent the wheel - use existing frameworks
Parallel workstreams - don’t do everything sequentially
Start with small cohort - validate before scaling
Focus on business value - build insights, not just pipelines
Your Assignment This Week
Assignment 1: Map Your Integration Requirements (2 hours)
Create a document with:
Data sources you need to integrate
Identifiers available in each source
Matching strategy (exact, probabilistic, manual?)
HIPAA compliance gaps
Expected record volumes
Business questions you’ll answer
Template:
Data Source 1: [EHR Name]
- Records: [X million]
- Identifiers: [MRN, SSN, Name, DOB]
- Refresh frequency: [Daily/Weekly]
- HIPAA status: [Compliant/Needs review]
Data Source 2: [Claims Source]
- Records: [X million]
- Identifiers: [Member ID, SSN, Name, DOB]
- Refresh frequency: [Monthly]
- HIPAA status: [Compliant]
Matching Strategy:
- Primary: SSN + DOB (if both present)
- Secondary: Probabilistic (Name + DOB + Zip)
- Manual review threshold: 60-80% confidence
Business Questions:
1. [Specific question requiring both sources]
2. [Another question]
Assignment 2: Test the De-identification Code (1 hour)
Take my
HIPAADeIdentifierclass (in code section above)Run it on sample data (use synthetic data if you don’t have real data)
Verify all 18 HIPAA identifiers are removed
Calculate what % of your data would be retained
Email me your results. I’m collecting data on de-identification effectiveness across different data types.
Assignment 3: Calculate Your Integration Cost (30 minutes)
Use this spreadsheet formula:
Labor Cost:
- Data Engineers: [X FTE] × [Months] × $15K/month
- Legal Review: [Y hours] × $300/hour
- Manual Review: [Z hours] × $50/hour
Infrastructure Cost:
- Azure Databricks: [Cluster size] × [Hours] × $0.50/DBU
- Storage: [TB] × $25/month
- Data Transfer: [GB] × $0.05/GB
Total: $[X]
Expected Savings: $[Y]
