Week 4: EHR + Claims Data Integration: The Technical Reality

What 18 months of hell at Vanderbilt taught me about data integration (so you can do it in 3)

Chad You

Dec 16, 2025

Welcome back to Week 4.

If you’ve been following along, you now understand:

Week 1: Why healthcare analytics is broken
Week 2: The modern data pipeline architecture
Week 3: Building a healthcare data warehouse from scratch

This week, we tackle the hardest problem in healthcare analytics:

Integrating EHR (Electronic Health Record) data with Claims data.

This is where most projects die. I know because I’ve killed several of them.

Let me save you 18 months of pain.

The Problem Nobody Talks About

Scene: Vanderbilt University Medical Center, March 2023.

Executive mandate: “We need to integrate EHR and Medicaid claims for 1.5 million Tennessee members to support value-based care reporting. CMS deadline is 6 months.”

Sounds simple, right? We have:

✅ Epic EHR system (clinical data)
✅ Tennessee Medicaid claims (billing data)
✅ Both contain patient identifiers
✅ Just join them together

Reality check: It took us 18 months to get it working.

Not because we were incompetent. Because EHR and Claims data fundamentally don’t want to be together.

Here’s why:

The Seven Circles of Integration Hell

Circle 1: They Don’t Speak the Same Language

EHR data:

ICD-10 codes: “E11.9 - Type 2 diabetes mellitus without complications”
Detailed clinical notes
Lab results with units: “HbA1c: 7.2%”
Medications with dosages: “Metformin 500mg BID”
Structured around clinical encounters

Claims data:

ICD-10 codes: “E119” (no description)
No clinical context
No lab results
Medications as NDC codes: “00093-7303-01”
Structured around billing events

Example of the chaos:

Same patient, same diabetes diagnosis:

EHR (Epic):
{
  “diagnosis”: “Type 2 diabetes mellitus without complications”,
  “icd10”: “E11.9”,
  “date_diagnosed”: “2020-03-15”,
  “hba1c”: 7.2,
  “provider”: “Dr. Jane Smith, Endocrinology”
}

Claims (TennCare):
{
  “diagnosis_code”: “E119”,
  “service_date”: “2020-03-15”,
  “allowed_amount”: 245.00,
  “provider_npi”: “1234567890”
}

They’re talking about the same thing, but you’d never know it from the data structure.

Circle 2: The Identifier Nightmare

The theory: Use patient identifiers to link records.

The reality: There are no universal patient identifiers in the US.

Here’s what we had to work with at VUMC:

Epic EHR identifiers:

MRN (Medical Record Number): Internal to VUMC
Social Security Number: Sometimes present, often missing
Name + DOB: Subject to typos and variations

TennCare Claims identifiers:

Medicaid ID: Changes when patients lose/regain eligibility
SSN: Present in ~60% of records
Name + DOB: Different formatting conventions

Real example of matching hell:

EHR Record:
- Name: “SMITH, JOHN ANDREW”
- DOB: 1985-03-15
- SSN: XXX-XX-1234
- MRN: V12345678

Claims Record A:
- Name: “SMITH,JOHN A”
- DOB: 03/15/1985
- SSN: [BLANK]
- Medicaid ID: TN0098765

Claims Record B:
- Name: “SMITH, JOHN”
- DOB: 1985-03-15
- SSN: XXX-XX-1234
- Medicaid ID: TN0012345

Claims Record C:
- Name: “SMITH, J A”
- DOB: 1985-03-15
- SSN: XXX-XX-1234
- Medicaid ID: TN0098765

Question: Are A, B, and C the same person? How sure are you?

Our solution (which took 4 months to build):

def probabilistic_match(ehr_record, claims_record):
    “”“
    Multi-stage matching algorithm with confidence scores
    “”“
    score = 0
    
    # Stage 1: Exact SSN match (if both present)
    if ehr_record.ssn and claims_record.ssn:
        if ehr_record.ssn == claims_record.ssn:
            score += 50  # Strong match
        else:
            return 0  # Definite non-match
    
    # Stage 2: DOB match
    if normalize_date(ehr_record.dob) == normalize_date(claims_record.dob):
        score += 30
    else:
        score -= 20  # Different DOB is red flag
    
    # Stage 3: Name matching (fuzzy)
    name_similarity = jaro_winkler_similarity(
        normalize_name(ehr_record.name),
        normalize_name(claims_record.name)
    )
    score += name_similarity * 20
    
    # Stage 4: Address matching (if available)
    if ehr_record.zip and claims_record.zip:
        if ehr_record.zip == claims_record.zip:
            score += 10
    
    # Confidence thresholds
    if score >= 80:
        return “definite_match”
    elif score >= 60:
        return “probable_match”
    elif score >= 40:
        return “possible_match”
    else:
        return “no_match”

def normalize_name(name):
    “”“Remove common variations”“”
    name = name.upper().strip()
    name = re.sub(r’[,.]’, ‘’, name)
    name = re.sub(r’\s+’, ‘ ‘, name)
    # Handle Jr, Sr, II, III
    name = re.sub(r’\b(JR|SR|II|III|IV)\b’, ‘’, name)
    return name

def normalize_date(date_str):
    “”“Handle different date formats”“”
    # Try multiple formats
    formats = [’%Y-%m-%d’, ‘%m/%d/%Y’, ‘%Y%m%d’, ‘%m-%d-%Y’]
    for fmt in formats:
        try:
            return datetime.strptime(date_str, fmt).date()
        except:
            continue
    return None

Results:

Definite matches: 87% of records
Probable matches: 9% (manual review required)
Possible matches: 3% (flagged for data quality team)
No matches: 1% (orphaned records)

Time to build: 4 months Manual review hours: 200+ hours for the 9% probable matches

Circle 3: The Timeline Paradox

The problem: EHR and Claims record events at different times.

Example scenario:

Patient visits ER on April 15, 2024 with chest pain.

EHR records:
- April 15, 2024 11:30 AM: Patient check-in
- April 15, 2024 11:45 AM: Triage note
- April 15, 2024 12:00 PM: Physician assessment
- April 15, 2024 2:30 PM: Discharge summary
- Diagnosis: “Acute chest pain, rule out MI”
- Final: “Costochondritis” (chest wall inflammation)

Claims records:
- April 30, 2024: Claim submitted
- May 15, 2024: Claim processed
- June 1, 2024: Claim adjusted (coding error)
- June 15, 2024: Final claim paid
- Diagnosis code: “I20.9” (Angina, unspecified) [wrong!]

Questions this creates:

What date do you use for analysis? Service date (Apr 15) or paid date (Jun 15)?
Which diagnosis is correct? EHR (costochondritis) or Claims (angina)?
How do you handle the 2-month lag in claims data?
What about retroactive adjustments?

Our solution:

-- Create a unified timeline view
CREATE VIEW patient_timeline AS
SELECT 
    patient_id,
    ‘ehr’ as source,
    encounter_date as event_date,
    encounter_date as recorded_date,
    diagnosis_code,
    ‘clinical’ as data_quality
FROM ehr_encounters

UNION ALL

SELECT
    patient_id,
    ‘claims’ as source,
    service_date as event_date,
    paid_date as recorded_date,
    diagnosis_code,
    CASE 
        WHEN DATEDIFF(day, service_date, paid_date) <= 30 THEN ‘good’
        WHEN DATEDIFF(day, service_date, paid_date) <= 60 THEN ‘delayed’
        ELSE ‘very_delayed’
    END as data_quality
FROM claims

ORDER BY patient_id, event_date;

Design decision: We track both service date (when it happened) and recorded date (when we learned about it).

This is critical for:

Predictive modeling (only use data available at prediction time)
Reporting (explain data lags to stakeholders)
Data quality monitoring (flag suspiciously delayed claims)

Circle 4: The HIPAA Minefield

The scene: Week 3 of the VUMC project.

Me: “We need to send EHR data to Azure for integration with claims.”

Legal: “What’s your HIPAA compliance plan?”

Me: “Azure is HIPAA compliant with a BAA.”

Legal: “Show me the data flow diagram. What PHI is moving where?”

Three months later...

Here’s what we had to build:

┌─────────────────────────────────────────────────────────┐
│                    HIPAA Compliance Layers              │
└─────────────────────────────────────────────────────────┘

Layer 1: Data Extraction
├─ EHR (Epic) → Export to SFTP (encrypted)
├─ Claims → Export to SFTP (encrypted)
└─ Both use AES-256 encryption at rest

Layer 2: De-identification (HIPAA Safe Harbor)
├─ Remove 18 identifiers:
│  ├─ Names
│  ├─ Addresses (except state)
│  ├─ Dates (except year)
│  ├─ Phone/Fax
│  ├─ Email
│  ├─ SSN
│  ├─ MRN
│  ├─ Device IDs
│  └─ ... (see full list in code below)
│
└─ Generate surrogate keys:
   ├─ patient_id → hash(patient_id + salt)
   ├─ provider_npi → hash(provider_npi + salt)
   └─ Keep salt in separate key management system

Layer 3: Data Transfer
├─ SFTP → Azure Data Lake (encrypted in transit)
├─ TLS 1.3 required
└─ Audit logging enabled

Layer 4: Processing
├─ Azure Databricks (HIPAA-compliant workspace)
├─ Network isolation (private endpoints)
├─ No data egress to internet
└─ All access logged

Layer 5: Access Control
├─ Role-based access (RBAC)
├─ Multi-factor authentication
├─ VPN required for remote access
└─ Annual security training required

The de-identification code:

import hashlib
import re
from datetime import datetime

class HIPAADeIdentifier:
    “”“
    HIPAA Safe Harbor de-identification
    Based on 45 CFR § 164.514(b)(2)
    “”“
    
    def __init__(self, salt_key):
        self.salt = salt_key
        
    def deidentify_record(self, record):
        “”“Remove 18 HIPAA identifiers”“”
        
        safe_record = {}
        
        # 1. Names - remove entirely
        # safe_record[’name’] = None  # Don’t include at all
        
        # 2. Addresses - keep only state
        if ‘address’ in record:
            safe_record[’state’] = record[’address’].get(’state’)
        
        # 3. Dates - keep only year (except DOB - shift dates)
        if ‘dob’ in record:
            # Shift DOB by random offset (preserve age)
            safe_record[’birth_year’] = self._extract_year(record[’dob’])
        
        if ‘service_dates’ in record:
            safe_record[’service_years’] = [
                self._extract_year(d) for d in record[’service_dates’]
            ]
        
        # 4-8. Contact info - remove
        # phone, fax, email, ssn, mrn - don’t include
        
        # 9. Generate surrogate key
        if ‘patient_id’ in record:
            safe_record[’patient_key’] = self._generate_surrogate(
                record[’patient_id’]
            )
        
        # 10. Device IDs - remove
        # 11. URLs - remove
        # 12. IP addresses - remove
        # ... etc
        
        # Keep clinical data (diagnoses, procedures, etc)
        safe_record[’diagnoses’] = record.get(’diagnoses’, [])
        safe_record[’procedures’] = record.get(’procedures’, [])
        safe_record[’medications’] = record.get(’medications’, [])
        
        # Age buckets (not exact age)
        if ‘age’ in record:
            safe_record[’age_bucket’] = self._age_bucket(record[’age’])
        
        # Zip code - only first 3 digits
        if ‘zip’ in record:
            safe_record[’zip3’] = record[’zip’][:3]
        
        return safe_record
    
    def _generate_surrogate(self, original_id):
        “”“Create irreversible hash”“”
        return hashlib.sha256(
            f”{original_id}{self.salt}”.encode()
        ).hexdigest()
    
    def _extract_year(self, date_str):
        “”“Extract year from date”“”
        try:
            return datetime.strptime(date_str, ‘%Y-%m-%d’).year
        except:
            return None
    
    def _age_bucket(self, age):
        “”“Group ages into buckets”“”
        if age < 18:
            return ‘<18’
        elif age < 30:
            return ‘18-29’
        elif age < 40:
            return ‘30-39’
        elif age < 50:
            return ‘40-49’
        elif age < 60:
            return ‘50-59’
        elif age < 70:
            return ‘60-69’
        elif age < 80:
            return ‘70-79’
        else:
            return ‘80+’

# Usage
deidentifier = HIPAADeIdentifier(salt_key=os.environ[’SALT_KEY’])

# Before
ehr_record = {
    ‘patient_id’: ‘V12345678’,
    ‘name’: ‘John Smith’,
    ‘dob’: ‘1985-03-15’,
    ‘ssn’: ‘123-45-6789’,
    ‘phone’: ‘615-555-1234’,
    ‘address’: {
        ‘street’: ‘123 Main St’,
        ‘city’: ‘Nashville’,
        ‘state’: ‘TN’,
        ‘zip’: ‘37203’
    },
    ‘diagnoses’: [’E11.9’, ‘I10’],
    ‘age’: 39
}

# After
safe_record = deidentifier.deidentify_record(ehr_record)
# {
#     ‘patient_key’: ‘a3f5b8c9...’,  # irreversible hash
#     ‘state’: ‘TN’,
#     ‘birth_year’: 1985,
#     ‘diagnoses’: [’E11.9’, ‘I10’],
#     ‘age_bucket’: ‘30-39’,
#     ‘zip3’: ‘372’
# }

Legal review time: 3 months Documentation pages: 47 Meetings with compliance: 12 Revisions to data flow: 8

But here’s the important part: Once approved, we could reuse this framework for every future project. The 3 months paid off.

Circle 5: The Data Volume Problem

At VUMC, we’re integrating:

EHR: 1.5M patients, 50M+ encounters, 500M+ clinical events
Claims: Same 1.5M patients, 200M+ claims over 5 years

Naive approach:

-- This query will run until the heat death of the universe
SELECT 
    e.patient_id,
    e.encounter_date,
    e.diagnosis_code as ehr_diagnosis,
    c.service_date,
    c.diagnosis_code as claims_diagnosis
FROM ehr_encounters e
LEFT JOIN claims c 
    ON e.patient_id = c.patient_id
    AND ABS(DATEDIFF(day, e.encounter_date, c.service_date)) <= 7
WHERE e.encounter_date >= ‘2019-01-01’

Problem: This join creates 50M × 200M = 10 trillion comparisons.

Our optimized approach:

-- Step 1: Create patient-level spine (partition key)
CREATE TABLE patient_spine AS
SELECT DISTINCT
    patient_key,
    birth_year,
    state,
    first_encounter_date,
    last_encounter_date
FROM ehr_encounters
WHERE encounter_date >= ‘2019-01-01’;

-- Step 2: Partition EHR data by patient + year
CREATE TABLE ehr_encounters_partitioned
PARTITIONED BY (patient_key, encounter_year)
AS
SELECT 
    *,
    YEAR(encounter_date) as encounter_year
FROM ehr_encounters;

-- Step 3: Partition Claims data by patient + year
CREATE TABLE claims_partitioned
PARTITIONED BY (patient_key, service_year)
AS
SELECT
    *,
    YEAR(service_date) as service_year
FROM claims;

-- Step 4: Join within partitions (MUCH faster)
SELECT 
    e.patient_key,
    e.encounter_date,
    e.diagnosis_code as ehr_diagnosis,
    c.service_date,
    c.diagnosis_code as claims_diagnosis
FROM ehr_encounters_partitioned e
LEFT JOIN claims_partitioned c
    ON e.patient_key = c.patient_key
    AND e.encounter_year = c.service_year
    AND ABS(DATEDIFF(day, e.encounter_date, c.service_date)) <= 7
WHERE e.encounter_year >= 2019;

Performance:

Naive query: Never completed (>24 hours)
Optimized query: 12 minutes on Databricks cluster (64 cores)

Cost:

Naive: $∞ (never finishes)
Optimized: $38 per run

Circle 6: The Data Quality Quagmire

Truth bomb: Both EHR and Claims data are garbage. Just in different ways.

EHR data problems:

Coding inconsistency (10 doctors, 10 ways to document diabetes)
Missing data (labs not entered, notes incomplete)
Copy-paste errors (residents copy previous notes)
Temporal issues (medication list not updated)

Claims data problems:

Upcoding (billing for more expensive procedure)
Bundling/unbundling games
Delayed submission (6+ months for some claims)
Denials/adjustments create duplicates

Real example from TennCare:

Patient X has diabetes. Here’s what we found:

EHR (Epic):
- Encounter 1: E11.9 (T2DM without complications)
- Encounter 2: E11.65 (T2DM with hyperglycemia)
- Encounter 3: E11.9
- Lab: HbA1c 7.2% (confirms diabetes)
- Medication: Metformin 500mg BID (confirms treatment)

Claims (TennCare):
- Claim 1: E11.9 (T2DM)
- Claim 2: E10.9 (T1DM) ← WRONG TYPE!
- Claim 3: E119 (missing decimal)
- Claim 4: 250.00 (old ICD-9 code) ← OUTDATED!
- Claim 5: [No diagnosis] (denied claim)

What’s the truth?

Our data quality framework:

class DataQualityScorer:
    “”“
    Assign quality scores to each data element
    Use for weighted analysis
    “”“
    
    def score_diagnosis(self, record):
        “”“
        Score diagnosis reliability (0-100)
        “”“
        score = 50  # baseline
        
        # Source credibility
        if record[’source’] == ‘ehr’:
            if record[’record_type’] == ‘physician_note’:
                score += 30  # High trust
            elif record[’record_type’] == ‘automated_problem_list’:
                score += 10  # Medium trust
        elif record[’source’] == ‘claims’:
            if record[’claim_status’] == ‘paid’:
                score += 20  # Verified by payer
            elif record[’claim_status’] == ‘denied’:
                score -= 30  # Suspect coding
        
        # Temporal recency
        days_old = (datetime.now() - record[’recorded_date’]).days
        if days_old <= 90:
            score += 10
        elif days_old > 365:
            score -= 10
        
        # Consistency across sources
        if record.get(’confirmed_by_multiple_sources’):
            score += 20
        
        # Specificity (more specific = better quality)
        if ‘.’ in record[’icd_code’]:  # has decimal
            score += 5
        
        # Clinical coherence (does it make sense?)
        if self.check_clinical_coherence(record):
            score += 10
        
        return max(0, min(100, score))  # clamp to 0-100
    
    def check_clinical_coherence(self, record):
        “”“
        Check if diagnosis makes sense given context
        “”“
        # Example: Type 1 diabetes new diagnosis at age 65 is suspect
        if record[’icd_code’].startswith(’E10’):  # T1DM
            if record.get(’age’) and record[’age’] > 50:
                if not record.get(’previous_t1dm_diagnosis’):
                    return False  # Likely coding error
        
        # Example: Pregnancy diagnosis for male patient
        if record[’icd_code’].startswith(’O’):  # Pregnancy codes
            if record.get(’gender’) == ‘M’:
                return False  # Definite error
        
        return True

# Usage: Weight analyses by quality scores
quality_scorer = DataQualityScorer()

for record in patient_records:
    score = quality_scorer.score_diagnosis(record)
    record[’quality_score’] = score

# Then in analysis:
# Don’t just count diagnoses - weight by quality
weighted_prevalence = sum(
    r[’quality_score’] / 100.0 
    for r in records 
    if r[’icd_code’] == ‘E11.9’
) / len(records)

Key insight: Don’t try to make data perfect. Instead:

Score data quality
Use scores as weights in analysis
Flag low-quality data for manual review
Report confidence intervals based on quality

Circle 7: The “Now What?” Problem

The scene: Month 18 of the VUMC project.

We finally had integrated data. Clean, validated, HIPAA-compliant.

Executive: “Great! Now show me... something.”

Me: “What do you want to see?”

Executive: “I don’t know. You’re the data person. Show me insights!”

The brutal truth: Integration is only 20% of the value. The other 80% is knowing what questions to ask.

What we built (the actual product):

# Population Health Dashboard
# Azure Function that runs daily

def generate_population_insights(patient_cohort):
    “”“
    Combine EHR + Claims to find actionable insights
    “”“
    
    insights = []
    
    # Insight 1: High-risk patients (EHR + Claims)
    high_risk = find_high_risk_patients(
        ehr_data=get_recent_encounters(patient_cohort),
        claims_data=get_recent_utilization(patient_cohort)
    )
    
    # Insight 2: Care gaps (EHR shows diagnosis, Claims show no treatment)
    care_gaps = find_care_gaps(
        ehr_diagnoses=get_chronic_conditions(patient_cohort),
        claims_prescriptions=get_filled_medications(patient_cohort)
    )
    
    # Insight 3: Unnecessary utilization (Claims show pattern, EHR shows alternative)
    unnecessary_er = find_unnecessary_er_visits(
        claims_er_visits=get_er_visits(patient_cohort),
        ehr_pcp_availability=get_pcp_schedule(patient_cohort)
    )
    
    # Use GPT-4 to generate plain-English summary
    summary = generate_executive_summary(
        high_risk_count=len(high_risk),
        care_gap_details=care_gaps,
        er_cost=sum(v[’cost’] for v in unnecessary_er)
    )
    
    return {
        ‘insights’: insights,
        ‘summary’: summary,
        ‘actionable_patients’: high_risk + care_gaps,
        ‘estimated_savings’: calculate_roi(insights)
    }

def find_care_gaps(ehr_diagnoses, claims_prescriptions):
    “”“
    Example: Patient has diabetes diagnosis but not taking meds
    “”“
    gaps = []
    
    for patient in ehr_diagnoses:
        if ‘E11’ in patient[’icd_codes’]:  # Has diabetes
            # Check if taking diabetes meds
            meds = [m for m in claims_prescriptions 
                   if m[’patient_key’] == patient[’patient_key’]
                   and m[’drug_class’] in [’metformin’, ‘insulin’, ‘sglt2’]]
            
            if not meds:
                gaps.append({
                    ‘patient_key’: patient[’patient_key’],
                    ‘condition’: ‘Type 2 Diabetes’,
                    ‘gap’: ‘No diabetes medication filled in 90 days’,
                    ‘risk’: ‘high’,
                    ‘action’: ‘Contact patient to discuss medication adherence’,
                    ‘estimated_cost’: 5000  # Avg cost of diabetes complication
                })
    
    return gaps

This is what executives wanted:

✅ “234 patients with diabetes not taking meds” (actionable list)
✅ “Estimated $1.2M in preventable complications” (ROI)
✅ “Auto-generated outreach list” (next steps)

Not:

❌ “Here’s a dashboard of integrated data” (so what?)
❌ “95% of records successfully linked” (who cares?)

The Honest Timeline (What Actually Happened)

Month 1-3: Legal & Compliance

BAA with Azure
Data governance policies
HIPAA compliance framework
Security architecture review

Month 4-7: Data Extraction

Epic data export (took 6 weeks to get IT approval)
Claims data export (relatively easy)
Building de-identification pipeline
Testing with synthetic data

Month 8-11: Matching Algorithm

Probabilistic matching logic
Manual review workflow
Validation with known matches
False positive/negative analysis

Month 12-15: Data Quality

Cleaning and standardization
Quality scoring framework
Reconciliation rules
Exception handling

Month 16-18: Product Development

Population health dashboard
Care gap identification
Risk stratification
Executive reporting

Total time: 18 months Total cost: ~$800K (4 FTE + infrastructure)

The 3-Month Playbook (How to Do It Faster)

After doing this twice (TennCare and VUMC), here’s the optimized path:

Month 1: Foundation

Week 1:

Get legal approval (use my BAA template - see appendix)
Set up Azure environment with HIPAA controls
Deploy de-identification pipeline (use my code)

Week 2:

Extract EHR data to SFTP
Extract Claims data to SFTP
Validate data formats

Week 3:

Load to Azure Data Lake
Run de-identification
Create partitioned tables

Week 4:

Build patient spine
Implement matching algorithm (use my probabilistic matcher)
Validate on sample (1000 patients)

Month 2: Integration

Week 5-6:

Scale matching to full population
Manual review of probable matches
Build quality scoring framework

Week 7-8:

Create unified data model
Build integration pipelines
Automated testing

Month 3: Product

Week 9-10:

Build population health dashboard
Care gap analysis
Risk stratification

Week 11-12:

Executive reporting
User acceptance testing
Production deployment

Keys to success:

Don’t reinvent the wheel - use existing frameworks
Parallel workstreams - don’t do everything sequentially
Start with small cohort - validate before scaling
Focus on business value - build insights, not just pipelines

Your Assignment This Week

Assignment 1: Map Your Integration Requirements (2 hours)

Create a document with:

Data sources you need to integrate
Identifiers available in each source
Matching strategy (exact, probabilistic, manual?)
HIPAA compliance gaps
Expected record volumes
Business questions you’ll answer

Template:

Data Source 1: [EHR Name]
- Records: [X million]
- Identifiers: [MRN, SSN, Name, DOB]
- Refresh frequency: [Daily/Weekly]
- HIPAA status: [Compliant/Needs review]

Data Source 2: [Claims Source]
- Records: [X million]
- Identifiers: [Member ID, SSN, Name, DOB]
- Refresh frequency: [Monthly]
- HIPAA status: [Compliant]

Matching Strategy:
- Primary: SSN + DOB (if both present)
- Secondary: Probabilistic (Name + DOB + Zip)
- Manual review threshold: 60-80% confidence

Business Questions:
1. [Specific question requiring both sources]
2. [Another question]

Assignment 2: Test the De-identification Code (1 hour)

Take my HIPAADeIdentifier class (in code section above)
Run it on sample data (use synthetic data if you don’t have real data)
Verify all 18 HIPAA identifiers are removed
Calculate what % of your data would be retained

Email me your results. I’m collecting data on de-identification effectiveness across different data types.

Assignment 3: Calculate Your Integration Cost (30 minutes)

Use this spreadsheet formula:

Labor Cost:
- Data Engineers: [X FTE] × [Months] × $15K/month
- Legal Review: [Y hours] × $300/hour
- Manual Review: [Z hours] × $50/hour

Infrastructure Cost:
- Azure Databricks: [Cluster size] × [Hours] × $0.50/DBU
- Storage: [TB] × $25/month
- Data Transfer: [GB] × $0.05/GB

Total: $[X]
Expected Savings: $[Y]

Shovel Seller MD

Discussion about this post

Ready for more?