Week 6: SQL Optimization for Billion-Row Healthcare Tables

How I turned a 24-hour query into a 3-minute query (without changing the database)

Dec 30, 2025

Welcome to Week 6.

Last week, you learned Python for healthcare analytics. This week, we’re diving into SQL optimization.

Why does this matter?

Because 90% of healthcare data lives in relational databases. And most SQL queries against healthcare data are criminally slow.

Not because the database is slow. Because the queries are bad.

I’m going to show you how to write SQL that doesn’t crash, doesn’t timeout, and actually finishes before your stakeholders give up and ask someone else.

The 24-Hour Query That Changed Everything

Scene: TennCare, 2013. My third week on the job.

Boss: “Can you pull a report on diabetes patients with ER visits in the last year?”

Me: “Sure, give me an hour.”

The query I wrote:

SELECT 
    m.member_id,
    m.first_name,
    m.last_name,
    COUNT(c.claim_id) as er_visit_count,
    SUM(c.allowed_amount) as total_er_cost
FROM members m
JOIN claims c ON m.member_id = c.member_id
JOIN diagnoses d ON c.claim_id = d.claim_id
WHERE d.diagnosis_code LIKE 'E11%'  -- Type 2 Diabetes
  AND c.place_of_service = 'ER'
  AND c.service_date >= DATEADD(year, -1, GETDATE())
GROUP BY m.member_id, m.first_name, m.last_name
ORDER BY total_er_cost DESC;

Result: Query ran for 24+ hours. Then crashed.

The problem:

members table: 1.4M rows
claims table: 200M rows
diagnoses table: 500M rows

My query was doing a cross join of 1.4M × 200M × 500M = 140 quadrillion comparisons.

The Optimized Version (3 Minutes)

After two weeks of learning SQL optimization the hard way, here’s what actually works:

-- Step 1: Create filtered claim subset (uses index on service_date)
WITH recent_er_claims AS (
    SELECT 
        claim_id,
        member_id,
        allowed_amount,
        service_date
    FROM claims WITH (NOLOCK)  -- Read uncommitted for reporting
    WHERE place_of_service = 'ER'
      AND service_date >= DATEADD(year, -1, GETDATE())
),

-- Step 2: Filter to diabetes diagnoses (uses index on diagnosis_code)
diabetes_claims AS (
    SELECT DISTINCT
        d.claim_id,
        rec.member_id,
        rec.allowed_amount
    FROM recent_er_claims rec
    INNER JOIN diagnoses d WITH (NOLOCK)
        ON rec.claim_id = d.claim_id
    WHERE d.diagnosis_code >= 'E11'   -- More efficient than LIKE
      AND d.diagnosis_code < 'E12'
),

-- Step 3: Aggregate (small dataset now)
member_summary AS (
    SELECT 
        member_id,
        COUNT(DISTINCT claim_id) as er_visit_count,
        SUM(allowed_amount) as total_er_cost
    FROM diabetes_claims
    GROUP BY member_id
)

-- Step 4: Join to members (only for patients in summary)
SELECT 
    m.member_id,
    m.first_name,
    m.last_name,
    ms.er_visit_count,
    ms.total_er_cost
FROM member_summary ms
INNER JOIN members m WITH (NOLOCK)
    ON ms.member_id = m.member_id
ORDER BY ms.total_er_cost DESC;

Result: 3 minutes, 2,847 rows returned.

What changed:

✅ Filter early (reduce dataset before joins)
✅ Use CTEs (Common Table Expressions) for readability
✅ Use range queries instead of LIKE
✅ Add NOLOCK hints (safe for reporting queries)
✅ Join on indexed columns only

The 7 Rules of Healthcare SQL Optimization

Rule 1: Filter Early, Join Late

Bad (joins 200M rows, then filters):

SELECT *
FROM claims c
JOIN diagnoses d ON c.claim_id = d.claim_id
WHERE c.service_date >= '2023-01-01'
  AND d.diagnosis_code LIKE 'E11%';

Good (filters first, then joins small datasets):

WITH filtered_claims AS (
    SELECT claim_id, member_id
    FROM claims
    WHERE service_date >= '2023-01-01'
),
filtered_diagnoses AS (
    SELECT claim_id
    FROM diagnoses
    WHERE diagnosis_code >= 'E11' AND diagnosis_code < 'E12'
)
SELECT c.*, d.*
FROM filtered_claims c
JOIN filtered_diagnoses d ON c.claim_id = d.claim_id;

Performance:

Bad query: 45 minutes
Good query: 2 minutes
22.5x speedup

Rule 2: Avoid LIKE with Leading Wildcards

Bad (can’t use index):

WHERE diagnosis_code LIKE '%E11%'  -- Full table scan

Less bad (can use index, but still slow):

WHERE diagnosis_code LIKE 'E11%'  -- Index range scan

Best (uses index efficiently):

WHERE diagnosis_code >= 'E11' 
  AND diagnosis_code < 'E12'  -- Index seek

Why this matters:

-- Real example from VUMC
-- Query: Find all diabetes diagnoses

-- Version 1: LIKE with wildcard
SELECT COUNT(*)
FROM diagnoses
WHERE diagnosis_code LIKE '%E11%';
-- Execution time: 45 seconds
-- Reads: 500M rows (full table scan)

-- Version 2: LIKE without leading wildcard
SELECT COUNT(*)
FROM diagnoses
WHERE diagnosis_code LIKE 'E11%';
-- Execution time: 8 seconds
-- Reads: 12M rows (index range scan)

-- Version 3: Range query
SELECT COUNT(*)
FROM diagnoses
WHERE diagnosis_code >= 'E11' AND diagnosis_code < 'E12';
-- Execution time: 2 seconds
-- Reads: 12M rows (index seek)

Rule 3: Use Covering Indexes

The problem: Every row lookup requires reading from disk (slow).

The solution: Put all needed columns in the index (read from index only).

Example from TennCare:

-- Original query (slow)
SELECT member_id, service_date, allowed_amount
FROM claims
WHERE service_date >= '2023-01-01'
  AND place_of_service = 'ER';

-- Execution plan shows:
-- 1. Index Seek on service_date (fast)
-- 2. Key Lookup for allowed_amount (slow - 200M disk reads)
-- Total: 12 minutes

The fix:

-- Create covering index
CREATE NONCLUSTERED INDEX IX_Claims_ServiceDate_ER
ON claims (service_date, place_of_service)
INCLUDE (member_id, allowed_amount);

-- Same query now:
-- 1. Index Seek on service_date (fast)
-- 2. All columns in index (no lookup needed)
-- Total: 45 seconds

-- 16x speedup

My covering index strategy for healthcare:

-- Claims table (200M rows)
CREATE NONCLUSTERED INDEX IX_Claims_ServiceDate_Coverage
ON claims (service_date, place_of_service)
INCLUDE (member_id, claim_id, allowed_amount, diagnosis_code);

CREATE NONCLUSTERED INDEX IX_Claims_MemberId_Coverage
ON claims (member_id, service_date)
INCLUDE (claim_id, allowed_amount, place_of_service);

-- Diagnoses table (500M rows)
CREATE NONCLUSTERED INDEX IX_Diagnoses_Code_Coverage
ON diagnoses (diagnosis_code)
INCLUDE (claim_id, diagnosis_sequence);

-- Members table (1.4M rows - smaller, less critical)
CREATE NONCLUSTERED INDEX IX_Members_Id
ON members (member_id)
INCLUDE (first_name, last_name, date_of_birth);

Storage cost: ~15GB additional space per index Query speedup: 10-50x on average ROI: Absolutely worth it

Rule 4: Partition Large Tables

At VUMC, we partition claims by year:

-- Create partition function
CREATE PARTITION FUNCTION PF_Claims_Year (DATE)
AS RANGE RIGHT FOR VALUES 
('2019-01-01', '2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01');

-- Create partition scheme
CREATE PARTITION SCHEME PS_Claims_Year
AS PARTITION PF_Claims_Year
ALL TO ([PRIMARY]);

-- Create partitioned table
CREATE TABLE claims_partitioned (
    claim_id BIGINT,
    member_id VARCHAR(20),
    service_date DATE,
    allowed_amount DECIMAL(10,2),
    -- other columns...
    CONSTRAINT PK_Claims_Partitioned 
        PRIMARY KEY (claim_id, service_date)
) ON PS_Claims_Year(service_date);

Query optimization with partitions:

-- Without partitioning:
-- Scans all 200M rows
SELECT *
FROM claims
WHERE service_date >= '2024-01-01';
-- Execution time: 45 seconds

-- With partitioning:
-- Only scans 2024 partition (35M rows)
SELECT *
FROM claims_partitioned
WHERE service_date >= '2024-01-01';
-- Execution time: 8 seconds

-- 5.6x speedup

Partition strategy for healthcare:

Claims: Partition by service_date (yearly)
Encounters: Partition by encounter_date (yearly)
Lab results: Partition by result_date (quarterly)
Prescriptions: Partition by fill_date (yearly)

Benefits:

✅ Query pruning (only scan relevant partitions)
✅ Easier maintenance (archive old partitions)
✅ Parallel processing (query multiple partitions simultaneously)

Rule 5: Use Appropriate JOIN Types

Healthcare data is full of optional relationships. Choose wisely.

-- Example: Patient medications

-- INNER JOIN (only patients with prescriptions)
SELECT m.member_id, p.drug_name
FROM members m
INNER JOIN prescriptions p ON m.member_id = p.member_id;
-- Returns: 234,521 patients (who have Rx)

-- LEFT JOIN (all patients, even without prescriptions)
SELECT m.member_id, p.drug_name
FROM members m
LEFT JOIN prescriptions p ON m.member_id = p.member_id;
-- Returns: 1,400,000 patients (many with NULL drug_name)

-- EXISTS (just check if prescription exists)
SELECT m.member_id
FROM members m
WHERE EXISTS (
    SELECT 1 
    FROM prescriptions p 
    WHERE p.member_id = m.member_id
);
-- Returns: 234,521 patients (same as INNER JOIN, but faster)

Real example from VUMC:

-- Task: Find patients WITHOUT primary care visit in last year

-- BAD: Using LEFT JOIN + NULL check (slow)
SELECT m.member_id
FROM members m
LEFT JOIN encounters e 
    ON m.member_id = e.member_id
    AND e.encounter_type = 'PRIMARY_CARE'
    AND e.encounter_date >= DATEADD(year, -1, GETDATE())
WHERE e.encounter_id IS NULL;
-- Execution time: 8 minutes
-- Reason: Must process ALL encounters (50M rows)

-- GOOD: Using NOT EXISTS (fast)
SELECT m.member_id
FROM members m
WHERE NOT EXISTS (
    SELECT 1
    FROM encounters e
    WHERE e.member_id = m.member_id
      AND e.encounter_type = 'PRIMARY_CARE'
      AND e.encounter_date >= DATEADD(year, -1, GETDATE())
);
-- Execution time: 45 seconds
-- Reason: Stops searching once match found (early exit)

-- 10.7x speedup

Rule 6: Avoid Functions on Indexed Columns

Bad (can’t use index):

WHERE YEAR(service_date) = 2024  -- Function on column

Good (can use index):

WHERE service_date >= '2024-01-01' 
  AND service_date < '2025-01-01'  -- Column in raw form

Real examples from healthcare:

-- Finding patients by age

-- ❌ BAD: Function on date_of_birth
SELECT member_id
FROM members
WHERE DATEDIFF(year, date_of_birth, GETDATE()) >= 65;
-- Index on date_of_birth CANNOT be used
-- Execution time: 5 seconds (1.4M full table scan)

-- ✅ GOOD: Calculate cutoff date
SELECT member_id
FROM members
WHERE date_of_birth <= DATEADD(year, -65, GETDATE());
-- Index on date_of_birth CAN be used
-- Execution time: 0.2 seconds (index seek)

-- 25x speedup

More examples:

-- Finding upper/lowercase matches

-- ❌ BAD
WHERE UPPER(last_name) = 'SMITH'  -- Can't use index

-- ✅ GOOD (if using case-insensitive collation)
WHERE last_name = 'Smith'  -- Uses index

-- Or create computed column with index
ALTER TABLE members 
ADD last_name_upper AS UPPER(last_name) PERSISTED;

CREATE INDEX IX_Members_LastNameUpper 
ON members (last_name_upper);

Rule 7: Use Query Hints Strategically

NOLOCK (read uncommitted):

-- For reporting queries (can tolerate dirty reads)
SELECT COUNT(*)
FROM claims WITH (NOLOCK)
WHERE service_date >= '2024-01-01';

-- Benefits:
-- ✅ No locks acquired (doesn't block writers)
-- ✅ Faster (no lock overhead)
-- ✅ Better concurrency

-- Risks:
-- ⚠️ May read uncommitted data
-- ⚠️ May count rows twice or skip rows
-- ⚠️ ONLY use for non-critical reporting

MAXDOP (limit parallelism):

-- Limit to 4 cores (prevent query from hogging all resources)
SELECT *
FROM claims
WHERE service_date >= '2024-01-01'
OPTION (MAXDOP 4);

-- When to use:
-- ✓ Shared database (other users need resources)
-- ✓ Query is low-priority
-- ✓ Excessive parallelism causing slowdown

RECOMPILE (force new execution plan):

-- For queries with variable parameters
DECLARE @start_date DATE = '2024-01-01';

SELECT *
FROM claims
WHERE service_date >= @start_date
OPTION (RECOMPILE);

-- When to use:
-- ✓ Parameters vary widely (date ranges, etc)
-- ✓ Cached plan is suboptimal
-- ✓ Query runs infrequently (recompile cost is acceptable)

My hint strategy at VUMC:

-- Nightly batch reports (long-running, non-critical)
SELECT ... 
WITH (NOLOCK)
OPTION (MAXDOP 4);

-- Ad-hoc analysis (parameter sniffing issues)
SELECT ...
OPTION (RECOMPILE);

-- Critical operational queries (need latest data, need speed)
SELECT ...
-- No hints (use default locking, full parallelism)

Real-World Optimization Case Study

Business question: “What’s the average cost per patient for diabetes care, broken down by provider?”

Naive query (never finishes):

SELECT 
    p.provider_name,
    COUNT(DISTINCT c.member_id) as patient_count,
    AVG(c.allowed_amount) as avg_cost_per_claim,
    SUM(c.allowed_amount) / COUNT(DISTINCT c.member_id) as avg_cost_per_patient
FROM claims c
JOIN providers p ON c.provider_id = p.provider_id
JOIN diagnoses d ON c.claim_id = d.claim_id
WHERE d.diagnosis_code LIKE 'E11%'
  AND c.service_date >= '2023-01-01'
GROUP BY p.provider_name
ORDER BY avg_cost_per_patient DESC;

-- Problems:
-- 1. Joins before filtering (200M × 500M rows)
-- 2. LIKE with wildcard (can't use index efficiently)
-- 3. No covering indexes
-- 4. GROUP BY on provider_name (string comparison is slow)
-- 5. Complex aggregation on massive dataset

-- Estimated execution time: Never completes (killed after 6 hours)

Optimized version (2 minutes):

-- Step 1: Filter diagnoses early
WITH diabetes_claims AS (
    SELECT DISTINCT claim_id
    FROM diagnoses WITH (NOLOCK)
    WHERE diagnosis_code >= 'E11' 
      AND diagnosis_code < 'E12'
),

-- Step 2: Filter claims early + aggregate
claims_summary AS (
    SELECT 
        c.provider_id,
        c.member_id,
        SUM(c.allowed_amount) as total_cost
    FROM claims c WITH (NOLOCK)
    INNER JOIN diabetes_claims dc 
        ON c.claim_id = dc.claim_id
    WHERE c.service_date >= '2023-01-01'
    GROUP BY c.provider_id, c.member_id
),

-- Step 3: Provider-level aggregation
provider_stats AS (
    SELECT 
        provider_id,
        COUNT(DISTINCT member_id) as patient_count,
        SUM(total_cost) as total_cost,
        AVG(total_cost) as avg_cost_per_patient
    FROM claims_summary
    GROUP BY provider_id
    HAVING COUNT(DISTINCT member_id) >= 10  -- Filter out low-volume providers
)

-- Step 4: Join to provider names (small dataset now)
SELECT 
    p.provider_name,
    ps.patient_count,
    ps.avg_cost_per_patient,
    ps.total_cost
FROM provider_stats ps
INNER JOIN providers p WITH (NOLOCK)
    ON ps.provider_id = p.provider_id
WHERE ps.patient_count >= 10
ORDER BY ps.avg_cost_per_patient DESC
OPTION (MAXDOP 8);

-- Execution time: 2 minutes
-- Rows returned: 3,847 providers

What I changed:

✅ Filter first - reduced diagnoses from 500M → 12M rows
✅ Use range query - >= 'E11' AND < 'E12' instead of LIKE 'E11%'
✅ Aggregate early - summary by member before provider aggregation
✅ GROUP BY provider_id (integer) instead of provider_name (string)
✅ Filter low-volume - HAVING clause eliminates noise
✅ Join last - only 3,847 providers, not 200M claims
✅ Add NOLOCK - safe for reporting query
✅ Add MAXDOP - limit resource usage

Performance improvement: 180x speedup (6 hours → 2 minutes)

The Execution Plan is Your Friend

How to read execution plans in SQL Server:

-- Enable actual execution plan
SET STATISTICS TIME ON;
SET STATISTICS IO ON;

-- Run your query
SELECT ...

-- Check output:
-- SQL Server parse and compile time: 32 ms
-- SQL Server Execution Time: 145,234 ms (2.4 minutes)
-- Table 'claims'. Scan count 1, logical reads 12,456,789

What to look for:

🚨 Red Flags:

Table Scan (should be Index Seek)

|--Table Scan
   Object: [dbo].[claims]
   Rows: 200,000,000  ⚠️ Reading entire table!

High cost operation

|--Hash Match (Aggregate)  Cost: 87%  ⚠️ This operation dominates

Key Lookup (missing covering index)

|--Nested Loops
   |--Index Seek  Cost: 2%
   |--Key Lookup  Cost: 98%  ⚠️ Expensive lookups

Sort (should use index for ORDER BY)

|--Sort  Cost: 45%  ⚠️ Sorting 10M rows in memory

✅ Good Signs:

Index Seek (using index efficiently)

|--Index Seek
   Object: [IX_Claims_ServiceDate]
   Rows: 234,521  ✓ Small subset

Low-cost joins

|--Nested Loops  Cost: 5%  ✓ Efficient

Parallelism (using multiple cores)

|--Parallelism (Gather Streams)  ✓ Using 8 threads

My Pre-Flight Checklist

Before running any query on production:

-- 1. Estimate row count
SELECT COUNT_BIG(*)
FROM claims WITH (NOLOCK)
WHERE service_date >= '2024-01-01';
-- If >10M rows, optimize before running

-- 2. Check if indexes exist
EXEC sp_helpindex 'claims';
-- Verify indexes on all WHERE/JOIN columns

-- 3. Test on small date range first
SELECT ...
WHERE service_date >= '2024-12-01'  -- 1 month only
  AND service_date < '2024-12-31';
-- Execution time: 5 seconds
-- Extrapolate: 5 sec × 12 months = 1 minute (acceptable)

-- 4. Check execution plan
SET SHOWPLAN_XML ON;
SELECT ...;
SET SHOWPLAN_XML OFF;
-- Look for table scans, missing indexes

-- 5. Set timeout
SET LOCK_TIMEOUT 300000;  -- 5 minutes
-- Query will fail fast if locked/slow

Common Healthcare SQL Patterns

Pattern 1: Gap in Care Analysis

-- Find patients with diabetes but no HbA1c test in last year

WITH diabetes_patients AS (
    SELECT DISTINCT member_id
    FROM diagnoses d
    JOIN claims c ON d.claim_id = c.claim_id
    WHERE d.diagnosis_code >= 'E11' AND diagnosis_code < 'E12'
      AND c.service_date >= DATEADD(year, -2, GETDATE())
),

recent_hba1c AS (
    SELECT DISTINCT member_id
    FROM lab_results
    WHERE lab_test_code = 'HBA1C'
      AND result_date >= DATEADD(year, -1, GETDATE())
)

SELECT dp.member_id
FROM diabetes_patients dp
WHERE NOT EXISTS (
    SELECT 1 
    FROM recent_hba1c r 
    WHERE r.member_id = dp.member_id
);

Pattern 2: High Utilizer Identification

-- Top 1% of patients by cost

WITH patient_costs AS (
    SELECT 
        member_id,
        SUM(allowed_amount) as total_cost,
        COUNT(DISTINCT claim_id) as claim_count
    FROM claims WITH (NOLOCK)
    WHERE service_date >= DATE ADD(year, -1, GETDATE())
    GROUP BY member_id
),

percentile_cutoff AS (
    SELECT 
        PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_cost) 
        OVER () as p99_cost
    FROM patient_costs
)

SELECT 
    pc.member_id,
    pc.total_cost,
    pc.claim_count
FROM patient_costs pc
CROSS JOIN percentile_cutoff pct
WHERE pc.total_cost >= pct.p99_cost
ORDER BY pc.total_cost DESC;

Pattern 3: Chronic Condition Prevalence

-- Count patients with multiple chronic conditions

WITH chronic_conditions AS (
    SELECT DISTINCT
        c.member_id,
        CASE 
            WHEN d.diagnosis_code >= 'E11' AND d.diagnosis_code < 'E12' THEN 'Diabetes'
            WHEN d.diagnosis_code >= 'I10' AND d.diagnosis_code < 'I16' THEN 'Hypertension'
            WHEN d.diagnosis_code >= 'I50' AND d.diagnosis_code < 'I51' THEN 'Heart Failure'
            WHEN d.diagnosis_code >= 'J44' AND d.diagnosis_code < 'J45' THEN 'COPD'
            WHEN d.diagnosis_code >= 'N18' AND d.diagnosis_code < 'N19' THEN 'CKD'
        END as condition
    FROM claims c WITH (NOLOCK)
    JOIN diagnoses d WITH (NOLOCK) ON c.claim_id = d.claim_id
    WHERE c.service_date >= DATEADD(year, -2, GETDATE())
      AND d.diagnosis_code IN (
          -- Use specific codes for better performance
          SELECT diagnosis_code FROM chronic_condition_codes
      )
),

condition_counts AS (
    SELECT 
        member_id,
        COUNT(DISTINCT condition) as condition_count
    FROM chronic_conditions
    WHERE condition IS NOT NULL
    GROUP BY member_id
)

SELECT 
    condition_count,
    COUNT(*) as patient_count,
    CAST(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS DECIMAL(5,2)) as percentage
FROM condition_counts
GROUP BY condition_count
ORDER BY condition_count;

Your Assignment This Week

Assignment 1: Optimize a Slow Query (2 hours)

Take a query that takes >1 minute to run.

Capture the execution plan
Identify the bottleneck (table scan? missing index? bad join order?)
Apply one optimization technique from this module
Measure the improvement

Email me:

Original query + execution time
Execution plan screenshot (before)
Optimized query + new execution time
What you changed and why

Assignment 2: Create a Covering Index (1 hour)

Find a query you run frequently that does Key Lookups.

Identify the columns needed
Create a covering index
Measure before/after performance

Template:

CREATE NONCLUSTERED INDEX IX_[Table]_[Column]_Coverage
ON [table] ([key_columns])
INCLUDE ([non_key_columns]);

Assignment 3: Partition Strategy (30 minutes)

For your largest table:

Decide on partition column (usually date)
Design partition boundaries (yearly? quarterly?)
Estimate storage per partition
Write partition creation script (don’t execute yet, just plan)

Next Week: Module 7

“Power BI vs Tableau: Medical Dashboards That Executives Actually Use”

You now know how to query data fast. Next week, I’ll show you how to visualize it in ways that drive decisions.

See you Tuesday.

Appendix: Index Strategy Template

-- Claims table optimization (200M rows)
-- Primary scenarios: date range queries, member lookups, cost analysis

-- 1. Service date queries (most common)
CREATE NONCLUSTERED INDEX IX_Claims_ServiceDate_Coverage
ON claims (service_date, place_of_service)
INCLUDE (member_id, claim_id, allowed_amount, provider_id)
WITH (FILLFACTOR = 95, ONLINE = ON);

-- 2. Member history lookups
CREATE NONCLUSTERED INDEX IX_Claims_Member_Coverage
ON claims (member_id, service_date DESC)
INCLUDE (claim_id, allowed_amount, diagnosis_code, place_of_service)
WITH (FILLFACTOR = 95, ONLINE = ON);

-- 3. Provider analysis
CREATE NONCLUSTERED INDEX IX_Claims_Provider_Coverage
ON claims (provider_id, service_date)
INCLUDE (member_id, allowed_amount, claim_id)
WITH (FILLFACTOR = 95, ONLINE = ON);

-- 4. Cost analysis
CREATE NONCLUSTERED INDEX IX_Claims_Cost_Analysis
ON claims (service_date, allowed_amount DESC)
INCLUDE (member_id, provider_id, diagnosis_code)
WITH (FILLFACTOR = 95, ONLINE = ON);

-- Maintenance: Rebuild monthly
ALTER INDEX ALL ON claims REBUILD 
WITH (ONLINE = ON, FILLFACTOR = 95);

-- Monitor fragmentation
SELECT 
    index_name = i.name,
    fragmentation_percent = s.avg_fragmentation_in_percent,
    page_count = s.page_count
FROM sys.dm_db_index_physical_stats(
    DB_ID(), 
    OBJECT_ID('claims'), 
    NULL, 
    NULL, 
    'LIMITED'
) s
JOIN sys.indexes i ON s.object_id = i.object_id 
    AND s.index_id = i.index_id
WHERE s.avg_fragmentation_in_percent > 10
ORDER BY s.avg_fragmentation_in_percent DESC;

P.S. — The Most Important Lesson

Perfect optimization doesn’t exist.

At some point, you hit diminishing returns. Going from 10 minutes → 3 minutes is worth it. Going from 3 minutes → 2.5 minutes is not.

Know when to stop:

Query runs in <5 minutes? Good enough.
Query runs hourly in a batch? <30 minutes is fine.
Query is ad-hoc? <10 minutes is acceptable.

Focus optimization effort on:

Queries that run frequently (1000x/day)
Queries that block other users
Queries in user-facing applications

Everything else? Ship it and move on.

Chad

Shovel Seller MD

Discussion about this post

Ready for more?