Week 6: SQL Optimization for Billion-Row Healthcare Tables
How I turned a 24-hour query into a 3-minute query (without changing the database)
Welcome to Week 6.
Last week, you learned Python for healthcare analytics. This week, we’re diving into SQL optimization.
Why does this matter?
Because 90% of healthcare data lives in relational databases. And most SQL queries against healthcare data are criminally slow.
Not because the database is slow. Because the queries are bad.
I’m going to show you how to write SQL that doesn’t crash, doesn’t timeout, and actually finishes before your stakeholders give up and ask someone else.
The 24-Hour Query That Changed Everything
Scene: TennCare, 2013. My third week on the job.
Boss: “Can you pull a report on diabetes patients with ER visits in the last year?”
Me: “Sure, give me an hour.”
The query I wrote:
SELECT
m.member_id,
m.first_name,
m.last_name,
COUNT(c.claim_id) as er_visit_count,
SUM(c.allowed_amount) as total_er_cost
FROM members m
JOIN claims c ON m.member_id = c.member_id
JOIN diagnoses d ON c.claim_id = d.claim_id
WHERE d.diagnosis_code LIKE 'E11%' -- Type 2 Diabetes
AND c.place_of_service = 'ER'
AND c.service_date >= DATEADD(year, -1, GETDATE())
GROUP BY m.member_id, m.first_name, m.last_name
ORDER BY total_er_cost DESC;
Result: Query ran for 24+ hours. Then crashed.
The problem:
memberstable: 1.4M rowsclaimstable: 200M rowsdiagnosestable: 500M rows
My query was doing a cross join of 1.4M × 200M × 500M = 140 quadrillion comparisons.
The Optimized Version (3 Minutes)
After two weeks of learning SQL optimization the hard way, here’s what actually works:
-- Step 1: Create filtered claim subset (uses index on service_date)
WITH recent_er_claims AS (
SELECT
claim_id,
member_id,
allowed_amount,
service_date
FROM claims WITH (NOLOCK) -- Read uncommitted for reporting
WHERE place_of_service = 'ER'
AND service_date >= DATEADD(year, -1, GETDATE())
),
-- Step 2: Filter to diabetes diagnoses (uses index on diagnosis_code)
diabetes_claims AS (
SELECT DISTINCT
d.claim_id,
rec.member_id,
rec.allowed_amount
FROM recent_er_claims rec
INNER JOIN diagnoses d WITH (NOLOCK)
ON rec.claim_id = d.claim_id
WHERE d.diagnosis_code >= 'E11' -- More efficient than LIKE
AND d.diagnosis_code < 'E12'
),
-- Step 3: Aggregate (small dataset now)
member_summary AS (
SELECT
member_id,
COUNT(DISTINCT claim_id) as er_visit_count,
SUM(allowed_amount) as total_er_cost
FROM diabetes_claims
GROUP BY member_id
)
-- Step 4: Join to members (only for patients in summary)
SELECT
m.member_id,
m.first_name,
m.last_name,
ms.er_visit_count,
ms.total_er_cost
FROM member_summary ms
INNER JOIN members m WITH (NOLOCK)
ON ms.member_id = m.member_id
ORDER BY ms.total_er_cost DESC;
Result: 3 minutes, 2,847 rows returned.
What changed:
✅ Filter early (reduce dataset before joins)
✅ Use CTEs (Common Table Expressions) for readability
✅ Use range queries instead of LIKE
✅ Add NOLOCK hints (safe for reporting queries)
✅ Join on indexed columns only
The 7 Rules of Healthcare SQL Optimization
Rule 1: Filter Early, Join Late
Bad (joins 200M rows, then filters):
SELECT *
FROM claims c
JOIN diagnoses d ON c.claim_id = d.claim_id
WHERE c.service_date >= '2023-01-01'
AND d.diagnosis_code LIKE 'E11%';
Good (filters first, then joins small datasets):
WITH filtered_claims AS (
SELECT claim_id, member_id
FROM claims
WHERE service_date >= '2023-01-01'
),
filtered_diagnoses AS (
SELECT claim_id
FROM diagnoses
WHERE diagnosis_code >= 'E11' AND diagnosis_code < 'E12'
)
SELECT c.*, d.*
FROM filtered_claims c
JOIN filtered_diagnoses d ON c.claim_id = d.claim_id;
Performance:
Bad query: 45 minutes
Good query: 2 minutes
22.5x speedup
Rule 2: Avoid LIKE with Leading Wildcards
Bad (can’t use index):
WHERE diagnosis_code LIKE '%E11%' -- Full table scan
Less bad (can use index, but still slow):
WHERE diagnosis_code LIKE 'E11%' -- Index range scan
Best (uses index efficiently):
WHERE diagnosis_code >= 'E11'
AND diagnosis_code < 'E12' -- Index seek
Why this matters:
-- Real example from VUMC
-- Query: Find all diabetes diagnoses
-- Version 1: LIKE with wildcard
SELECT COUNT(*)
FROM diagnoses
WHERE diagnosis_code LIKE '%E11%';
-- Execution time: 45 seconds
-- Reads: 500M rows (full table scan)
-- Version 2: LIKE without leading wildcard
SELECT COUNT(*)
FROM diagnoses
WHERE diagnosis_code LIKE 'E11%';
-- Execution time: 8 seconds
-- Reads: 12M rows (index range scan)
-- Version 3: Range query
SELECT COUNT(*)
FROM diagnoses
WHERE diagnosis_code >= 'E11' AND diagnosis_code < 'E12';
-- Execution time: 2 seconds
-- Reads: 12M rows (index seek)
Rule 3: Use Covering Indexes
The problem: Every row lookup requires reading from disk (slow).
The solution: Put all needed columns in the index (read from index only).
Example from TennCare:
-- Original query (slow)
SELECT member_id, service_date, allowed_amount
FROM claims
WHERE service_date >= '2023-01-01'
AND place_of_service = 'ER';
-- Execution plan shows:
-- 1. Index Seek on service_date (fast)
-- 2. Key Lookup for allowed_amount (slow - 200M disk reads)
-- Total: 12 minutes
The fix:
-- Create covering index
CREATE NONCLUSTERED INDEX IX_Claims_ServiceDate_ER
ON claims (service_date, place_of_service)
INCLUDE (member_id, allowed_amount);
-- Same query now:
-- 1. Index Seek on service_date (fast)
-- 2. All columns in index (no lookup needed)
-- Total: 45 seconds
-- 16x speedup
My covering index strategy for healthcare:
-- Claims table (200M rows)
CREATE NONCLUSTERED INDEX IX_Claims_ServiceDate_Coverage
ON claims (service_date, place_of_service)
INCLUDE (member_id, claim_id, allowed_amount, diagnosis_code);
CREATE NONCLUSTERED INDEX IX_Claims_MemberId_Coverage
ON claims (member_id, service_date)
INCLUDE (claim_id, allowed_amount, place_of_service);
-- Diagnoses table (500M rows)
CREATE NONCLUSTERED INDEX IX_Diagnoses_Code_Coverage
ON diagnoses (diagnosis_code)
INCLUDE (claim_id, diagnosis_sequence);
-- Members table (1.4M rows - smaller, less critical)
CREATE NONCLUSTERED INDEX IX_Members_Id
ON members (member_id)
INCLUDE (first_name, last_name, date_of_birth);
Storage cost: ~15GB additional space per index Query speedup: 10-50x on average ROI: Absolutely worth it
Rule 4: Partition Large Tables
At VUMC, we partition claims by year:
-- Create partition function
CREATE PARTITION FUNCTION PF_Claims_Year (DATE)
AS RANGE RIGHT FOR VALUES
('2019-01-01', '2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01');
-- Create partition scheme
CREATE PARTITION SCHEME PS_Claims_Year
AS PARTITION PF_Claims_Year
ALL TO ([PRIMARY]);
-- Create partitioned table
CREATE TABLE claims_partitioned (
claim_id BIGINT,
member_id VARCHAR(20),
service_date DATE,
allowed_amount DECIMAL(10,2),
-- other columns...
CONSTRAINT PK_Claims_Partitioned
PRIMARY KEY (claim_id, service_date)
) ON PS_Claims_Year(service_date);
Query optimization with partitions:
-- Without partitioning:
-- Scans all 200M rows
SELECT *
FROM claims
WHERE service_date >= '2024-01-01';
-- Execution time: 45 seconds
-- With partitioning:
-- Only scans 2024 partition (35M rows)
SELECT *
FROM claims_partitioned
WHERE service_date >= '2024-01-01';
-- Execution time: 8 seconds
-- 5.6x speedup
Partition strategy for healthcare:
Claims: Partition by service_date (yearly)
Encounters: Partition by encounter_date (yearly)
Lab results: Partition by result_date (quarterly)
Prescriptions: Partition by fill_date (yearly)
Benefits:
✅ Query pruning (only scan relevant partitions)
✅ Easier maintenance (archive old partitions)
✅ Parallel processing (query multiple partitions simultaneously)
Rule 5: Use Appropriate JOIN Types
Healthcare data is full of optional relationships. Choose wisely.
-- Example: Patient medications
-- INNER JOIN (only patients with prescriptions)
SELECT m.member_id, p.drug_name
FROM members m
INNER JOIN prescriptions p ON m.member_id = p.member_id;
-- Returns: 234,521 patients (who have Rx)
-- LEFT JOIN (all patients, even without prescriptions)
SELECT m.member_id, p.drug_name
FROM members m
LEFT JOIN prescriptions p ON m.member_id = p.member_id;
-- Returns: 1,400,000 patients (many with NULL drug_name)
-- EXISTS (just check if prescription exists)
SELECT m.member_id
FROM members m
WHERE EXISTS (
SELECT 1
FROM prescriptions p
WHERE p.member_id = m.member_id
);
-- Returns: 234,521 patients (same as INNER JOIN, but faster)
Real example from VUMC:
-- Task: Find patients WITHOUT primary care visit in last year
-- BAD: Using LEFT JOIN + NULL check (slow)
SELECT m.member_id
FROM members m
LEFT JOIN encounters e
ON m.member_id = e.member_id
AND e.encounter_type = 'PRIMARY_CARE'
AND e.encounter_date >= DATEADD(year, -1, GETDATE())
WHERE e.encounter_id IS NULL;
-- Execution time: 8 minutes
-- Reason: Must process ALL encounters (50M rows)
-- GOOD: Using NOT EXISTS (fast)
SELECT m.member_id
FROM members m
WHERE NOT EXISTS (
SELECT 1
FROM encounters e
WHERE e.member_id = m.member_id
AND e.encounter_type = 'PRIMARY_CARE'
AND e.encounter_date >= DATEADD(year, -1, GETDATE())
);
-- Execution time: 45 seconds
-- Reason: Stops searching once match found (early exit)
-- 10.7x speedup
Rule 6: Avoid Functions on Indexed Columns
Bad (can’t use index):
WHERE YEAR(service_date) = 2024 -- Function on column
Good (can use index):
WHERE service_date >= '2024-01-01'
AND service_date < '2025-01-01' -- Column in raw form
Real examples from healthcare:
-- Finding patients by age
-- ❌ BAD: Function on date_of_birth
SELECT member_id
FROM members
WHERE DATEDIFF(year, date_of_birth, GETDATE()) >= 65;
-- Index on date_of_birth CANNOT be used
-- Execution time: 5 seconds (1.4M full table scan)
-- ✅ GOOD: Calculate cutoff date
SELECT member_id
FROM members
WHERE date_of_birth <= DATEADD(year, -65, GETDATE());
-- Index on date_of_birth CAN be used
-- Execution time: 0.2 seconds (index seek)
-- 25x speedup
More examples:
-- Finding upper/lowercase matches
-- ❌ BAD
WHERE UPPER(last_name) = 'SMITH' -- Can't use index
-- ✅ GOOD (if using case-insensitive collation)
WHERE last_name = 'Smith' -- Uses index
-- Or create computed column with index
ALTER TABLE members
ADD last_name_upper AS UPPER(last_name) PERSISTED;
CREATE INDEX IX_Members_LastNameUpper
ON members (last_name_upper);
Rule 7: Use Query Hints Strategically
NOLOCK (read uncommitted):
-- For reporting queries (can tolerate dirty reads)
SELECT COUNT(*)
FROM claims WITH (NOLOCK)
WHERE service_date >= '2024-01-01';
-- Benefits:
-- ✅ No locks acquired (doesn't block writers)
-- ✅ Faster (no lock overhead)
-- ✅ Better concurrency
-- Risks:
-- ⚠️ May read uncommitted data
-- ⚠️ May count rows twice or skip rows
-- ⚠️ ONLY use for non-critical reporting
MAXDOP (limit parallelism):
-- Limit to 4 cores (prevent query from hogging all resources)
SELECT *
FROM claims
WHERE service_date >= '2024-01-01'
OPTION (MAXDOP 4);
-- When to use:
-- ✓ Shared database (other users need resources)
-- ✓ Query is low-priority
-- ✓ Excessive parallelism causing slowdown
RECOMPILE (force new execution plan):
-- For queries with variable parameters
DECLARE @start_date DATE = '2024-01-01';
SELECT *
FROM claims
WHERE service_date >= @start_date
OPTION (RECOMPILE);
-- When to use:
-- ✓ Parameters vary widely (date ranges, etc)
-- ✓ Cached plan is suboptimal
-- ✓ Query runs infrequently (recompile cost is acceptable)
My hint strategy at VUMC:
-- Nightly batch reports (long-running, non-critical)
SELECT ...
WITH (NOLOCK)
OPTION (MAXDOP 4);
-- Ad-hoc analysis (parameter sniffing issues)
SELECT ...
OPTION (RECOMPILE);
-- Critical operational queries (need latest data, need speed)
SELECT ...
-- No hints (use default locking, full parallelism)
Real-World Optimization Case Study
Business question: “What’s the average cost per patient for diabetes care, broken down by provider?”
Naive query (never finishes):
SELECT
p.provider_name,
COUNT(DISTINCT c.member_id) as patient_count,
AVG(c.allowed_amount) as avg_cost_per_claim,
SUM(c.allowed_amount) / COUNT(DISTINCT c.member_id) as avg_cost_per_patient
FROM claims c
JOIN providers p ON c.provider_id = p.provider_id
JOIN diagnoses d ON c.claim_id = d.claim_id
WHERE d.diagnosis_code LIKE 'E11%'
AND c.service_date >= '2023-01-01'
GROUP BY p.provider_name
ORDER BY avg_cost_per_patient DESC;
-- Problems:
-- 1. Joins before filtering (200M × 500M rows)
-- 2. LIKE with wildcard (can't use index efficiently)
-- 3. No covering indexes
-- 4. GROUP BY on provider_name (string comparison is slow)
-- 5. Complex aggregation on massive dataset
-- Estimated execution time: Never completes (killed after 6 hours)
Optimized version (2 minutes):
-- Step 1: Filter diagnoses early
WITH diabetes_claims AS (
SELECT DISTINCT claim_id
FROM diagnoses WITH (NOLOCK)
WHERE diagnosis_code >= 'E11'
AND diagnosis_code < 'E12'
),
-- Step 2: Filter claims early + aggregate
claims_summary AS (
SELECT
c.provider_id,
c.member_id,
SUM(c.allowed_amount) as total_cost
FROM claims c WITH (NOLOCK)
INNER JOIN diabetes_claims dc
ON c.claim_id = dc.claim_id
WHERE c.service_date >= '2023-01-01'
GROUP BY c.provider_id, c.member_id
),
-- Step 3: Provider-level aggregation
provider_stats AS (
SELECT
provider_id,
COUNT(DISTINCT member_id) as patient_count,
SUM(total_cost) as total_cost,
AVG(total_cost) as avg_cost_per_patient
FROM claims_summary
GROUP BY provider_id
HAVING COUNT(DISTINCT member_id) >= 10 -- Filter out low-volume providers
)
-- Step 4: Join to provider names (small dataset now)
SELECT
p.provider_name,
ps.patient_count,
ps.avg_cost_per_patient,
ps.total_cost
FROM provider_stats ps
INNER JOIN providers p WITH (NOLOCK)
ON ps.provider_id = p.provider_id
WHERE ps.patient_count >= 10
ORDER BY ps.avg_cost_per_patient DESC
OPTION (MAXDOP 8);
-- Execution time: 2 minutes
-- Rows returned: 3,847 providers
What I changed:
✅ Filter first - reduced diagnoses from 500M → 12M rows
✅ Use range query -
>= 'E11' AND < 'E12'instead ofLIKE 'E11%'✅ Aggregate early - summary by member before provider aggregation
✅ GROUP BY provider_id (integer) instead of provider_name (string)
✅ Filter low-volume - HAVING clause eliminates noise
✅ Join last - only 3,847 providers, not 200M claims
✅ Add NOLOCK - safe for reporting query
✅ Add MAXDOP - limit resource usage
Performance improvement: 180x speedup (6 hours → 2 minutes)
The Execution Plan is Your Friend
How to read execution plans in SQL Server:
-- Enable actual execution plan
SET STATISTICS TIME ON;
SET STATISTICS IO ON;
-- Run your query
SELECT ...
-- Check output:
-- SQL Server parse and compile time: 32 ms
-- SQL Server Execution Time: 145,234 ms (2.4 minutes)
-- Table 'claims'. Scan count 1, logical reads 12,456,789
What to look for:
🚨 Red Flags:
Table Scan (should be Index Seek)
|--Table Scan
Object: [dbo].[claims]
Rows: 200,000,000 ⚠️ Reading entire table!
High cost operation
|--Hash Match (Aggregate) Cost: 87% ⚠️ This operation dominates
Key Lookup (missing covering index)
|--Nested Loops
|--Index Seek Cost: 2%
|--Key Lookup Cost: 98% ⚠️ Expensive lookups
Sort (should use index for ORDER BY)
|--Sort Cost: 45% ⚠️ Sorting 10M rows in memory
✅ Good Signs:
Index Seek (using index efficiently)
|--Index Seek
Object: [IX_Claims_ServiceDate]
Rows: 234,521 ✓ Small subset
Low-cost joins
|--Nested Loops Cost: 5% ✓ Efficient
Parallelism (using multiple cores)
|--Parallelism (Gather Streams) ✓ Using 8 threads
My Pre-Flight Checklist
Before running any query on production:
-- 1. Estimate row count
SELECT COUNT_BIG(*)
FROM claims WITH (NOLOCK)
WHERE service_date >= '2024-01-01';
-- If >10M rows, optimize before running
-- 2. Check if indexes exist
EXEC sp_helpindex 'claims';
-- Verify indexes on all WHERE/JOIN columns
-- 3. Test on small date range first
SELECT ...
WHERE service_date >= '2024-12-01' -- 1 month only
AND service_date < '2024-12-31';
-- Execution time: 5 seconds
-- Extrapolate: 5 sec × 12 months = 1 minute (acceptable)
-- 4. Check execution plan
SET SHOWPLAN_XML ON;
SELECT ...;
SET SHOWPLAN_XML OFF;
-- Look for table scans, missing indexes
-- 5. Set timeout
SET LOCK_TIMEOUT 300000; -- 5 minutes
-- Query will fail fast if locked/slow
Common Healthcare SQL Patterns
Pattern 1: Gap in Care Analysis
-- Find patients with diabetes but no HbA1c test in last year
WITH diabetes_patients AS (
SELECT DISTINCT member_id
FROM diagnoses d
JOIN claims c ON d.claim_id = c.claim_id
WHERE d.diagnosis_code >= 'E11' AND diagnosis_code < 'E12'
AND c.service_date >= DATEADD(year, -2, GETDATE())
),
recent_hba1c AS (
SELECT DISTINCT member_id
FROM lab_results
WHERE lab_test_code = 'HBA1C'
AND result_date >= DATEADD(year, -1, GETDATE())
)
SELECT dp.member_id
FROM diabetes_patients dp
WHERE NOT EXISTS (
SELECT 1
FROM recent_hba1c r
WHERE r.member_id = dp.member_id
);
Pattern 2: High Utilizer Identification
-- Top 1% of patients by cost
WITH patient_costs AS (
SELECT
member_id,
SUM(allowed_amount) as total_cost,
COUNT(DISTINCT claim_id) as claim_count
FROM claims WITH (NOLOCK)
WHERE service_date >= DATE ADD(year, -1, GETDATE())
GROUP BY member_id
),
percentile_cutoff AS (
SELECT
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_cost)
OVER () as p99_cost
FROM patient_costs
)
SELECT
pc.member_id,
pc.total_cost,
pc.claim_count
FROM patient_costs pc
CROSS JOIN percentile_cutoff pct
WHERE pc.total_cost >= pct.p99_cost
ORDER BY pc.total_cost DESC;
Pattern 3: Chronic Condition Prevalence
-- Count patients with multiple chronic conditions
WITH chronic_conditions AS (
SELECT DISTINCT
c.member_id,
CASE
WHEN d.diagnosis_code >= 'E11' AND d.diagnosis_code < 'E12' THEN 'Diabetes'
WHEN d.diagnosis_code >= 'I10' AND d.diagnosis_code < 'I16' THEN 'Hypertension'
WHEN d.diagnosis_code >= 'I50' AND d.diagnosis_code < 'I51' THEN 'Heart Failure'
WHEN d.diagnosis_code >= 'J44' AND d.diagnosis_code < 'J45' THEN 'COPD'
WHEN d.diagnosis_code >= 'N18' AND d.diagnosis_code < 'N19' THEN 'CKD'
END as condition
FROM claims c WITH (NOLOCK)
JOIN diagnoses d WITH (NOLOCK) ON c.claim_id = d.claim_id
WHERE c.service_date >= DATEADD(year, -2, GETDATE())
AND d.diagnosis_code IN (
-- Use specific codes for better performance
SELECT diagnosis_code FROM chronic_condition_codes
)
),
condition_counts AS (
SELECT
member_id,
COUNT(DISTINCT condition) as condition_count
FROM chronic_conditions
WHERE condition IS NOT NULL
GROUP BY member_id
)
SELECT
condition_count,
COUNT(*) as patient_count,
CAST(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS DECIMAL(5,2)) as percentage
FROM condition_counts
GROUP BY condition_count
ORDER BY condition_count;
Your Assignment This Week
Assignment 1: Optimize a Slow Query (2 hours)
Take a query that takes >1 minute to run.
Capture the execution plan
Identify the bottleneck (table scan? missing index? bad join order?)
Apply one optimization technique from this module
Measure the improvement
Email me:
Original query + execution time
Execution plan screenshot (before)
Optimized query + new execution time
What you changed and why
Assignment 2: Create a Covering Index (1 hour)
Find a query you run frequently that does Key Lookups.
Identify the columns needed
Create a covering index
Measure before/after performance
Template:
CREATE NONCLUSTERED INDEX IX_[Table]_[Column]_Coverage
ON [table] ([key_columns])
INCLUDE ([non_key_columns]);
Assignment 3: Partition Strategy (30 minutes)
For your largest table:
Decide on partition column (usually date)
Design partition boundaries (yearly? quarterly?)
Estimate storage per partition
Write partition creation script (don’t execute yet, just plan)
Next Week: Module 7
“Power BI vs Tableau: Medical Dashboards That Executives Actually Use”
You now know how to query data fast. Next week, I’ll show you how to visualize it in ways that drive decisions.
See you Tuesday.
Appendix: Index Strategy Template
-- Claims table optimization (200M rows)
-- Primary scenarios: date range queries, member lookups, cost analysis
-- 1. Service date queries (most common)
CREATE NONCLUSTERED INDEX IX_Claims_ServiceDate_Coverage
ON claims (service_date, place_of_service)
INCLUDE (member_id, claim_id, allowed_amount, provider_id)
WITH (FILLFACTOR = 95, ONLINE = ON);
-- 2. Member history lookups
CREATE NONCLUSTERED INDEX IX_Claims_Member_Coverage
ON claims (member_id, service_date DESC)
INCLUDE (claim_id, allowed_amount, diagnosis_code, place_of_service)
WITH (FILLFACTOR = 95, ONLINE = ON);
-- 3. Provider analysis
CREATE NONCLUSTERED INDEX IX_Claims_Provider_Coverage
ON claims (provider_id, service_date)
INCLUDE (member_id, allowed_amount, claim_id)
WITH (FILLFACTOR = 95, ONLINE = ON);
-- 4. Cost analysis
CREATE NONCLUSTERED INDEX IX_Claims_Cost_Analysis
ON claims (service_date, allowed_amount DESC)
INCLUDE (member_id, provider_id, diagnosis_code)
WITH (FILLFACTOR = 95, ONLINE = ON);
-- Maintenance: Rebuild monthly
ALTER INDEX ALL ON claims REBUILD
WITH (ONLINE = ON, FILLFACTOR = 95);
-- Monitor fragmentation
SELECT
index_name = i.name,
fragmentation_percent = s.avg_fragmentation_in_percent,
page_count = s.page_count
FROM sys.dm_db_index_physical_stats(
DB_ID(),
OBJECT_ID('claims'),
NULL,
NULL,
'LIMITED'
) s
JOIN sys.indexes i ON s.object_id = i.object_id
AND s.index_id = i.index_id
WHERE s.avg_fragmentation_in_percent > 10
ORDER BY s.avg_fragmentation_in_percent DESC;
P.S. — The Most Important Lesson
Perfect optimization doesn’t exist.
At some point, you hit diminishing returns. Going from 10 minutes → 3 minutes is worth it. Going from 3 minutes → 2.5 minutes is not.
Know when to stop:
Query runs in <5 minutes? Good enough.
Query runs hourly in a batch? <30 minutes is fine.
Query is ad-hoc? <10 minutes is acceptable.
Focus optimization effort on:
Queries that run frequently (1000x/day)
Queries that block other users
Queries in user-facing applications
Everything else? Ship it and move on.
Chad
