Week 12: My Data Stack 2026: The Tools I Actually Use (And Why)

A quarterly reflection on what’s in my toolbox

Feb 10, 2026

Hey friends,

Two months into 2026, and I’m seeing a lot of “ultimate data stack” posts floating around LinkedIn. You know the ones—beautifully designed diagrams with 47 tools that nobody actually uses together.

So I thought I’d do something different: show you what I’m actually using to build healthcare AI products. Not the tools I wish I used. Not the ones that look good on a slide deck. The ones that are open in my browser tabs right now.

Why This Matters

Your data stack isn’t just a collection of tools. It’s a reflection of your philosophy about building products. Every tool you choose is a bet on speed vs. control, flexibility vs. simplicity, cost vs. capability.

And in healthcare AI? These choices matter even more. We’re dealing with PHI, regulatory requirements, and clinical workflows that can’t afford to break.

My Current Stack (Spring 2026)

Let me walk you through each layer and why I picked what I picked.

Data Warehouse: Snowflake

Why I use it: HIPAA compliance out of the box, zero cluster management, and it just works.

I’ve tried BigQuery (great for Google Cloud natives), Redshift (AWS lock-in felt heavy), and Databricks (amazing for ML, but overkill for most analytics). Snowflake hits the sweet spot for healthcare data teams.

The automatic scaling is clutch when our claims processing jobs spike at month-end. And the data sharing features? Game-changer for working with payers without moving PHI around.

Real talk: It’s not the cheapest option. But the time I save not managing infrastructure pays for itself.

Orchestration: Airflow

Why I use it: Because sometimes boring is beautiful.

I know everyone’s hyped about Prefect and Dagster. I’ve played with both. But Airflow is battle-tested in healthcare, has a massive community, and every data engineer I hire already knows it.

When you’re running ETL pipelines that touch patient data at 2 AM, you want boring and reliable. Not the hot new thing that might have bugs in production.

My setup: Astronomer’s managed Airflow. I don’t want to manage Kubernetes clusters for my orchestrator.

Transformation: dbt

Why I use it: This one’s non-negotiable in 2025.

If you’re still writing raw SQL without version control, documentation, and testing—I’m sorry, but you’re building a house of cards.

dbt forces you to think about data lineage, write tests, and document your business logic. When the compliance team asks “how did we calculate this quality metric?”—I can point them to a dbt model with full lineage.

I use dbt Cloud for the scheduling and IDE, but the open-source core is what matters.

Reverse ETL: Hightouch

Why I use it: Because operational analytics is where the value lives.

Here’s the thing: your data warehouse is useless if insights never make it to the people who need them. Hightouch syncs our risk scores into our EHR, pushes patient cohorts to our care management platform, and updates Salesforce with usage analytics.

Could I build custom connectors? Sure. Do I want to maintain them? Hell no.

Alternative I considered: Census (great product, similar capabilities). Went with Hightouch for the healthcare-specific connectors.

BI Tool: Tableau (with AI Copilot Layer)

Why I use it: Because my stakeholders already know how to use it—and now it’s AI-powered.

I love Looker’s modeling layer. I’m impressed by Sigma’s spreadsheet interface. But here’s what matters: our clinical team has used Tableau for years. They have dashboards they trust. They know how to build calculated fields.

The game-changer in 2026? Tableau’s AI copilot integration. Our care managers can now ask questions in plain English and get visualizations instantly. “Show me high-risk diabetics who haven’t had an A1C test in 6 months” → instant dashboard.

Switching BI tools is like switching EHRs—technically possible, emotionally devastating.

That said, I’m watching Mode’s AI analyst and Hex’s LLM-powered notebooks closely. The line between BI and AI-assisted analytics is blurring fast.

Data Quality: Great Expectations

Why I use it: Because bad data in healthcare isn’t just annoying—it’s dangerous.

We run GE tests on every dataset that touches clinical decision-making:

Is this member ID valid?
Are these diagnosis codes in the right format?
Do these dollar amounts make sense?

It’s saved our ass more times than I can count. Found data quality issues before they became compliance issues.

ML Platform: Databricks (with Unity Catalog AI)

Why I use it: When you need serious ML and LLM horsepower in one place.

For our risk stratification models and NLP pipelines, Snowflake wasn’t enough. Databricks in 2026 gives us:

Distributed training with MLflow
Feature store for reusable features
Unity Catalog for governance
Native LLM fine-tuning and serving (game-changer)
Built-in RAG pipeline orchestration

The big shift in 2026: we’re running both traditional ML and LLM workloads on the same platform. No more duct-taping together different systems.

Yes, it’s expensive. But compute costs are nothing compared to the cost of a bad clinical model. And the unified governance for both ML and AI? Worth every penny.

Vector Database: Pinecone (with pgvector for some workloads)

Why I use it: RAG is table stakes in 2026.

Our clinical documentation search, our prior auth assistant, our ICD-10 coding helper—they all use RAG architectures. Pinecone handles the vector storage and similarity search without me building infrastructure.

The 2026 twist: We’ve also started using pgvector for smaller, lower-latency workloads. Turns out, for some use cases, keeping vectors in Postgres is simpler and cheaper than a dedicated vector DB.

The hybrid approach:

Pinecone: Large-scale semantic search (millions of clinical notes)
pgvector: Real-time lookups (patient context in our EHR integration)

Why not Weaviate or Qdrant? Both great products. Pinecone won for managed simplicity, pgvector won for “it’s already in our stack.”

LLM Ops: Langfuse + Braintrust

Why I use it: Because “prompt engineering” became “AI product management.”

We track:

Which prompts are performing well
Where our costs are going (this got EXPENSIVE in 2026)
What our latency looks like
How often we’re hitting token limits
Prompt version control and A/B testing (new in 2026)
Multi-model routing (GPT-4 vs Claude vs Llama)

The 2026 reality: LLM costs ballooned as we scaled. We added Braintrust for sophisticated prompt testing and cost optimization. Saved us ~40% on API costs by routing simpler queries to cheaper models.

Every production LLM app needs observability. The combination of Langfuse (monitoring) + Braintrust (optimization) is our secret weapon.

Data Catalog: Atlan (with AI Auto-Documentation)

Why I use it: For team sanity—now with AI superpowers.

When you have 500+ tables in your warehouse, people need to find stuff. Atlan in 2026 surfaces:

What data exists
Who owns it
How it’s being used
Whether it contains PHI
AI-generated descriptions and column documentation (saves hours)
Automated lineage mapping (shows impact of schema changes)

Started with a spreadsheet. Graduated to a Notion doc. Finally bit the bullet on a real catalog. Should’ve done it sooner.

The 2026 upgrade: Atlan’s AI now automatically documents new tables by reading dbt models, analyzing column distributions, and even suggesting data quality rules. It’s like having a junior data engineer on documentation duty 24/7.

The Glue Layer

A few smaller tools that hold everything together:

Fivetran: For 90% of our data ingestion (EHR, claims, CRM)
Hightouch: For operational workflows (consolidated from Census)
Monte Carlo: For data observability (catching broken pipelines)
Hex: For collaborative notebooks (replaced Jupyter entirely)
dbt Semantic Layer: New in 2026—single source of truth for metrics
Secoda: AI-powered data search (experimental, but promising)

What I Specifically Avoided (And Why)

Palantir Foundry: Still too expensive, still too much lock-in for our stage.

Building custom on Spark: We’re a 12-person data team, not Netflix.

All-in-one AI platforms: Tools promising “no-code AI” sound great but end up limiting us when we need custom workflows.

Self-hosted LLMs: Tried Llama on our own infrastructure. The cost savings weren’t worth the operational headache. We use hosted APIs and fine-tune when needed.

Airbyte (self-hosted): Great product, but Fivetran’s managed service is worth the premium for us.

Vertex AI: Google’s AI platform is solid, but we’re already deep in the Databricks ecosystem. Switching would mean rebuilding too much.

The Philosophy Behind These Choices

Looking at this stack, you’ll notice some themes:

Managed > Self-hosted: Our team is small. We optimize for velocity.
Healthcare-native when possible: PHI handling isn’t something to wing.
Community size matters: Popular tools have better docs, more integrations, easier hiring.
Boring is good for infrastructure, experimental for AI: We take risks on the AI layer, not the data layer.
AI-augmented everything (new in 2026): Every tool in our stack now has some AI feature—documentation, query generation, anomaly detection. We lean into it.

What’s Changing in 2026

A few things I’m actively experimenting with:

Compound AI systems: Chaining multiple specialized AI agents instead of one mega-model
Real-time feature stores: Moving from batch to streaming for risk models
Federated learning infrastructure: For multi-site collaborations without moving PHI
Agentic workflows: Claude/GPT agents that can write and execute dbt models autonomously
Embedded AI in Snowflake: Cortex AI functions are getting good enough to replace some Databricks workloads

The big question for 2026: Do we still need separate tools for ML and LLM workloads, or is everything converging?

My bet: Convergence. By 2027, I think we’ll have one unified platform for all AI/ML work.

The Real Cost

People always ask about pricing. Here’s the honest answer for our scale (processing ~5M patient records, 50TB of data, plus heavy AI workloads):

Snowflake: ~$10K/month (up from $8K, more compute-intensive queries)
Databricks: ~$18K/month (up significantly—LLM fine-tuning is expensive)
LLM API costs: ~$15K/month (new line item in 2026—this grew FAST)
Fivetran: ~$3K/month
dbt Cloud: ~$1.5K/month
LLM Ops (Langfuse + Braintrust): ~$2K/month
Everything else: ~$8K/month

Total: ~$57K/month in data + AI infrastructure.

The 2026 reality: AI costs are now our second-largest infrastructure expense after compute. If you’re building AI products, budget for it early.

Sounds like a lot? It’s still a rounding error compared to:

The value we deliver to care teams
What it would cost to build this ourselves
Our healthcare IT budget overall
The reimbursement we’re enabling (that’s the real ROI)

What Would I Change?

If I were starting from scratch in 2026:

Start with MotherDuck for the first 6-12 months. It’s DuckDB in the cloud, serverless, and shockingly capable. Snowflake is overkill until you have real scale.
Use GitHub Actions for orchestration until you actually need Airflow. Don’t over-engineer early.
Invest in LLM cost tracking from day one. We added this too late. The bill shock in month 3 was painful. Set up Braintrust or similar before you scale.
Build AI evals infrastructure early. We’re retrofitting this now. Should’ve had systematic prompt testing from the start.
Consolidate vendors. We have some overlap that we’re cleaning up. Every vendor relationship has overhead.
Budget for AI costs separately. Don’t lump LLM APIs into “infrastructure.” Track it as its own category or you’ll get surprised.

The Tools Don’t Matter (Except They Do)

Here’s the paradox: your data stack is both critically important and completely unimportant.

Unimportant because: No patient was ever cured because you picked Snowflake over BigQuery.

Important because: The right tools let you move fast, maintain quality, and sleep at night.

The 2026 twist: AI changed the game. Your stack now needs to support:

Traditional analytics
ML pipelines
LLM applications
Agentic workflows

Choose boring technology for infrastructure. Optimize for your team’s skills. Experiment aggressively on AI, conservatively on data.

And remember: the best data stack is the one you actually use—and can afford to scale.

What’s in your stack? Reply and let me know what you’re using—I’m always curious what’s working for other healthcare data teams.

Until next quarter,

Chad

This is part of my 52-week Healthcare AI Insider series. Each week, I share what’s actually working (and what’s not) in healthcare AI.

Shovel Seller MD

Discussion about this post

Ready for more?