Building a Multi-Client CDC Data Pipeline with Kafka and Flink
How I designed a real-time ETL pipeline using Debezium, Kafka, Flink, MongoDB, and ClickHouse.
- Kafka
- Flink
- ClickHouse
- Microservices
The Problem
We had multiple clients.
Each client had their own database.
We needed:
- Real-time analytics
- Data isolation per client
- No heavy queries hitting transactional databases
- Near real-time updates for dashboards
The challenge wasn’t just moving data.
It was doing it safely, at scale, without breaking production systems.
The Architecture
The flow looked like this:
Client Databases
↓
Debezium (CDC)
↓
Kafka
↓
Flink (Transformation Layer)
↓
MongoDB (Raw)
↓
ClickHouse (Analytics)
Each layer had a clear responsibility.
Step 1: CDC with Debezium
Instead of modifying application code to emit events,
we used Change Data Capture (CDC).
Debezium reads the database write-ahead log and publishes changes into Kafka.
Benefits:
- No extra logic in services
- Guaranteed alignment with database state
- Easy to add new downstream consumers
But CDC emits row-level changes — not business meaning.
That responsibility shifts to the processing layer.
Step 2: Kafka as the Backbone
Kafka became our central event log.
Important design decisions:
- Topics named by domain (e.g.
user.updated,order.created) - Messages include
client_id - Partition strategy carefully chosen to balance ordering and distribution
In multi-client systems, partitioning is critical.
Too naive, and one noisy client can affect everyone.
Step 3: Flink as the Brain
Raw CDC events are messy.
Flink handled:
- Filtering
- Data normalization
- Deduplication
- Client-level enrichment
- Aggregation (when needed)
We treated Kafka as the event log,
and Flink as the transformation engine.
Flink also gave us:
- Stateful processing
- Checkpointing
- Controlled parallelism
Dual Storage Strategy
We didn’t insert directly into ClickHouse.
Instead, we split storage into two layers:
MongoDB (Raw Layer)
- Stores near-original events
- Useful for replay
- Helpful for debugging
- Flexible schema
ClickHouse (Analytics Layer)
- Pre-transformed data
- Columnar storage
- Optimized for dashboards
- High ingestion throughput
This separation made the system more resilient.
If transformation logic changed,
we could replay from raw data safely.
Multi-Client Isolation
Every event carried a client_id.
We enforced isolation by:
- Including
client_idin partition strategy - Keying Flink streams by client
- Filtering queries per client in analytics
Without strict tagging,
multi-client pipelines become dangerous quickly.
Failure Scenarios We Designed For
In distributed systems, failure is normal.
We assumed:
- Connectors would restart
- Consumers would crash
- Events would be retried
- Network partitions would happen
So we built:
- Idempotent writes
- Deduplication logic
- Monitoring for consumer lag
- Clear separation between raw and transformed data
Exactly-once is a goal.
At-least-once is reality.
Design accordingly.
Lessons Learned
- CDC simplifies event generation but complicates semantics.
- Raw data storage is critical for resilience.
- Analytics databases must be modeled differently from transactional ones.
- Multi-client systems require strict isolation discipline.
- Observability is not optional in streaming systems.
Final Thoughts
This pipeline wasn’t just about moving data.
It was about designing a system that:
- Doesn’t overload transactional databases
- Scales with new clients
- Handles failures gracefully
- Supports real-time analytics
- Remains debuggable under pressure
Event-driven architecture adds complexity.
But when done correctly,
it gives you scalability, isolation, and resilience by design.