The Law You Cannot Break #
In physics, you cannot travel faster than light. In distributed systems, you cannot beat the CAP Theorem. Yet, I see countless founders and Junior Architects promising: “Our system will be 100% consistent and 100% available, even during a network outage.”
No, it won’t. That is mathematically impossible. Understanding CAP is the difference between a system that handles failure gracefully and one that corrupts data during a blackout.
The Triangle: Consistency, Availability, Partition Tolerance #
You get to pick two. But actually, you only get to pick one. Here is the logic why.
- Consistency (C): Every read receives the most recent write or an error. (All nodes see the same data at the same time).
- Availability (A): Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
- Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.
The “Gotcha”: In a distributed system (cloud), P is mandatory. Network cables get cut. AWS Regions go dark. Latency spikes. You cannot choose to avoid Partitions. Therefore, your real choice is only: CP vs. AP.
The Decision: CP vs. AP #
When the network breaks (Partition happens), what does your system do?
Option 1: CP (Consistency over Availability) #
- The Logic: “If I cannot talk to the other database node to verify the data is current, I will shut down and refuse to answer.”
- Use Case:Banking / ATM.
- Scenario: The link between the ATM and the Bank Server is broken.
- Result: The ATM says “Out of Service.” It does not let you withdraw money because it can’t confirm your balance. It chooses to be Unavailable to ensure Consistency.
- Databases: HBase, MongoDB (by default), Redis (standard).
Option 2: AP (Availability over Consistency) #
- The Logic: “I can’t talk to the main server, but I have a copy of the data from 5 minutes ago. I will serve that to the user so the site stays up.”
- Use Case:Social Media / E-Commerce Catalog.
- Scenario: You change your profile picture. The network is slow.
- Result: Your friend sees the old picture for 10 minutes. The site didn’t crash; it just served “stale” data. It chose Availability over Consistency.
- Databases: Cassandra, DynamoDB, CouchDB.
Architecture Diagram: The Split Brain #
This is what happens logically during a network partition.
flowchart TD
%% Define Nodes First to avoid ID confusion
U1[User 1]
U2[User 2]
NA[Node A - USA]
NB[Node B - Europe]
DB1[("DB Shard 1")]
DB2[("DB Shard 2")]
NoteBox["NOTE: CP System = Node B locks. AP System = Node B accepts write."]
%% Top Layer Connections
U1 --> NA
U2 --> NB
%% The Broken Link (Standard dotted line to prevent errors)
NA -. "LINK BROKEN" .- NB
%% The Subgraph (Simple box)
subgraph Scenario [Scenario: Network Partition]
direction TB
NA -- "Writes $100" --> DB1
NB -- "Writes $50" --> DB2
end
%% Connect Note to the bottom
DB1 --- NoteBox
DB2 --- NoteBoxPACELC: The “Real” CAP Theorem #
The CAP theorem only applies when things are broken (during a partition). But what about when things are working normally? That is where PACELC comes in: “If there is a Partition (P), how does the system trade off Availability (A) and Consistency (C)? Else (E), when the system is running normally, how does it trade off Latency (L) and Consistency (C)?”
- The Logic: Even when the network is fine, if you want perfect Consistency (C), you have to pay with Latency (L) because you have to wait for all nodes to acknowledge the write.
- Trade-off: If you want super fast responses (Low Latency), you must accept that data might be stale for a few milliseconds.
Conclusion: Don’t Fight Physics #
As an architect, your job is to classify every feature:
- “Is this feature a Bank?” (Requires CP).
- “Is this feature a Billboard?” (Requires AP).
Never try to build a CP system on an AP database (e.g., trying to do strict financial locking on Cassandra). You will lose data, and you will lose your job.