View Categories

Designing WhatsApp – Real-Time at Scale

2 min read

The “Database” Misconception #

When asked to design a chat app, most engineers focus on how to store the messages. They design complex schemas: Messages Table, Conversations Table, UnreadCount Table.

The Logic Check: WhatsApp is not a Storage System. It is a Routing System. At its peak (before multi-device support), WhatsApp was famous for deleting messages from their servers the moment they were delivered to your phone. This “Store-and-Forward” architecture is why they could support 450 million users with only 32 engineers when Facebook bought them. They didn’t pay for massive storage clusters; they just paid for bandwidth.

The Core Logic: The “Connection” Problem #

The hard part of WhatsApp isn’t sending text; it’s maintaining open connections for billions of devices. Every phone needs a persistent link (TCP/WebSocket) to the server so it can receive a message instantly without draining the battery by “polling” every 5 seconds.

The Solution: Erlang & FreeBSD Most web servers (Apache/Nginx/Node.js) struggle to hold more than 10,000 concurrent connections per machine due to memory overhead (thread/process weight). WhatsApp chose Erlang (a language built for telecom switches).

  • Lightweight Processes: An Erlang process uses ~300 words of memory.
  • Result: They famously achieved 2 Million connections per single server.
  • The Logic: By packing connections so densely, they minimized the number of servers they had to manage. Less servers = Less Ops complexity = Smaller Team.

Architecture Diagram: Store-and-Forward #

Here is the logical lifecycle of a message. Notice how the database is only a temporary buffer.

graph TD
    UserA["User A (Sender)"] --> Gateway["Chat Gateway (Erlang)"]
    
    Gateway -- "Is User B Online?" --> Session{"Session Manager"}
    
    %% ================= SCENARIO 1: ONLINE =================
    subgraph OnlineScenario ["Scenario 1: Online"]
        Session -- "Yes" --> DirectPush["Gateway (Push)"]
        DirectPush -- "Push Message" --> UserB["User B (Receiver)"]
        UserB -- "Ack (Received)" --> DirectPush
        DirectPush -- "Notify (Double Tick)" --> UserA
    end
    
    %% ================= SCENARIO 2: OFFLINE =================
    subgraph OfflineScenario ["Scenario 2: Offline"]
        Session -- "No" --> DB[("Temporary Storage")]
        
        %% The delayed flow
        UserB_Later["User B (Comes Online)"] -- "Connects" --> Gateway_Later["Gateway (Sync)"]
        Gateway_Later -- "Fetch Pending" --> DB
        DB -- "Deliver & Delete" --> UserB_Later
    end

The “End-to-End Encryption” Logic #

WhatsApp uses the Signal Protocol. This is not just a security feature; it is an architectural constraint.

  • The Logic: The server cannot read the message.
  • Implication: You cannot implement “Server-Side Search.” (e.g., You can’t ask the API “Show me all messages containing ‘Dinner'”).
  • Architecture Shift: Search must be implemented locally on the client (SQLite on Android/iOS). The server is dumb pipes. This further reduces server cost because the server doesn’t need to index petabytes of text.

The Decision Matrix: Connection Protocols #

How do you talk to the phone?

ProtocolThe Logic ChoiceWhy?
HTTP Short PollingNever“Are there messages?” … “No.” (Repeated every 2s). Kills battery. Floods server.
HTTP Long PollingLegacyKeep connection open until data arrives. Heavy header overhead. Good for old browsers, bad for mobile.
WebSocketsGoodFull duplex. Persistent. Standard for web chat.
MQTT (Message Queuing Telemetry Transport)The WinnerWhat Facebook Messenger/WhatsApp use. Extremely lightweight binary protocol designed for oil pipelines and satellites. Perfect for unstable mobile networks.

Real-World Lesson: The “Thundering Herd” of New Year’s Eve #

The biggest threat to a chat app is a global event (New Year’s, World Cup). Everyone picks up their phone at 12:00:00 AM.

  • The Spike: Traffic jumps 100x in 1 second.
  • The Failure: If your Load Balancer accepts all connections, the servers crash.
  • The Fix: Backpressure. WhatsApp servers are designed to reject connections when load > 80%. The client app (on the phone) has logic: “Connection Failed. Wait Random(1, 5) seconds and retry.” That “Random” jitter is critical. It spreads the 12:00:00 AM spike over 12:00:05 AM, saving the infrastructure.

Conclusion #

Building a chat app is easy. Building WhatsApp is hard because it requires mastering Concurrency.

  • Don’t just throw JSON over HTTP.
  • Use a lightweight binary protocol (MQTT/Protobuf).
  • Use a language that handles concurrency natively (Erlang, Go, Elixir).
  • Design the server to be a Router, not a Librarian.