View Categories

Database Internals – B-Trees vs. LSM Trees

3 min read

The Black Box Fallacy

Most developers treat databases like magic black boxes. You send data in, you get data out. When the database gets slow, the “naive” reaction is: “Upgrade the instance size!” or “Switch to NoSQL!”

But “NoSQL” isn’t a magical speed boost. The performance of a database is strictly dictated by the Data Structure it uses to write to the hard disk. If you don’t understand how your database writes to disk, you are designing blind.

The entire database world is essentially a battle between two algorithms:

  1. B-Trees (Used by PostgreSQL, MySQL, Oracle, SQL Server).
  2. LSM Trees (Used by Cassandra, DynamoDB, RocksDB, LevelDB).

The Core Logic: The Phonebook vs. The Journal #

To choose the right database, you must understand the physics of the disk head (or SSD page).

1. The B-Tree (The Phonebook)

  • Logic: A B-Tree is designed for Reads.
  • Analogy: Think of an alphabetically sorted Phonebook.
  • The Write Penalty: To add a new name (“Zack”), you can’t just scribble it on the cover. You have to find the “Z” page, potentialy split the page if it’s full, and shuffle entries around to keep it sorted. This is Random I/O. It is slow for writing.
  • The Read Reward: To find “Alice”, you use a binary search. You find the data in O(log n) time. It is blazing fast.

2. The LSM Tree (The Journal)

  • Logic: A Log-Structured Merge Tree is designed for Writes.
  • Analogy: Think of a daily Journal or a Ledger.
  • The Write Reward: To add “Zack”, you just write “Zack” on the very next blank line at the end of the file. You never go back to edit old pages. This is Sequential I/O. It is incredibly fast (close to max disk throughput).
  • The Read Penalty: To find “Alice”, you have to check the most recent journal entry… then the one before that… then the one before that. You might have to check 10 different files (SSTables) to find the current state of “Alice”.

Architecture Diagram: The Write Path #

Here is how the data flows differently in these two architectures.

flowchart TD
    %% ================= LEFT SIDE: B-TREE =================
    subgraph BTree ["B-Tree <br/> (PostgreSQL/MySQL)"]
        direction TB
        
        %% 1. Invisible Spacer to prevent title overlap
        SpacerB[ ]
        
        %% 2. The Content
        Write1[Insert Query] --> Buffer1[Buffer Pool]
        Buffer1 -- "Random I/O (Seek)" --> Disk1[("Disk Pages")]
        
        Note1["Note: Must find specific page <br/> and overwrite it (Slower Writes)."]
        Disk1 -.-> Note1

        %% 3. Force content below spacer
        SpacerB ~~~ Write1
    end

    %% ================= RIGHT SIDE: LSM TREE =================
    subgraph LSM ["LSM Tree <br/> (Cassandra/RocksDB)"]
        direction TB
        
        %% 1. Invisible Spacer to prevent title overlap
        SpacerLSM[ ]
        
        %% 2. The Content
        Write2[Insert Query] --> MemTable["MemTable (RAM)"]
        MemTable -- "Sequential I/O (Append Only)" --> SSTable1["SSTable File 1"]
        
        SSTable1 -.-> SSTable2["SSTable File 2"]
        
        Note2["Note: Writes are just appended. <br/> Background compaction fixes order later."]
        SSTable2 -.-> Note2

        %% 3. Force content below spacer
        SpacerLSM ~~~ Write2
    end

    %% ================= STYLING =================
    %% Hide the spacers completely but give them height
    style SpacerB height:20px,width:0px,fill:none,stroke:none
    style SpacerLSM height:0px,width:0px,fill:none,stroke:none

The Decision Matrix: The Physics Check #

Do not choose a database based on “Marketing.” Choose it based on your traffic pattern.

Traffic PatternThe Logic ChoiceWhy?
Read Heavy (90/10)B-Tree (SQL)If you read 100x more than you write (e.g., a CMS, User Profile), you want the Reads to be O(1). You can tolerate the slower writes.
Write Heavy (10/90)LSM Tree (NoSQL)If you are ingesting IoT sensor data, Chat Logs, or Event Streams, you need to write millions of rows per second. B-Trees will lock up. LSM Trees soak up writes like a sponge.
Updates (In-Place)B-TreeUpdating a row in a B-Tree is easy (jump to page, change value). Updating in LSM is actually a “new write” (tombstone), which wastes space until compaction runs.
Range ScansB-Tree“Give me users A-F.” B-Trees store them physically next to each other.

Real-World Case Study: Discord’s Chat History #

We discussed this briefly in Blueprint 02, but here is the Internal reason.

  • The Problem: Discord initially used MongoDB (B-Tree).
  • The Physics: People chat randomly. Writing a message into a B-Tree requires “Random I/O” to find the right collection and page. As the data set grew to terabytes, the disk head was jumping around too much (High Latency).
  • The Switch: They moved to Cassandra (LSM Tree).
  • The Result: A chat message is just “appended” to the log. The disk head never has to “seek.” It just writes. This allowed them to handle billions of messages with predictable low latency.

Conclusion #

There is no “Fast” Database. There are only databases optimized for Read Seek (B-Tree) or Write Append (LSM Tree).

  • Building a Bank? Use B-Trees (Consistency & Reads).
  • Building a Log/Chat System? Use LSM Trees (Ingestion Speed).