b+trees

06 Feb, 2025

Not long ago I debugged the on-disk b-tree implementation for the database I'm writing. To help with my own debugging process, I wanted to work through the b-tree insertion algorithm.

B-trees are a tree data structure, that is self-balancing and supports fast look-up, insert, and delete, making them good for databases. They tend to be short trees with a large fan-out (number of child nodes). For databases we want to minimize disk reads, so nodes are usually the size of a disk block. There are two types of nodes internal nodes and leaf nodes. Internal nodes have a list of keys and a list of pointers to children nodes, the keys partition the search space within a node -- marking the boundaries between two sets of data. There will always be one more pointer than key, and we can visualize the pointer key like they're laid out together with the || denoting a pointer.

Internal node

         [|    key a     ||    key b      ||    key c     |]
         /               |                |                \
        /                |                |                 \
    pointer[0]       pointer[1]       pointer[2]          pointer[3]
    (keys < a)       (a <= keys < b)  (b <= keys < c)     (c <= keys)

The first pointer holds values less than the first key. The important thing to keep in mind here is that pointers are to the right of the key you want to search for -- i.e. to access key[i] we want to access pointer[i+1]. (there is an asymmetry on leaf nodes).

The leaf nodes of a b-tree can hold pointers to data, with the last pointer of each leaf pointing to the neighboring leaf (the sibling pointer).

Leaf nodes

    [|   a   ||   b   ||   c   |]---sibling pointer-->[|   x   ||   y   ||   z   |]
    |        |        |                               |        |        |
    |        |        |                               |        |        |
  data a   data b   data c                          data x   data y   data z

Notice the asymmetry, the actual data for key[i] lives at pointer[i].

val insert: btree:btree_node -> key:key_value -> pointer:int -> unit

This is the signature of the main insert function, which recursively inserts the key-pointer pair on subtrees. For internal nodes we find the first index i where our key < key[i] and call insert on the subtree pointer[i] points to.

(* if key is smaller than the first key, insert key and pointer in front *)
(* else find the highest key[i] <= key and insert after *)
val insert_in_leaf: btree_leaf:btree_node -> key:key_value -> pointer:int -> unit

On leaf nodes the leaf can either be under capacity or full. Under capacity is easy, we just find the correct place to insert. If the leaf is full we're gonna have to insert and perform a split, splitting into a left leaf and right leaf (each half containing half of the key-pointers).

(* if left and right are leaf nodes the key is the smallest key from the right node*)
(* otherwise if it's internal nodes, then key is a key_value that partitions left and right *)
val insert_into_parent: btree_left_node:btree_node -> splitting_key:key_value -> btree_right_node:int -> unit

After splitting the leaf, we have to find the smallest key on the right leaf and insert that into the parent. This could subsequently split the internal parent node. One splits of internal nodes, the key to insert is the one that partitions the left and the right. We keep inserting into parent till we're at the make a new root/or internal node is under capacity.

The code for ocaml_db b+tree implementation can be found here