B Trees

B-trees are the most common way of index implementation in almost all relational database systems.

B-trees use fixed size pages of around 4 KB (may vary though) to write the data. Each page has a unique identifier which allows to refer from one page to another page. This gives a tree-like structure like in the following figure:

(Source: Martin Kleppman’s Designing Data Intensive Applications: https://dataintensive.net/)

At the top of the tree there’s a root page. When a key of the index should be looked up, the process starts here.Each key of the root page contains references to child pages, which in turn refer to other pages or hold the final values. If the latter is the case the pages are called leaf pages. Following the references, starting at the root page will eventually lead to a leaf page.

Updating an existing value in a B-tree is quite simple as you simply have to to follow the references for the key that should be updated until it is found in a leaf page. The value for that particular key is changed and written again.

If a new key-value pair should be written, you’ll have to check whether the right leaf page has enough free space to hold the new data. If that’s the case, the data is simply written to the existing file. If not enough space is available, the page needs to be split into n equally sized parts. The parent page holding the reference must be updated accordingly.

A key feature of B-trees are that they are balanced, meaning pages have roughly the same size and all paths from the root to the leafs have the same lenght. The effect of this are predictable lookup times anf efficient traversals through the tree.

A drawback of B-trees are that write operations can be fairly complicated. Fist of all, if a page needs to be split, the parent leaf needs to be updated and re-written too. Second, some writes might be so huge that a single split might not suffice, rather splits of multiple pages are necessary. Thus, write operations can take some time to finish. If the database crashes while a write is happening the data might only partially be written.

To make a database more resilient to such cases, it’s common to add another data structure. This is called a write-ahead log and is written to disk. It is an append-only file and gets updated right before every write operation. If a database crashes while writing a B-tree, the write-ahead log can be used to redo the write operation once the database has recovered from the crash.