mvcc paper

05 Jan, 2025

I wanted to read this paper so I could get a better idea on how indexes are managed in MVCC. In my previous post, I was a bit confused about how MVCC's handled updating indexes. My understanding after reading most of the paper is that DBMS don't update tuples instead it creates a new version of the tuple and inserts this into the b-tree this is why we need a garbage collector to clean up parts of tree to keep it from growing unbounded. I took some notes on MVCC implementation and the trade-offs of each design decision mentioned in the paper. Overall I would say, the index management isn't as detailed as I would like, and the paper has no references for me to further dig into, but it did demystify versions are managed by indexes (by allowing false positives to turn up).

goal of the paper is to study how different design decisions impact the overall performance of the database
- looks into concurrency control protocols, version storage, garbage collection, and index management
- and evaluates on oltp workloads (transaction heavy workloads)
metadata for mvcc
- txn-id, begin-ts, end-ts, pointer
concurrency control protocol
- timestamp ordering (mvto)
  - uses timestamps to determine serializablility
  - aborts transactions attempting to read or write versions who's write lock is held by another transaction
  - dbms uses the txn-id to determine which versions are visible -- i.e txn-id is between begin-ts and end-ts and write lock isn't held by another transaction
  - read-ts prevents creation of new versions from stale versions of tuples, on updates we create a new version of the tuple if we can acquire the write lock and read-ts is older than txn-id
- optimsitic concurrency control (mvocc)
  - assumes conflicts are rare so transactions don't need to hold locks
  - defers conflict detection till validation-phase
    - can waste compute if there's a lot of conflicts
    - paper mentions potential for starvation too
  - transactions can't read new versions unless another transaction creating it has committed
  - transaction reading old versions, won't find out till validation, where it will know to abort
    - read phase, we can read and update visible versions
    - validation phase, we create a new commit timestamp (commit-ts) and check the transaction's read set to check if any tuples have been updated by another committed transaction (comparing commit-ts with begin-ts)
    - write phase, create a new version with a commit timestamp (assigned at the beginning of validation phase, t-commit)
- two-phase locking (mv2pl)
  - pessimistic approach assuming conflicts are likely, and uses locks to ensure serializ
  - uses read-cnt and txn-id as read and write locks
  - on reads increment read-cnt if no transactions hold the write lock -- txn-id is 0
  - can only update if no transaction is reading or writing to tuple -- read-cnt and txn-id are both 0
  - create a new commit timestamp (like mvocc) and create the new version of tuples and release all acquired locks
  - transactions that can't acquire locks are aborted (use no-wait policy -- can't deadlocks if you avoid them completely xd)
- serialization certifier
  - creates a graph to track dependencies between transactions and to check for cycles -- anti-dependencies, when transaction creates a new version, who's older version was read by another transaction
  - can reduce aborts, and prevent write skews, but validating a graph comes with performance overhead
  - mentions two certifiers, ssi (serializable snapshot isolation) and ssn (serial safety net)
  - decent blog post on ssi (i didn't follow the explanation...so maybe not the best)
version storage
- when a transaction performs an update, a new version of the tuple is created and inserted into the table.
- each tuple has a pointer field that dbms uses to create linked list of tuple versions, called a version chain
- when a delete happens, we can modify the begin-ts to make it not visible
- append-only storage
  - old-to-new (o2n)
    - head of the chain is the oldest version, and newer versions at the end
    - we don't have to remap indexes, but we may have to traverse the chain for newer versions
  - new-to-old (n2o)
    - new version is the head with older versions at the end
    - have to update all table indexes to point to the new version
      - can use layer of indirection to avoid this problem by using a single physical location that maps tuples latest version to physical address
- time-travel storage
  - similar to new-to-old append only, except all the older versions are stored on a separate time-travel table
  - when updating, copy tuple from the main table to the time-travel table then update the tuple on the main table
  - indexes will always point to the tuple on the main table so we never have to update them
- delta storage
  - like time-travel, dbms maintains the main version of the tuple is on the main table, and a sequence of delta versions on a separate delta table
  - delta versions are the modified attributes of the tuple instead of the entire tuple itself (think of it as just diffing, and storing the old parts)
  - overhead from having to access multiple attribute delta chains
garbage collection
- the goal of the gc is to detect expired versions unlinking them from the version chains and reclaiming unused space
- expired versions are those versions created by aborted transactions or those not visible to any active transactions (checking end-ts against txn-id of all active transactions)
- tuple-level garbage collection - there are two primary methods
  - background vacuuming (VAC)
    - uses a daemon that periodically scans the db for expired versions
    - doesn't really scale well for larger dbs as it will require running more vac threads
      - transactions can register invalid versions in a latch-free structure
  - cooperative cleaning (COOP)
    - when dbms traverses version chains to find visible versions, it simultaneously records expired versions
    - tuples that never get accesses can have expired versions, so usually they still run a background vac
- transaction-level garbage collection (epoch based)
  - assign each transaction an epoch (treat epochs as a segment of time transactions are active for)
  - considers a transaction expired if all versions it generated are no longer accessible by currently active transactions
  - a big draw is it works with all version storage schemes
  - at the end of an epoch we just reclaim all the space the transactions used
  - with this we have to now manage read/write sets for each transaction per epoch
index management
- most questions went back on how updates work with b-tree data structure which map keys to values
- since index doesn't contain the version information, we can get false positives (index points to a version that's not visible)
- how often an index update's it's pointer depends a lot on the version storage schemes
- in delta storage schemes index always points to the main version on the main table
- o2n - we'll never have to update the index pointers
- n2o - always update the index to repoint to the new version
- logical pointers are an layer of indirection, that maps a tuple's identifier (PKey or a 64bit tupleId) to the head of it's version chain
- physical pointers are store the physical address of version in the index entries (only applicable for append-only storage)