mvcc paper
I wanted to read this paper so I could get a better idea on how indexes are managed in MVCC. In my previous post, I was a bit confused about how MVCC's handled updating indexes. My understanding after reading most of the paper is that DBMS don't update tuples instead it creates a new version of the tuple and inserts this into the b-tree this is why we need a garbage collector to clean up parts of tree to keep it from growing unbounded. I took some notes on MVCC implementation and the trade-offs of each design decision mentioned in the paper. Overall I would say, the index management isn't as detailed as I would like, and the paper has no references for me to further dig into, but it did demystify versions are managed by indexes (by allowing false positives to turn up).
- goal of the paper is to study how different design decisions impact the overall performance of the database
- looks into concurrency control protocols, version storage, garbage collection, and index management
- and evaluates on oltp workloads (transaction heavy workloads)
- metadata for mvcc
- txn-id, begin-ts, end-ts, pointer
- concurrency control protocol
- timestamp ordering (mvto)
- uses timestamps to determine serializablility
- aborts transactions attempting to read or write versions who's write lock is held by another transaction
- dbms uses the txn-id to determine which versions are visible -- i.e txn-id is between begin-ts and end-ts and write lock isn't held by another transaction
- read-ts prevents creation of new versions from stale versions of tuples, on updates we create a new version of the tuple if we can acquire the write lock and read-ts is older than txn-id
- optimsitic concurrency control (mvocc)
- assumes conflicts are rare so transactions don't need to hold locks
- defers conflict detection till validation-phase
- can waste compute if there's a lot of conflicts
- paper mentions potential for starvation too
- transactions can't read new versions unless another transaction creating it has committed
- transaction reading old versions, won't find out till validation, where it will know to abort
- read phase, we can read and update visible versions
- validation phase, we create a new commit timestamp (commit-ts) and check the transaction's read set to check if any tuples have been updated by another committed transaction (comparing commit-ts with begin-ts)
- write phase, create a new version with a commit timestamp (assigned at the beginning of validation phase, t-commit)
- two-phase locking (mv2pl)
- pessimistic approach assuming conflicts are likely, and uses locks to ensure serializ
- uses read-cnt and txn-id as read and write locks
- on reads increment read-cnt if no transactions hold the write lock -- txn-id is 0
- can only update if no transaction is reading or writing to tuple -- read-cnt and txn-id are both 0
- create a new commit timestamp (like mvocc) and create the new version of tuples and release all acquired locks
- transactions that can't acquire locks are aborted (use no-wait policy -- can't deadlocks if you avoid them completely xd)
- serialization certifier
- creates a graph to track dependencies between transactions and to check for cycles -- anti-dependencies, when transaction creates a new version, who's older version was read by another transaction
- can reduce aborts, and prevent write skews, but validating a graph comes with performance overhead
- mentions two certifiers, ssi (serializable snapshot isolation) and ssn (serial safety net)
- decent blog post on ssi (i didn't follow the explanation...so maybe not the best)
- timestamp ordering (mvto)
- version storage
- when a transaction performs an update, a new version of the tuple is created and inserted into the table.
- each tuple has a pointer field that dbms uses to create linked list of tuple versions, called a version chain
- when a delete happens, we can modify the begin-ts to make it not visible
- append-only storage
- old-to-new (o2n)
- head of the chain is the oldest version, and newer versions at the end
- we don't have to remap indexes, but we may have to traverse the chain for newer versions
- new-to-old (n2o)
- new version is the head with older versions at the end
- have to update all table indexes to point to the new version
- can use layer of indirection to avoid this problem by using a single physical location that maps tuples latest version to physical address
- old-to-new (o2n)
- time-travel storage
- similar to new-to-old append only, except all the older versions are stored on a separate time-travel table
- when updating, copy tuple from the main table to the time-travel table then update the tuple on the main table
- indexes will always point to the tuple on the main table so we never have to update them
- delta storage
- like time-travel, dbms maintains the main version of the tuple is on the main table, and a sequence of delta versions on a separate delta table
- delta versions are the modified attributes of the tuple instead of the entire tuple itself (think of it as just diffing, and storing the old parts)
- overhead from having to access multiple attribute delta chains
- garbage collection
- the goal of the gc is to detect expired versions unlinking them from the version chains and reclaiming unused space
- expired versions are those versions created by aborted transactions or those not visible to any active transactions (checking end-ts against txn-id of all active transactions)
- tuple-level garbage collection - there are two primary methods
- background vacuuming (VAC)
- uses a daemon that periodically scans the db for expired versions
- doesn't really scale well for larger dbs as it will require running more vac threads
- transactions can register invalid versions in a latch-free structure
- cooperative cleaning (COOP)
- when dbms traverses version chains to find visible versions, it simultaneously records expired versions
- tuples that never get accesses can have expired versions, so usually they still run a background vac
- background vacuuming (VAC)
- transaction-level garbage collection (epoch based)
- assign each transaction an epoch (treat epochs as a segment of time transactions are active for)
- considers a transaction expired if all versions it generated are no longer accessible by currently active transactions
- a big draw is it works with all version storage schemes
- at the end of an epoch we just reclaim all the space the transactions used
- with this we have to now manage read/write sets for each transaction per epoch
- index management
- most questions went back on how updates work with b-tree data structure which map keys to values
- since index doesn't contain the version information, we can get false positives (index points to a version that's not visible)
- how often an index update's it's pointer depends a lot on the version storage schemes
- in delta storage schemes index always points to the main version on the main table
- o2n - we'll never have to update the index pointers
- n2o - always update the index to repoint to the new version
- logical pointers are an layer of indirection, that maps a tuple's identifier (PKey or a 64bit tupleId) to the head of it's version chain
- physical pointers are store the physical address of version in the index entries (only applicable for append-only storage)