1.0 Introduction
Recently I have been involved in creating a solution that guarantees cross-partition uniqueness within a partitioned table consisting multiple child tables. Some people refer to this feature as a global index and there has been some discussion about it in the email thread here. Though the idea is good, the approach sparks a lot of controvercies as the approach changes how partitioned indexes are stored. It basically store all partitioned index together as one and uses TableOid as a key to reference internally. My team and I are experimenting an alternative approach that guarantees cross-partition uniqueness without changing the fundamentals, but before we can do that, we have to understand how uniqueness works in partitioned tables in PG.
2.0 CREATE INDEX Uniqueness Guarantee
PG follows this basic procedure to check uniqueness violations
perform a heap scan on the target child partition table
store the visible tuples into a
BTSpool
structure (spool1) and dead tuples into anotherBTSpool
structure (spool2), so there are 2 BTSpool structures used and spool2 can be NULL if a table has no dead tuples or uniqueness is not required. A spool structure can be understood as a collection of index tuples.perform sort on spool1 and spool2 if available
the sorting algorithm is equipped with duplicate detection, if two of the same tuples are subsequently sorted together, a duplicate is found and will raise Error here if uniqueness is required.
if no duplicate is found from sorting, PG will build the index tree based on spool1 and spool2
Destroy all the spools when index creation is done
This logic is located in btbuild() function within
src/backend/access/nbtree/nbtsort.c
and is invoked from DefineIndex function in indexcmds.c. btbuild will be called multiple times from DefineIndex depending on the number of active child partition tables.
3.0 INSERT And UPDATE Uniqueness Guarantee
At planner and optimizer stage, PG already know which child partition table the new data should be inserted or updated to
PG will first inserts the heap tuple in target heap relation.
then it will attempt to insert new index tuples associated with the heap tuple by calling
_bt_doinsert()
in src/backend/access/nbtree/nbtinsert.cwhen uniqueness check is required by the index, PG will construct a scan key from the new heap tuple and try to fetch an existing matching tuple from the heap partition table by calling
_bt_check_unique()
if no matching heap tuple is found from the current child partition, then there is no conflict
if a matching heap tuple is found from the current child partition, additional checks will be done in below:
if the found tuple is not yet committed, (for example, another backend is still working on this tuple and has not yet committed), the process will wait here until the other process commits or aborts.
when the other process commits or aborts, the process will re-fetch the same tuple again.
if the other backend aborts, the duplicate tuple should fail to be fetched, and therefore no conflicts
if the other backend commits, the duplicate tuple can still be fetched, and therefore a potential conflict.
before raising error, PG will do one more checking, which is to fetch the visibility status of the heap tuple that PG is currently inserting. This is to cover a case where another backend is currently doing CREATE UNIQUE INDEX CONCURRENTLY while the current backend is trying to insert or update the data.
check if the current tuple to be inserted can be fetched from the heap relation
if yes, then the current tuple is still visible and there is definitely a conflict
if no, then the current tuple has become invisible and is not considered a conflict
continue with index tree creation if no conflict is found
This logic above is mostly located in _bt_doinsert()
and _bt_check_unique()
functions within src/backend/access/nbtree/nbtinsert.c
.
4.0 ATTACH Uniqueness Guarantee
The table to be attached can either have unique (or not unique) index already defined or have no index at all. There are 2 potential cases during ATTACH:
Attaching a table with no index to a partition table, PG will automatically create a new index for the attached table following similar index parameters as the original partition table. But currently. This index creation follows the same procedures defined in
2.0
Attaching a table with a unique index defined using the same index key as partition table’s global unique index, PG will not create a new index for the attached table, and will simply attach the table.