How PostgreSQL Handles Sub Transaction Visibility In Streaming Replication Setup?

1. Introduction

As an experienced PostgreSQL user, you may have a lot of experience in setting up streaming replication in your database clusters to make multiple backups of your data. But have you wondered how the standby is able to correctly determine if a tuple sent from the primary should be visible to the user or not. In the case where a transaction contains multiple subtransactions, how does the standby determine the visibility in this case? You might say… well it is PostgreSQL so it will just work… This is true. If you are someone who is curious and interested in knowing how PostgreSQL does certain things internally, then this blog may be interesting for you as we will talk about normally transaction and subtransactions and how both affect the standby’s ability to determine tuple visibility.

2. How is Tuple Visibility Determined?

PostgreSQL determines a tuple’s visibility based on the concept of transaction ID (a.k.a txid) and it is normally stored in a tuple’s header as either xmin or xmax where xmin holds the txid of the transaction that inserted this tuple and t_xmax holds the txid of the transaction that deleted or updated this tuple. If a tuple has not been deleted, its t_xmax is 0. There is also another mechanism called commit log (a.k.a clog) where it has information of currently active transaction IDs. When an inserted tuple has a valid t_xmin and invalid t_xmax, PostgreSQL will have to consult the clog first to determine if the txid stored in t_xmin is visible or not, then the result will be stored in the tuple’s hint_bit field about the visibility information such that it does not always have to consult the clog which could be resource intensive. If a tuple has a valid t_xmax, then there is no need to consult the clog as the tuple must be invisible.

This is just the basic ideas of how visibility is determined but it is enough for me to continue with this blog. For more information about visibility, you could refer to this resource

3. What is a Subtransaction?

As the name suggests, it is a smaller transaction that exists within a regular transaction and it can be created using the SAVEPOINT keyword after your BEGIN statement. Normally, when you have issued a BEGIN statement, all the tuples created from the subsequent DML statements such as INSERT or UPDATE will have a txid associated with these tuples and they will all be the same.

When you issue a SAVEPOINT statement within this transaction, all the tuples created from the subsequent DML statements will have a different txid that is normally larger than the main transaction’s txid. When you issue the SAVEPOINT statement again, the txid changes again.

The advantage of Subtransaction is that in a very large transaction, the entire transaction will not be aborted due to an error. It allows you to rollback to the previously created SAVEPOINT and continue to execute the main transaction.

The biggest difference between a subtransaction and a regular transaction is that there is not a concept of commit associated with subtransactions and it will not be recorded in the WAL file. However, if subtransaction is aborted, it will be recorded in the WAL file and subsequently be sent to the standby.

4. How Does a Standby Determines Visibility Information In a Regular Transaction?

Here’s an example of a regular transaction without any subtransactions:

BEGIN;
INSERT INTO test VALUES(111,111111);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,222222);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,333333);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
COMMIT;   => a XLOG_XACT_COMMIT WAL record having t_xmin = 500 is sent to standby

In a general case, the above queries will send at least 4 WAL records to the standby during streaming replication. First 3 contain the actual tuple information with t_xmin set to the current transaction ID which is 500 in the example. In other words, before the COMMIT is issued, the standby should have already received 3 WAL records from primary and completed the redo, meaning that the 3 tuples already exist in standby’s memory buffer. However, they are not visible yet because transaction ID 500 does not exist in the standby. It is until the COMMIT is issued in the primary that will subsequently generate a XLOG_XACT_COMMIT WAL record to be sent to standby to notify that transaction ID 500 is now committed and valid. When standby receives the XLOG_XACT_COMMIT WAL record, it will eventually update to its clog so the subsequent SELECT queries can do visibility check against.

5. How Does a Standby Process Visibility Information In a Regular Transaction Containing Subtransactions?

Now, let’s use the same example but add some SAVEPOINT to create subtransactions

BEGIN;
INSERT INTO test VALUES(111,111111);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,222222);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,333333);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
SAVEPOINT A;
INSERT INTO test VALUES(111,444444);    => a heap_redo WAL record having t_xmin = 501 is sent to standby
INSERT INTO test VALUES(111,555555);    => a heap_redo WAL record having t_xmin = 501 is sent to standby
INSERT INTO test VALUES(111,666666);    => a heap_redo WAL record having t_xmin = 501 is sent to standby
SAVEPOINT B;
INSERT INTO test VALUES(111,777777);    => a heap_redo WAL record having t_xmin = 502 is sent to standby
INSERT INTO test VALUES(111,888888);    => a heap_redo WAL record having t_xmin = 502 is sent to standby
INSERT INTO test VALUES(111,999999);    => a heap_redo WAL record having t_xmin = 502 is sent to standby
COMMIT;   => a XLOG_XACT_COMMIT WAL record having t_xmin = 500 is sent to standby

As you can see, after the creation of SAVEPOINT, the tuple’s t_xmin value will be changed and is no longer equal to the current transaction ID, which is 500 in the example. Before the COMMIT is issued, the standby should have received 9 WAL records, 3 having t_xmin = 500, 3 having t_xmin = 501 and 3 having t_xmin = 502. None of them is visible yet in the standby because these transaction IDs do not exist yet. When the primary issues the COMMIT statement, it will generate a XLOG_XACT_COMMIT WAL record and send to standby containing only the current transaction ID 500 without the 501 and 502.

After the commit, the standby is able to see all 9 tuples that are replicated from the primary. Now, you may be asking…. why? How come the standby knows that txid = 501 and 502 are also visible even though the primary never send a XLOG_XACT_COMMIT WAL record to tell the standby that 501 and 502 are also committed. Keep reading to find out.

Using the similar example, if we rollback to one of the SAVEPOINTs, the primary will send a XLOG_XACT_ABORT WAL record to notify the standby a transaction ID has become invalid.

BEGIN;
INSERT INTO test VALUES(111,111111);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,222222);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,333333);    => a heap_redo WAL record having t_xmin = 500 is sent to standby
SAVEPOINT A;
INSERT INTO test VALUES(111,444444);    => a heap_redo WAL record having t_xmin = 501 is sent to standby
INSERT INTO test VALUES(111,555555);    => a heap_redo WAL record having t_xmin = 501 is sent to standby
INSERT INTO test VALUES(111,666666);    => a heap_redo WAL record having t_xmin = 501 is sent to standby
SAVEPOINT B;
INSERT INTO test VALUES(111,777777);    => a heap_redo WAL record having t_xmin = 502 is sent to standby
INSERT INTO test VALUES(111,888888);    => a heap_redo WAL record having t_xmin = 502 is sent to standby
INSERT INTO test VALUES(111,999999);    => a heap_redo WAL record having t_xmin = 502 is sent to standby
ROLLBACK TO SAVEPOINT B;  => a XLOG_XACT_ABORT WAL record having t_xmin = 502 is sent to standby
COMMIT;   => a XLOG_XACT_COMMIT WAL record having t_xmin = 500 is sent to standby

In this example, when doing a ROLLBACK to SAVEPOINT B, a XLOG_XACT_ABORT WAL record will be sent to the standby to invalidate that the transaction ID 502 is invalid. So when primary commits, records having txid=502 will not be visible as primary has notified.

6. The Secret of Standby’s KnownAssignedTransactionIds

In addition to handling the XLOG_XACT_COMMIT and XLOG_XACT_ABORT WAL records to determine if a txid is valid or not, it actually manages a global txid list called KnownAssignedTransactionIds. Whenever a standby receives a heap_redo WAL record, it will save its xmin to KnownAssignedTransactionIds. Using the same example as above again here:

BEGIN;
INSERT INTO test VALUES(111,111111);    => a heap_redo WAL record having t_xmin = 500 is sent to standby, and standby puts 500 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,222222);    => a heap_redo WAL record having t_xmin = 500 is sent to standby, and standby puts 500 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,333333);    => a heap_redo WAL record having t_xmin = 500 is sent to standby, and standby puts 500 in KnownAssignedTransactionIds
SAVEPOINT A;
INSERT INTO test VALUES(111,444444);    => a heap_redo WAL record having t_xmin = 501 is sent to standby, and standby puts 501 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,555555);    => a heap_redo WAL record having t_xmin = 501 is sent to standby, and standby puts 501 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,666666);    => a heap_redo WAL record having t_xmin = 501 is sent to standby, and standby puts 501 in KnownAssignedTransactionIds
SAVEPOINT B;
INSERT INTO test VALUES(111,777777);    => a heap_redo WAL record having t_xmin = 502 is sent to standby, and standby puts 502 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,888888);    => a heap_redo WAL record having t_xmin = 502 is sent to standby, and standby puts 502 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,999999);    => a heap_redo WAL record having t_xmin = 502 is sent to standby, and standby puts 502 in KnownAssignedTransactionIds
COMMIT;   => a XLOG_XACT_COMMIT WAL record having t_xmin = 500 is sent to standby and standby submits 500, 501 and 502 to clog as valid.

Before the COMMIT is issued on the primary, the standby already has txid 500, 501, and 502 present in the KnownAssignedTransactionIds. When the standby receives XLOG_XACT_COMMIT WAL, it will actually submit txid 500 from the WAL plus the 501 and 502 from KnownAssignedTransactionIds to the clog. So that the subsequent SELECT query consult the clog it will say txid 500, 501, and 502 are visible. If the primary sends XLOG_XACT_ABORT as a result of rolling back to a previous SAVEPOINT, that invalid txid will be removed from the KnownAssignedTransactionIds, so when the transaction commits, the invalid txid will not be submit to clog.

7. Summary

So this is how PostgreSQL determines a tuple’s visibility within subtransactions in a streaming replication setup. Of course, there is more details to what is discussed here in this blog, but I hope today’s discussion can help you get a general understanding of how PostgreSQL determines the visibility and maybe it will help you in your current work on PostgreSQL.