Atomic Commit with 2 Phase Commit in FDW Distributed Setup

1. Introduction

PostgreSQL’s 2 phase commit (2PC) feature allows a database to store the details of a transaction on disk without committing it. This is done by issuing PREPARE TRANSACTION [name] command at the end of a transaction block. When the user is ready to commit, he/she can issue COMMIT PREPARED [name] where [name] should match the [name] in PREPARE TRANSACTION command. Because the transaction details are stored on disk with 2PC, the server is able to commit this transaction at a later time even if it crashes or out of service for some time. In a single database instance, the use of 2PC is not critical; the plain ‘commit’ can perform the job equally as well. However, in a larger setup, the data may be distributed on 2 or more database instances (for example, via Foreign Data Wrapper (FDW)), the use of 2PC is absolutely critical here to keep every database instance in sync.

2. Atomic Commit Problem with Foreign Data Wrapper (FDW)

Current postgres_fdw does not support the use of 2PC to commit foreign server. When a commit command is sent to the main server, it will send the same commit command to all of the foreign servers before processing the commit for itself. If one of the foreign node fails the commit, the main server will go through a abort process and will not commit itself due to the failure. However, some of the foreign nodes could already been successfully committed, resulting in a partially committed transaction.

Consider this diagram:

where the CN node fails the commit to DN3 and goes through a abort process, but at the same time, DN1 and DN2 have already been committed successfully and can no longer be rollbacked. This scenario creates a partial commit that may not be desirable.

3. FDW with 2PC Capability

If we were to add a 2PC functionality to current postgres_fdw, instead of sending the same commit to all foreign servers, we let the main server to send PREPARE TRANSACTION instead. The main server should proceed to send COMMIT PREPARE to all foreign servers Only when all of the foreign servers have successfully completed the PREPARE TRANSACTION. If one of them fails at the PREPARE stage, the main server is still able to ROLLBACK those foreign server who have successfully prepared.

Consider this diagram:

where the CN node fails the PREPARE TRANSACTION to DN3 and sends ROLLBACK PREPARED to DN1 and DN2 before going to the abort process. With the 2PC method, there will not be any partial commits.

3. Handling COMMIT PREPARED Failure?

If a foreign server fails at the PREPARE stage, it is not too late to rollback the rest of foreign servers that have succeeded the PREPARE, so the main server can still send ROLLBACK PREPARED to the foreign servers. However, if a foreign server fails at the COMMIT PREPARE stage, the other foreign servers who have succeeded it can no longer be rollbacked, potentially causing a partial commit as well.

In our implementation, we still allow the main server to continue with the commit even though a foreign server fails COMMIT PREPARED. In addition, we give user a warning about one of the foreign server may not have committed successfully, which leads to a potential partial commit. The foreign server with a failed COMMIT PREPARE will now have something called a “orphaned prepared transaction” that has yet to be committed.

Consider this diagram:

where DN3 fails a COMMIT PREPARED command and the CN node continues the commit with warning.

4. Handling Orphaned Prepared Transaction

Following the above example, if we perform a SELECT query, we will see that DN3 does not have an updated value while DN1 and DN2 have been updated. Also, DN3 still have the transaction prepared and stored in its disk. What’s left to do is to have somebody to login to DN3 and manually run a COMMIT PREPARED command. If that is done successfully, there will no longer be a partial commit.

The way we handle this is to make a orphaned prepared transaction detector at each foreign server and we introduce an intermediate and external global transaction manager (GTM) node that records all the FDW operations. Again, following the above example, when DN3 detects a orphaned prepare transaction, it will make a connection to the GTM node and check if this prepared transaction comes from a CN node. If it is, then we simply let DN3 do a self-commit of the prepared transaction automatically, without any human intervention. If GTM does not have a record, then this orphaned prepared transaction must be created manually by another DN3 user and it should not do anything to it except to just give a warning in the log file.

This is the general concept how we handle atomic commit and orphaned prepared transactions. There may be better and more complex solutions out there but for us, having an intermediate GTM node to coordinate all the operations between CN and DN nodes seems to be the simplest.

Distributed Database With PostgreSQL - Atomic Commit Problems

1. Introduction

If you are into distributed database research, especially one that is setup using Foreign Data Wrapper (FDW) + partitioned foreign tables, you probably have heard that there are many potential issues associated with this setup. Atomic commit, atomic visibility and global deadlock detection are one of the most popular issues that one can encounter in a distributed database environment.

There has been a lot of discussion on the PostgreSQL community on atomic commit and potential ways to solve this problem, and also discuss how this issue introduces an atomic visibility issue.

I recently had a chance to scan through several of the community discussion on atomic commit and would like to share what I have learned in this blog.

2. The problem

In a distributed environment, if a query involves modification to multiple foreign tables (or shards), current PostgreSQL cannot guarantee either all of them commit or all of them rollback. This means that, if one of them fails to commit for some reason but the rest have succeeded, we will see inconsistent results. The desired behavior would be that if one of the foreign tables fails to commit, PostgreSQL should be able to roll back the ones that have been committed successfully.

However, with the current postgres_fdw and the transaction manager, it is currently not possible to “uncommit” a foreign transaction that has been committed. This is why most of the discussions generally regard PG’s “2 Phase Commit” as the basic building block to achieve atomic commit.

2. 2 Phase Commit (2PC)

2PC basically splits a regular commit action into 2 stages, a PREPARE TRANSACTION and a COMMIT PREPARED (or ROLLBACK PREPARED) stage. Doing this allows PG to uncommit an in-progress commit in case it finds a condition that all of the foreign transactions need to be aborted.

For example, 1 Coordinator Node (CN) sends a UPDATE query to 3 Data Nodes (DNs) and would like to commit

  • CN sends a PREPARE TRANSACTION to each DN1 and waits for response
  • CN sends a PREPARE TRANSACTION to each DN2 and waits for response
  • CN sends a PREPARE TRANSACTION to each DN3 and waits for response

If any of these PREPARE TRANSACTION (say DN3) fails to deliver or its response contains error, CN still has a chance to ROLLBACK transactions on DN1 and DN2 and DEALLOCATE the existing prepared transaction. After that, the CN will abort the local transaction and finish the transaction with error and nothing gets changed here.

Only if all 3 DNs have confirmed the PREPARE TRANSACTION success, CN can then proceed to commit the local transaction, if local succeeds, continue to send COMMIT PREPARED to each DN, or send ROLLBACK PREPARED if local fails

  • CN sends a COMMIT/ROLLBACK PREPARED to each DN1 and waits for response
  • CN sends a COMMIT/ROLLBACK PREPARED to each DN2 and waits for response
  • CN sends a COMMIT/ROLLBACK PREPARED to each DN3 and waits for response

If any of these fails to COMMIT PREPARED at this point, say (DN3), CN is not able to uncommit DN1 and DN2 that have been 2-phase committed, so at this point, CN can only retry indefinitely to deliver COMMIT PREPARED to DN3 to ensure atomicity. If DN3 is unreachable, CN will keep on trying forever without closing the transaction and leaving the client waiting.

If client choose to cancel the transaction by hitting ctrl-c, then the transaction will close with a warning message that DN3 has not confirmed a COMMIT PREPARED message and data may be inconsistent. DN3 at this point may have an orphaned prepared transaction that needs to be handled.

3. Orphaned Prepared Transaction

In the previous example, an orphaned prepared transaction can happen if CN prepares a foreign transaction but never has a chance to commit it (possibly due to network). In this case, the DN node can very easily detect an orphaned prepared transaction based on the time the prepared transaction has been created. Unfortunately, without a centralized Global Transaction Manager (GTM), DN node is not able to decide if this orphaned prepared transaction should be committed or discarded. Therefore, most of the solutions on the community only support the detection of orphaned prepared transaction without any handling.

4. How Atomic Visibility Appeared?

Using 2PC allows CN node to rollback a commit during the PREPARE stage but cannot allow rollback during the commit prepared stage, so it can only retry forever or exit with a warning. The visibility issue can happen in a case where DN1 and DN2 have already comitted the prepared transactions while DN3 has not yet committed. At this point, if a client does a SELECT statement across DN1, DN2 and DN3, it will see that DN3 still does not have the right data but DN1 and DN2 have. This is not atomic at all.

To solve this, several people proposed a snapshot that can be shared globally across all the DN and CN nodes. The standard PG’s snapshot is purely based on transaction ID to determine if a tuple is visible or not, but with multiple different PG servers, each of them have their own transaction ID range and could all be very different, so sharing a snapshot based on transaction ID is impossible.

This is why a Commit Sequence Number (CSN) snapshot is being proposed in the discussion, which is based on time, and could be made the same in all of the CN and DN nodes and act as a reference point to determine if a tuple belongs to the past or future. The problem is then shifted to how to ensure all of the CN and DN nodes have consistently the same time? Even the best clock chipsets are subject to time skewing and will slowly go out of sync. The concept of Clock-SI is then added to the discussion, which is designed to handle such a clock skew. This makes the discussion somewhat more complicated and out of scope.

But for me, this may not be worth the effort to completely solve the atomic visibility issue with all these complicated stuff.

5. reference to some of the discussions

Transactions involving multiple servers
Proosal for CSN based snapshot
Global snapshot

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×