Cross-partition uniqueness guarantee with global unique index

1.0 Introduction

My colleague, David, recently published a post “Global Index, a different approach” that describes the work that we are doing to implement global unique index in an approach that does not change current PostgreSQL’s partitioning framework, while allowing cross-partition uniqueness constraint. To implement this, we must first know how PostgreSQL currently ensures uniqueness on a single table with a unique index and then we expand on top of this logic to support cross-partition uniqueness check. This blog of mine here has a rough overview how unique index works in PG. In this blog, I would like to describe the approach we take to ensure cross-partition uniqueness check during index creation in both serial and parallel build.

2.0 Cross-Partition Uniqueness Check in Serial Global Unique Index Build

As described in this blog here, uniqueness is guaranteed by doing a heap scan on a table and sorting the tuples inside one or two BTSpool structures. If 2 tuples with the same scan key are sorted right next to each other, uniqueness violation is found and system errors out. For example, building a global unique index on a partitioned table containing 6 partitions, at least 6 differnt BTSpool will be filled and used to determine uniqueness violation within each partition creation. So if a duplicate exists in another partition, PG currently cannot detect that. So, in theory if we introduce another BTSpool at a global scale that is visible to all partitions and lives until all partitions have been scanned, we can put all index tuples from all partitions in this global spool and determine cross-partition uniqueness simply by sorting it when the last partition scan is finished

This diagram below illustrates the position of the new global level BTSpool (called spool3) and how it can be used to determine cross-partition uniqueness.

Cross-partition uniqueness check in action:

create table globalidxpart (a int, b int, c text) partition by range (a);
CREATE TABLE

create table globalidxpart1 partition of globalidxpart for values from (0) to (100000);
CREATE TABLE

create table globalidxpart2 partition of globalidxpart for values from (100001) to (200000);
CREATE TABLE

create table globalidxpart3 partition of globalidxpart for values from (200001) to (300000);
CREATE TABLE

create table globalidxpart4 partition of globalidxpart for values from (300001) to (400000);
CREATE TABLE

create table globalidxpart5 partition of globalidxpart for values from (400001) to (500000);
CREATE TABLE

create table globalidxpart6 partition of globalidxpart for values from (500001) to (600000);
CREATE TABLE

insert into globalidxpart (a, b, c) values (42, 572814, 'inserted first on globalidxpart1');
INSERT 0 1

insert into globalidxpart (a, b, c) values (150000, 572814, 'inserted duplicate b on globalidxpart2');
INSERT 0 1

insert into globalidxpart (a, b, c) values (550000, 572814, 'inserted duplicate b on globalidxpart6');
INSERT 0 1

create unique index on globalidxpart (b) global;
ERROR: could not create unique index "globalidxpart1_b_idx"
DETAIL: Key (b)=(572814) is duplicated.

delete from globalidxpart where a = 150000 and b = 572814;
DELETE 1

create unique index on globalidxpart (b) global;
ERROR: could not create unique index "globalidxpart1_b_idx"
DETAIL: Key (b)=(572814) is duplicated.

delete from globalidxpart where a = 42 and b = 572814;
DELETE 1

create unique index on globalidxpart (b) global;
CREATE INDEX

Some Considerations

How many index tuples can this new global BTSpool hold? Since it needs to hold tuples from all partitions, does it need a lot of memory space allocated?

It uses maintenance_work_mem from postgresql.conf, same as BTSpool1. When it is near capacity, it will start to write tuples on disk as temporary files (also refer to as logical tapes within PostgreSQL, more on this later) instead of in the memory. So the spool can actually hold much more tuples than we thought. Before the final sorting, we will have to do a final merge of all the logical tapes that PG has written out on disk if memory is not enough to hold all tuples, then do a final merge sort to determine uniqueness.

3.0 Cross-Partition Uniqueness Check in Parallel Global Unique Index Build

Cross-partition uniqueness check using a global-scale spool is very straight forward in serial index build case.

PG’s current parallel sorting is much more complex as it uses logical tapes to share and merge intermediate sorted results written on disk as temporary files by each workers. At the final sorting, the leader process take over all logical tapes written out by workers and perform final merge sort to determine uniqueness.

For example, if 3 workers (one of them being the leader) are requested to build a single partition’s index, there will be 3 logical tapes (or 3 temporary files) written out on disk (each being intermediately sorted by each worker before written). The workers use shared memory to coordiate with each other such that they do not write to the same tape files and overwrite each other. When all workers finish, the leader will “Take over all logical tapes”, merge the tapes and perform a final sort. When done, PG will destroy all the parallel workers, which in turn will destroy all logical tape files on disk before moving on to the next partition.

So, to achieve cross-partition check in parallel, we have to retain those logical tape files when we finish sorting one partition. Currently PG will destroy them when a partition’s index build is finished in parallel. If number of worker spawned is X and number of partition is Y, at the last partition build finish, we should have X * Y logical tapes on disk that we need to do merge sort on. We still use a separate spool3 to manage the tapes and persist them until all partitions are finished.

This diagram below illustrates the position of spool3 and how it can be used to determine cross-partition uniqueness in parallel.

How Unique Index Works in PG

1.0 Introduction

Recently I have been involved in creating a solution that guarantees cross-partition uniqueness within a partitioned table consisting multiple child tables. Some people refer to this feature as a global index and there has been some discussion about it in the email thread here. Though the idea is good, the approach sparks a lot of controvercies as the approach changes how partitioned indexes are stored. It basically store all partitioned index together as one and uses TableOid as a key to reference internally. My team and I are experimenting an alternative approach that guarantees cross-partition uniqueness without changing the fundamentals, but before we can do that, we have to understand how uniqueness works in partitioned tables in PG.

2.0 CREATE INDEX Uniqueness Guarantee

PG follows this basic procedure to check uniqueness violations

  1. perform a heap scan on the target child partition table

  2. store the visible tuples into a BTSpool structure (spool1) and dead tuples into another BTSpool structure (spool2), so there are 2 BTSpool structures used and spool2 can be NULL if a table has no dead tuples or uniqueness is not required. A spool structure can be understood as a collection of index tuples.

  3. perform sort on spool1 and spool2 if available

  4. the sorting algorithm is equipped with duplicate detection, if two of the same tuples are subsequently sorted together, a duplicate is found and will raise Error here if uniqueness is required.

  5. if no duplicate is found from sorting, PG will build the index tree based on spool1 and spool2

  6. Destroy all the spools when index creation is done

  7. This logic is located in btbuild() function within src/backend/access/nbtree/nbtsort.c and is invoked from DefineIndex function in indexcmds.c. btbuild will be called multiple times from DefineIndex depending on the number of active child partition tables.

3.0 INSERT And UPDATE Uniqueness Guarantee

At planner and optimizer stage, PG already know which child partition table the new data should be inserted or updated to

  1. PG will first inserts the heap tuple in target heap relation.

  2. then it will attempt to insert new index tuples associated with the heap tuple by calling _bt_doinsert() in src/backend/access/nbtree/nbtinsert.c

  3. when uniqueness check is required by the index, PG will construct a scan key from the new heap tuple and try to fetch an existing matching tuple from the heap partition table by calling _bt_check_unique()

  4. if no matching heap tuple is found from the current child partition, then there is no conflict

  5. if a matching heap tuple is found from the current child partition, additional checks will be done in below:

  6. if the found tuple is not yet committed, (for example, another backend is still working on this tuple and has not yet committed), the process will wait here until the other process commits or aborts.

  7. when the other process commits or aborts, the process will re-fetch the same tuple again.

  8. if the other backend aborts, the duplicate tuple should fail to be fetched, and therefore no conflicts

  9. if the other backend commits, the duplicate tuple can still be fetched, and therefore a potential conflict.

  10. before raising error, PG will do one more checking, which is to fetch the visibility status of the heap tuple that PG is currently inserting. This is to cover a case where another backend is currently doing CREATE UNIQUE INDEX CONCURRENTLY while the current backend is trying to insert or update the data.

  11. check if the current tuple to be inserted can be fetched from the heap relation

  12. if yes, then the current tuple is still visible and there is definitely a conflict

  13. if no, then the current tuple has become invisible and is not considered a conflict

  14. continue with index tree creation if no conflict is found

This logic above is mostly located in _bt_doinsert() and _bt_check_unique() functions within src/backend/access/nbtree/nbtinsert.c.

4.0 ATTACH Uniqueness Guarantee

The table to be attached can either have unique (or not unique) index already defined or have no index at all. There are 2 potential cases during ATTACH:

  • Attaching a table with no index to a partition table, PG will automatically create a new index for the attached table following similar index parameters as the original partition table. But currently. This index creation follows the same procedures defined in 2.0

  • Attaching a table with a unique index defined using the same index key as partition table’s global unique index, PG will not create a new index for the attached table, and will simply attach the table.

How to Set Up NFS and Run PG on it

Introduction

Network File System (NFS) is a distributed file system protocol that allows a user on a client node to access files residing on a server node over network much like local storage is accessed. Today in this blog, I will share how to set up both NFSv4 server and client on CentOS7 and run PG on it.

NFS Server

First, install the NFS server components by

yum install nfs-utils

This will install nfs process on the server machine in which we can go ahead to enable and start the NFS server

systemctl enable nfs
systemctl start nfs

Create a directory that will be mounted by NFS clients

mkdir /home/myserveruser/sharedir

In /etc/exports, add a new record like below

/home/myuser/sharedir	X.X.X.X(rw,sync,no_subtree_check,no_root_squash)

This line is allowing a client having IP address of X.X.X.X to mount the directory at /home/myserveruser/sharedir and can do read and write as specified by rw option. For all possible options, refer to the blog here for definition of each options used. This directory will be used to initialized PostgreSQL database cluster by a NFS client over the network.

If more than one client/host will mount the same directory, you will need to include them in /etc/exports as well.

/home/myuser/sharedir	X.X.X.X(rw,sync,no_subtree_check,no_root_squash)
/home/myuser/sharedir A.A.A.A(rw,sync,no_subtree_check,no_root_squash)
/home/myuser/sharedir B.B.B.B(rw,sync,no_subtree_check,no_root_squash)

Then, we are ready to restart the NFS service to take account the new changes in /etc/exports

systemctl restart nfs

Record the userid and groupid of the user associated with the directory to be exported to client. For example, userid = 1009, groupid = 1009

id myuser

You may also want to ensure that the firewall on centos7 is either disabled or set to allow the NFS traffic to passthrough. You can check the firewall status and add new port to be allowed with the following commands:

sudo firewall-cmd --zone=public --list-all
firewall-cmd --permanent --zone=public --add-port=2049/tcp
firewall-cmd --permanent --zone=public --add-port=2049/udp
firewall-cmd --reload

NFS Client

First, install the NFS client components by

yum install nfs-utils

Create a directory on the client side that will be used as the mount point to the NFS server

mkdir /home/myserveruser/sharedir

Since PostgreSQL cannot be run as root user, we need a way to mount the remote NFS without root privileges. To do that, we need to add one more entry in /etc/fstab

Y.Y.Y.Y:/home/myserveruser/sharedir		/home/myclientuser/mountpoint	nfs 	rw,noauto,user 0 0

where Y.Y.Y.Y is the IP address of the NFS server that client will connect to, nfs is the type of file system we will mount as, rw means read and write, noauto means the mount point will not be mounted automatically at system startup and user means this mount point can be mounted by a non-root user.

With this new line added to /etc/fstab, you could try mounting it using mount -a command or simply do a reboot

Now, we need to match the userid and groupid of myserveruser on the NFS server side and myclientuser on the NFS side. The user names do not have to be equal, but the userid and groupid must be equal to have the correct permission to access the mounted directory,

To set the userid and groupid of the myclientuser:

usermod -u 1009 myclientuser
groupmod -u 1009 myclientuser

where 1009 is recorded on the NFS server side by the id command.

Please note that the above commands will change the property of the specified user and that will also cause the tool to iteratively change all the files under /home/myclientuser to be the same property given that it is owned by the specified user. This process may take a long time to complete. if you have a huge amount of data there, I would recommend creating a new user instead of changing the userid and groupid. Please note that files outside of specified user’s /home directory will not be changed the properly, which means current user can no longer access them once userid and groupid have been changed.

With all the setup in place, we are now ready to mount NFS:

$ mount /home/clientuser/mountpoint;

and the folder mountpoint on NFS client side will now reflect to sharedir folder on the NFS server

Finally, we can initialize a new database cluster on the mountpoint on the NFS client side

$ initdb -D /home/clientuser/mountpoint

and the initial database files should be physically stored on the NFS server side instead of NFS client. This may take longer than usual because the file I/Os are now done over the network via NFS.

Maintain a custom PG to PG Connection With libpq's COPY protocol

1. Introduction

Recently in my development work, a custom connection is required to be maintained between a PG backend on primary and another PG backends on standby nodes to communicate custom data in addition to the existing walsender/walreceiver connection that streams WAL data. Of course, I could just create a new standalone backend and maintain a socket connection myself to communicate custom data. Technically, It could work, but it also created several problems. the persistence, user security, data encryption of this custom connection all need to be handled as well. So, why not just use libpq to handle all of these for us? Today, in this blog, I will share my experience with using libpq COPY protocol to maintain a custom connection for custom data based on PG14.

2. Create new routines in libpqwalreceiver.c

This file is located in src/backend/replication/libpqwalreceiver and is compiled as a shared library (.so) that contains routines related to libpq library that allows a PG backend to use libpq without having it compiled in the backend code. When a PG backend process needs to use libpq, it needs to load the shared library first using load_file() call.

My requirement is simple, the new routines I needed are connect, send, recv, similar to normal socket interactions. I do not have a close function defined here, because I would like the connection to be persisted as long as the primary and standby are running. When one of them exits, the connection will be automatically terminated once it has detected the peer has disconnected.

3. The Connect Routine

Unlike the libpqrcv_connect routine for replication, my case is much simpler. I just need my standby node to connect to the primary node, so I can simply reuse standby’s primary_conninfo configuration parameter to connect. This will trigger the primary node to fork a new backend process to serve this connection. A code snipper could look like this:

connect

I also set my libpq socket connection to use blocking socket and set asyncStatus to be PGASYNC_COPY_BOTH to indicate that I will be doing a bidirectional data communication

4. The Send Routine

My send routine is exactly the same to the libpqrcv_send routine for replication. Both uses PQputCopyData to send streams of data out to the primary. Renamed it for consistency. Snippet below:

send

5. The Recv Routine

Also, very similar to libpqrcv_recv routine for replication, it shares almost exactly the same code. Except that for my requirement, the connection needs to be a synchronous connection. This means that my standby will block while waiting for primary to respond. In order to make recv synchronous, I had to pass a 0 to the third argument of PQgetCopyData. So, if you are okay with asynchronous connection, this routine could look exactly the same as libpqrcv_recv as well.

recv

6. Having Standby to Send Out Some Custom Data

Now that we have the libpq wrapper routines made for our own purpose, we can then have the standby sends some custom data to the primary and waits for a response. Note that I am sending a letter ‘N’ followed by 3 example custom data, 100, 200, 300. Libpq COPY uses the letter d to indicate a COPY command, and what we are doing here is to wrap our own commands within the d command

StringInfoData buf_blk_request;
WalReceiverConn *wrconn;
int len;

load_file("libpqwalreceiver", false);


wrconn = netbuf_connect("dbname=postgres host=127.0.0.1 port=5550");


initStringInfo(&buf_blk_request);
pq_sendbyte(&buf_blk_request, 'N');
pq_sendint32(&buf_blk_request, 100);
pq_sendint32(&buf_blk_request, 200);
pq_sendint32(&buf_blk_request, 300);
pq_flush();

/* Send it */
netbuf_send(wrconn, buf_blk_request.data, buf_blk_request.len);

/* Read the data */
len = netbuf_recv(wrconn, &tmp, &fd);
if (len > 0)
{
/*
* Something was received from primary
*/
}

7. Having Primary to Receive the Custom Data

When we send something using the methods above, the primary’s postmaster’s main loop will receive the data and decide what to do. Because we are using COPY protocol, the first character is d, in which src/backend/tcop/postgres.c already has a handler for that. So we will need to add additional code under the d handler in postgres.c to receive and process the data sent by standby and provide a response if needed.

case 'd':			/* copy data */
elog(DEBUG2, "copy data request received");
int op;

op = pq_getmsgbyte(&input_message);
if (op == 'N')
{
StringInfoData buf_blk_reply;
int data1, data2, data3;

/* receive custom data here */
data1 = pq_getmsgint(&input_message, 4);
data2 = pq_getmsgint(&input_message, 4);
data3 = pq_getmsgint(&input_message, 4);
pq_getmsgend(&input_message);


/* send another custom data back to standby here */
pq_beginmessage(&buf_blk_reply, 'd');
pq_sendint32(&buf_blk_request, 400);
pq_sendint32(&buf_blk_request, 500);
pq_sendint32(&buf_blk_request, 600);
pq_endmessage(&buf_blk_reply);
pq_flush();
}
break;

8. Summary

Based on libpq COPY, I have created a separate communication channel between a primary and standby node that can be used to communicate custom data similar to how you would normally handle a regular socket. All this is based on the COPY protocl that libpq already supports and within that protocol, we wrap our own data. In the above examples, when standby sends 100, 200, 300 to the primary, it is able to receive it and respond 400, 500, 600. This simple example can be expanded to support other things that you may need in your development. This way of using COPY for my own purpose may not be the cleanest way, but it is what works for me.

Understand PG's MVCC Visibility Check Rules

1. Introduction

PostgreSQL’s MultiVersion Concurrency Control (MVCC) is an “advanced technique for improving database performance in a multi-user environment” according to Vadim Mikheev. This technique requires multiple “versions” of the same data tuple exist in the system governed by snapshots taken during different time periods. In other words, under such technique, it is PG’s responsibility to figure out how many tuples are visible to the user and how many are not, according to multiple parameters such as snapshots taken, current transaction ID…etc. This is also known as visibility check rules in PG and today in this blog, I will talk about the basic principles of visibility check so you get an idea how PG performs this task internally and I hope it would be helpful in your development work.

2. Parameters Involved?

  • The tuple itself, which contains:
    • xmin (the transaction ID that inserts this tuple)
    • xmax (the transaction ID that deletes this tuple if > 0, otherwise it’s not deleted)
    • cid (command ID)
    • hintbit
  • Global top transaction ID if available
  • current snapshot, which also contains xmin, xmax and cid
  • Commit Log Data (CLOG)

3. The Process

3.1 Check the Hintbit

The visibility process starts by checking the hintbit. If hintbit has a state of COMMITTED, then we can skip most of the other visibility checking rules for better efficiency. With the tuple committed, PG then take this tuple’s xmin value to check with the current snapshot to ensure this tuple is not currently in progress. (See 3.3 below for formula to determine in progress). This check is required because there is a chance that this tuple, though committed, is still being updated by other backend processes at this very moment. If the tuple is not currently in progress by other backends, it will finally check its xmax value to ensure that it is invalid. Invalid means it has not been deleted. When all of the above are true, this tuple is considered visible.

If hintbit indicates ABORTED, then this tuple is considered invisible to user. If hintbit has no value, then PG should continue with the rest of visibility check rules.

3.2 Check If Tuple’s xmin Equal to Global Top Transaction ID

The next check involves taking tuple’s xmin and compare with global top transaction ID to see if they are equal. This global top transaction ID will only be set when user starts the transaction manually with begin clause. If user does not starts the transaction this way (no begin clause is issued), then this global top transaction ID will not be set and therefore this checking will be skipped.

If tuple xmin equals global top transaction ID, this means the tuple is currently in progress by the current backend, not others. This is where the command ID (cid) will be used to determine the visibility. Within a transaction block, each command issued has an associated command ID to indicate what command comes first and what comes later. For example, if a SELECT comes after an UPDATE, within a transaction, then the SELECT must see the new tuple updated by the previous UPDATE clause and cid plays a part in this determination. This behavior is also governed by the isolation level which is unfortunately out of scope of this blog. The default isolation level is READ COMMITTED, which makes the SELECT sees the data changed by previous UPDATE; but if the isolation is set to REPEATABLE READ then the SELECT will not see the data changed by previous UPDATE. Please keep this in mind.

If tuple’s xmin does not equal to global top transaction ID, then PG should continue with the rest of visibility

3.3 Check the Current Snapshot

The next check involves the PG to check if this tuple is currently in progress by other backend processes. This is done by comparing the tuple’s xmin value against current snapshot’s xmin and xmax values according to this formula.

  • tuple xmin < snapshot xmin = not in progress
  • tuple xmin >= snapshot xmax = in progress

If this tuple is considered in progress by the snapshot, then this tuple must not be visible to the current user because other backends are still working on it. If this tuple is NOT considered in progress, then PG should continue with the rest of visibility check rules.

3.4 Check the Commit Log (CLOG)

This next check involves the PG to take tuple’s xmin to check against the CLOG to see if this tuple has been committed or aborted. CLOG is like an array of transaction IDs and each array element stores a commit status. PG has a formula to convert a transaction ID into a CLOG block plus an offset to precisely access the right CLOG element. This CLOG data structure is regularly flushed to disk with checkpoint process and it is located in these 3 directories: pg_xact, pg_multixact and pg_subtrans.

If CLOG says the given tuple is committed, then PG will continue to check the tuple’s xmax value to ensure that it is invalid. Invalid means it has not been deleted and therefore it is visible to user. At the same time, PG will also update the hintbit value of this tuple to COMMITTED such that in the next visibility checking, PG will not access CLOG again, which is rather expensive.

If CLOG says the given tuple is aborted or invalid, then this tuple is not visible to the current user. At the same time, it will also update the hintbit value of this tuple to INVALID

Summary

The above processes are the very basic visibility rules within PG according to my understanding. Of course there are other complex checking not mentioned here, such as the involvement of sub transactions and multi transactions, but they do follow somewhat similar pattern to what’s mentioned today. I hope this blog would be helpful for you.

Atomic Commit with 2 Phase Commit in FDW Distributed Setup

1. Introduction

PostgreSQL’s 2 phase commit (2PC) feature allows a database to store the details of a transaction on disk without committing it. This is done by issuing PREPARE TRANSACTION [name] command at the end of a transaction block. When the user is ready to commit, he/she can issue COMMIT PREPARED [name] where [name] should match the [name] in PREPARE TRANSACTION command. Because the transaction details are stored on disk with 2PC, the server is able to commit this transaction at a later time even if it crashes or out of service for some time. In a single database instance, the use of 2PC is not critical; the plain ‘commit’ can perform the job equally as well. However, in a larger setup, the data may be distributed on 2 or more database instances (for example, via Foreign Data Wrapper (FDW)), the use of 2PC is absolutely critical here to keep every database instance in sync.

2. Atomic Commit Problem with Foreign Data Wrapper (FDW)

Current postgres_fdw does not support the use of 2PC to commit foreign server. When a commit command is sent to the main server, it will send the same commit command to all of the foreign servers before processing the commit for itself. If one of the foreign node fails the commit, the main server will go through a abort process and will not commit itself due to the failure. However, some of the foreign nodes could already been successfully committed, resulting in a partially committed transaction.

Consider this diagram:

where the CN node fails the commit to DN3 and goes through a abort process, but at the same time, DN1 and DN2 have already been committed successfully and can no longer be rollbacked. This scenario creates a partial commit that may not be desirable.

3. FDW with 2PC Capability

If we were to add a 2PC functionality to current postgres_fdw, instead of sending the same commit to all foreign servers, we let the main server to send PREPARE TRANSACTION instead. The main server should proceed to send COMMIT PREPARE to all foreign servers Only when all of the foreign servers have successfully completed the PREPARE TRANSACTION. If one of them fails at the PREPARE stage, the main server is still able to ROLLBACK those foreign server who have successfully prepared.

Consider this diagram:

where the CN node fails the PREPARE TRANSACTION to DN3 and sends ROLLBACK PREPARED to DN1 and DN2 before going to the abort process. With the 2PC method, there will not be any partial commits.

3. Handling COMMIT PREPARED Failure?

If a foreign server fails at the PREPARE stage, it is not too late to rollback the rest of foreign servers that have succeeded the PREPARE, so the main server can still send ROLLBACK PREPARED to the foreign servers. However, if a foreign server fails at the COMMIT PREPARE stage, the other foreign servers who have succeeded it can no longer be rollbacked, potentially causing a partial commit as well.

In our implementation, we still allow the main server to continue with the commit even though a foreign server fails COMMIT PREPARED. In addition, we give user a warning about one of the foreign server may not have committed successfully, which leads to a potential partial commit. The foreign server with a failed COMMIT PREPARE will now have something called a “orphaned prepared transaction” that has yet to be committed.

Consider this diagram:

where DN3 fails a COMMIT PREPARED command and the CN node continues the commit with warning.

4. Handling Orphaned Prepared Transaction

Following the above example, if we perform a SELECT query, we will see that DN3 does not have an updated value while DN1 and DN2 have been updated. Also, DN3 still have the transaction prepared and stored in its disk. What’s left to do is to have somebody to login to DN3 and manually run a COMMIT PREPARED command. If that is done successfully, there will no longer be a partial commit.

The way we handle this is to make a orphaned prepared transaction detector at each foreign server and we introduce an intermediate and external global transaction manager (GTM) node that records all the FDW operations. Again, following the above example, when DN3 detects a orphaned prepare transaction, it will make a connection to the GTM node and check if this prepared transaction comes from a CN node. If it is, then we simply let DN3 do a self-commit of the prepared transaction automatically, without any human intervention. If GTM does not have a record, then this orphaned prepared transaction must be created manually by another DN3 user and it should not do anything to it except to just give a warning in the log file.

This is the general concept how we handle atomic commit and orphaned prepared transactions. There may be better and more complex solutions out there but for us, having an intermediate GTM node to coordinate all the operations between CN and DN nodes seems to be the simplest.

Distributed Database With PostgreSQL - Atomic Commit Problems

1. Introduction

If you are into distributed database research, especially one that is setup using Foreign Data Wrapper (FDW) + partitioned foreign tables, you probably have heard that there are many potential issues associated with this setup. Atomic commit, atomic visibility and global deadlock detection are one of the most popular issues that one can encounter in a distributed database environment.

There has been a lot of discussion on the PostgreSQL community on atomic commit and potential ways to solve this problem, and also discuss how this issue introduces an atomic visibility issue.

I recently had a chance to scan through several of the community discussion on atomic commit and would like to share what I have learned in this blog.

2. The problem

In a distributed environment, if a query involves modification to multiple foreign tables (or shards), current PostgreSQL cannot guarantee either all of them commit or all of them rollback. This means that, if one of them fails to commit for some reason but the rest have succeeded, we will see inconsistent results. The desired behavior would be that if one of the foreign tables fails to commit, PostgreSQL should be able to roll back the ones that have been committed successfully.

However, with the current postgres_fdw and the transaction manager, it is currently not possible to “uncommit” a foreign transaction that has been committed. This is why most of the discussions generally regard PG’s “2 Phase Commit” as the basic building block to achieve atomic commit.

2. 2 Phase Commit (2PC)

2PC basically splits a regular commit action into 2 stages, a PREPARE TRANSACTION and a COMMIT PREPARED (or ROLLBACK PREPARED) stage. Doing this allows PG to uncommit an in-progress commit in case it finds a condition that all of the foreign transactions need to be aborted.

For example, 1 Coordinator Node (CN) sends a UPDATE query to 3 Data Nodes (DNs) and would like to commit

  • CN sends a PREPARE TRANSACTION to each DN1 and waits for response
  • CN sends a PREPARE TRANSACTION to each DN2 and waits for response
  • CN sends a PREPARE TRANSACTION to each DN3 and waits for response

If any of these PREPARE TRANSACTION (say DN3) fails to deliver or its response contains error, CN still has a chance to ROLLBACK transactions on DN1 and DN2 and DEALLOCATE the existing prepared transaction. After that, the CN will abort the local transaction and finish the transaction with error and nothing gets changed here.

Only if all 3 DNs have confirmed the PREPARE TRANSACTION success, CN can then proceed to commit the local transaction, if local succeeds, continue to send COMMIT PREPARED to each DN, or send ROLLBACK PREPARED if local fails

  • CN sends a COMMIT/ROLLBACK PREPARED to each DN1 and waits for response
  • CN sends a COMMIT/ROLLBACK PREPARED to each DN2 and waits for response
  • CN sends a COMMIT/ROLLBACK PREPARED to each DN3 and waits for response

If any of these fails to COMMIT PREPARED at this point, say (DN3), CN is not able to uncommit DN1 and DN2 that have been 2-phase committed, so at this point, CN can only retry indefinitely to deliver COMMIT PREPARED to DN3 to ensure atomicity. If DN3 is unreachable, CN will keep on trying forever without closing the transaction and leaving the client waiting.

If client choose to cancel the transaction by hitting ctrl-c, then the transaction will close with a warning message that DN3 has not confirmed a COMMIT PREPARED message and data may be inconsistent. DN3 at this point may have an orphaned prepared transaction that needs to be handled.

3. Orphaned Prepared Transaction

In the previous example, an orphaned prepared transaction can happen if CN prepares a foreign transaction but never has a chance to commit it (possibly due to network). In this case, the DN node can very easily detect an orphaned prepared transaction based on the time the prepared transaction has been created. Unfortunately, without a centralized Global Transaction Manager (GTM), DN node is not able to decide if this orphaned prepared transaction should be committed or discarded. Therefore, most of the solutions on the community only support the detection of orphaned prepared transaction without any handling.

4. How Atomic Visibility Appeared?

Using 2PC allows CN node to rollback a commit during the PREPARE stage but cannot allow rollback during the commit prepared stage, so it can only retry forever or exit with a warning. The visibility issue can happen in a case where DN1 and DN2 have already comitted the prepared transactions while DN3 has not yet committed. At this point, if a client does a SELECT statement across DN1, DN2 and DN3, it will see that DN3 still does not have the right data but DN1 and DN2 have. This is not atomic at all.

To solve this, several people proposed a snapshot that can be shared globally across all the DN and CN nodes. The standard PG’s snapshot is purely based on transaction ID to determine if a tuple is visible or not, but with multiple different PG servers, each of them have their own transaction ID range and could all be very different, so sharing a snapshot based on transaction ID is impossible.

This is why a Commit Sequence Number (CSN) snapshot is being proposed in the discussion, which is based on time, and could be made the same in all of the CN and DN nodes and act as a reference point to determine if a tuple belongs to the past or future. The problem is then shifted to how to ensure all of the CN and DN nodes have consistently the same time? Even the best clock chipsets are subject to time skewing and will slowly go out of sync. The concept of Clock-SI is then added to the discussion, which is designed to handle such a clock skew. This makes the discussion somewhat more complicated and out of scope.

But for me, this may not be worth the effort to completely solve the atomic visibility issue with all these complicated stuff.

5. reference to some of the discussions

Transactions involving multiple servers
Proosal for CSN based snapshot
Global snapshot

The significance of LSN in PostgreSQL System

1. Introduction

Last year I wrote a blog about PostgreSQL’s timeline concept, which is essential for executing Point In Time Recovery (PITR) back to a particular timeline and particular Log Sequence Number (LSN). But we have not talked about the idea of LSN in which everything else is built upon. Today in this blog, I will describe the significance of LSN

2. The LSN and REDO

We hear the term REDO a lot and it could mean a couple things depending on the context it is used. If used as a REDO point, it indicates the location of the current WAL record since the current or last checkpoint is started. This is the point where the database should commence the recovery.

When REDO is used by itself as a verb, it refers to the action of a standby server, which will read the WAL contents sent from the primary via streaming replication and actually replay the actions. Depending on the database action, different ways to REDO needs to be defined. This is why in heapam.c source code, functions like heap_redo and heap2_redo exist.

LSN stands for Log Sequence Number and it is basically a pointer to a location in the WAL. It is represented as unsigned 64-bit value internally in PG; external to the PG, it is represented as 2 hexadecimal number separated by a slash (/), First number being the segment ID of a particular WAL file and second number being the offset within that segment file.

For example:

1/CF54A048 could also be represented as a single value as 0x01CF54A048, which equals to 7,773,397,064 and this is the integer number used internally in PG to compute or comparae different LSN values.

Based on this value, you could tell the volume of a database, the bigger the LSN, the larger the database.

3. The pd_lsn value in data page header

I have been researching about the feasibility of having multiple PostgreSQL servers to sharer a single set of data storage. One server being able to read and write, the rest of servers are read-only. On the read-only servers, I introduced a temporary storage space in their respective cluster folder to temporary hold any data blocks that they want to eject from the buffer manager in case the primary server has not yet flushed data to the shared storage space.

During my testing with this approach, I came across an issue where the standbys may face a situation where the same data page but different version could exist in both the shared storage and also their own temporary space. Loading the wrong version of the page may result in incorrect output from user queries.

So, how can we solve this issue?

The pd_lsn parameter inside each buffer page header indicates the “last WAL LSN that has modified this page”. So by simply comparing the pd_lsn values from 2 version of the data page, I could ensure the standby always read the correct one.

The LSN is significant in a sense that almost every WAL recovery functions of PG work around LSN and in my own feasibility studies, LSN also plays an important role in making shared standby’s more intelligent in loading data blocks from multiple sources.

2021 PG Asia Conference Delivered Another Successful Online Conference Again!

Introduction

On December 14-17, 2021, PostgresConf.CN & PGconf.Asia2021 (referred to as 2021 PG Asia Conference) was held online successfully and once again listed as a global gold standard conference on Postgresql.org! This conference was jointly organized by the China PostgreSQL Association, PostgresConf International Conference Organization,and PGConf.Asia Asian Community

About the Conference

Also known as the Asia’s largest open source relational database ecology conference

PostgresConf.CN and PGConf.Asia were hosted together as one conference online for the first time back in 2020. In 2021, both conferences joined forces again to deliver another successful conference that attracted record high numbers of PG enthusiasts from the world!

PostgresConf.CN is an annual conference held by the China PostgreSQL Association for PostgreSQL users and developers. It is also one of the conference series held by PostgresConf Organization. PostgreConf.CN 2019 took place in Beijing, it was very well attended by PostgreSQL users and community members across the globe.

PGCONF.Asia is also an annual PostgreSQL event that took place in Bali Indonesia in 2019, it was a continuation of the PGCONF.Asia event that took place in Tokyo, Japan in 2018. The first PGCONG.Asia conference took place in 2016 in Tokyo, this conference acts as a hub of PostgreSQL related development and technical discussion among PostgreSQL users and developers in the region as well as experts from around the globe.

Learn more about these conferences and the organizers from these resources:

Sponsors

This conference was sponsored by:

Platinum:

Golden:

Silver:

14 Conference Channels over 4 Days!

This conference was broadcast live with a record-high number of viewers streaming the conference events. It consists of a total of 14 live channels delivering speeches in Mandarin Chinese, English, Korean and Indonesian languages.

the conference gathered 101 technical presentations and more than 100 well-known experts and scholars around the world to provide a grand technical feast for all the participating PG enthusiasts.

  • Guangnam Ni, Fellow of the Chinese Academy of Engineering
  • Peng Liu, Vice chairman of China Open Source Software Promotion Alliance and researcher of the Chinese Academy of Sciences
  • Xing Chunxiao, deputy director of the Information System Committee of the China Computer Federation
  • Bruce Momjian, co-founder of PostgreSQL international community and vice president of EDB
  • Peter Zaitsev, Founder and CEO of Percona
  • Ding Zhiming, Director of the Spatiotemporal Data Science Research Center of the Chinese Academy of Sciences
  • Experts from from HighGo, Alibaba, Tencent, Amazon, Kylin, Inspur, Ping An, Meichuang, SphereEx, Inspur Yunxi database, VMware Greenplum, Huawei, Ali cloud, ZTE…etc
  • Professors from Peking University, Tsinghua University, Institute of Information Technology, Chinese Academy of Sciences, Shenzhen University, Zhengzhou University, Shandong Normal University, Suzhou University of Science and Technology and other academic and scientific research institutions
  • And many, many more!

With the great support from these PostgreSQL communities, the conference was held with great success, which brought together the Chinese PG power, major Asian PG contributors and many PostgreSQL experts worldwide to build the largest PG ecosystem in Asia.

Conference Highlights

Opening Speech by Fellow Guangnan Ni

Guangnan Ni

Guangnan Ni first expressed his warm congratulations to the holding of the PGConf.Asia 2021 Asia Conference, and greeted sincerely to the representatives of various countries and open source experts participating in the conference. He pointed out that open source is an important trend in the development of today’s world, and open source technology is profoundly changing the global digital economy and information industry pattern. The holding of this PG Asia Conference is of great significance to the development of the open source industry, the promotion of the open source movement and the popularization of open source culture in China, and the strengthening of international exchanges and cooperation.

Bruce Momjian

Bruce Momjian

Bruce Momjian, as a core group member of the PG international community, has always been keen on sharing themes in the community. This time Bruce made a sharing based on the status quo of PG, project development challenges, competition challenges, and technical challenges, mainly emphasizing that the development process of the PG version has been adapting. Changes in the environment, continuous breakthroughs and innovations, such as SSD optimization, virtualization, containerization, and functional expansion of cloud environments have been completed in the face of technical challenges. For the future evolution of the PG version, it will continue to optimize and iterate on several aspects such as PG write amplification, TDE encryption, and horizontal scalability.

Peter Zaitsev

Peter Zaitsev
Mr. Peter Zaitsev, CEO of Percona, has been a guest at the PG Asia Conference for the second time. This time Peter brought us a speech of “State of Open Source Databases”, mainly from cloud computing, PG peripheral ecology, DaaS and K8S. This aspect expresses the openness and flexibility of PG, which enables PG and the evolution of the technical architecture to be better integrated, promote mutual development, and meet the needs of different business scenarios. He said that Percona, which has been engaged in open source database services for many years, loves to embrace PG and make innovations based on PG to bring more value to customers.

Chunxiao Xing

Chunxiao Xing
Professor Xing Chunxiao, deputy dean of the Institute of Information Technology of Tsinghua University and deputy director of the Information System Committee of the China Computer Federation, shared with us the most popular topics of “metaverse”, blockchain, and data lake, emphasizing that data management is a popular technology development Only the continuous evolution and iteration of database management software can better provide support for cutting-edge technologies and make our lives more meaningful.

Zhiming Ding

Zhiming Ding
The ivorySQL project is a PG open source derivative project with a broad ecological foundation and Chinese characteristics. Its purpose is to develop community collaboration among Chinese developers and allow engineers to create greater personal and social value.

The session was presented by Professor Ding Zhiming, an early promoter of research practice in the domestic PG field, the new president of the China PG branch, and director of the Center for Space-Time Data Science (STDMS) of the Chinese Academy of Sciences. Their system completely expounds the necessity and realization idea of “PostgreSQL database software + Chinese characteristics”, the positioning, characteristics and operation mode of the ivorySQL project and community. Through the launch event, the project was officially known to the majority of PG practitioners in China.

The launch of IvorySQL attracted many domestic PG enthusiasts, and they have provided suggestions and opinions to make the project better. We feel very thankful for their effort into this new project. In the mean time, please check out the project’s official website at www.ivorysql.org for the latest developments.

Xiaojun Zheng

Xiaojun Zheng
Xiaojun Zheng, Chief Scientist of HighGo, presented “Highgo peer-to-peer service cluster database”.

Mr. Zheng first gave a systematic overview of the definition of database cluster, and then explained the concept and implementation of “peer-to-peer service cluster” in a more generalized manner. Finally, Mr. Zheng led everyone to experience Higao’s Peer-to-peer service cluster database environment. Through this practical demonstration, we have brought you a new idea and method for realizing peer-to-peer service clustering. Architects and product managers who pay attention to the PostgreSQL database, I believe they can gain new product inspiration from it.

Jian Feng

Jian Feng
Director Feng first introduced the two major product lines of K1 Power and their respective market opportunities, advantages, and application scenarios. He then demonstrated the business value of K1 Power’s advanced virtualization features to PostgreSQL database in four aspects, and elaborated on running under K1 Power Linux. The best practices of the PostgreSQL database and the uniqueness of the x86 CPU. Finally, Director Feng introduced the PowerFans community on Github https://github.com/powerfans/. This PowerFan community can be said to be the king of technical feasts. In addition to providing the above best practices, it also provides open-source software binary packages, useful tools and documentation, etc. This PowerFans community provides database enthusiasts with great convenience to use the Power server platform, such as: PostgreSQL under the PowerLinux platform and you no longer need to make RPM packages yourself, Inspur commercial machines are ready for you!

Grant Zhou - The Launch of IvorySQL Open Source Project

ivorysql
This conference specially invited the well-known domestic database solution supplier, HighGo Database, to jointly organize and launch the HighGo database sub-forum. Jichao Ma, the pre-sales director of HighGo Software, presented “Highgo Database New Features”, and introduced some new features and functions of HighGo Database related to high availability.

Grant Zhou, head of the IvorySQL open source community presented “IvorySQL-an Oracle compatible open source PostgreSQL database”, and introduced an Oracle-compatible open source PostgreSQL database ivorySQL.

KaiGai Kohei

KaiGai Kohei
KaiGai Kohei has been a contributor of PostgreSQL and Linux kernel over 15 years. He presented a speech called “GPU revision of PostGIS” that introduces an extension called “PG-Storm” to pull resource from GPU for analytic SQL workloads.

Takeshi MISHIMA

Takeshi MISHIMA
Takeshi MISHIMA is a Ph.D in engineering principle of SRA OSS, Inc. He presented a speech called “A Middleware to Synchronize Multiple Instances in a PostgreSQL Cluster” that introduced a middle ware called “Pangea” to provide better performance in clustering synchronization by load balancing SELECT queries

Paul Brebner

Paul Brebner
Paul Brebneris the Technology Evangelist at Instaclustr. He’s been learning new scalable open source technologies, solving realistic problems, and building applications, and blogging about Apache Cassandra, Apache Kafka and of course PostgreSQL! He presented a speech called “Change Data Capture (CDC) With Kafka Connect and the Debezium PostgreSQL Source Connector” to introduce the “Debezium PostgreSQL Connector”, and explain how it can make the elephant (PostgreSQL) to be as fast as a cheetah (Kafka) such that the system becomes more change-aware and flexible.

Michael Christofides

Michael Christofides
Michael Christofides is the Co-founder of pgMustard who presented a speech called “A beginners guide to EXPLAIN ANALYZE” to give a beginner friendly introduction to EXPLAIN vs EXPLAIN ANALYZE.

Bohan Zhang

Bohan Zhang
Bohan Zhang is the co-founder of OtterTune who presented a speech called “Ottertune: automatic database optimization using machine learning” which talks about how the database group of Carnegie Mellon University uses machine learning to automatically tune and improve database systems.

Ibrar Ahmed

Ibrar Ahmed
Ibrar Ahmed is a Senior Software Architect in Percona LLC who has vast experience in software design and development with PostgreSQL. He presented a speech “A Deep Dive into PostgreSQL Indexing” and share some useful strategies how to take fully advantage of different index types that can be extremely useful for systems developers. He also emphasized that not all index types are appropirate for all circumstances; sometimes, inproper use of index can lower database performance.

Karel van der Walt

Karel van der Walt
Karel van der Walt is the principle of MentalArrow who presented a speech called “Modularizing computation via Common Table Expressions (CTEs)”, which discusses a topic to add modularized computation on top of the existing building blocks of CTEs to improve readability and testability

Yugo Nagata

Yugo Nagata
Yugo Nagata is a software engineer and chief scientist at SRA OSS, Japan. He is a specialist in database engineering who has great interest in PostgreSQL theories and internals. He presented a speech called “PostgreSQL Internals and Incremental View Maintenance Implementation” to share a very interesting technique to update materialzied views rapidly (instead of the old REFRESH MATERIALIZED VIEW). He also explaied the internal workings of materialized views amd shared the status of his current patch proposal of “Incremental View Maintenance” feature to the PostgreSQL community.

Cary Huang

Cary Huang
Cary Huang is a senior software developer of HornetLab Technology in Canada. He presented a speech called “The principles behind TLS and How it protects PG” to discuss the TLS protocol in details and explain how it is the most commonly used but the least understood security protocol today.

Muhammad Usama

Muhammad Usama
Muhammad Usama is a database architect of HornetLab Technology and a major contributor to Pgpool-II. He presented a speech called “PostgreSQL HA with Pgpool-II and whats been happening in Pgpool-II lately….” to discuss the roadmap of core features of Pgpool-II with strong emphasis on High Availability (HA), in which Muhammad believed is the most critical aspect of any enterprise application.

Asif Rehman

Asif Rehman
Asif Rehman is a senior software engineer at HornetLab Technology and he presented a speech called “PostgreSQL Replication” which explained the purpose of replication and discussed all types of replication methods with their pros and cons in different scenarios.

Seoul Sub-Forum

seoul sub forum

In the Seoul sub-forum, four Korean PostgreSQL user group members gave speeches, namely Ioseph Kim, Jaegeun Yu, Daniel Lee, and Lee Jiho.

Ioseph Kim is a contributor to PostgreSQL11 and PostgreSQL12 and translated the PostgreSQL Korean documentation. The topic of his speech was “Peculiar SQL with PostgreSQL for application developer”. In this speech, Loseph Kim talked about the syntax of PostgreSQL’s Lateral Joins, distinct on… etc. He also introduced the new features of PostgreSQL14.

Jaegeun Yu has more than 3 years of experience as a PostgreSQL DBA. The topic of his speech is “Porting from Oracle UDF and Optimization”. In this speech, Jaegeun Yu talks about the basic principles of function optimization and shows how to write PostgreSQL functions efficiently while porting, and illustrate the migration process with examples.

Daniel Lee brought you a speech “Citus high-availability environment deployment based on Patroni”. In this speech, Daniel Lee introduced the high availability technology solution of Citus, and actually demonstrated the steps to build a Citus HA environment based on Patroni.

Lee Jiho is the head of the department at insignal in Korea, and has more than 20 years of experience in using PostgreSQL. Lee Jiho brought a speech “Database configuration and migration and index clustering around clusters in PostgreSQL”. In this speech, Lee Jiho introduced the changes and upgrades of Cluster, table clustering … etc.

Jakarta Sub-Forum

jakarta sub forum

In Jakarta sub-forum, CEO-Julyanto Sutandang and CTO-Lucky Haryadi from Equnix Business Solutions brought 3 speeches in Indonesian.

Equunix Business Solutions’ CEO-Julyanto Sutandang brought 2 speeches on “Active active PostgreSQL Do We Really Need It” and “In-memory Database, Is It Really Faster?”, Equunix Business Solutions CTO-Lucky Haryadi’s speech topic is “oes HA Can Help You Balance Your Load”. As a technology company with strong influence in the field of PostgreSQL and Linux in Southeast Asia, the speech brought by Equnix Business Solutions is a perfect combination of practice and theory.

The successful landing of this Jakarta sub-forum has greatly promoted the dissemination and development of PostgreSQL database technology in Indonesia and even Southeast Asia.

Special Thanks To

special thanks

A Look Inside PostgreSQL's Extended Query Protocol

1. Introduction

Few weeks ago, I was tasked to have a detailed look inside PostgreSQL’s extended query protocol and study its internal mechanisms for a project that depends on this particular feature. In this blog, I will explain how extended protocol works in my own words and how it differs from simple query.

2. Simple Query

A client is able to initiate 2 type of queries against a PostgreSQL server, simple query or extended query. Simple query, as the name suggests, is very simple and straightforward to understand. When we launch the psql client tool to connect to a PostgreSQLs server, almost all of the SQL commands sent are considered as simple queries. This includes the use of begin and commit to enclose a transaction, having multiple SQL queries included in one big query separated by a semicolon, executing or defining a function and much more.

Simple query automatically follows the standard query processing routines, which consist of these stages:

  • Parser
  • Analyzer
  • Rewriter
  • Planner
  • Executor

Refer to this blog here for detailed information about query processing.

The communication between the client and the server is also very straightforward. In case of DML operations such as INSERT, the client sends the query to the server for processing and the server responds a CommandComplete message such as INSERT 0 1 followed by a ReadyForQuery(IDLE) message to indicate that the server has finished the query and is now idle. The client can send another query and it follows the same patter.

In case of a SELECT query, the server will send the row description followed by the actual row data satisfying the query until there is no more rows to return. In the end, the server sends a ReadyForQuery(IDLE) message to indicate that the server has finished the query and is now idle.

3. Extended Query

Extended query is the other way for the client to complete a query excepts that it breaks down the standard query processing into different steps and the client is responsible to ensure these steps are followed and executed correctly. The client is able to control thses steps by sending the following wire protocol message types:

  • ‘P’ message (Parse)

    • P message takes a generalize query string with data values repalced with placeholders like $1, $2, which can be later substitued with real values in the Bind step.
    • This generalized query string will be parsed via these query processing routines: Parser -> Analyzer -> Rewriter
    • At the end of a successful Parse, a prepared statement is produced, similar to SQL’s PREPARE clause
    • This prepared statement can be named or unnammed (more on this next).
    • This prepared statement is just a representation of the input query and it cannot be executed yet.
  • ‘B’ message (Bind)

    • B takes the named or unnammed prepared statement produced from the P message and replaces the placeholders ($1, $2) with the user-supplied values.
    • After the values are bound to the prepared statement, we basically have a completed query and it will then be fed into the planner stage to produce the most optimized query plan for the query.
    • At the end of a successful planning, a portal is produced.
    • This portal can also be named or unnammed (more on this next)
    • A portal is basically an object that represents how to execute a particular query
  • ‘E’ message (Execute)

    • E takes the named or unnamed portal produced from the B message and actuallly launch the executor to execute the query (or portal).
    • resultant rows are produced (if any) and returns to the client
  • ‘S’ message (Sync)

    • The client has to send a S message to the server to indicate the end of the extended query.
    • This message causes the server to end the current transaction and sends ReadyForQuery(IDLE) message back to client.

What is the purpose of separating a simple query into multiple steps? One major benefit of using the extended query is that it can save a lot of unnecessary parsing of the same query structure. Instead, we can have the common strucuture parsed only once and then bind and execute with different values multiple times.

4. Named and Unnamed Prepared Statement and Portal

In previous section we mentioned that a prepared statement and protal can be either named or unnamed during the extended query protocol. What does it mean and what is the significance?

The general rule is that PostgreSQL server can only keep one unnamed prepared statement or portal. Requests containing unnamed prepared statemnt or portal will replace the existing ones.

With named prepared statement or portal, the server will respectively remember them based on the names given and client is able to invoke them any time or specifically destroyed by sending a Close message.

So, it is possible for a client to create multiple prepared statements within a transaction, each having different prepared statement names, and then bind values for them all at the same time by giving them different portal names. Eventually, the client chooses which portal to be executed.

More importantly, named prepared statement’s life time lasts for the entire TCP session unless explicitly destroyed; named portal lasts only until the end of transaction or when it is executed or explicitly destroyed.

But, does it make client implementation more complicated if extended query is used?

Yes, but client has a choice to complicate things or not

Taking libpq for example, when extended query is used, it requires the client application to provide a prepared statement name to construct the P (Parse) message, but does not require the client application to provide portal name to construct the B (Bind) message. This means that with libpq, we can send P message multiple times with different prepared statement names, but with B message, it forces to use unnamed portal name, so we always have one portal to execute. This avoids the case with multiple portal names in the client side to manage.

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×