A Look Inside PostgreSQL's Extended Query Protocol

1. Introduction

Few weeks ago, I was tasked to have a detailed look inside PostgreSQL’s extended query protocol and study its internal mechanisms for a project that depends on this particular feature. In this blog, I will explain how extended protocol works in my own words and how it differs from simple query.

2. Simple Query

A client is able to initiate 2 type of queries against a PostgreSQL server, simple query or extended query. Simple query, as the name suggests, is very simple and straightforward to understand. When we launch the psql client tool to connect to a PostgreSQLs server, almost all of the SQL commands sent are considered as simple queries. This includes the use of begin and commit to enclose a transaction, having multiple SQL queries included in one big query separated by a semicolon, executing or defining a function and much more.

Simple query automatically follows the standard query processing routines, which consist of these stages:

  • Parser
  • Analyzer
  • Rewriter
  • Planner
  • Executor

Refer to this blog here for detailed information about query processing.

The communication between the client and the server is also very straightforward. In case of DML operations such as INSERT, the client sends the query to the server for processing and the server responds a CommandComplete message such as INSERT 0 1 followed by a ReadyForQuery(IDLE) message to indicate that the server has finished the query and is now idle. The client can send another query and it follows the same patter.

In case of a SELECT query, the server will send the row description followed by the actual row data satisfying the query until there is no more rows to return. In the end, the server sends a ReadyForQuery(IDLE) message to indicate that the server has finished the query and is now idle.

3. Extended Query

Extended query is the other way for the client to complete a query excepts that it breaks down the standard query processing into different steps and the client is responsible to ensure these steps are followed and executed correctly. The client is able to control thses steps by sending the following wire protocol message types:

  • ‘P’ message (Parse)

    • P message takes a generalize query string with data values repalced with placeholders like $1, $2, which can be later substitued with real values in the Bind step.
    • This generalized query string will be parsed via these query processing routines: Parser -> Analyzer -> Rewriter
    • At the end of a successful Parse, a prepared statement is produced, similar to SQL’s PREPARE clause
    • This prepared statement can be named or unnammed (more on this next).
    • This prepared statement is just a representation of the input query and it cannot be executed yet.
  • ‘B’ message (Bind)

    • B takes the named or unnammed prepared statement produced from the P message and replaces the placeholders ($1, $2) with the user-supplied values.
    • After the values are bound to the prepared statement, we basically have a completed query and it will then be fed into the planner stage to produce the most optimized query plan for the query.
    • At the end of a successful planning, a portal is produced.
    • This portal can also be named or unnammed (more on this next)
    • A portal is basically an object that represents how to execute a particular query
  • ‘E’ message (Execute)

    • E takes the named or unnamed portal produced from the B message and actuallly launch the executor to execute the query (or portal).
    • resultant rows are produced (if any) and returns to the client
  • ‘S’ message (Sync)

    • The client has to send a S message to the server to indicate the end of the extended query.
    • This message causes the server to end the current transaction and sends ReadyForQuery(IDLE) message back to client.

What is the purpose of separating a simple query into multiple steps? One major benefit of using the extended query is that it can save a lot of unnecessary parsing of the same query structure. Instead, we can have the common strucuture parsed only once and then bind and execute with different values multiple times.

4. Named and Unnamed Prepared Statement and Portal

In previous section we mentioned that a prepared statement and protal can be either named or unnamed during the extended query protocol. What does it mean and what is the significance?

The general rule is that PostgreSQL server can only keep one unnamed prepared statement or portal. Requests containing unnamed prepared statemnt or portal will replace the existing ones.

With named prepared statement or portal, the server will respectively remember them based on the names given and client is able to invoke them any time or specifically destroyed by sending a Close message.

So, it is possible for a client to create multiple prepared statements within a transaction, each having different prepared statement names, and then bind values for them all at the same time by giving them different portal names. Eventually, the client chooses which portal to be executed.

More importantly, named prepared statement’s life time lasts for the entire TCP session unless explicitly destroyed; named portal lasts only until the end of transaction or when it is executed or explicitly destroyed.

But, does it make client implementation more complicated if extended query is used?

Yes, but client has a choice to complicate things or not

Taking libpq for example, when extended query is used, it requires the client application to provide a prepared statement name to construct the P (Parse) message, but does not require the client application to provide portal name to construct the B (Bind) message. This means that with libpq, we can send P message multiple times with different prepared statement names, but with B message, it forces to use unnamed portal name, so we always have one portal to execute. This avoids the case with multiple portal names in the client side to manage.

The PostgreSQL Timeline Concept

1. Introduction

In my previous blog here, I discussed about PostgreSQL’s point in time recovery where PostgreSQL supports an ability to recover your database to a specific time, recovery point or transaction ID in the past but I did not discuss in detail the concept of timeline id, which is also important in database recovery.

2. What is timeline ID and Why it is important?

A timeline ID is basically a point of divergence in WAL. It represents a point, of to be exact, the LSN of the WAL in which the database starts to diverge. Divergence happens when an user performs a point in time recovery or when the standby server is promoted. The timeline ID is included in the first 8 bytes of WAL segment files under pg_wal/ directory.

For example:
pg_wal/000000010000000000000001, indicates that this WAL segment belongs to timeline ID = 1

and

pg_wal/000000020000000000000001, indicates that this WAL segment belongs to timeline ID = 2

Timeline ID behaves somewhat like git branch function without the ability to move forward in parallel and to merge back to the master branch. Your development starts from a master branch, and you are able to create a new branch (A) from the master branch to continue a specific feature development. Let’s say the feature also involves several implementation approaches and you are able to create additional branches (B, C and D) to implement each approach.

This is a simple illustration of git branch:

With timeline ID, your database starts from timeline ID 1 and it will stay at 1 for all subsequent database operations. Timeline ID 2 will be created when the user performs a point in time recovery on timeline 1 and all of the subsequnt database operations at this point belong to timeline ID 2. While at 2, the user could perform more PITR to create timeline 3, 4 and 5 respectively.

In the previous PITR blog, I mentioned that you could do PITR based on time, a recovery point, a LSN or a transaction ID but all these can only apply to one particular timeline. In postgresql.conf, you can select a desired recovery timeline by the recovery_target_timeline parameter. This parameter can be 'latest', 'current', or 'a particular timeline ID value'.

With this configuration, an user is able to recovery the database to a particular point of a particular timeline in the past

This is a simple illustration of timeline ID:

3. The History File Associated with a Timeline ID

The history files are created under pg_wal/ directory with a .history postfix when a new timeline Id is created. This file describes all the past divergence points that have to be replayed in order to reach the current timeline. Without this file, it is impossible to tell where a timeline comes from, thus not being able to do PITR.

For example, a history file 00000003.history may contain the following contents

cat pg_wal/00000003.history
1 0/30000D8 no recovery target specified

2 0/3002B08 no recovery target specified

which means that timeline 3 comes from LSN(0/3002B08) of timeline 2, which comes from LSN(0/30000D8) of timeline 1.

4. Importance of Continuous Archiving

With the concept of timeline ID, it is possible that the same LSN or the same WAL segments exist in multiple timelines.

For example: the WAL segments, 3, 4, 5, 6 exist in both timeline 1 and timeline 2 but with different contents. Since the current timeline is 2, so the ones in timeline 2 will continue to grow forward.

000000010000000000000001
000000010000000000000002
000000010000000000000003
000000010000000000000004
000000010000000000000005
000000010000000000000006
000000020000000000000003
000000020000000000000004
000000020000000000000005
000000020000000000000006
000000020000000000000007
000000020000000000000008

With more timelines created, the number of WAL segments files may also increase. Sine PG keeps a certain amount of WAL segment files before deleting them, it is super important to archive all the WAL segments to a separate location either by enabling continuous archiving function or using pg_receivewal tool. With all WAL segment files archived in a separate location, the user is able to perform successful point in time recovery to any timeline and any LSN.

PostgresConf.CN and PGConf.Asia 2021 (China Branch) Join Forces Again to Bring You the Best!

About the Event

In November, 2020, PostgresConf.CN and PGconf.Asia2020 cooperatively organized 2020 PG Asia Conference online for the very first time! This conference attracted over 100 participating experts, scholars and speakers around the world to deliver a grand technical feast for all the participants. The event was broadcast exclusively via the Modb Technology Community platform in China with a record-high number of viewers streaming the conference channels. Each channel on average accumulated over 30,000+ active LIVE streams, the official conference blog views accumulated over 50,000+ views, and the news reports and articles from media exceeded over 3,000+ entries.

This year, both PostgresConf.CN and PGConf.Asia 2021 (China Branch) will join forces again to bring you the best PostgreSQL conference in Asia!

Official Event Website - 2021

Time and Rough Agenda

The 2021 PG Asia Conference will be a 4-day event from December 14 to 17, 2021.

Rough schedule is as below. Please note that the final schedule is subject to the date of the meeting. The agenda time is Beijing Time(GMT+8)

  • December 14 - Day 1
    • 09:00-18:00 Main Conference
  • December 15 - Day 2
    • 09:00-17:00 Mandarin sub-Comprehensive forum
    • 09:00-17:00 Indonesia sub
    • 09:00-17:00 Korean Sub
    • 09:00-17:00 English sub
    • 09:00-12:00 Special Training A
    • 14:00-16:30 Special Training B
  • December 16 - Day 3
    • 09:00-17:00 Mandarin sub-Application Forum
    • 09:00-17:00 Japanese Sub
    • 09:00-17:00 Thai Sub
    • 09:00-12:00 Special Training C
    • 10:00-11:30 Developer Unconference
    • 14:00-16:30 Special Training D
  • December 17 - Day 4
    • 09:00-12:00 Special Training E
    • 14:00-16:30 Special Training F

About PostgresConf.CN

PostgresConf.CN is an annual event held by the China PostgreSQL Association for PostgreSQL users and developers, PostgreSQL is a leading open-source relational database with an active and vibrant community. PostgreConf.CN 2020 took place as an online conference for the very first time; it was very well attended by PostgreSQL users and community members across the globe. It is also one of the conference series held by PostgresConf Organization.

Official Event Website - 2020

About PGConf.Asia

PGCONF.Asia is a shorthand of the Asian PostgreSQL Conference which has been hosted in Tokyo from 2016 until 2018. The conference was held in Bali for the first time in Sep 2019 and it is intended as the Asian level of PostgreSQL International Conference. It co-hosted the 2020 PG Asia Conference together with PostgresConf.CN as an online event and accumulated unprecedented number of participants worldwide. The organization is often regarded as the HUB and the Summit of the PostgreSQL Community throughout Asian Countries.

Call for Paper

Each speaker is given a 30-minute session (25 minutes of speech + 5 minutes of Q&A).The direction of the speech can be: kernel development, application practice, operation and maintenance management, data migration, distributed, high availability, security control, optimization, cloud database, open source upstream and downstream ecology, etc.

Please submit your speech topic first. The deadline for topic submission is October 31, 2021. The organizing committee will complete topic review before November 15, 2021. You will be notified the result of your topic selection at that time, please start preparing the speech content. The deadline for PPT submission is November 30, 2021.

Please download the PPT template here.

You will have to create a free Postgres Conference account in order to submit your topic.

Access the account registration here

Contacts

Global Sponsor Interface: grantzhou@postgresconf.org
About Speech submission and other inquiries: meeting@postgresqlchina.com

PostgreSQL 14 Continuous archiving and Point In Time Recovery Tutorial

1. Introduction

Recently I have been practicing the internals of PostgreSQL continuous archiving and the point in time recovery features that are required for my current development work. Today I would like to demonstrate these important features with the recently released PostgreSQL 14 on Ubuntu 18.04.

2. Write Ahead Log?

Before going into the details of continuous archiving, it is important to understand the concept of Write Ahead Log (WAL). WAL files are generated by PG that contains all of the operations done on the database since the beginning. Operations such as INSERT, UPDATE, DELETE, VACUUM …etc are captured in WAL files. Having these WAL files, it is possible to recreate the database simply by replaying them, allowing an user to recover the database to a certain state in case of fault. This is the basis of continuous archiving and point in time recovery.

3. What is Continuous Archiving?

The generated WAL files are normally stored in the pg_wal directory within the PG database cluster, but they will not grow forever. The configuration parameters, max_wal_size and min_wal_size control how many WAL files can be kept in pg_wal directory. The checkpointer process will periodically purge the old WAL files, leaving only recent ones.

So, it is important to set up continuous archiving so all these WAL files can be archived to somewhere else outside of PG cluster. So, when you need all the old WAL files for recovery, PG can restore them from the archive.

To enable WAL archiving and restoring, set these parameters in postgresql.conf:

archive_mode = on
archive_command = 'cp %p /path/to/archive/%f'
restore_command = 'cp /path/to/archive/%f %p'

where you should replace /path/to/archive with your own archive path on your system. %p and %f will be swapped with path to WAL segment and WAL segment name to complete the command.

When a WAL segment is ready to be archived, PG will create a signal file in pg_wal/archive_status to indicate a particular WAL segment is ready for archive.

In the example below, the segment 00000001000000000000000E is ready for archive, indicated by the postfix .ready while all of the previous segments have been successfully archived and therefore indicated by the .done postfix.

$ ls pgtest/pg_wal/archive_status/
000000010000000000000002.done 000000010000000000000005.done 00000001000000000000000A.done 00000001000000000000000E.ready
000000010000000000000003.done 000000010000000000000007.done 00000001000000000000000B.done
000000010000000000000004.done 000000010000000000000008.done 00000001000000000000000D.done

The PG’s archiver process will then be waken up to perform the archive by running the archive_command configured.

$ps -ef | grep postgres

caryh 1487 1 0 11:10 ? 00:00:00 postgres -D /home/caryh/pgtest
caryh 1510 1487 0 11:10 ? 00:00:00 postgres: checkpointer
caryh 1511 1487 0 11:10 ? 00:00:00 postgres: background writer
caryh 1512 1487 0 11:10 ? 00:00:00 postgres: walwriter
caryh 1516 1487 0 11:10 ? 00:00:00 postgres: autovacuum launcher
caryh 1520 1487 0 11:10 ? 00:00:00 postgres: archiver archiving 00000001000000000000000E
caryh 1521 1487 0 11:10 ? 00:00:00 postgres: stats collector
caryh 1522 1487 0 11:10 ? 00:00:00 postgres: logical replication launcher

Note that the archiver (PID=1520) also shows its progress on the ps display.

When done successfully, the signal file in pg_wal/archive_status will be updated to .done postfix.

$ ls pgtest/pg_wal/archive_status/
000000010000000000000002.done 000000010000000000000005.done 00000001000000000000000A.done 00000001000000000000000E.done
000000010000000000000003.done 000000010000000000000007.done 00000001000000000000000B.done
000000010000000000000004.done 000000010000000000000008.done 00000001000000000000000D.done

In the next checkpoint, these .done files will be removed so these status files will not be continuously growing as well.

4. What is Point In Time Recovery (PITR)?

Having all of the WAL segments backed up in a separate archive, we gained the ability to recovery the database up yo a certain point in time in the past or completely recover the whole database. This depends on your use case, if you made a major mistake and need to start again from a point of time in the past, you can have PG to recover to that particular time during recovery mode and continue the database operation from that point. This is also referred to as switching to a new time line ID and we will discuss this more in the next blog.

Let’s continue from the above example (which already has 1 million rows of data) and do a point in time recovery.

  1. Make a basebackup of the current database, we can use pg_basebackup to achieve this

    $ pg_basebackup -U caryh -h 127.0.0.1 --progress -D pgtest-back
  2. back to the database and continue inserting some more data, use pg_switch_wal to immediate write out the WAL segment and obtain a LSN. LSN stands for Log Sequence Number, and it logically represent a WAL entry within a WAL segment. Refer to documentation here for more information.

After obtaining the LSN, we again insert some more rows of data.

insert into test values(generate_series(1,1000000), 'asdas');
insert into test values(generate_series(1,1000000), 'asdas');

pg_switch_wal();
pg_switch_wal
---------------
0/13DAC308

insert into test values(generate_series(1,1000000), 'asdas');
insert into test values(generate_series(1,1000000), 'asdas');

So, in total, this table test should have 5 million rows of data, because it started with 1 million and we just inserted 4 million more in the example above.

The WAL location indicated by LSN 0/13DAC308 indicated a time when the database only contains 3 million rows, and this is the point of time that we would like to recover up to in our example.

  1. Stop the database server

    pg_ctl -D pgtest stop
  2. Wipe out everything in this database pgtest

$ rm -rf pgtest/

I know it sounds crazy, but remember, we made a basebackup back in (1.) + all the WAL segments in the archive, so technically we still have everything.

  1. Copy everything from our basebackup back to pgtest
cp -r pgtest-back/* pgtest/
  1. Edit pgtest/postgresql.conf and set your recover target

Since we are using LSN as our target, we can simply put the LSN we captured to recovery_target_lsn configuration

recovery_target_lsn = '0/13DAC308'

PG also supports other ways to define recovery target, based on timestamp, name or xid. Refer to this documentation for other options.

  1. Signal the database to run in recovery mode by creating a recovery.signal file under the pgtest cluster
$touch pgtest/recovery.signal
  1. Start the server
    $pg_ctl -D pgtest start

the server will now start in recovery mode and it will restore WAL files from the archive and perform the recover. You may log in with psql and check that the database should contain also 3 million rows instead of 5.

You may notice that even though the database has been recovered to a point of time in the past, you will encounter a database in recovery or read only database error if you intend to insert additional data. This is because we are still in the recovery mode but is currently paused.

This is configured by the recovery_target_action option, which defaults to pause. This is actually intended, to allow you to have a moment to check your database and confirm that it is indeed the database state that you would like to recover to. If this is wrong, you can simply shutdown the database and reconfigure the recovery_target_lsn until you reach a desired state of database.

  1. Exit the recovery mode
    Once you are confirm the database is recovered correctly, you can exit the recovery mode by this psql command:
    select pg_wal_replay_resume();

This command will end the recovery mode and you should be able to insert additional data to the database. The recovery.signal file will be removed, and the future WAL segments will have a new timeline ID.

Timeline ID is also an important aspect of the recovery and we will discuss more on timeline ID in my next post.

Implement Foreign Scan With FDW Interface API

1. Introduction

Recently I have been tasked to familiarize myself with the Foreign Data Wrapper (FDW) interface API to build a new FDW capable of doing vertical / columnar sharding, meaning that the FDW is capable of collecting column information from multiple sources and combine them together as a result query. I will document and blog about the vertical sharding in later posts. Now, in this blog, I would like to share some of the key findings and understanding of FDW interface related to foreign scan.

FDW foreign scan API routines have some similarities to the table access method API’s sequential scan routines. You can learn more about it in the blog post here. Most noticeable difference is that the FDW routines require us to take part during the planning stage, so we have to come out with a foreign plan for the exectuor so it knows what and how to perform the foreign scan.

In order to complete a foreign scan, the following FDW routines are invoked in order:

  • IsForeignScanParallelSafe
  • GetForeignRelSize
  • GetForeignPaths
  • GetForeignPlan
  • BeginForeignScan
  • IterateForeignScan
  • IterateForeignScan
  • EndForeignScan

2.1 IsForeignScanParallelSafe

In this routine, you have to tell PG whether or not the foreign scan can support parallel Scan by returning true to indicate it is supported, or false otherwise. Parallel scan is invokved by the planner when a scan query involves retrieving a large amount of data. Depending on the foreign server and your implementation, you have to decide whether or not parallel scan is supported.

2.2 GetForeignRelSize

Similar to the table access method’s relation_estimate_size() API, this routine requires you to provide an estimate of the amount of tuples/rows would be involved during this round of scanning. This information is very important because it directly affects the planner’s decision to execute the query. For example, if you say foreign relation has 1 million tuples to be fetched, then the planner is most likely to use parallel scan to complete the scan for performance reasons if you answer “true” in the IsForeignScanParallelSafe() function call.

2.3 GetForeignPaths

In this routine, we are required to provide a list of possible paths to complete the required foreign scan. Paths simply means ways to execute. For example, you can complete a foreign scan by doing a sequential scan by fetching 1 tuple at a time, or you can complete the same scan using index to look-up a target tuple if the table has an index defined and user gives specific WHERE clause to pinpoint a tuple. If user provides a WHERE clause to filter the result, you also have an option to estimate whether the WHERE can be executed by the foreign server or by the local server as it fetches a tuple.

All these are possible pathnodes are required by the planner to decide on the optimal path to execute the scan. Each pathnode requires you to provide the startup cost, total cost, number of rows and path keys if any. The official documentation suggests that you should use add_path and create_foreignscan_path routines to prepare a list of path nodes.

2.4 GetForeignPlan

Having provided all the possible pathnodes, planner will pick one that is the most optimal and pass that pathnode to this routine and ask you to make a foreign scan plan for the executor. This is a very important function because whatever plan that we provide here directly affects how the executor is going to function, so be cautious here.

In this function, the planner will give you everything about the query such as the target attribute list and the restricting clauses (WHERE clauses) and also the information about the foreign relation, such as its OID, and each column’s data types. Official PostgreSQL documentation suggests using make_foreignscan() routine with the desired arguments to create the foreign plan.

In here, you also will have an option to push down the restricting clauses or not. In other words, you can choose to send all the WHERE clauses to the foreign server to process if it can support it. This results in much less data communication. You can also choose having all the WHERE clauses to be processed locally, but this requires the FDW to fetch every single row from the foreign server, resulting in more data communication. This decision is controlled by the List *qpqual argument to the make_foreignscan() function. If this list contains the list of your restricting clauses, the local server will perform the WHERE evaluation locally; if this list is empty, then the local server assumes the FDW will do the WHERE evaluation on the foreign server

2.5 BeginForeignScan

This routine will be called before starting the foreign scan, giving you a chance to allocate all the necessary control information required for your foreign scan. You can define your own control information, for example, cursor location, remote connection structure, current offset…etc. and store in node->fdw_state and it will be passed down to the IterateForeignScan as the scan is begin completed.

2.6 IterateForeignScan

This is the main routine to perform the actual scan. You need to put the proper logic here depending on your use case to fetch one or more rows from the foreign server. When you have the row data, you will need to convert that to a TupleTableSlot strucutre in which the PG internals can understand. We normally use ExecStoreHeapTuple() routine to convert a HeapTuple into a TupleTableSlot. This routine will be continuously called by the PG executor as long as you still have data to fetch and it will only stop once you return an empty TupleTableSlot.

2.7 EndForeignScan

This routine will be called at the end of foreign scan, giving you the opportunity to clean up the control information allocated in BeginForeignScan(). This marks the end of foreign scan via FDW.

3. Summary

There is a lot of information and articals out there related to PostgreSQL’s FDW API interface. This blog is merely a quick summary of what I have learned during the FDW evaluation and there is much more to what’s discussed here. For more detail, you can refer to the official documentation about the FDW callback functons here

Understanding the Security Around PostgreSQL

1. What Is Security?

The word “Security” is a very broad concept and could refer to completely different procedures and methodology to achieve. Knowing what security means to your application is very important, so you could execute proper security practices and procedures to ensure the safety of your company’s assets. Data compromises could often lead to financial loss, reputation damage, consumer confidence disintegration, brand erosion, and non-compliance of government and industry regulation.

For this reason, the security on infrastructure software such as PostgreSQL is even more important because any data compromises could have nation or city-wide impacts, which are often difficult to completely recover.

2. Common Database Compromises

User Compromises:

  • Excessive privileges
  • Privilege abuse
  • Weak user authentication
  • Weak password
  • Default privilege too open

Data Compromises:

  • Unmanaged and unprotected sensitive data
  • Backup data exposure
  • Stolen hard disks
  • Unmanaged encryption keys

Network Compromises:

  • Firewall rules
  • Deep Packet Inspection (DPS)
  • Vulnerability prevention
  • Denial of Service (DOS) attack

Vulnerability:

  • Software bug
  • Buffer overflow
  • SQL injection
  • Privileged escalation

3. The Big Picture

img

This picture shows different types of “security” around a PostgreSQL server and there are roughly 5 types of security concepts involved here:

3.1 Data Security (Over Network)

This is the security in the communication between PostgreSQL client and server that we almost always want to use TLS to encrypt the data communication in a production environment. TLS guarantees the mutual trust between the client and the server so each side is sure that it is communicating with the right entity instead of a rogue server. SSH tunneling is also a common option to secure a psql connection when TLS is not fully set up. SSH tunneling is also very secure as each connection forces client and server to generate and agree on an encryption key that is valid only for that session. Furthermore, SSH tunneling can be made more secured by setting up the public and private key pair between client and server to ensure the authenticity of the two entities.

3.2 Data Security

This is the security between PostgreSQL and the disk in which it writes data to. This security type is often refereed as a “Cluster Data Encryption” or “Transparent Data Encryption”. Current version of PostgreSQL does not support this feature but there is a handful of talented people working on this feature right now. This security is designed to prevent data compromises directly done on the hard disk. By encrypting the data on the disk, hard disk theft will not be able to extract useful information from the hard disk.

3.3 Network Security

This is the security that most likely will involve a firewall in between a connecting client and a server. The purpose of a firewall is to block most of the malicious connections coming from the public network and prevent unauthorized access to the server. Most advanced firewalls such as an IPS can block DOS attacks and perform deep packet examination according to a database of known malicious packet and attacks. There are also firewalls such as an IDS that perform network monitoring only and will raise alert to the operator should it detects an attack attempt.

3.4 Vulnerability

This is the security that is mostly caused by a software bug that allows an attacker to take advantage of the server, steal data, or simply out a stop to the server and cause damage. The best way to prevent this is upgrade your PostgreSQL server to the latest version that has addressed most of the known vulnerabilities.

3.5 User Security

This is the security that relates mostly to the user management, sometimes called a Role-Based Access Control (RBAC). This is where a database administrator is managing each database user and setting the right privileges for the right users. Excessive privileges, weak passwords and privilege abuses are very common if not done correctly. Make sure the right users get the right privileges and use a third party authentication servers such as LDAP or Kerberos instead of simple passwords can significantly increase the security ratings of your database infrastructure.

Using GDB To Trace Into a Parallel Worker Spawned By Postmaster During a Large Query

1. Introduction

I am working on a new PostgreSQL feature that redefines the way a tuple’s visibility status is determined. The feature is working very nicely until I start doing a large SELECT query, which triggers PostgreSQL to spawn multiple parallel workers to process the request. When this happens, the feature I am working on start to yield incorrect results. A good portion of the data tuples returned are missing because they are considered as invisible, while some portion of it remains visible. It immediately came to my attention that the new feature I am working on does not work in parallel worker mode and somehow I need to find a way to debug into a spawned parallel worker to examine how it is computing the visibility and what is missing inside.

In this blog, I would like to share with you how I use GDB to debug and trace into a new parallel worker spawned by Postmaster in order to fix the visibility issue.

2. GDB Basics

I wrote another blog previously that shows how to use GDB to trace and debug a PostgreSQL issues and share some of the most common commands that I use every day to resolve software issues. If you are new to GDB, I suggest giving this blog a read here

3. How and When does PG Spawn A New Parallel Worker

When you use psql to connect to a PostgreSQL database, it will spawn a new backend worker process to serve this connecting client. Most of the queries you provide will be processed by this backend process, includes SELECT, UPDATE, INSERT…etc. By default, if your SELECT query will require doing a sequential scan over 8MB of data, it will try to use a parallel worker to help speed up the processing. This 8MB threshold can be configured by the min_parallel_table_scan_size parameter in postgresql.conf . There is another configuration parameter max_parallel_workers that controls the maximum number of parallel workers is allowed to be spawned. The default is 8.

Technically, I can avoid my visibility issues simply by either setting min_parallel_table_scan_size to a huge number, or setting max_parallel_workers to 0. But this is really not my style, I would like to keep all these goodies that PG provides while being able to solve the problem.

To spawn a parallel worker, the psql backend will initialize a parallel worker context in the global process table and a message queue based on shared memory for communication with the backend. Then it sends a signal to postmaster to notify that the global process table has been updated.

When postmaster receives the signal, it will load the global process table and found that it needs to spawn a new parallel worker. It will proceed to fork a new parallel worker according to the context information supplied. This information determines the entry point for the parallel worker and what to do once spawned. During processing, the parallel worker and the psql backend use the message queue to communicate tuples back and forth and finally the psql backend will gather together all the data tuples and produce a final result back to the user.

4. Can We Use GDB to attach This Parallel Worker’s PID When Spawned?

Technically yes, but the life time of this parallel worker may be very short, by the time you see its PID from the ps -ef command, the worker may have already done its job and exited. This means, it is too late for me to start GDB and attach to its PID.

Instead, the technique I am going to show you today will trace the parallel worker from the moment it starts.

5. Tracing the Parallel Worker

I will be using this instance of PostgreSQL server (version 12.5) as an example where PID 11976 is the psql backend process serving the psql client.

img

Pre-Condition:

Connect psql to the PostgreSQL server, create an example table and inserted about 2.5M rows of data. This will for sure trigger parallel workers.

$ psql -d postgres -U postgres -p 6660
psql (12.5)
Type "help" for help.

postgres=# create table test(a int, b int);
CREATE TABLE
postgres=# insert into test values(generate_series(1,500000),1);
INSERT 0 500000
postgres=# insert into test values(generate_series(1,500000),1);
INSERT 0 500000
postgres=# insert into test values(generate_series(1,500000),1);
INSERT 0 500000
postgres=# insert into test values(generate_series(1,500000),1);
INSERT 0 500000
postgres=# insert into test values(generate_series(1,500000),1);
INSERT 0 500000
postgres=#

Step 1: Attach GDB to the psql backend having PID = 11976 and Set a Break Point

I am setting a break point at the function RegisterDynamicBackgroundWorker. This is called when parallel worker is required to complete the query. Setting a breakpoint allows us more control as to when to proceed with a parallel worker spawn.

gdb postgres
(gdb) attach 11976
(gdb) b RegisterDynamicBackgroundWorker

Step 2: Attach GDB to the Postmaster having PID = 11959 and Set 2 Break Points

We are using a second GDB to attach to the postmaster and set 2 break points there. fork_process is the function before postmaster actually spawns a new parallel worker using the system fork() call. ParallelWorkerMain is the main function for the parallel worker after it has been spawned.

gdb postgres
(gdb) attach 11959
(gdb) b fork_process
(gdb) b ParallelWorkerMain

Step 3: Execute a Large SELECT Query On psql To Trigger the Break Points

postgres=# select count(*) from test;

The RegisterDynamicBackgroundWorker break point will be hit on the first GDB session having attached PID = 11959

Use the continue or c GDB command to continue to spawn the worker

Breakpoint 1, RegisterDynamicBackgroundWorker (worker=0x7ffd867f3c80, handle=0x55a009b77388) at bgworker.c:1002
1002 bool success = false;
(gdb) c
Continuing.

As you continue the first GDB session, the second GDB session will pause due to receipt of a SIGUSR1 signal. This signal tells postmaster to reload the global process table and then spawn a parallel worker. Using the continue command will hit the first break point at fork_process

Program received signal SIGUSR1, User defined signal 1.
0x00007f301b97d0f7 in __GI___select (nfds=5, readfds=0x7ffd867f47d0, writefds=0x0, exceptfds=0x0, timeout=0x7ffd867f4740)
at ../sysdeps/unix/sysv/linux/select.c:41
41 in ../sysdeps/unix/sysv/linux/select.c
(gdb) c
Continuing.

Breakpoint 1, fork_process () at fork_process.c:47
47 fflush(stdout);

(gdb)

Step 4: Tell Postmaster GDB To Follow Child Process Instead Of Parent

At this point, the postmaster GDB is now waiting at the fork_process call, which is right before spawning a parallel worker. This is a good time now to tell GDB to follow the child process instead of staying at parent when the process calls fork(). The reason we want to set this late at this moment is because postmaster is occasionally spawning other backend processes such as walsender and walreceiver. Setting to follow child process early may cause our GDB to follow to another backend process that we are not interested in.

(gdb) set follow-fork-mode child

You may use the continue command after setting it to follow child. Then immediately the GDB will switch to the new child process having PID = 12198 below and hit our second break point ParallelWorkerMain. So, Now the GDB is debugging the parallel worker process instead of the original postmaster.

(gdb) c
Continuing.
[New process 12198]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Switching to Thread 0x7f301ca79740 (LWP 12198)]

Thread 2.1 "postgres" hit Breakpoint 2, ParallelWorkerMain (main_arg=1544458234) at parallel.c:1207
1207 {

Step 5: Continue To Debug The Parallel Process

Using the ps -ef | grep postgres command, we can see a new parallel worker being spawned having PID = 12198

img

At this point, you are free to explore the process flow of the parallel worker. For me, I am debugging the visibility issues, so I will set additional break points at HeapTupleSatisfiesMVCC and TransactionIdIsCurrentTransactionId. In your case, you may be debugging some other functionalities.

Being able to debugging into a parallel worker with GDB allows me to see the problems I was having and being able to fix quickly.

If you are having trouble tracing into a parallel workers spawned by PostgreSQL during run time, I hope this blog will be helpful to you.

How PostgreSQL Handles Sub Transaction Visibility In Streaming Replication Setup?

1. Introduction

As an experienced PostgreSQL user, you may have a lot of experience in setting up streaming replication in your database clusters to make multiple backups of your data. But have you wondered how the standby is able to correctly determine if a tuple sent from the primary should be visible to the user or not. In the case where a transaction contains multiple subtransactions, how does the standby determine the visibility in this case? You might say… well it is PostgreSQL so it will just work… This is true. If you are someone who is curious and interested in knowing how PostgreSQL does certain things internally, then this blog may be interesting for you as we will talk about normally transaction and subtransactions and how both affect the standby’s ability to determine tuple visibility.

2. How is Tuple Visibility Determined?

PostgreSQL determines a tuple’s visibility based on the concept of transaction ID (a.k.a txid) and it is normally stored in a tuple’s header as either xmin or xmax where xmin holds the txid of the transaction that inserted this tuple and t_xmax holds the txid of the transaction that deleted or updated this tuple. If a tuple has not been deleted, its t_xmax is 0. There is also another mechanism called commit log (a.k.a clog) where it has information of currently active transaction IDs. When an inserted tuple has a valid t_xmin and invalid t_xmax, PostgreSQL will have to consult the clog first to determine if the txid stored in t_xmin is visible or not, then the result will be stored in the tuple’s hint_bit field about the visibility information such that it does not always have to consult the clog which could be resource intensive. If a tuple has a valid t_xmax, then there is no need to consult the clog as the tuple must be invisible.

This is just the basic ideas of how visibility is determined but it is enough for me to continue with this blog. For more information about visibility, you could refer to this resource

3. What is a Subtransaction?

As the name suggests, it is a smaller transaction that exists within a regular transaction and it can be created using the SAVEPOINT keyword after your BEGIN statement. Normally, when you have issued a BEGIN statement, all the tuples created from the subsequent DML statements such as INSERT or UPDATE will have a txid associated with these tuples and they will all be the same.

When you issue a SAVEPOINT statement within this transaction, all the tuples created from the subsequent DML statements will have a different txid that is normally larger than the main transaction’s txid. When you issue the SAVEPOINT statement again, the txid changes again.

The advantage of Subtransaction is that in a very large transaction, the entire transaction will not be aborted due to an error. It allows you to rollback to the previously created SAVEPOINT and continue to execute the main transaction.

The biggest difference between a subtransaction and a regular transaction is that there is not a concept of commit associated with subtransactions and it will not be recorded in the WAL file. However, if subtransaction is aborted, it will be recorded in the WAL file and subsequently be sent to the standby.

4. How Does a Standby Determines Visibility Information In a Regular Transaction?

Here’s an example of a regular transaction without any subtransactions:

BEGIN;
INSERT INTO test VALUES(111,111111); => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,222222); => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,333333); => a heap_redo WAL record having t_xmin = 500 is sent to standby
COMMIT; => a XLOG_XACT_COMMIT WAL record having t_xmin = 500 is sent to standby

In a general case, the above queries will send at least 4 WAL records to the standby during streaming replication. First 3 contain the actual tuple information with t_xmin set to the current transaction ID which is 500 in the example. In other words, before the COMMIT is issued, the standby should have already received 3 WAL records from primary and completed the redo, meaning that the 3 tuples already exist in standby’s memory buffer. However, they are not visible yet because transaction ID 500 does not exist in the standby. It is until the COMMIT is issued in the primary that will subsequently generate a XLOG_XACT_COMMIT WAL record to be sent to standby to notify that transaction ID 500 is now committed and valid. When standby receives the XLOG_XACT_COMMIT WAL record, it will eventually update to its clog so the subsequent SELECT queries can do visibility check against.

5. How Does a Standby Process Visibility Information In a Regular Transaction Containing Subtransactions?

Now, let’s use the same example but add some SAVEPOINT to create subtransactions

BEGIN;
INSERT INTO test VALUES(111,111111); => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,222222); => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,333333); => a heap_redo WAL record having t_xmin = 500 is sent to standby
SAVEPOINT A;
INSERT INTO test VALUES(111,444444); => a heap_redo WAL record having t_xmin = 501 is sent to standby
INSERT INTO test VALUES(111,555555); => a heap_redo WAL record having t_xmin = 501 is sent to standby
INSERT INTO test VALUES(111,666666); => a heap_redo WAL record having t_xmin = 501 is sent to standby
SAVEPOINT B;
INSERT INTO test VALUES(111,777777); => a heap_redo WAL record having t_xmin = 502 is sent to standby
INSERT INTO test VALUES(111,888888); => a heap_redo WAL record having t_xmin = 502 is sent to standby
INSERT INTO test VALUES(111,999999); => a heap_redo WAL record having t_xmin = 502 is sent to standby
COMMIT; => a XLOG_XACT_COMMIT WAL record having t_xmin = 500 is sent to standby

As you can see, after the creation of SAVEPOINT, the tuple’s t_xmin value will be changed and is no longer equal to the current transaction ID, which is 500 in the example. Before the COMMIT is issued, the standby should have received 9 WAL records, 3 having t_xmin = 500, 3 having t_xmin = 501 and 3 having t_xmin = 502. None of them is visible yet in the standby because these transaction IDs do not exist yet. When the primary issues the COMMIT statement, it will generate a XLOG_XACT_COMMIT WAL record and send to standby containing only the current transaction ID 500 without the 501 and 502.

After the commit, the standby is able to see all 9 tuples that are replicated from the primary. Now, you may be asking…. why? How come the standby knows that txid = 501 and 502 are also visible even though the primary never send a XLOG_XACT_COMMIT WAL record to tell the standby that 501 and 502 are also committed. Keep reading to find out.

Using the similar example, if we rollback to one of the SAVEPOINTs, the primary will send a XLOG_XACT_ABORT WAL record to notify the standby a transaction ID has become invalid.

BEGIN;
INSERT INTO test VALUES(111,111111); => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,222222); => a heap_redo WAL record having t_xmin = 500 is sent to standby
INSERT INTO test VALUES(111,333333); => a heap_redo WAL record having t_xmin = 500 is sent to standby
SAVEPOINT A;
INSERT INTO test VALUES(111,444444); => a heap_redo WAL record having t_xmin = 501 is sent to standby
INSERT INTO test VALUES(111,555555); => a heap_redo WAL record having t_xmin = 501 is sent to standby
INSERT INTO test VALUES(111,666666); => a heap_redo WAL record having t_xmin = 501 is sent to standby
SAVEPOINT B;
INSERT INTO test VALUES(111,777777); => a heap_redo WAL record having t_xmin = 502 is sent to standby
INSERT INTO test VALUES(111,888888); => a heap_redo WAL record having t_xmin = 502 is sent to standby
INSERT INTO test VALUES(111,999999); => a heap_redo WAL record having t_xmin = 502 is sent to standby
ROLLBACK TO SAVEPOINT B; => a XLOG_XACT_ABORT WAL record having t_xmin = 502 is sent to standby
COMMIT; => a XLOG_XACT_COMMIT WAL record having t_xmin = 500 is sent to standby

In this example, when doing a ROLLBACK to SAVEPOINT B, a XLOG_XACT_ABORT WAL record will be sent to the standby to invalidate that the transaction ID 502 is invalid. So when primary commits, records having txid=502 will not be visible as primary has notified.

6. The Secret of Standby’s KnownAssignedTransactionIds

In addition to handling the XLOG_XACT_COMMIT and XLOG_XACT_ABORT WAL records to determine if a txid is valid or not, it actually manages a global txid list called KnownAssignedTransactionIds. Whenever a standby receives a heap_redo WAL record, it will save its xmin to KnownAssignedTransactionIds. Using the same example as above again here:

BEGIN;
INSERT INTO test VALUES(111,111111); => a heap_redo WAL record having t_xmin = 500 is sent to standby, and standby puts 500 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,222222); => a heap_redo WAL record having t_xmin = 500 is sent to standby, and standby puts 500 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,333333); => a heap_redo WAL record having t_xmin = 500 is sent to standby, and standby puts 500 in KnownAssignedTransactionIds
SAVEPOINT A;
INSERT INTO test VALUES(111,444444); => a heap_redo WAL record having t_xmin = 501 is sent to standby, and standby puts 501 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,555555); => a heap_redo WAL record having t_xmin = 501 is sent to standby, and standby puts 501 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,666666); => a heap_redo WAL record having t_xmin = 501 is sent to standby, and standby puts 501 in KnownAssignedTransactionIds
SAVEPOINT B;
INSERT INTO test VALUES(111,777777); => a heap_redo WAL record having t_xmin = 502 is sent to standby, and standby puts 502 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,888888); => a heap_redo WAL record having t_xmin = 502 is sent to standby, and standby puts 502 in KnownAssignedTransactionIds
INSERT INTO test VALUES(111,999999); => a heap_redo WAL record having t_xmin = 502 is sent to standby, and standby puts 502 in KnownAssignedTransactionIds
COMMIT; => a XLOG_XACT_COMMIT WAL record having t_xmin = 500 is sent to standby and standby submits 500, 501 and 502 to clog as valid.

Before the COMMIT is issued on the primary, the standby already has txid 500, 501, and 502 present in the KnownAssignedTransactionIds. When the standby receives XLOG_XACT_COMMIT WAL, it will actually submit txid 500 from the WAL plus the 501 and 502 from KnownAssignedTransactionIds to the clog. So that the subsequent SELECT query consult the clog it will say txid 500, 501, and 502 are visible. If the primary sends XLOG_XACT_ABORT as a result of rolling back to a previous SAVEPOINT, that invalid txid will be removed from the KnownAssignedTransactionIds, so when the transaction commits, the invalid txid will not be submit to clog.

7. Summary

So this is how PostgreSQL determines a tuple’s visibility within subtransactions in a streaming replication setup. Of course, there is more details to what is discussed here in this blog, but I hope today’s discussion can help you get a general understanding of how PostgreSQL determines the visibility and maybe it will help you in your current work on PostgreSQL.

How PostgreSQL Inserts A New Record With The help of Table Access Method API and Buffer Manager

1. Introduction

This blog talks about a high level description of the mechanism behind PostgreSQL to execute an INSERT query. This process involves many steps of processing before the data is put in the right place. These process normally involves several catalog cache lookup to determine if the destination table exists or several checking on the constraint violations..etc. This blog will mainly focus on the part where the processing handle is passed to the PostgreSQL’s table access method API and its interaction with buffer manager and WAL routines. This is also the core area where an INSERT query is actually executed. If you are a developer looking to understand how PostgreSQL works internally, this blog may be helpful to you…

2. Table Access Method APIs in PostgreSQL

Pluggable table access method API has been made available since PostgreSQL 12, which allows a developer to redefine how PostgreSQL stores / retrieves table data. This API contains a total of 42 routines that need to be implemented in order to complete the implementation and honestly it is no easy task to understand all of them and to implement them. This API structure is defined in tableam.h under the name typedef struct TableAmRoutine

Today I will describe the routines related to INSERT.

3. INSERT Query Overall Call Flow

A few of the 42 routines will be called by executor just to complete an INSERT query. This section will describe these routines in the order they are called.

3.1 slot_callbacks

const TupleTableSlotOps *(*slot_callbacks) (Relation rel);

The executor needs to find out which set of tuple table slot (TTS) callback operation this table access method is compatible with. TTS is a set of routines that ensures the tuple storage is compatible between the executor and your access method. The executor will execute the TTS callback to translate your tuple strucuture to TupleTableSlot format in which the executor will understand. The default heap access method uses TTSOpsBufferHeapTuple defined in execTuples.c to handle this operation

3.2 heap_insert

void
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)

heap_insert is the entry point to perform the actual data insertion and it will undergo several other routines provided by buffer manager and WAL module in order to complete the insertion.

heap_prepare_insert
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);

This is a subroutine for heap_insert where it will initialize the tuple header contents such as relation OID, infomasks, xmin, xmax values. It will also determine if the tuple is too big that TOAST is required to complete the insertion. These terms and parameters are very technical in PostgreSQL. If you are not sure what exactly they are, you could refer to resources here and here.

RelationGetBufferForTuple
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertStateData *bistate,
Buffer *vmbuffer, Buffer *vmbuffer_other);

This is an entry function to access buffer manager resources and all it is doing is ask the buffer manager to return a buffer ID that can be used to store the target tuple. This may sound very straightforward, but there is quite a lot of processing on the buffer manager side to properly determine a desired buffer location.

First, it will do a quick size check. If the input tuple is larger than the size of each buffer block, it will return immediately with error as TOAST has to be used in this case. Then it will try to put the tuple on the same page the system last inserted the tuple on to see if it will fit there. If not, it will utilize the free space map to find another page that could fit tuple. If that does not work out, then buffer manage will allocate a new data page (also referred to as extend) to be used to hold this new tuple. As soon as we have a desired buffer page determined, buffer manager will cache this page in the relation structure such that next time the same relation visits the buffer manager, it knows immediately about the reference to the last inserted block.

RelationPutHeapTuple
extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
HeapTuple tuple, bool token);

Once we have identified the location of the buffer to store the tuple, the insert routine will then call RelationPutHeapTuple to actually put the tuple in the specified buffer location. This routine will again ask the buffer manager to get a pointer reference to the data page using the buffer ID we obtained from RelationGetBufferForTuple, then add the tuple data using PageAddItem() routine. Internally in buffer manager, it manages the relationship between buffer ID, buffer descriptor and the actual pointer to the data page to help us correctly identify and write to a data page. After a successful write, the routine will save a CTID of the inserted tuple. This ID is the location of this tuple and it consists of the data page number and the offset. For more information about how buffer manager works, you can refer to the resource here

Mark buffer dirty
extern void MarkBufferDirty(Buffer buffer);

At this point, the tuple data is already stored in the buffer manager referenced by a particular data page plus an offset, but it is not yet flushed to disk yet. In this case, we almost always will have to call MarkBufferDirty function to signal buffer manager that there are some tuples on the page that have not been flushed to disk and therefore in the next checkpoint, it will ensure the new tuples are flushed to disk.

[Insert WAL Record]

Last but not least, after doing all the hassle of finding a buffer location to put our tuple in and mark it as dirty, it is time for the heap_insert routine to populate a WAL record. This part is not the focus of this blog so I will skip the high level details of WAL writing.

3.3 End of the insertion

At this point the insertion of a new tuple data has finished and proper WAL record has been written. The routine will once again save the CTID value that we derived during the data insertion and save this value to the TTS structure so the executor also gets a copy of the location of the tuple. Then it will clean up the local resources before returning.

4. Summary

What we have discussed here is the basic call flow of a simple sequential scan. If we were to visualize the process, it should look something like this:
tuple insert

How PostgreSQL Executes Sequential Scans with the Help of Table Access Methods APIs

1. Introduction

There are many approaches for PostgreSQL to retrieve the data back to the user. Depending on the user’s input query, the planner module is responsible for selecting the most optimum approach to retrieve the requested data. Sequential scan is one of these approaches that is mostly selected when the user requests a large volume of data (for example, “SELECT * from tablename;”) or when a table has no index declared. Sequential scan is mostly handled by the table access method API within PostgreSQL and heap access method is the default one PostgreSQL uses today. In this short post, I will show you how sequential scan is done in the table access method API.

2. Table Access Method APIs in PostgreSQL

Pluggable table access method API has been made available since PostgreSQL 12, which allows a developer to redefine how PostgreSQL stores / retrieves table data. This API contains a total of 42 routines that need to be implemented in order to complete the implementation and honestly it is no easy task to understand all of them and to implement them. This API structure is defined in tableam.h under the name typedef struct TableAmRoutine

Today I will describe the routines related to sequential scan and I hope it could help you if you are someone looking to create your own table access method.

3. Sequential Scan Overall Call Flow

Few of the 42 routines will be called by executor just to complete a sequential scan request. This section will describe these routines in the order they are called.

3.1 relation_size

uint64		(*relation_size) (Relation rel, ForkNumber forkNumber);

relation_size is the first routine to be called and it is relatively simple. The caller will expect the routine to return the total size of the relation described by rel and forkNumber. The default heap access method will simply invoke the storage manager smgr to find the number of data blocks this particular relation physically occupies on disk and multiplies that number with the size of each block BLCKSZ (default is 8k). If you are not sure about the relationship between relation and its fork number, you could refer to this blog to get more information.

The size returned by this routine basically sets the boundary of our sequential scan.

3.2 slot_callbacks

const TupleTableSlotOps *(*slot_callbacks) (Relation rel);

Next, the executor needs to find out which set of tuple table slot (TTS) callback operation this table access method is compatible with. TTS is a set of routines that ensures the tuple storage is compatible between the executor and your access method. The executor will execute the TTS callback to translate your tuple strucuture to TupleTableSlot format in which the executor will understand. The default heap access method uses TTSOpsBufferHeapTuple defined in execTuples.c to handle this operation

3.3 scan_begin

TableScanDesc (*scan_begin) (Relation rel,
Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
ParallelTableScanDesc pscan,
uint32 flags);

Now the scan can officially begin. This is sequential scan’s initialization routine in which it will allocate a new scan descriptor using the parameters passed in by the executor. The purpose of scan descriptor structure is to keep track of the sequential scan while it is being executed. For example, to track where the scan should begin,; when was the block number of the last scan; which block should we resume scanning and how many blocks have been scanned…etc. scan descriptor will be destroyed once the sequential scan has completed.

The executor expects the routine to return a fully allocated and initialize pointer to TableScanDesc struct

3.4 scan_getnextslot

bool		(*scan_getnextslot) (TableScanDesc scan,
ScanDirection direction,
TupleTableSlot *slot);

This is the main routine for sequential scan where the caller expects the routine to fetch one tuple from the buffer manager, converts it to the TTS format in which executor understands and save it in the input pointer called slot. Each call to this routine will results in one tuple to be returned. If a table contains 1000 tuples, this function will be called 1000 times. The boolean return code is the indication to the caller if the routine has more tuples to return, as soon as false is returned, it signals the executor that we have exhausted all the tuples and it should stop calling this function.

In normal sequential scan case, this routine works in per-page mode. This means it will read one full block from buffer manager and scan it to get all the tuple addresses and their offsets in the scan descriptor, so in the subsequent calls to the same function, it will not load the full page again from buffer manage all the time; it will only start to load the next block when all the tuples on the current block have been scanned and returned.

As you can see, the scan descriptor plays an important role here as most of the control information is saved there and is regularly updated whenever a call the scan_getnextslot is made.

3.5 scan_end

void		(*scan_end) (TableScanDesc scan);

This is the last routine to be called to basically clean up the table scan descriptor, which was used heavily during the sequential scan. At this point, the executor should already have all the tuple information from the sequential scan methods.

4. Prepare the Data to Return

Now, the executor is finished with the table access method and has already had access to all the tuples for a particular relation. It then needs to go through another round of filtering to determine which of these tuples satisfy the condition set by the user, (for example, when user gives the WHERE clause to limit the scan results). This is done in another infinite for loop in execScan.c to perform ExecQual on each TTS. Finally, the end results will be sent to the end user.

5. Summary

What we have discussed here is the basic call flow of a simple sequential scan. If we were to visualize the process, it should look something like this:
Sequential Scan

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×