Bakul Banthia
Bakul Banthia
,
August 4, 2024
PostgreSQL

Building a Highly-Available PostgreSQL System with Tessell

Bakul Banthia
Bakul Banthia
,
August 4, 2024
Table of Contents

TABLE OF CONTENTS

Share this blog
arrow icon

Ensuring databases' high availability (HA) is crucial for business continuity and performance. PostgreSQL, one of the most popular open-source relational databases, offers robust HA solutions, and Tessell PostgreSQL is a standout example. This guide will delve into the construction of the Tessell PostgreSQL HA system, its failover and switchover behaviors, and how it handles maintenance windows like patching.

Setup

To illustrate the setup of a Tessell PostgreSQL HA system, we’ll consider a three-node configuration designed to ensure seamless operation and minimal downtime. The high-availability solution consists of three nodes, each with specific roles to enhance the system’s resilience and efficiency. This setup ensures that the database remains operational even in the event of a node failure, providing a robust solution for mission-critical applications.

Configuration

Node 1: The primary database handles all write operations. It is the main point of interaction for any application requiring data modifications. By centralizing write operations, Node 1 ensures data consistency and integrity.

Node 2: This serves as a synchronous standby database, accepting read-only operations. It mirrors the primary database in real time, immediately replicating any changes. This setup allows for load balancing of read operations and provides a backup that can be quickly promoted if the primary node fails.

Node 3: This node functions as an observer or arbiter node within the underlying etcd cluster. While it does not directly participate in the replication process, its role is crucial for maintaining quorum and coordinating the cluster’s state, adding an extra layer of reliability to the HA system.

Provisioning

Provisioning involves several steps to establish a functional and reliable HA cluster. Launching the three nodes and configuring them to work together seamlessly starts.

  • Launch Nodes: Start Node 1, Node 2, and Node 3. This step ensures that each node is correctly initialized and capable of communicating with the others.
  • Form Quorum: Establish a quorum for the three-node etcd cluster. Quorum is essential for maintaining consensus within the cluster, allowing it to make coordinated decisions even if one of the nodes fails.
  • Configure Patroni:
    • On Node 1, set up Patroni as the primary node. Patroni is a vital component that automates failover, ensuring the primary node can be seamlessly replaced if it becomes unavailable.
    • On Node 2, configure Patroni as a synchronous standby. This setup ensures real-time replication and readiness to take over as the primary if needed.
  • Patroni API Callbacks: Set up necessary API callbacks for automated management. These callbacks help manage failover and switchover processes, keeping the system responsive and reducing manual intervention.
  • Launch HA Cluster: Start the Patroni HA cluster to ensure high availability. This final step brings the entire system online, ready to handle database operations with built-in redundancy and failover capabilities.

Topology

The topology of the Tessell PostgreSQL HA system can be visualized in the diagram below, depicting the interactions and roles of each node within the cluster. The diagram clearly shows how data flows between the nodes, illustrating the primary node’s connection to the standby and observer nodes. Understanding this topology is crucial for managing and troubleshooting the HA system, as it highlights the paths data takes during normal operations and failovers.

Failover Scenarios

Tessell PostgreSQL is engineered to handle failover scenarios efficiently, minimizing application impact. The system’s design focuses on maintaining data integrity and availability, even in the face of hardware or software failures. Here are some common failover scenarios:

Scenario 1: Node 1 Failure

If Node 1, the primary database, fails, the Patroni HA system detects the failure and promotes Node 2 to become the new primary database. This automated process ensures minimal downtime and zero data loss due to synchronous replication. The replication strategy ensures that every commit on the primary node is simultaneously recorded on the standby, making the standby immediately ready to take over. The application continues running with Node 2, handling write operations and maintaining business continuity without significant interruptions.

Scenario 2: Node 2 Failure

In the event of Node 2 failure, the Patroni HA system disables the synchronous replication slots. When Node 2 becomes available again, Patroni automatically reconfigures it by synchronizing the delta from the primary node. This process involves copying only the data changes during the downtime, making the re-sync process efficient and quick. Once Node 2 is back in sync, it resumes its role as the synchronous standby. This automatic reconfiguration ensures the HA system remains resilient, quickly recovering from failures without manual intervention.

Scenario 3: Node 3 Failure

Node 3 serves as an observer and does not directly participate in replication. Therefore, its failure does not impact the primary and standby nodes. The underlying etcd cluster continues to function, maintaining the necessary quorum for decision-making. This design ensures that the system can tolerate the loss of the observer node without affecting the primary and standby operational capabilities. Once Node 3 is restored, it rejoins the quorum, resuming its role in maintaining cluster state and coordination.

Patching

Maintenance, such as patching, is crucial for security and performance. Tessell PostgreSQL allows patching with minimal downtime through the following steps:

  • Parallel Shutdown: Shut down the primary and replica nodes simultaneously. This step ensures that both nodes are taken offline simultaneously, reducing the risk of inconsistencies.
  • Patch Application: Apply the patch to both standby and primary nodes. Patching both nodes in parallel ensures they remain in sync, avoiding potential version conflicts.

Tessell ensures atomicity in patching operations. If the patch fails on the replica, the changes on the primary node are also reversed. Similarly, if the patch fails on the primary node, it reverts the patch on the replica, ensuring consistency and minimizing downtime. This approach to patching maintains system stability, allowing for necessary updates without compromising data integrity or availability.

Summary

The Tessell PostgreSQL HA system exemplifies a robust, high-availability solution for PostgreSQL databases. Careful provisioning, efficient failover mechanisms, and minimal downtime during maintenance ensure that applications remain resilient and performant. Following the guidelines outlined in this guide, you can set up and maintain a reliable HA system for your PostgreSQL databases, ensuring continuous operation and data integrity. This high level of availability and robustness is essential for modern applications that demand consistent performance and reliability.

Follow us
Youtube Button