High-Availability Ansible Automation Platform with EDB PostgreSQL Active-Passive DR - Solution Guide

Overview

When Ansible Automation Platform becomes mission-critical infrastructure – orchestrating network changes, managing security compliance, or coordinating multi-cloud deployments – downtime is not an option. A database failure, datacenter outage, or unplanned maintenance window can halt automation across the enterprise, blocking change tickets, delaying deployments, and leaving teams unable to respond to incidents.

This guide demonstrates how to deploy Red Hat Ansible Automation Platform 2.6 with a multi-datacenter Active-Passive disaster recovery architecture using EDB Postgres Advanced Server and EDB Failover Manager (EFM). The result is a resilient automation platform capable of surviving datacenter failures with Recovery Time Objective (RTO) under 5 minutes and Recovery Point Objective (RPO) under 5 seconds – ensuring automation continuity for mission-critical operations.

Implementation approach: This solution leverages Red Hat’s AAP 2.6 Container Enterprise Topology deployed via a single unified installation across both datacenters, with DC2 AAP services intentionally stopped post-installation to maintain the Active-Passive configuration. The database layer and HAProxy connection routing are fully supported by EDB, while the AAP platform follows Red Hat’s tested enterprise topology.

Business value: Guaranteed automation availability for mission-critical workflows. Reduced risk of extended outages blocking change management, compliance enforcement, or incident response. Automated failover eliminates manual intervention during datacenter failures, reducing downtime from hours (manual DR procedures) to minutes (automated database promotion and AAP activation).

Technical value: Proven enterprise topology for production AAP deployments. Streaming replication with sub-5-second RPO protects against data loss. EFM-managed automated failover orchestrates database promotion and AAP service activation without operator intervention. Full implementation roadmap from infrastructure provisioning through testing and production cutover.


Background

Why Disaster Recovery Matters for Automation Platforms

Ansible Automation Platform has evolved from a convenient automation tool to mission-critical enterprise infrastructure. Organizations use AAP to orchestrate network changes across thousands of devices, enforce compliance policies on production servers, coordinate multi-cloud deployments, and automate incident response. When AAP is unavailable, these critical workflows stop.

A database failure in a single-datacenter AAP deployment creates cascading impact:

Traditional backup-and-restore approaches offer Recovery Time Objectives measured in hours, not minutes. Restoring a multi-database AAP deployment from backup requires coordinating database recovery, verifying data consistency across four separate databases (controller, hub, EDA, gateway), and validating service health before resuming automation. This is unacceptable for platforms supporting mission-critical workflows.

Active-Passive Multi-Datacenter DR Architecture

Active-Passive disaster recovery deploys a complete AAP stack in two datacenters:

When DC1 fails, EDB Failover Manager (EFM) automatically promotes the DC2 database to primary and triggers a post-promotion script that starts AAP services in DC2. The global load balancer detects DC2 health checks passing and redirects traffic. Total failover time: under 5 minutes.

This architecture extends Red Hat’s tested Container Enterprise Topology – an 8-VM, multi-component design for high-scale AAP deployments – into a multi-datacenter configuration. Each datacenter follows Red Hat’s validated single-datacenter model, with cross-datacenter database replication and automated failover orchestration providing disaster recovery capability.

EDB Postgres: Enterprise-Grade Database Platform

EDB Postgres Advanced Server extends PostgreSQL with enterprise features required for mission-critical AAP deployments:

EDB is a trusted PostgreSQL partner with deep integration into Red Hat’s ecosystem. EFM’s post-promotion script capability enables coordinated database-and-application failover, ensuring AAP services start only after the database is ready to accept connections.


Solution

Components

Ansible Automation Platform – the automation layer:

EDB PostgreSQL – the database layer:

Infrastructure:

Who Benefits

Persona Challenge What They Gain
IT Ops Engineer / SRE AAP downtime blocks critical automation; manual DR procedures require coordinating database restore, service startup, and health validation across 16+ VMs Automated failover orchestration via EFM; tested recovery procedures with concrete validation commands; complete operational runbook for daily health checks and emergency failover
Automation Architect Uncertainty about how to design production-ready AAP for mission-critical use cases; existing single-datacenter deployments lack disaster recovery capability Production-validated reference architecture based on Red Hat’s Container Enterprise Topology; detailed component specifications, network topology, and firewall rules; complete implementation guidance for this 26-VM enterprise design
IT Manager / Director Business justification required for DR investment; inability to commit to SLA targets without proven recovery procedures Measurable RTO/RPO targets (5 minutes / 5 seconds); infrastructure scale and resource planning guidance; implementation roadmap with clear phase gates from planning through production cutover

Prerequisites

Red Hat Ansible Automation Platform

Ansible Automation Platform 2.6+ – required for Container Enterprise Topology support and unified containerized installer.

Operating System and Runtime

External Systems

System Required Examples
EDB Postgres Advanced Server Yes EDB subscription or trial license
EDB Failover Manager (EFM) Yes Bundled with EDB Postgres Advanced Server
Global Load Balancer Yes F5 BIG-IP, HAProxy, AWS Route53, Azure Traffic Manager
WAL Archive Storage Yes S3, Azure Blob, NFS
Site-to-site connectivity Yes VPN or Direct Connect with < 100ms latency

Operational Impact: High during implementation phases; Medium for ongoing operations; High during failover/failback procedures.

Infrastructure Requirements

Total Resource Footprint:

Cost and Resource Notes


Multi-Datacenter DR Architecture

High-Level Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                        GLOBAL LOAD BALANCER                            │
│                      (F5 / HAProxy / Route53)                          │
│                    https://aap.example.com                             │
│                                                                        │
│  Health Checks: /api/v2/ping/ every 10s                                │
│  Active-Passive Routing: DC1 (Priority 100) → DC2 (Priority 50)        │
└──────────────┬────────────────────────────────┬────────────────────────┘
               │ (Active - 100% traffic)        │ (Passive - 0% traffic)
               │                                │
┌──────────────▼─────────────────┐   ┌──────────▼──────────────────────┐
│      DATACENTER 1 (Active)     │   │    DATACENTER 2 (Standby)       │
│                                │   │                                 │
│  ┌──────────────────────────┐  │   │  ┌──────────────────────────┐   │
│  │  AAP Component Layer     │  │   │  │  AAP Component Layer     │   │
│  │  (8 VMs - Active)        │  │   │  │  (8 VMs - STOPPED)       │   │
│  │                          │  │   │  │                          │   │
│  │  gateway1-dc1            │  │   │  │  gateway1-dc2            │   │
│  │  gateway2-dc1            │  │   │  │  gateway2-dc2            │   │
│  │    + Redis colocated     │  │   │  │    + Redis (stopped)     │   │
│  │                          │  │   │  │                          │   │
│  │  controller1-dc1         │  │   │  │  controller1-dc2         │   │
│  │  controller2-dc1         │  │   │  │  controller2-dc2         │   │
│  │    (dedicated VMs)       │  │   │  │    (stopped)             │   │
│  │                          │  │   │  │                          │   │
│  │  hub1-dc1                │  │   │  │  hub1-dc2                │   │
│  │  hub2-dc1                │  │   │  │  hub2-dc2                │   │
│  │    + Redis colocated     │  │   │  │    + Redis (stopped)     │   │
│  │                          │  │   │  │                          │   │
│  │  eda1-dc1                │  │   │  │  eda1-dc2                │   │
│  │  eda2-dc1                │  │   │  │  eda2-dc2                │   │
│  │    + Redis colocated     │  │   │  │    + Redis (stopped)     │   │
│  └──────────────────────────┘  │   │  └──────────────────────────┘   │
│  ┌───────────────────────────┐ │   │  ┌───────────────────────────┐  │
│  │  HAProxy Load Balancer    │ │   │  │  HAProxy Load Balancer    │  │
│  │  vip-dc1.example.com      │ │   │  │  vip-dc2.example.com      │  │
│  └────────┬──────────────────┘ │   │  └────────┬──────────────────┘  │
│           │                    │   │           │                     │         
│  ┌────────▼───────────────────┐│   │  ┌────────▼───────────────────┐ │
│  │ PostgreSQL Cluster (3)     ││   │  │ PostgreSQL Cluster (3)     │ │
│  │ (EDB Postgres Advanced 16) ││   │  │ (EDB Postgres Advanced 16) │ │
│  │                            ││   │  │                            │ │
│  │ pg-dc1-1 (PRIMARY)         ││   │  │ pg-dc2-1 (STANDBY/DP)      │ │
│  │   - awx                    ││   │  │   - awx (replica)          │ │
│  │   - automationhub          ││   │  │   - automationhub          │ │
│  │   - automationedacontroller││   │  │   - automationedacontroller│ │
│  │   - automationgateway      ││   │  │   - automationgateway      │ │
│  │                            ││   │  │                            │ │
│  │ pg-dc1-2 (STANDBY)         ││   │  │ pg-dc2-2 (STANDBY)         │ │
│  │ pg-dc1-3 (STANDBY)         ││   │  │ pg-dc2-3 (STANDBY)         │ │
│  │                            ││   │  │                            │ │
│  │ VIP: 10.1.2.100 (EFM)      ││   │  │ VIP: 10.2.2.100 (EFM)      │ │
│  └────────┬───────────────────┘│   │  └────────┬───────────────────┘ │
│           │                    │   │           │                     │
│  ┌────────▼──────────────────┐ │   │  ┌────────▼───────────────────┐ │
│  │ Barman Backup Server      │ │   │  │ Barman Backup Server       │ │
│  │ + WAL Archive (NFS/S3)    │ │   │  │ + WAL Archive (NFS/S3)     │ │
│  └───────────────────────────┘ │   │  └────────────────────────────┘ │
└───────────┬────────────────────┘   └────────────┬────────────────────┘
            │                                     │
            │      Streaming Replication (SSL)    │
            │      5432 (direct or VPN tunnel)    │
            └─────────────────────────────────────┘
                     (Asynchronous)

Data Flow During Normal Operations (DC1 Active)

User → GLB → HAProxy(DC1) → AAP Containers(DC1) → VIP(DC1) → PostgreSQL PRIMARY(DC1)
                                                                      │
                                    ┌─────────────────────────────────┼───────────────┐
                                    │                                 │               │
                                    ▼                                 ▼               ▼
                            PG Standby DC1-2                  PG Standby DC1-3    S3/Barman
                                                                      │
                                                        Streaming Replication (WAN)
                                                                      │
                                                                      ▼
                                                          PG Designated Primary DC2-1
                                                                      │
                                            ┌─────────────────────────┼──────────────┐
                                            │                         │              │
                                            ▼                         ▼              ▼
                                    PG Standby DC2-2          PG Standby DC2-3   S3/Barman

Automated Failover Sequence (DC1 Failure)

1. EFM Detects Primary Failure (pg-dc1-1)
   Time: T+0s

2. EFM Promotes DC2 Designated Primary (pg-dc2-1)
   Command: pg_ctl promote
   Time: T+15s

3. EFM Updates VIP (DC2)
   VIP 10.2.2.100 moved to pg-dc2-1
   Time: T+20s

4. EFM Executes Post-Promotion Script
   Script starts AAP containers in DC2
   Time: T+25s to T+180s

5. Global Load Balancer Detects DC2 Healthy
   Health checks to DC2: PASSING
   Route traffic to DC2
   Time: T+200s

6. DNS TTL Expiration
   Clients resolve aap.example.com to DC2
   Time: T+200s to T+260s (assuming 60s TTL)

7. Failover Complete
   Platform RTO: ~240s (4 minutes)
   User-facing RTO: ~300s (5 minutes including DNS TTL)

Important operational impacts:

Component Specifications

Quick reference: 26 VMs total (13 per datacenter), 68 vCPU / 272GB RAM per datacenter

View detailed component specifications →

AAP Component VMs (Per Datacenter)

Component Specification Count Resource per VM Total Resources
Platform Gateway RHEL 9.4+, Podman + Redis 2 4 vCPU, 16GB RAM, 60GB disk 8 vCPU, 32GB RAM
Automation Controller RHEL 9.4+, Podman 2 4 vCPU, 16GB RAM, 60GB disk 8 vCPU, 32GB RAM
Automation Hub RHEL 9.4+, Podman + Redis 2 4 vCPU, 16GB RAM, 60GB disk 8 vCPU, 32GB RAM
Event-Driven Ansible RHEL 9.4+, Podman + Redis 2 4 vCPU, 16GB RAM, 60GB disk 8 vCPU, 32GB RAM
HAProxy DB Router RHEL 9.4+, HAProxy 1 2 vCPU, 8GB RAM, 40GB disk 2 vCPU, 8GB RAM
Total AAP Infrastructure - 9 VMs - 34 vCPU, 136GB RAM

PostgreSQL Database Cluster (Per Datacenter)

Role Count Specification
Primary + 2 Standby (DC1) 3 8 vCPU, 32GB RAM, 500GB SSD
Designated Primary + 2 Standby (DC2) 3 8 vCPU, 32GB RAM, 500GB SSD

EFM Quorum Requirements:

  • Minimum cluster size: 3 nodes (1 primary + 2 standbys) to achieve quorum for automated failover
  • Witness node: Optional dedicated witness node recommended for production deployments to prevent split-brain scenarios during network partitions
  • Quorum calculation: Majority of configured nodes must be reachable for failover to proceed (e.g., 2 of 3 nodes, or 3 of 5 with witness)
  • Split-brain prevention: EFM requires majority consensus before promoting a standby to prevent multiple primaries

Production recommendation: Deploy a lightweight witness node (2 vCPU, 4GB RAM) in a third location or availability zone to maintain quorum during single-datacenter failures. This guide uses 3-node clusters per DC for simplicity; add witness nodes for production deployments requiring maximum availability.

AAP Databases (4 databases per PostgreSQL instance)

-- Database Layout (AAP 2.6 official database names)
CREATE DATABASE awx OWNER aap;                          -- 50GB (main controller database)
CREATE DATABASE automationhub OWNER aap;                -- 20GB (content/collections)
CREATE DATABASE automationedacontroller OWNER aap;      -- 10GB (event-driven automation)
CREATE DATABASE automationgateway OWNER aap;            -- 5GB (platform gateway)

Network Topology

DC1 Network:
  - AAP Subnet:       10.1.1.0/24
    - gateway1-dc1:     10.1.1.11    gateway2-dc1:     10.1.1.12
    - controller1-dc1:  10.1.1.13    controller2-dc1:  10.1.1.14
    - hub1-dc1:         10.1.1.15    hub2-dc1:         10.1.1.16
    - eda1-dc1:         10.1.1.17    eda2-dc1:         10.1.1.18
    - haproxy-db-dc1:   10.1.1.20

  - Database Subnet:  10.1.2.0/24
    - pg-dc1-1:         10.1.2.21    pg-dc1-2:         10.1.2.22
    - pg-dc1-3:         10.1.2.23
    - Database VIP:     10.1.2.100 (EFM managed)

DC2 Network:
  - AAP Subnet:       10.2.1.0/24
    - gateway1-dc2:     10.2.1.11    gateway2-dc2:     10.2.1.12
    - controller1-dc2:  10.2.1.13    controller2-dc2:  10.2.1.14
    - hub1-dc2:         10.2.1.15    hub2-dc2:         10.2.1.16
    - eda1-dc2:         10.2.1.17    eda2-dc2:         10.2.1.18
    - haproxy-db-dc2:   10.2.1.20

  - Database Subnet:  10.2.2.0/24
    - pg-dc2-1:         10.2.2.21    pg-dc2-2:         10.2.2.22
    - pg-dc2-3:         10.2.2.23
    - Database VIP:     10.2.2.100 (EFM managed)

WAN Connectivity:
  - Type: Site-to-Site VPN or Direct Connect
  - Bandwidth: 100 Mbps minimum, 1 Gbps recommended
  - Latency: < 100ms required for streaming replication
  - Encryption: IPsec or TLS

Important notes on hostname conventions:

  • DC-specific hostnames are intentional: Hostnames like gateway1-dc1 and pg-dc2-1 explicitly identify which datacenter hosts each VM
  • User-facing endpoint is datacenter-agnostic: Users access https://aap.example.com (GLB manages routing)
  • During failover, DC2 hostnames remain unchanged: When DC2 becomes active, nodes retain their -dc2 suffix – this is expected and correct
  • Internal references use VIPs or HAProxy: AAP components connect to database via HAProxy (10.1.1.20 or 10.2.1.20), which routes to the datacenter-local PostgreSQL VIP
  • Why not use datacenter-agnostic names? Explicit DC identifiers in hostnames aid troubleshooting, capacity planning, and operational awareness of which datacenter is serving traffic

Production consideration: Some organizations prefer datacenter-agnostic hostnames (e.g., gateway1-a, gateway1-b) to avoid confusion. This guide uses explicit DC identifiers for operational clarity, but either approach works as long as the GLB provides the user-facing abstraction.

Why HAProxy instead of pgBouncer? AAP 2.6 has specific connection pooling requirements that make HAProxy the recommended approach for database connection routing. HAProxy routes AAP containers to the EFM-managed PostgreSQL VIP without connection pooling. See the source architecture documentation’s “HAProxy vs pgBouncer Architectural Analysis” for complete design rationale.


Solution Walkthrough

Phase 1: Infrastructure Preparation

Operational Impact: None – provisioning and configuration only

Duration: Week 1-2

Tasks:

  1. Provision VMs (26 total)
    • DC1: 8 AAP VMs + 3 PostgreSQL + 1 HAProxy + 1 Barman
    • DC2: 8 AAP VMs + 3 PostgreSQL + 1 HAProxy + 1 Barman
  2. Install RHEL 9.4+ on all nodes
  3. Configure network (VLANs, firewall rules, VPN between DCs)
  4. Install Podman on AAP component VMs
  5. Configure storage (SSD for databases, ensure 3000 IOPS minimum)

Network firewall rules required:

# User Access (GLB → HAProxy)
Source: 0.0.0.0/0
Dest: 10.1.1.100, 10.2.1.100
Port: 443/tcp

# HAProxy → Platform Gateway
Source: 10.1.1.10, 10.2.1.10
Dest: 10.1.1.11-12, 10.2.1.11-12
Port: 80/443

# Platform Gateway → AAP Components
Source: 10.1.1.11-12, 10.2.1.11-12
Dest: 10.1.1.13-18, 10.2.1.13-18
Port: 8080/8443 (Controller), 8081/8444 (Hub), 8082/8445 (EDA)

# AAP Components → PostgreSQL (via HAProxy)
Source: 10.1.1.0/24, 10.2.1.0/24
Dest: 10.1.1.20, 10.2.1.20
Port: 5432/tcp

# HAProxy → PostgreSQL VIP
Source: 10.1.1.20, 10.2.1.20
Dest: 10.1.2.100, 10.2.2.100
Port: 5432/tcp

# PostgreSQL Replication (DC1 → DC2)
Source: 10.1.2.21-23
Dest: 10.2.2.21-23
Port: 5432/tcp

# EFM Cluster Communication
Source: 10.1.2.0/24, 10.2.2.0/24
Dest: 10.1.2.0/24, 10.2.2.0/24
Port: 7800-7810/tcp

Phase 2: Database Cluster Setup

Operational Impact: Low – database installation and replication setup, no production traffic

Duration: Week 3-4

Key tasks: Install EDB Postgres Advanced Server, configure primary database with streaming replication, initialize AAP databases, set up local and cross-datacenter standbys, install and configure EDB Failover Manager

View detailed database setup steps →

Step 1: Install EDB Postgres Advanced Server

# On all PostgreSQL nodes (pg-dc1-1, pg-dc1-2, pg-dc1-3, pg-dc2-1, pg-dc2-2, pg-dc2-3)

# Add EDB repository
sudo dnf -y install https://dl.enterprisedb.com/default/release/get/16/rpm

# Install EDB Postgres Advanced Server
sudo dnf -y install edb-as16-server edb-as16-contrib

# Initialize database cluster
sudo /usr/edb/as16/bin/edb-as-16-setup initdb

# Start and enable service
sudo systemctl start edb-as-16
sudo systemctl enable edb-as-16

Step 2: Configure primary database (DC1)

postgresql.conf (pg-dc1-1):

listen_addresses = '*'
port = 5432
max_connections = 1500
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 64MB
maintenance_work_mem = 2GB

# Replication Settings
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 1GB
hot_standby = on
hot_standby_feedback = on
synchronous_standby_names = 'pg-dc1-2'  # Local sync standby
synchronous_commit = on

# Archive Settings
archive_mode = on
archive_command = 'barman-cloud-wal-archive --cloud-provider aws-s3 --endpoint-url https://s3.us-east-1.amazonaws.com s3://aap-wal-dc1 edb-cluster %p'
archive_timeout = 60

# Performance Tuning
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
random_page_cost = 1.1  # For SSD
effective_io_concurrency = 200

pg_hba.conf additions (pg-dc1-1):

# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     replicator      10.1.2.22/32            scram-sha-256  # pg-dc1-2
host    replication     replicator      10.1.2.23/32            scram-sha-256  # pg-dc1-3
host    replication     replicator      10.2.2.21/32            scram-sha-256  # pg-dc2-1 (cross-DC)

# AAP database access
host    awx                     aap     10.1.1.0/24             scram-sha-256
host    automationhub           aap     10.1.1.0/24             scram-sha-256
host    automationedacontroller aap     10.1.1.0/24             scram-sha-256
host    automationgateway       aap     10.1.1.0/24             scram-sha-256

Step 3: Initialize AAP databases

-- On pg-dc1-1
CREATE ROLE aap LOGIN PASSWORD 'SCRAM-SHA-256$...' ENCRYPTED;
CREATE ROLE replicator REPLICATION LOGIN PASSWORD 'SCRAM-SHA-256$...' ENCRYPTED;

CREATE DATABASE awx OWNER aap;
CREATE DATABASE automationhub OWNER aap;
CREATE DATABASE automationedacontroller OWNER aap;
CREATE DATABASE automationgateway OWNER aap;

-- automation_hub requires hstore extension
\c automationhub
CREATE EXTENSION IF NOT EXISTS hstore;

Step 4: Set up local standbys (DC1-2, DC1-3)

# On pg-dc1-2
sudo systemctl stop edb-as-16
sudo -u enterprisedb rm -rf /var/lib/edb/as16/data/*
sudo -u enterprisedb pg_basebackup -h pg-dc1-1 -U replicator \
  -D /var/lib/edb/as16/data -P -Xs -R --slot=pg_dc1_2_slot -C

# Start standby
sudo systemctl start edb-as-16

# Verify replication
psql -h pg-dc1-1 -U postgres -c "SELECT * FROM pg_stat_replication;"

Repeat for pg-dc1-3 with slot pg_dc1_3_slot.

Step 5: Set up cross-datacenter standby (DC2-1)

# On pg-dc2-1
sudo systemctl stop edb-as-16
sudo -u enterprisedb rm -rf /var/lib/edb/as16/data/*
sudo -u enterprisedb pg_basebackup -h pg-dc1-1 -U replicator \
  -D /var/lib/edb/as16/data -P -Xs -R --slot=pg_dc2_1_slot -C

# Start designated primary (currently in standby role)
sudo systemctl start edb-as-16

# Verify cross-DC replication
psql -h pg-dc1-1 -U postgres -c "SELECT application_name, state, sync_state FROM pg_stat_replication WHERE application_name='pg-dc2-1';"

Set up pg-dc2-2 and pg-dc2-3 as standbys of pg-dc2-1.

Step 6: Install and configure EDB Failover Manager (EFM)

# On all PostgreSQL nodes
sudo dnf -y install edb-efm47

# Configure EFM properties
sudo vi /etc/edb/efm-4.7/efm.properties

/etc/edb/efm-4.7/efm.properties (pg-dc1-1):

# Database Configuration
db.user=efm
db.password.encrypted=<encrypted_password>
db.port=5432
db.database=postgres

# Node Configuration
bind.address=10.1.2.21:7800
is.witness=false
db.service.owner=enterprisedb
db.service.name=edb-as-16
db.bin=/usr/edb/as16/bin

# Membership (all nodes in DC1 cluster)
nodes=10.1.2.21:7800 10.1.2.22:7800 10.1.2.23:7800

# Auto-failover Settings
auto.failover=true
auto.reconfigure=true
failover.timeout=60
node.timeout=60

# Virtual IP (for AAP connection)
virtual.ip=10.1.2.100
virtual.ip.interface=eth0
virtual.ip.prefix=24
virtual.ip.single=true

# Post-promotion Script (AAP integration)
script.post.promotion=/usr/edb/efm-4.7/bin/efm-orchestrated-failover.sh %h %s %a %v
enable.custom.scripts=true
script.timeout=600

# Notification
notification.level=WARNING
user.email=ops@example.com

Start EFM:

sudo systemctl start edb-efm-4.7
sudo systemctl enable edb-efm-4.7

# Verify cluster status
/usr/edb/efm-4.7/bin/efm cluster-status efm

Step 7: Configure WAL archiving to S3

# Install barman-cli on all PostgreSQL nodes
sudo dnf -y install barman-cli

# Configure S3 credentials
cat > ~/.aws/credentials <<EOF
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
EOF

# Test WAL archiving
barman-cloud-wal-archive --cloud-provider aws-s3 \
  --endpoint-url https://s3.us-east-1.amazonaws.com \
  s3://aap-wal-dc1 edb-cluster /tmp/test.wal

Phase 3: AAP Installation

Operational Impact: Medium – installs AAP services, requires database connectivity

Duration: Week 5-6

Key tasks: Download AAP containerized installer, configure inventory for 16 AAP VMs across both datacenters, run installer, verify installation, stop DC2 services for standby mode

View detailed AAP installation steps →

Step 8: Download AAP containerized installer

# On a controller node in DC1
cd /opt
tar -xzf ansible-automation-platform-containerized-setup-2.6-1.tar.gz
cd ansible-automation-platform-containerized-setup-2.6-1

Step 9: Create unified inventory file

/opt/aap/inventory:

# Red Hat Ansible Automation Platform 2.6 - Container Enterprise Topology
# Multi-Datacenter Active/Passive Configuration

# Platform Gateway (4 VMs - 2 per DC with colocated Redis)
[automationgateway]
gateway1-dc1.example.com
gateway2-dc1.example.com
gateway1-dc2.example.com
gateway2-dc2.example.com

# Automation Controller (4 VMs - 2 per DC, dedicated)
[automationcontroller]
controller1-dc1.example.com
controller2-dc1.example.com
controller1-dc2.example.com
controller2-dc2.example.com

# Automation Hub (4 VMs - 2 per DC with colocated Redis)
[automationhub]
hub1-dc1.example.com
hub2-dc1.example.com
hub1-dc2.example.com
hub2-dc2.example.com

# Event-Driven Ansible (4 VMs - 2 per DC with colocated Redis)
[automationeda]
eda1-dc1.example.com
eda2-dc1.example.com
eda1-dc2.example.com
eda2-dc2.example.com

# Redis (colocated on gateway, hub, and EDA nodes)
[redis]
gateway1-dc1.example.com
gateway2-dc1.example.com
hub1-dc1.example.com
hub2-dc1.example.com
eda1-dc1.example.com
eda2-dc1.example.com
gateway1-dc2.example.com
gateway2-dc2.example.com
hub1-dc2.example.com
hub2-dc2.example.com
eda1-dc2.example.com
eda2-dc2.example.com

[all:vars]
# Common variables
postgresql_admin_username=postgres
postgresql_admin_password='<set your own>'

# Red Hat Registry Credentials
registry_username='<your RHN username>'
registry_password='<your RHN password>'

# Redis Configuration
redis_mode='standalone'

# Platform Gateway Configuration
gateway_admin_password='<set your own>'
gateway_pg_database='automationgateway'
gateway_pg_username='aap'
gateway_pg_password='<set your own>'
gateway_main_url='https://aap.example.com'

# Automation Controller Configuration
controller_admin_password='<set your own>'
controller_pg_database='awx'
controller_pg_username='aap'
controller_pg_password='<set your own>'

# Automation Hub Configuration
hub_admin_password='<set your own>'
hub_pg_database='automationhub'
hub_pg_username='aap'
hub_pg_password='<set your own>'

# Event-Driven Ansible Configuration
eda_admin_password='<set your own>'
eda_pg_database='automationedacontroller'
eda_pg_username='aap'
eda_pg_password='<set your own>'

# DC1-specific host variables (pointing to DC1 HAProxy)
[automationgateway:vars]
gateway1-dc1.example.com gateway_pg_host='10.1.1.20' gateway_pg_port='5432'
gateway2-dc1.example.com gateway_pg_host='10.1.1.20' gateway_pg_port='5432'

[automationcontroller:vars]
controller1-dc1.example.com controller_pg_host='10.1.1.20' controller_pg_port='5432'
controller2-dc1.example.com controller_pg_host='10.1.1.20' controller_pg_port='5432'

[automationhub:vars]
hub1-dc1.example.com hub_pg_host='10.1.1.20' hub_pg_port='5432'
hub2-dc1.example.com hub_pg_host='10.1.1.20' hub_pg_port='5432'

[automationeda:vars]
eda1-dc1.example.com eda_pg_host='10.1.1.20' eda_pg_port='5432'
eda2-dc1.example.com eda_pg_host='10.1.1.20' eda_pg_port='5432'

# DC2-specific host variables (pointing to DC2 HAProxy)
gateway1-dc2.example.com gateway_pg_host='10.2.1.20' gateway_pg_port='5432'
gateway2-dc2.example.com gateway_pg_host='10.2.1.20' gateway_pg_port='5432'
controller1-dc2.example.com controller_pg_host='10.2.1.20' controller_pg_port='5432'
controller2-dc2.example.com controller_pg_host='10.2.1.20' controller_pg_port='5432'
hub1-dc2.example.com hub_pg_host='10.2.1.20' hub_pg_port='5432'
hub2-dc2.example.com hub_pg_host='10.2.1.20' hub_pg_port='5432'
eda1-dc2.example.com eda_pg_host='10.2.1.20' eda_pg_port='5432'
eda2-dc2.example.com eda_pg_host='10.2.1.20' eda_pg_port='5432'

Critical: DC2 nodes will be STOPPED after installation. All admin passwords and database credentials must match between DC1 and DC2 for seamless failover.

Step 10: Install AAP on DC1 (active)

cd /opt/ansible-automation-platform-containerized-setup-2.6-1

# Run installer
./setup.sh

# Verify installation
podman ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Enable systemd services
systemctl enable --now automation-controller-web
systemctl enable --now automation-controller-task
systemctl enable --now automation-gateway
systemctl enable --now automation-hub
systemctl enable --now eda-activation-worker
systemctl enable --now redis

Step 11: Install AAP on DC2 (standby) and stop services

# On DC2 nodes
cd /opt/ansible-automation-platform-containerized-setup-2.6-1

# CRITICAL: Ensure SECRET_KEY matches DC1
# Copy /etc/tower/SECRET_KEY from DC1 to DC2 before install

# Run installer
./setup.sh

# IMMEDIATELY STOP all AAP containers (standby mode)
systemctl stop automation-controller-web automation-controller-task
systemctl stop automation-gateway automation-hub eda-activation-worker redis

# Disable auto-start
systemctl disable automation-controller-web automation-controller-task
systemctl disable automation-gateway automation-hub eda-activation-worker redis

Step 12: Configure HAProxy for database connection routing

/etc/haproxy/haproxy.cfg (DC1 and DC2):

global
    log /dev/log local0 info
    chroot /var/lib/haproxy
    stats socket /var/lib/haproxy/stats mode 600 level admin
    stats timeout 30s
    user haproxy
    group haproxy
    daemon
    maxconn 4000

defaults
    log     global
    mode    tcp
    option  tcplog
    option  dontlognull
    timeout connect 10s
    timeout client  1h
    timeout server  1h
    timeout check   5s
    retries 3

# Backend - PostgreSQL VIP (EFM-managed)
backend postgresql_backend
    mode tcp
    balance roundrobin
    
    # External health check validates writable node
    option external-check
    external-check path "/usr/bin:/bin"
    external-check command /usr/local/bin/check-postgres-writable.sh
    
    # Single backend: EFM-managed VIP always points to PRIMARY
    server postgresql-vip 10.1.2.100:5432 check inter 5s rise 2 fall 3 maxconn 500

# Frontend - AAP Database Connections
frontend postgresql_frontend
    bind *:5432
    mode tcp
    default_backend postgresql_backend

# Stats interface
listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /stats
    stats refresh 10s
    stats auth admin:ChangeMeStats123!

External health check script:

#!/bin/bash
# /usr/local/bin/check-postgres-writable.sh

PGHOST="${1:-10.1.2.100}"
PGPORT="${2:-5432}"
PGUSER="haproxy_healthcheck"
PGDATABASE="postgres"
TIMEOUT=3

# Check 1: PostgreSQL is reachable
if ! timeout "${TIMEOUT}" pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -q; then
    logger -t haproxy-healthcheck "PostgreSQL unreachable: ${PGHOST}:${PGPORT}"
    exit 1
fi

# Check 2: PostgreSQL is NOT in recovery (writable PRIMARY)
IS_RECOVERY=$(timeout "${TIMEOUT}" psql \
    -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" \
    -t -c "SELECT pg_is_in_recovery();" 2>/dev/null | tr -d '[:space:]')

if [[ "${IS_RECOVERY}" == "f" ]]; then
    exit 0  # Writable PRIMARY
else
    logger -t haproxy-healthcheck "PostgreSQL is read-only: ${PGHOST}:${PGPORT}"
    exit 1  # Read-only STANDBY
fi

Create health check user:

-- On primary database
CREATE USER haproxy_healthcheck WITH PASSWORD 'HealthCheckPassword123!';
GRANT CONNECT ON DATABASE postgres TO haproxy_healthcheck;

-- Add to pg_hba.conf
# TYPE  DATABASE        USER                    ADDRESS         METHOD
host    postgres        haproxy_healthcheck     10.1.1.0/24     scram-sha-256
host    postgres        haproxy_healthcheck     10.2.1.0/24     scram-sha-256

Start HAProxy:

sudo systemctl enable --now haproxy

Phase 4: Integration and Automation

Operational Impact: Medium – failover automation configuration

Duration: Week 7-8

Key tasks: Create EFM post-promotion script for AAP activation, configure global load balancer, set up monitoring and alerting, create operational runbooks

View detailed integration and automation steps →

Step 13: Create EFM post-promotion script for AAP activation

/usr/edb/efm-4.7/bin/efm-orchestrated-failover.sh:

#!/bin/bash
# EFM Post-Promotion Script: Start AAP containers in failover datacenter

set -e

CLUSTER_NAME="$1"
NODE_TYPE="$2"
NODE_ADDRESS="$3"
VIP_ADDRESS="$4"

# Determine datacenter
if [[ "$NODE_ADDRESS" == *"dc2"* ]] || [[ "$NODE_ADDRESS" == "10.2"* ]]; then
    DATACENTER="DC2"
    GATEWAY_NODES=("gateway1-dc2" "gateway2-dc2")
    CONTROLLER_NODES=("controller1-dc2" "controller2-dc2")
    HUB_NODES=("hub1-dc2" "hub2-dc2")
    EDA_NODES=("eda1-dc2" "eda2-dc2")
else
    echo "ERROR: Failover to DC1 not expected"
    exit 1
fi

# Start AAP containers by component type
echo "Starting Platform Gateway nodes in $DATACENTER..."
for node in "${GATEWAY_NODES[@]}"; do
    ssh "$node" "systemctl start automation-gateway redis"
done

echo "Starting Automation Controller nodes in $DATACENTER..."
for node in "${CONTROLLER_NODES[@]}"; do
    ssh "$node" "systemctl start automation-controller-web automation-controller-task"
done

echo "Starting Automation Hub nodes in $DATACENTER..."
for node in "${HUB_NODES[@]}"; do
    ssh "$node" "systemctl start automation-hub redis"
done

echo "Starting Event-Driven Ansible nodes in $DATACENTER..."
for node in "${EDA_NODES[@]}"; do
    ssh "$node" "systemctl start eda-activation-worker redis"
done

# Wait for AAP API
MAX_WAIT=300
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
    if curl -k -s https://10.2.1.11/api/v2/ping/ | grep -q "200"; then
        echo "AAP is ready in $DATACENTER"
        break
    fi
    sleep 10
    ELAPSED=$((ELAPSED + 10))
done

# Send notifications
logger -t efm-failover "AAP activated in $DATACENTER"
echo "WARNING: EDA rulebook activations require MANUAL restart via AAP UI or API" | logger -t efm-failover

Make executable and configure SSH keys:

chmod +x /usr/edb/efm-4.7/bin/efm-orchestrated-failover.sh

# Configure passwordless SSH from all PostgreSQL nodes to all AAP nodes
# (Required for post-promotion script to start services)

Step 14: Configure Global Load Balancer

Critical DNS Configuration:

  • TTL recommendation: 30-60 seconds for the GLB record
    • Lower TTL (30s) = faster client failover, higher DNS query load
    • Higher TTL (60s) = reduced DNS queries, slower client convergence after failover
  • User-facing RTO calculation: Platform failover time (4 min) + DNS TTL (30-60s) = ~5 minutes total
  • Health check interval: 10 seconds with 2 consecutive failures before marking unhealthy
  • Health check timeout: 5 seconds

Example Route53 configuration:

{
  "HostedZoneId": "Z1234567890ABC",
  "ChangeBatch": {
    "Changes": [
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "aap.example.com",
          "Type": "A",
          "TTL": 60,
          "SetIdentifier": "DC1-Primary",
          "Failover": "PRIMARY",
          "HealthCheckId": "health-check-dc1",
          "ResourceRecords": [{"Value": "10.1.1.100"}]
        }
      },
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "aap.example.com",
          "Type": "A",
          "TTL": 60,
          "SetIdentifier": "DC2-Secondary",
          "Failover": "SECONDARY",
          "HealthCheckId": "health-check-dc2",
          "ResourceRecords": [{"Value": "10.2.1.100"}]
        }
      }
    ]
  }
}

Health check configuration:

{
  "Type": "HTTPS",
  "ResourcePath": "/api/v2/ping/",
  "FullyQualifiedDomainName": "aap-dc1.example.com",
  "Port": 443,
  "RequestInterval": 10,
  "FailureThreshold": 2,
  "MeasureLatency": true
}

Important: DNS TTL directly impacts user-facing RTO. A 60-second TTL means clients may continue attempting to reach DC1 for up to 60 seconds after GLB has switched to DC2. For mission-critical deployments, consider 30-second TTL or client-side connection retry logic.

Step 15: Set up monitoring and alerting

Prometheus alert rules:

# /etc/prometheus/alert-rules.yml

groups:
  - name: aap_alerts
    interval: 30s
    rules:
      - alert: AAPAPIDown
        expr: probe_success{job="aap-api"} == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "AAP API is down on {{ $labels.instance }}"

      - alert: PostgreSQLReplicationLagHigh
        expr: pg_replication_lag_seconds > 30
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag on {{ $labels.instance }}"

      - alert: PostgreSQLReplicationStopped
        expr: pg_replication_is_replica == 1 and pg_replication_lag_seconds == -1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication stopped on {{ $labels.instance }}"

Phase 5: Testing and Validation

Operational Impact: High during failover tests – production traffic should be drained before testing

Duration: Week 9-10

Key tasks: Test local database failover, test cross-datacenter failover, test AAP failover activation, test failback procedure, measure and validate RTO/RPO targets

View detailed testing and validation steps →

Step 16: Test local database failover (within DC1)

# Simulate primary failure
ssh pg-dc1-1 "sudo systemctl stop edb-as-16"

# EFM should automatically promote pg-dc1-2
# Monitor with:
/usr/edb/efm-4.7/bin/efm cluster-status efm

# Verify VIP moved
ping 10.1.2.100

# Verify new primary
psql -h 10.1.2.100 -U postgres -c "SELECT pg_is_in_recovery();"
# Expected: f (false)

Step 17: Test cross-datacenter failover (DC1 → DC2)

# Stop AAP in DC1
for node in gateway1-dc1 gateway2-dc1; do
    ssh "$node" "systemctl stop automation-gateway redis"
done
for node in controller1-dc1 controller2-dc1; do
    ssh "$node" "systemctl stop automation-controller-web automation-controller-task"
done
for node in hub1-dc1 hub2-dc1; do
    ssh "$node" "systemctl stop automation-hub redis"
done
for node in eda1-dc1 eda2-dc1; do
    ssh "$node" "systemctl stop eda-activation-worker redis"
done

# Promote DC2 database to primary
ssh pg-dc2-1 "sudo -u enterprisedb /usr/edb/as16/bin/pg_ctl promote -D /var/lib/edb/as16/data"

# Post-promotion script should start AAP in DC2
# Monitor logs:
tail -f /var/log/messages | grep efm-failover

# Verify AAP API in DC2
curl -k https://10.2.1.11/api/v2/ping/

Step 18: Test failback procedure (DC2 → DC1)

See Validation section for complete failback procedure.

Step 19: Measure RTO/RPO

Document actual failover times:

  • Database promotion: ~15 seconds
  • AAP startup: ~3 minutes
  • GLB detection: ~30 seconds
  • Total RTO: ~240 seconds (under 5-minute target)

Phase 6: Production Cutover

Operational Impact: High – production migration

Duration: Week 11-12

Key tasks: Final configuration review, production data migration, user acceptance testing, go-live planning

Step 20: Final configuration review

Step 21: Production data migration

Step 22: User acceptance testing

Step 23: Go-live


Validation

Per-Stage Validation Checklist

Stage What to Verify Success Indicator
Infrastructure All VMs provisioned and accessible SSH connectivity from provisioning host
Network Firewall rules configured nc -zv <ip> <port> succeeds for all required ports
PostgreSQL Primary Database service running systemctl status edb-as-16 shows active
AAP Databases Four databases created psql -l shows awx, automationhub, automationedacontroller, automationgateway
Local Replication Standbys replicating from primary SELECT * FROM pg_stat_replication; shows pg-dc1-2, pg-dc1-3
Cross-DC Replication DC2 replicating from DC1 SELECT * FROM pg_stat_replication; shows pg-dc2-1
EFM Cluster Cluster status healthy /usr/edb/efm-4.7/bin/efm cluster-status efm shows all nodes
VIP Virtual IP assigned to primary ping 10.1.2.100 resolves to pg-dc1-1
AAP DC1 Services running podman ps shows all containers on all DC1 AAP nodes
AAP DC2 Services stopped podman ps shows no containers on DC2 AAP nodes
HAProxy Routing to PostgreSQL VIP psql -h 10.1.1.20 -U aap -d awx connects successfully
AAP API Controller API responding curl -k https://aap.example.com/api/v2/ping/ returns 200
Local Failover EFM promotes local standby Stop pg-dc1-1; pg-dc1-2 becomes primary within 60s
Cross-DC Failover EFM promotes DC2 and starts AAP Promote pg-dc2-1; AAP starts in DC2 within 5 minutes
GLB Routing Traffic redirects to DC2 curl https://aap.example.com resolves to DC2 after failover

Health Check Commands

Database health check:

#!/bin/bash
# /usr/local/bin/check-postgres-health.sh

PG_HOST="${1:-localhost}"
PG_PORT="${2:-5432}"

if ! pg_isready -h "$PG_HOST" -p "$PG_PORT" -U postgres; then
    echo "CRITICAL: PostgreSQL not accepting connections"
    exit 2
fi

IS_REPLICA=$(psql -h "$PG_HOST" -p "$PG_PORT" -U postgres -t -c "SELECT pg_is_in_recovery();")
if [ "$IS_REPLICA" = " t" ]; then
    LAG=$(psql -h "$PG_HOST" -p "$PG_PORT" -U postgres -t -c \
        "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));")
    if (( $(echo "$LAG > 60" | bc -l) )); then
        echo "CRITICAL: Replication lag is ${LAG}s"
        exit 2
    fi
fi

echo "OK: PostgreSQL healthy"
exit 0

AAP health check:

#!/bin/bash
# /usr/local/bin/check-aap-health.sh

AAP_URL="${1:-https://localhost}"
HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" --max-time 10 "$AAP_URL/api/v2/ping/")

if [ "$HTTP_CODE" = "200" ]; then
    echo "OK: AAP API responding"
    exit 0
else
    echo "CRITICAL: AAP API returned HTTP $HTTP_CODE"
    exit 2
fi

Failback Procedure (DC2 → DC1)

When to use: After DC1 infrastructure is restored and you want to return to normal Active-Passive configuration (DC1 active, DC2 standby)

View detailed failback procedure →
# 1. Rebuild DC1 as standby of DC2
ssh pg-dc1-1 "sudo systemctl stop edb-as-16"
ssh pg-dc1-1 "sudo -u enterprisedb rm -rf /var/lib/edb/as16/data/*"
ssh pg-dc1-1 "sudo -u enterprisedb pg_basebackup -h pg-dc2-1 -U replicator \
    -D /var/lib/edb/as16/data -P -Xs -R --slot=pg_dc1_1_slot"

# 2. Start DC1 as standby
ssh pg-dc1-1 "sudo systemctl start edb-as-16"

# 3. Verify replication DC2→DC1
ssh pg-dc2-1 "psql -U postgres -c \"SELECT * FROM pg_stat_replication;\""

# 4. Wait for minimal replication lag (< 5 seconds)

# 5. Stop AAP in DC2
for node in gateway1-dc2 gateway2-dc2; do
    ssh "$node" "systemctl stop automation-gateway redis"
done
for node in controller1-dc2 controller2-dc2; do
    ssh "$node" "systemctl stop automation-controller-web automation-controller-task"
done
for node in hub1-dc2 hub2-dc2; do
    ssh "$node" "systemctl stop automation-hub redis"
done
for node in eda1-dc2 eda2-dc2; do
    ssh "$node" "systemctl stop eda-activation-worker redis"
done

# 6. Promote DC1 back to primary
ssh pg-dc1-1 "sudo -u enterprisedb /usr/edb/as16/bin/pg_ctl promote -D /var/lib/edb/as16/data"

# 7. Configure DC2 as standby again
ssh pg-dc2-1 "sudo systemctl stop edb-as-16"
ssh pg-dc2-1 "sudo -u enterprisedb rm -rf /var/lib/edb/as16/data/*"
ssh pg-dc2-1 "sudo -u enterprisedb pg_basebackup -h pg-dc1-1 -U replicator \
    -D /var/lib/edb/as16/data -P -Xs -R --slot=pg_dc2_1_slot"
ssh pg-dc2-1 "sudo systemctl start edb-as-16"

# 8. Start AAP in DC1
for node in gateway1-dc1 gateway2-dc1; do
    ssh "$node" "systemctl start automation-gateway redis"
done
for node in controller1-dc1 controller2-dc1; do
    ssh "$node" "systemctl start automation-controller-web automation-controller-task"
done
for node in hub1-dc1 hub2-dc1; do
    ssh "$node" "systemctl start automation-hub redis"
done
for node in eda1-dc1 eda2-dc1; do
    ssh "$node" "systemctl start eda-activation-worker redis"
done

# 9. Update Global Load Balancer back to DC1

# 10. Verify normal operations
curl -k https://aap.example.com/api/v2/ping/

Troubleshooting

Symptom Likely Cause Fix
Replication lag increasing WAN bandwidth saturated or high latency Verify network connectivity; check pg_stat_replication for write_lag, flush_lag, replay_lag
EFM failover does not trigger auto.failover=false or insufficient quorum Verify EFM properties; ensure majority of nodes can communicate
AAP containers fail to start in DC2 Database not ready or incorrect connection string Verify PostgreSQL VIP is accessible from AAP nodes; check HAProxy backend status
HAProxy health check fails Database in recovery mode (read-only) Verify pg_is_in_recovery() returns f; check EFM VIP assignment
Cross-DC replication stopped Replication slot removed or network partition Recreate replication slot; verify VPN/Direct Connect connectivity
Post-promotion script timeout SSH keys not configured or AAP startup slow Verify passwordless SSH from PostgreSQL nodes to AAP nodes; increase script.timeout in EFM properties

Operational Runbook

Daily Health Check

#!/bin/bash
# /usr/local/bin/daily-health-check.sh

echo "Checking AAP DC1..."
/usr/local/bin/check-aap-health.sh https://10.1.1.11

echo "Checking PostgreSQL DC1..."
/usr/local/bin/check-postgres-health.sh 10.1.2.100

echo "Checking PostgreSQL DC2 replication..."
ssh pg-dc1-1 "psql -U postgres -c \"SELECT application_name, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication WHERE application_name='pg-dc2-1';\""

echo "Checking EFM cluster status..."
/usr/edb/efm-4.7/bin/efm cluster-status efm

Emergency Failover to DC2

# /usr/local/bin/manual-failover-dc2.sh

# Stop AAP in DC1
for node in gateway1-dc1 gateway2-dc1 controller1-dc1 controller2-dc1 hub1-dc1 hub2-dc1 eda1-dc1 eda2-dc1; do
    ssh "$node" "systemctl stop automation-*"
done

# Promote DC2 database
ssh pg-dc2-1 "sudo -u enterprisedb /usr/edb/as16/bin/pg_ctl promote -D /var/lib/edb/as16/data"

# Start AAP in DC2
for node in gateway1-dc2 gateway2-dc2 controller1-dc2 controller2-dc2 hub1-dc2 hub2-dc2 eda1-dc2 eda2-dc2; do
    ssh "$node" "systemctl start automation-*"
done

# Update Global Load Balancer
echo "Update GLB to route to DC2"

# CRITICAL: Manually restart EDA activations
echo "MANUAL INTERVENTION REQUIRED: Restart EDA rulebook activations via AAP UI"

Post-Failover EDA Activation Restart

EDA rulebook activations do NOT automatically restart after failover. You must manually restart each activation:

# Option 1: Via AAP UI
# 1. Log in to https://aap.example.com
# 2. Navigate to Event-Driven Ansible → Rulebook Activations
# 3. For each activation in "Stopped" state, click "Restart"

# Option 2: Via API
AAP_TOKEN="<your_token>"
AAP_URL="https://aap.example.com"

# List all activations
curl -k -H "Authorization: Bearer ${AAP_TOKEN}" \
  "${AAP_URL}/api/eda/v1/activations/" | jq '.results[] | {id, name, is_enabled}'

# Restart specific activation by ID
ACTIVATION_ID=42
curl -k -X POST -H "Authorization: Bearer ${AAP_TOKEN}" \
  "${AAP_URL}/api/eda/v1/activations/${ACTIVATION_ID}/restart/"

Why manual restart is required: EDA activations maintain stateful connections to event sources (webhooks, Kafka topics, etc.). Automated restart during failover could cause duplicate event processing or message loss. Manual verification ensures event source connectivity and state consistency before resuming.

Rolling Restart of AAP Component

# Example: restart controller1-dc1 for patching

# 1. Drain jobs
# (Manual via AAP UI or API)

# 2. Stop services
ssh controller1-dc1 "systemctl stop automation-controller-web automation-controller-task"

# 3. Perform maintenance
ssh controller1-dc1 "dnf update -y && reboot"

# 4. Start services
ssh controller1-dc1 "systemctl start automation-controller-web automation-controller-task"

# 5. Verify health
curl -k https://controller1-dc1/api/v2/ping/

Maturity Path

Maturity Description What to Build
Crawl Single-datacenter AAP with local HA via EFM; manual failover procedures documented and tested quarterly Deploy AAP 2.6 Container Enterprise Topology with 3-node PostgreSQL cluster and EFM in one datacenter; document manual backup/restore procedures; test restore from Barman quarterly
Walk Multi-datacenter with manual failover procedures; cross-DC replication active; DR drills every 6 months Deploy this full architecture; disable auto.failover in EFM; execute manual failover procedure during scheduled maintenance windows; measure and document actual RTO/RPO
Run Automated failover with < 5-minute RTO; EFM orchestrates database promotion and AAP activation; continuous monitoring with alerting on replication lag and cluster health Enable auto.failover=true in EFM; deploy Prometheus/Grafana with alert rules; integrate post-promotion script with ITSM/Slack for automated incident creation; quarterly DR drills with business stakeholders observing


Summary

By implementing this multi-datacenter Active-Passive DR architecture, you have deployed mission-critical Ansible Automation Platform with guaranteed automation continuity:

Infrastructure scale:

This architecture ensures automation availability for workflows that cannot tolerate downtime – network orchestration, security compliance enforcement, multi-cloud deployments, and automated incident response.


Document Version: 1.0
Last Review: 2026-04-20
Based On: AAP Containerized Multi-Datacenter DR Architecture v2.0 (2026-03-31)