TPA (Trusted PostgreSQL Architecture) is a set of reference architectures. They are 2ndQuadrant's detailed recommendations on how to deploy Postgres, Postgres-BDR, and Postgres-XL in production.
TPA architectures are deployed using TPAexec; this article assumes that it has been correctly installed and tested. For customers who need to use TPAexec, this is usually done during an initial "TPAexec Walkthrough" screen sharing session.
This article shows how to prepare automated test files so that they can be placed in a TPAexec cluster directory and run with TPAexec.
This is particularly useful for developing new tests, or for maintaining customer-specific tests that must not become part of the TPAexec source code.
You must use TPAexec version 7.6.2 or later.
The tpaexec test
command accepts the name of a test as an optional
argument after the cluster name, as in the following example:
tpaexec test $ClusterDir mytest
where $ClusterDir
is an already-provisioned, deployed and running TPAexec cluster.
If there is a file $ClusterDir/tests/mytest.yml
then TPAexec will
run it, and its output will be logged in the usual TPAexec way.
The test file must be a valid Ansible playbook, which can use predefined TPAexec roles and tasks, as in this basic example:
- name: My development test playbook
any_errors_fatal: True
max_fail_percentage: 0
hosts: all
tasks:
- include_role: name=test tasks_from=prereqs.yml
tags: always
- name: Execute a test command
shell: "lsblk"
become: yes
become_user: root
register: testreg
- name: Write the output of the test command to a file
include_role: name=test tasks_from=output.yml
vars:
output_file: mytest.txt
content: |
{{ testreg.stdout }}
tags: always
This playbook is composed by a header, followed by a list of tasks, in this case three.
The first task includes predefined prereqs.yml
tasks from the test
TPAexec role:
tasks:
- include_role: name=test tasks_from=prereqs.yml
tags: always
These tasks carry out some initial actions, such as creating the subdirectories where the test will place its output files.
The second task executes a test command:
- name: Execute a test command
shell: "lsblk"
become: yes
become_user: root
register: testreg
Note that we change user to root
, and we save the output of the
command to a so-called Ansible register called testreg
.
In the third and final task, the output collected in the testreg
register is written to an output file:
- name: Write the output of the test command to a file
include_role: name=test tasks_from=output.yml
vars:
output_file: mytest.txt
content: |
{{ testreg.stdout }}
tags: always
Note that we only need to specify the file name, because we are using
the predefined output.yml
tasks from the test
TPAexec role to
write the file; these tasks place the file in a separate output
directory for each test, with a subdirectory for each node.
When running the test we can see the usual TPAexec output, which ends with a test recap and test timing information:
$ tpaexec test mycluster mytest
(...)
PLAY RECAP *********************************************************************
c1 : ok=34 changed=3 unreachable=0 failed=0
c2 : ok=27 changed=3 unreachable=0 failed=0
c3 : ok=27 changed=3 unreachable=0 failed=0
cb : ok=22 changed=3 unreachable=0 failed=0
real 0m10.238s
user 0m15.328s
sys 0m2.832s
The output displayed on screen is also copied to
$ClusterDir/ansible.log
so you do not need to redirect it to a file.
The output collected from each cluster node is placed in a directory
$ClusterDir/test/$Epoch
where $Epoch
is the time when the test was started, in Unix epoch
format. In our example the cluster has four nodes c1
, c2
, c3
and
cb
, so we find four output files:
$ find test -type f
test/1555594823/c3/mytest.txt
test/1555594823/c1/mytest.txt
test/1555594823/c2/mytest.txt
test/1555594823/cb/mytest.txt
We can verify that each file has the expected output:
$ cat test/1555594823/c1/mytest.txt
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 9.9G 0 disk
├─sda1 8:1 0 8.9G 0 part /
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 1022M 0 part [SWAP]
By default, a test should not write to the cluster, or perform any activity that could create problems to the cluster.
However, some tests need to do exactly that, in order to verify the property they are supposed to test.
So you should ask yourself: "can I run this test on production?"
If the answer is "no", then you must mark the test as destructive by adding an appropriate variable to the task that includes prerequisites, as in the following example:
- include_role: name=test tasks_from=prereqs.yml
vars:
destructive: yes
tags: always
This variable is designed to protect users from accidentally running the wrong test on a production cluster.
This is how it works. If the test is marked as destructive, TPAexec
will refuse to run it unless you add the command line option
--destroy-this-cluster
.
In this example we try to run a destructive test without adding
--destroy-this-cluster
, and TPAexec stops us:
$ tpaexec test mycluster mytest
(...)
TASK [test : Check if destructive tests should be run] *************************
fatal: [cb]: FAILED! => {
"assertion": "destroy_cluster|default(False)",
"changed": false,
"evaluated_to": false,
"msg": "You must specify --destroy-this-cluster to run destructive tests"
}
NO MORE HOSTS LEFT *************************************************************
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
c1 : ok=31 changed=1 unreachable=0 failed=0
c2 : ok=24 changed=1 unreachable=0 failed=0
c3 : ok=24 changed=1 unreachable=0 failed=0
cb : ok=19 changed=1 unreachable=0 failed=1
real 0m8.793s
user 0m13.160s
sys 0m2.388s
After we add the --destroy-this-cluster
option, the test is carried
out normally:
$ tpaexec test mycluster mytest --destroy-this-cluster
(...)
PLAY RECAP *********************************************************************
c1 : ok=40 changed=5 unreachable=0 failed=0
c2 : ok=32 changed=5 unreachable=0 failed=0
c3 : ok=32 changed=5 unreachable=0 failed=0
cb : ok=27 changed=5 unreachable=0 failed=0
real 0m11.978s
user 0m16.832s
sys 0m3.260s
One of the problem of destructive tests is that they destroy the cluster (no big surprise). This inconvenience can be mitigated with snapshots, if the platform support them to a sufficient extent.
The example described in this section uses the vagrant
platform on VirtualBox.
When taking a snapshot of a cluster, you need to take separate snapshots of each node. To avoid inconsistencies while at the same time keeping things simple, it is sufficient to shutdown all nodes and then take a snapshot of each node:
vagrant halt
vagrant snapshot save -f node1 snapshot_node1
vagrant snapshot save -f node2 snapshot_node2
vagrant snapshot save -f node3 snapshot_node3
At this point we have taken three single-node snapshots.
If we have any "state query" (i.e. a query that displays the state of the cluster), we can run it now and remember its output for later.
We can now perform any destructive action, and verify that the output of the state query has changed.
Then we can restore the snapshots with the following commands:
vagrant halt
vagrant snapshot restore --no-provision node1 snapshot_node1
vagrant snapshot restore --no-provision node2 snapshot_node2
vagrant snapshot restore --no-provision node3 snapshot_node3
vagrant up
First, we shut down all the nodes, to avoid problems like the post-restore version of node 1 talking with the pre-restore version of node 3. Then we restore each node snapshot, specifying each target node, and finally we start the restored instances.
At this point, the state query should return the initial state, proving that we have restored the state of the cluster that was previously saved.