etcd
is a strongly consistent, distributed key-value store that uses the Raft Consensus Protocol for group consensus between nodes, much like PGD. It is most commonly used as a Distributed Configuration Store, aka DCS, hence the name etcd
derives from the standard *nix /etc
directory that holds system configuration files and the d
is for it's distributed, multi-node architecture that provides for both High Availability and multiple endpoints that clients may connect to. Thus, etcd
's primary and most common use case is as a highly available store for configuration data.
At EnterpriseDB etcd
is a supported DCS component option for PGD 3.7's and PGD 4's Harp (although using BDR's own Raft component instead is recommended) and is currently the only DCS component that EnterpriseDB supports for Patroni clusters.
Deployment and configuration of etcd
for both PGD and and Patroni is typically done via EnterpriseDB's Trusted Postgres Architect, aka TPA, cluster orchestration tool. This section will detail how TPA deploys and configures etcd
.
The etcd
process on each TPA-managed node with etcd
in its roles list is run via custom, simple systemd
service unit:
[root@bdr1 /]# cat /etc/systemd/system/etcd.service
[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network-online.target local-fs.target remote-fs.target time-sync.target
Wants=network-online.target local-fs.target remote-fs.target time-sync.target
[Service]
User=etcd
Group=etcd
Type=notify
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
EnvironmentFile=/etc/etcd/etcd.conf
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target
The important items from that:
- The
%m
forETCD_NAME
is filled in bysystemd
with the system's ID (which can be seen directly in/etc/machine-id
). - The
etcd
data directory is at/var/lib/etcd
. - The main configuration file is
/etc/etcd/etcd.conf
.
Here is what the etcd.conf
file set up by TPA looks like:
[root@zippy /]# cat /etc/etcd/etcd.conf
ETCD_NAME="zippy"
ETCD_ENABLE_V2="true"
ETCD_DATA_DIR="/var/lib/etcd/pge14_patroni_small"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://zippy:2380"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://zippy:2379"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER="quiver=http://quiver:2380,uniform=http://uniform:2380,zippy=http://zippy:2380,"
ETCD_AUTO_COMPACTION_MODE="revision"
ETCD_AUTO_COMPACTION_RETENTION="10"
Here is a breakdown of these configuration settings:
-
ETCD_NAME
is the name of the local node in theetcd
cluster. Here it is "zippy". -
ETCD_ENABLE_V2
determines whether or not the cluster should acceptetcd
v2 protocol client requests. More on that later in this article but note now that we set this to "true". -
ETCD_DATA_DIR
the same as the setting from thesystemd
unit file but this overrides that setting. We set this to the TPA's cluster's name with any hyphens replaced by underscores. -
ETCD_INITIAL_ADVERTISE_PEER_URLS
is a list of URLs that peer nodes can use to reach this node. Here we set it to the local node/host name with port, which 2380 is the officialetcd
port for peer traffic. -
ETCD_LISTEN_PEER_URLS
list of URLs that the local node should listen on to accept requests from peer nodes. By using0.0.0.0
we tell it to listen on all local IPv4 addresses with port 2380. -
ETCD_ADVERTISE_CLIENT_URLS
is a list of URLs that client nodes can use to reach this node. Here we set it to the local node/host name with port 2379, which is the officialetcd
port for client traffic. -
ETCD_INITIAL_CLUSTER_STATE
should be set to "new" for all nodes present during a given cluster's initialization and "existing" when joining a member node to an existing cluster. You will typically always see this still at "new" but it can be changed in cases where a node needs to be added after initial TPA cluster deployment or when rejoining a failed node that needs to be removed and then rejoined. -
ETCD_INITIAL_CLUSTER
is the list of "ETCDNAME=URL" endpoints for the nodes present during initial deployment of theetcd
cluster. -
ETCD_AUTO_COMPACTION_MODE
andETCD_AUTO_COMPACTION_RETENTION
determine whenetcd
should perform compaction (somewhat likevacuum
) of its key space. HereETCD_AUTO_COMPACTION_MODE=revision
+ETCD_AUTO_COMPACTION_RETENTION=10
results in a compaction every five minutes to"latest revision" - 10
, i.e. it keeps the last 10 revisions.
If you ever need to enable debug logging you can add a line with ETCD_LOG_LEVEL=debug
to this file and restart the etcd
service. The default log level is 'info'.
Directly interacting with the etcd
cluster is primarily done with the etcdctl
CLI utility. There are two API versions that it can use, v2 or v3, with the two versions accepting different sets of commands, options for those commands, and displaying their outputs differently. TPA deployments will default to v3:
[root@zippy /]# etcdctl version
etcdctl version: 3.5.15
API version: 3.5
The API version to use can be overriden with the ETCDCTL_API
environment variable. Here we set it to version 2 which needs the version command to be given as an option:
[root@bdr1 /]# ETCDCTL_API=2 etcdctl --version
etcdctl version: 3.5.1
API version: 2
All further etcdctl
commands and outputs in this article use v3.
-
etcdctl member list
- List current member details for the cluster.
[root@zippy /]# etcdctl member list
52efd34a4faeff2a, started, zippy, http://zippy:2380, http://zippy:2379, false
7122c78d3eda3152, started, uniform, http://uniform:2380, http://uniform:2379, false
f3726e6209d84720, started, quiver, http://quiver:2380, http://quiver:2379, false
Each line above contains the member's hexadecimal ID, it's started status, name, peer URL, client URL, and whether or not it is a learner.
"Learners" are often confused with "followers". Followers are standard, full-member etcd
nodes that are not currently the Raft leader node whereas learner is a special node state where the given node does not participate in etcd
raft leader elections. This is useful state to join new nodes with in certain cases for etcd
clusters with large data set to prevent new nodes from being elected leader before they have fully synchronized their data set from the rest of the cluster. TPA-managed etcd
clusters do not use learners.
Use the -w table
option for better formatting:
[root@zippy /]# etcdctl member list -w table
+------------------+---------+---------+---------------------+---------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------+---------------------+---------------------+------------+
| 52efd34a4faeff2a | started | zippy | http://zippy:2380 | http://zippy:2379 | false |
| 7122c78d3eda3152 | started | uniform | http://uniform:2380 | http://uniform:2379 | false |
| f3726e6209d84720 | started | quiver | http://quiver:2380 | http://quiver:2379 | false |
+------------------+---------+---------+---------------------+---------------------+------------+
When relevant/available all commands below will use the -w table
option.
-
etcdctl endpoint status -w table
- Show detailed status info for the current node
[root@zippy /]# etcdctl endpoint status -w table
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 52efd34a4faeff2a | 3.5.15 | 70 kB | false | false | 4 | 107 | 107 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
-
etcdctl endpoint status --cluster -w table
-- Show detailed status info for all nodes in the cluster.
[root@zippy /]# etcdctl endpoint status --cluster -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://zippy:2379 | 52efd34a4faeff2a | 3.5.15 | 70 kB | false | false | 4 | 107 | 107 | |
| http://uniform:2379 | 7122c78d3eda3152 | 3.5.15 | 70 kB | true | false | 4 | 107 | 107 | |
| http://quiver:2379 | f3726e6209d84720 | 3.5.15 | 74 kB | false | false | 4 | 107 | 107 | |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
-
etcdctl move-leader --endpoints="<url_endpoints_list>" <member_id>
- Move raft leader status to a specific node, i.e. promote that node to raft leader.
The <url_endpoints_list>
must contain at least the current leader's peer URL. So, starting from the just listed cluster endpoints where uniform
was leader to make zippy
the leader:
[root@zippy /]# etcdctl move-leader --endpoints='http://uniform:2379' 52efd34a4faeff2a
Leadership transferred from 7122c78d3eda3152 to 52efd34a4faeff2a
[root@zippy /]# etcdctl endpoint status --cluster -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://zippy:2379 | 52efd34a4faeff2a | 3.5.15 | 70 kB | true | false | 5 | 113 | 113 | |
| http://uniform:2379 | 7122c78d3eda3152 | 3.5.15 | 70 kB | false | false | 5 | 113 | 113 | |
| http://quiver:2379 | f3726e6209d84720 | 3.5.15 | 74 kB | false | false | 5 | 113 | 113 | |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
-
etcdctl put <key> <value>
andetcdctl get <key>
- Write and and read data from theetcd
cluster.
While it's rare to need to write or read data directly from an etcd
cluster it is fairly straightforward to do:
[root@zippy /]# etcdctl put mykey1 'this is my value'
OK
[root@zippy /]# etcdctl get mykey1
mykey1
this is my value
The above is only a sampling of what can be done for manually writing to and reading from and etcd
cluster as you can also read prior revision values of a given key, watch as values for given key change, and more. See here in the etcd docs for more details on that if needed.
As with most processes and services, the logs are the primary record of what happened and when. Since TPA's etcd
is run via a systemd
service it logs to syslog
and TPA
does not configure a dedicated location for syslog
to send etcd
log data to so it goes only to /var/log/messages
.
If you wish to have your etcd
logging go to dedicated log files under /var/log/etcd
you can run the following on each etcd
host:
cat << 'ETCD_RSYSLOG' > /etc/rsyslog.d/25-etcd.conf
if $programname == 'etcd' then {
action(
type="omfile"
DirOwner="etcd"
DirGroup="etcd"
DirCreateMode="0750"
FileOwner="etcd"
FileGroup="etcd"
FileCreateMode="0640"
File="/var/log/etcd/etcd.log"
)
}
ETCD_RSYSLOG
systemctl restart rsyslog
After that rsyslog
service restart new etcd
log messages will then appear in /var/log/etcd
in addition to /var/log/messages
. Further, since that new /etc/rsyslog.d/25-etcd.conf
file is not managed by TPA it will not be edited or removed by future tpaexec deploy
runs.
As noted earlier in this article, debug logging can be enabled for a given node by adding a line with ETCD_LOG_LEVEL=debug
to /etc/etcd/etcd.conf
and restarting the etcd
service with systemctl restart etcd
.
These will typically be logged as warnings as they do not indicate outright failures. Frequent occurrences of these messages are indicative of storage or networking issues.
Here the message is "waiting for ReadIndex response took too long, retrying":
Nov 10 01:55:34 <hostname> etcd[3704281]:
{"level":"warn",
"ts":"2024-11-10T01:55:34.599Z",
"caller":"etcdserver/v3_server.go:815",
"msg":"waiting for ReadIndex response took too long, retrying",
"sent-request-id":10006596413400380916,"retry-timeout":"500ms"}
And here it is "apply request took too long":
Nov 10 01:55:37 <hostname> etcd[3704281]:
{"level":"warn",
"ts":"2024-11-10T01:55:37.173Z",
"caller":"etcdserver/util.go:166",
"msg": "apply request took too long",
"took":"2.291176116s",
"expected-duration":"100ms",
"prefix":"read-only range ",
"request":"key:\"/warp/bdrgroup/locations/JNB/leader\" ","
response":"range_response_count:0 size:7"}
Other etcd
log messages indicating networking, disk, or CPU issues causing slow operations or timeouts.
This message calls out a Raft (networking) round trip i/o timeout. This could be the network link or some other slowness issue on either end due to slow disk or high CPU:
Nov 09 12:26:03 <hostname> etcd[170401]:
{"level":"warn",
"ts":"2024-11-09T12:26:03.016Z",
"caller":"rafthttp/probing_status.go:68",
"msg":"prober detected unhealthy status",
"round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE",
"remote-peer-id":"74a639e0cde1ae9b",
"rtt":"1.883253ms",
"error":"dial tcp <ip_address>:2380: i/o timeout"}
This message specifically calls out slow disk but that is just the etcd
developers trying to be helpful as it could still be due to a networking or CPU issue:
Nov 10 23:51:55 <hostname> etcd[1424768]:
{"level":"warn",
"ts":"2024-11-10T23:51:55.203Z",
"caller":"etcdserver/raft.go:369",
"msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk",
"to":"ab41c09566fa11bc",
"heartbeat-interval":"100ms",
"expected-duration":"200ms",
"exceeded-duration":"210.864776ms"}
These will typically show up similar to this, note the references to "etcd-client" and "read: connection timed out":
Nov 10 01:55:33 <hostname> harp-manager[170352]:
{"level":"warn",
"ts":"2024-11-10T01:55:33.795883Z",
"logger":"etcd-client",
"caller":"v3@v3.5.12/retry_interceptor.go:62",
"msg":"retrying of unary invoker failed",
"target":"etcd-endpoints://0xc00015c000/<hostname>:2379","attempt":0,
"error":"rpc error: code = Unavailable desc = error reading from server: read tcp <ip_address>:41396-><ip_address>:2379: read: connection timed out"}
etcd
tracks and provides Prometheus metrics for its operations that are available via the http://<address>:2379/metrics
for each node. A Prometheus server/app is not needed to view them, though, as curl
can be used. Note that that URL outputs a lot of metrics data for not only etcd
itself but also Go (which etcd
is written in) and gRPC (which etcd
uses for its inter-node communication) so you'll want to always pipe the output of the curl
command through grep
to get the metrics you want.
Here are some important metrics to look at when diagnosing performance issues with etcd
nodes:
-
etcd_disk_wal_fsync_duration_seconds
- Tracks disk fsync latency times foretcd
'sfsync
calls when it persists its own WAL entries to disk, much like Postgres does with its own WAL when transactions commit. -
etcd_network_peer_round_trip_time_seconds
- Tracks requests/response times from the current node to each of the other nodes in the cluster. - See here in the etcd documentation for the full list of available metrics.
The output for each metrics is a histogram:
[root@zippy /]# curl -s -L http://127.0.0.1:2379/metrics | egrep etcd_disk_wal_fsync_duration_seconds
# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by WAL.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 4
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 5
etcd_disk_wal_fsync_duration_seconds_sum 0.0022687000000000002
etcd_disk_wal_fsync_duration_seconds_count 5
As the above values were take from an inactive cluster explaining how to read that would be a bit confusing so here is an actual output seen on a production deployment:
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 369809
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 408574
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 410173
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 410687
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 410936
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 410952
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 410961
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 410967
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 410971
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 410978
etcd_disk_wal_fsync_duration_seconds_sum 248.82545950300377
etcd_disk_wal_fsync_duration_seconds_count 410978
Each bucket's value is shown in the brackets, i.e. `le="" and each bucket contains the count of all requests that took less than its value. So here we can see:
- This tracked a total of 410978
fsync
s as detailed by theetcd_disk_wal_fsync_duration_seconds_count
value. - The total time for those 410978
fsync
s was 248.82545950300377 seconds as detailed by theetcd_disk_wal_fsync_duration_seconds_sum
value. - The
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"}
shows how many requests took less than infinity, so a different way of showing all of them. - Since all requests took less than 0.512 seconds that bucket and all of the buckets for larger values have the full count value.
- The 0.256 bucket has a lower value, though. So from that we can calculate that 7
fsync
s (410978 - 410971) took between 0.256 and 0.512 seconds. - The number of
fsync
's that took longer than 0.064 seconds was 410978 (total) - 410961 (less than 0.064) -> 17. - So this isn't too bad as the vast majority of these
fsync
s took well under that 100ms threshold whenetcd
would start complaining.
Here though are some etcd_network_peer_round_trip_time_seconds
for that same cluster:
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0004"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0008"} 6
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0016"} 27
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0032"} 92
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0064"} 674
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0128"} 1183
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0256"} 1875
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0512"} 2873
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.1024"} 4455
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.2048"} 6876
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.4096"} 8692
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.8192"} 9366
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="1.6384"} 9455
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="3.2768"} 9462
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="+Inf"} 9462
etcd_network_peer_round_trip_time_seconds_sum{To="3742a00670c8e1fd"} 1529.2726292220068
etcd_network_peer_round_trip_time_seconds_count{To="3742a00670c8e1fd"} 9462
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0002"} 0
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0004"} 2
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0008"} 15
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0016"} 60
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0032"} 530
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0064"} 855
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0128"} 1289
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0256"} 1919
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0512"} 2854
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.1024"} 4475
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.2048"} 6857
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.4096"} 8672
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.8192"} 9339
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="1.6384"} 9527
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="3.2768"} 9533
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="+Inf"} 9533
etcd_network_peer_round_trip_time_seconds_sum{To="9e77415325aab7d"} 1602.121938323003
etcd_network_peer_round_trip_time_seconds_count{To="9e77415325aab7d"} 9533
Here we have two sets of round trip time metrics, which I've separated with a blank line, since this was a three node cluster. Here we can see that this node was having serious network issues as the vast majority of round trip times to both of its peers took far longer than 100ms.