How to use and troubleshoot etcd

Erik Jones
Erik Jones
  • Updated

etcd is a strongly consistent, distributed key-value store that uses the Raft Consensus Protocol for group consensus between nodes, much like PGD. It is most commonly used as a Distributed Configuration Store, aka DCS, hence the name etcd derives from the standard *nix /etc directory that holds system configuration files and the d is for it's distributed, multi-node architecture that provides for both High Availability and multiple endpoints that clients may connect to. Thus, etcd's primary and most common use case is as a highly available store for configuration data.

At EnterpriseDB etcd is a supported DCS component option for PGD 3.7's and PGD 4's Harp (although using BDR's own Raft component instead is recommended) and is currently the only DCS component that EnterpriseDB supports for Patroni clusters.

etcd Deployment and Configuration

Deployment and configuration of etcd for both PGD and and Patroni is typically done via EnterpriseDB's Trusted Postgres Architect, aka TPA, cluster orchestration tool. This section will detail how TPA deploys and configures etcd.

The etcd process on each TPA-managed node with etcd in its roles list is run via custom, simple systemd service unit:

[root@bdr1 /]# cat /etc/systemd/system/etcd.service
[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network-online.target local-fs.target remote-fs.target time-sync.target
Wants=network-online.target local-fs.target remote-fs.target time-sync.target

[Service]
User=etcd
Group=etcd
Type=notify
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
EnvironmentFile=/etc/etcd/etcd.conf
Restart=always
RestartSec=10s
LimitNOFILE=40000

[Install]
WantedBy=multi-user.target

The important items from that:

  • The %m for ETCD_NAME is filled in by systemd with the system's ID (which can be seen directly in /etc/machine-id).
  • The etcd data directory is at /var/lib/etcd.
  • The main configuration file is /etc/etcd/etcd.conf.

Here is what the etcd.conf file set up by TPA looks like:

[root@zippy /]# cat /etc/etcd/etcd.conf
ETCD_NAME="zippy"
ETCD_ENABLE_V2="true"
ETCD_DATA_DIR="/var/lib/etcd/pge14_patroni_small"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://zippy:2380"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://zippy:2379"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER="quiver=http://quiver:2380,uniform=http://uniform:2380,zippy=http://zippy:2380,"
ETCD_AUTO_COMPACTION_MODE="revision"
ETCD_AUTO_COMPACTION_RETENTION="10"

Here is a breakdown of these configuration settings:

  • ETCD_NAME is the name of the local node in the etcd cluster. Here it is "zippy".
  • ETCD_ENABLE_V2 determines whether or not the cluster should accept etcd v2 protocol client requests. More on that later in this article but note now that we set this to "true".
  • ETCD_DATA_DIR the same as the setting from the systemd unit file but this overrides that setting. We set this to the TPA's cluster's name with any hyphens replaced by underscores.
  • ETCD_INITIAL_ADVERTISE_PEER_URLS is a list of URLs that peer nodes can use to reach this node. Here we set it to the local node/host name with port, which 2380 is the official etcd port for peer traffic.
  • ETCD_LISTEN_PEER_URLS list of URLs that the local node should listen on to accept requests from peer nodes. By using 0.0.0.0 we tell it to listen on all local IPv4 addresses with port 2380.
  • ETCD_ADVERTISE_CLIENT_URLS is a list of URLs that client nodes can use to reach this node. Here we set it to the local node/host name with port 2379, which is the official etcd port for client traffic.
  • ETCD_INITIAL_CLUSTER_STATE should be set to "new" for all nodes present during a given cluster's initialization and "existing" when joining a member node to an existing cluster. You will typically always see this still at "new" but it can be changed in cases where a node needs to be added after initial TPA cluster deployment or when rejoining a failed node that needs to be removed and then rejoined.
  • ETCD_INITIAL_CLUSTER is the list of "ETCDNAME=URL" endpoints for the nodes present during initial deployment of the etcd cluster.
  • ETCD_AUTO_COMPACTION_MODE and ETCD_AUTO_COMPACTION_RETENTION determine when etcd should perform compaction (somewhat like vacuum) of its key space. Here ETCD_AUTO_COMPACTION_MODE=revision + ETCD_AUTO_COMPACTION_RETENTION=10 results in a compaction every five minutes to "latest revision" - 10, i.e. it keeps the last 10 revisions.

If you ever need to enable debug logging you can add a line with ETCD_LOG_LEVEL=debug to this file and restart the etcd service. The default log level is 'info'.

Interacting with etcd

Directly interacting with the etcd cluster is primarily done with the etcdctl CLI utility. There are two API versions that it can use, v2 or v3, with the two versions accepting different sets of commands, options for those commands, and displaying their outputs differently. TPA deployments will default to v3:

[root@zippy /]# etcdctl version
etcdctl version: 3.5.15
API version: 3.5

The API version to use can be overriden with the ETCDCTL_API environment variable. Here we set it to version 2 which needs the version command to be given as an option:

[root@bdr1 /]# ETCDCTL_API=2 etcdctl --version
etcdctl version: 3.5.1
API version: 2

All further etcdctl commands and outputs in this article use v3.

  • etcdctl member list - List current member details for the cluster.
[root@zippy /]# etcdctl member list
  52efd34a4faeff2a, started, zippy, http://zippy:2380, http://zippy:2379, false
7122c78d3eda3152, started, uniform, http://uniform:2380, http://uniform:2379, false
f3726e6209d84720, started, quiver, http://quiver:2380, http://quiver:2379, false

Each line above contains the member's hexadecimal ID, it's started status, name, peer URL, client URL, and whether or not it is a learner.

"Learners" are often confused with "followers". Followers are standard, full-member etcd nodes that are not currently the Raft leader node whereas learner is a special node state where the given node does not participate in etcd raft leader elections. This is useful state to join new nodes with in certain cases for etcd clusters with large data set to prevent new nodes from being elected leader before they have fully synchronized their data set from the rest of the cluster. TPA-managed etcd clusters do not use learners.

Use the -w table option for better formatting:

[root@zippy /]# etcdctl member list -w table
+------------------+---------+---------+---------------------+---------------------+------------+
|        ID        | STATUS  |  NAME   |     PEER ADDRS      |    CLIENT ADDRS     | IS LEARNER |
+------------------+---------+---------+---------------------+---------------------+------------+
| 52efd34a4faeff2a | started |   zippy |   http://zippy:2380 |   http://zippy:2379 |      false |
| 7122c78d3eda3152 | started | uniform | http://uniform:2380 | http://uniform:2379 |      false |
| f3726e6209d84720 | started |  quiver |  http://quiver:2380 |  http://quiver:2379 |      false |
+------------------+---------+---------+---------------------+---------------------+------------+

When relevant/available all commands below will use the -w table option.

  • etcdctl endpoint status -w table - Show detailed status info for the current node
[root@zippy /]# etcdctl endpoint status  -w table
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 52efd34a4faeff2a |  3.5.15 |   70 kB |     false |      false |         4 |        107 |                107 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
  • etcdctl endpoint status --cluster -w table -- Show detailed status info for all nodes in the cluster.
[root@zippy /]# etcdctl endpoint status --cluster -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|   http://zippy:2379 | 52efd34a4faeff2a |  3.5.15 |   70 kB |     false |      false |         4 |        107 |                107 |        |
| http://uniform:2379 | 7122c78d3eda3152 |  3.5.15 |   70 kB |      true |      false |         4 |        107 |                107 |        |
|  http://quiver:2379 | f3726e6209d84720 |  3.5.15 |   74 kB |     false |      false |         4 |        107 |                107 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
  • etcdctl move-leader --endpoints="<url_endpoints_list>" <member_id> - Move raft leader status to a specific node, i.e. promote that node to raft leader.

The <url_endpoints_list> must contain at least the current leader's peer URL. So, starting from the just listed cluster endpoints where uniform was leader to make zippy the leader:

[root@zippy /]# etcdctl move-leader --endpoints='http://uniform:2379' 52efd34a4faeff2a
Leadership transferred from 7122c78d3eda3152 to 52efd34a4faeff2a

[root@zippy /]# etcdctl endpoint status --cluster -w table
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|   http://zippy:2379 | 52efd34a4faeff2a |  3.5.15 |   70 kB |      true |      false |         5 |        113 |                113 |        |
| http://uniform:2379 | 7122c78d3eda3152 |  3.5.15 |   70 kB |     false |      false |         5 |        113 |                113 |        |
|  http://quiver:2379 | f3726e6209d84720 |  3.5.15 |   74 kB |     false |      false |         5 |        113 |                113 |        |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
  • etcdctl put <key> <value> and etcdctl get <key> - Write and and read data from the etcd cluster.

While it's rare to need to write or read data directly from an etcd cluster it is fairly straightforward to do:

[root@zippy /]# etcdctl put mykey1 'this is my value'
OK

[root@zippy /]# etcdctl get mykey1
mykey1
this is my value

The above is only a sampling of what can be done for manually writing to and reading from and etcd cluster as you can also read prior revision values of a given key, watch as values for given key change, and more. See here in the etcd docs for more details on that if needed.

Monitoring and Debugging etcd issues

etcd logging

As with most processes and services, the logs are the primary record of what happened and when. Since TPA's etcd is run via a systemd service it logs to syslog and TPA does not configure a dedicated location for syslog to send etcd log data to so it goes only to /var/log/messages.

If you wish to have your etcd logging go to dedicated log files under /var/log/etcd you can run the following on each etcd host:

cat << 'ETCD_RSYSLOG' > /etc/rsyslog.d/25-etcd.conf
if $programname == 'etcd' then {
    action(
        type="omfile"
        DirOwner="etcd"
        DirGroup="etcd"
        DirCreateMode="0750"
        FileOwner="etcd"
        FileGroup="etcd"
        FileCreateMode="0640"
        File="/var/log/etcd/etcd.log"
    )
}
ETCD_RSYSLOG

systemctl restart rsyslog

After that rsyslog service restart new etcd log messages will then appear in /var/log/etcd in addition to /var/log/messages. Further, since that new /etc/rsyslog.d/25-etcd.conf file is not managed by TPA it will not be edited or removed by future tpaexec deploy runs.

Things to look for when examining etcd's logs for issues:

Search for standard keywords like "warning" and "error".

As noted earlier in this article, debug logging can be enabled for a given node by adding a line with ETCD_LOG_LEVEL=debug to /etc/etcd/etcd.conf and restarting the etcd service with systemctl restart etcd.

Messages indicating that etcd operations took too long, with that "too long" threshold being 100ms.

These will typically be logged as warnings as they do not indicate outright failures. Frequent occurrences of these messages are indicative of storage or networking issues.

Here the message is "waiting for ReadIndex response took too long, retrying":

Nov 10 01:55:34 <hostname> etcd[3704281]:
  {"level":"warn",
   "ts":"2024-11-10T01:55:34.599Z",
   "caller":"etcdserver/v3_server.go:815",
   "msg":"waiting for ReadIndex response took too long, retrying",
   "sent-request-id":10006596413400380916,"retry-timeout":"500ms"}

And here it is "apply request took too long":

Nov 10 01:55:37 <hostname> etcd[3704281]:
  {"level":"warn",
   "ts":"2024-11-10T01:55:37.173Z",
   "caller":"etcdserver/util.go:166",
   "msg": "apply request took too long",
   "took":"2.291176116s",
   "expected-duration":"100ms",
   "prefix":"read-only range ",
   "request":"key:\"/warp/bdrgroup/locations/JNB/leader\" ","
   response":"range_response_count:0 size:7"}

Other etcd log messages indicating networking, disk, or CPU issues causing slow operations or timeouts.

This message calls out a Raft (networking) round trip i/o timeout. This could be the network link or some other slowness issue on either end due to slow disk or high CPU:

Nov 09 12:26:03 <hostname> etcd[170401]:
  {"level":"warn",
   "ts":"2024-11-09T12:26:03.016Z",
   "caller":"rafthttp/probing_status.go:68",
   "msg":"prober detected unhealthy status",
   "round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE",
   "remote-peer-id":"74a639e0cde1ae9b",
   "rtt":"1.883253ms",
   "error":"dial tcp <ip_address>:2380: i/o timeout"}

This message specifically calls out slow disk but that is just the etcd developers trying to be helpful as it could still be due to a networking or CPU issue:

Nov 10 23:51:55 <hostname> etcd[1424768]:
    {"level":"warn",
     "ts":"2024-11-10T23:51:55.203Z",
     "caller":"etcdserver/raft.go:369",
     "msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk",
     "to":"ab41c09566fa11bc",
     "heartbeat-interval":"100ms",
     "expected-duration":"200ms",
     "exceeded-duration":"210.864776ms"}

harp-manager or harp-proxy issues talking to etcd nodes

These will typically show up similar to this, note the references to "etcd-client" and "read: connection timed out":

Nov 10 01:55:33 <hostname> harp-manager[170352]:
  {"level":"warn",
   "ts":"2024-11-10T01:55:33.795883Z",
   "logger":"etcd-client",
   "caller":"v3@v3.5.12/retry_interceptor.go:62",
   "msg":"retrying of unary invoker failed",
   "target":"etcd-endpoints://0xc00015c000/<hostname>:2379","attempt":0,
   "error":"rpc error: code = Unavailable desc = error reading from server: read tcp <ip_address>:41396-><ip_address>:2379: read: connection timed out"}

etcd metrics endpoints

etcd tracks and provides Prometheus metrics for its operations that are available via the http://<address>:2379/metrics for each node. A Prometheus server/app is not needed to view them, though, as curl can be used. Note that that URL outputs a lot of metrics data for not only etcd itself but also Go (which etcd is written in) and gRPC (which etcd uses for its inter-node communication) so you'll want to always pipe the output of the curl command through grep to get the metrics you want.

Here are some important metrics to look at when diagnosing performance issues with etcd nodes:

  • etcd_disk_wal_fsync_duration_seconds - Tracks disk fsync latency times for etcd's fsync calls when it persists its own WAL entries to disk, much like Postgres does with its own WAL when transactions commit.
  • etcd_network_peer_round_trip_time_seconds - Tracks requests/response times from the current node to each of the other nodes in the cluster.
  • See here in the etcd documentation for the full list of available metrics.

The output for each metrics is a histogram:

[root@zippy /]# curl -s -L http://127.0.0.1:2379/metrics | egrep etcd_disk_wal_fsync_duration_seconds
# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by WAL.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 4
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 5
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 5
etcd_disk_wal_fsync_duration_seconds_sum 0.0022687000000000002
etcd_disk_wal_fsync_duration_seconds_count 5

As the above values were take from an inactive cluster explaining how to read that would be a bit confusing so here is an actual output seen on a production deployment:

etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 369809
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 408574
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 410173
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 410687
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 410936
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 410952
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 410961
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 410967
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 410971
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 410978
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 410978
etcd_disk_wal_fsync_duration_seconds_sum 248.82545950300377
etcd_disk_wal_fsync_duration_seconds_count 410978

Each bucket's value is shown in the brackets, i.e. `le="" and each bucket contains the count of all requests that took less than its value. So here we can see:

  • This tracked a total of 410978 fsyncs as detailed by the etcd_disk_wal_fsync_duration_seconds_count value.
  • The total time for those 410978 fsyncs was 248.82545950300377 seconds as detailed by the etcd_disk_wal_fsync_duration_seconds_sum value.
  • The etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} shows how many requests took less than infinity, so a different way of showing all of them.
  • Since all requests took less than 0.512 seconds that bucket and all of the buckets for larger values have the full count value.
  • The 0.256 bucket has a lower value, though. So from that we can calculate that 7 fsyncs (410978 - 410971) took between 0.256 and 0.512 seconds.
  • The number of fsync's that took longer than 0.064 seconds was 410978 (total) - 410961 (less than 0.064) -> 17.
  • So this isn't too bad as the vast majority of these fsyncs took well under that 100ms threshold when etcd would start complaining.

Here though are some etcd_network_peer_round_trip_time_seconds for that same cluster:

etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0004"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0008"} 6
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0016"} 27
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0032"} 92
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0064"} 674
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0128"} 1183
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0256"} 1875
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.0512"} 2873
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.1024"} 4455
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.2048"} 6876
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.4096"} 8692
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="0.8192"} 9366
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="1.6384"} 9455
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="3.2768"} 9462
etcd_network_peer_round_trip_time_seconds_bucket{To="3742a00670c8e1fd",le="+Inf"} 9462
etcd_network_peer_round_trip_time_seconds_sum{To="3742a00670c8e1fd"} 1529.2726292220068
etcd_network_peer_round_trip_time_seconds_count{To="3742a00670c8e1fd"} 9462

etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0002"} 0
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0004"} 2
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0008"} 15
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0016"} 60
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0032"} 530
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0064"} 855
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0128"} 1289
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0256"} 1919
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.0512"} 2854
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.1024"} 4475
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.2048"} 6857
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.4096"} 8672
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="0.8192"} 9339
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="1.6384"} 9527
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="3.2768"} 9533
etcd_network_peer_round_trip_time_seconds_bucket{To="9e77415325aab7d",le="+Inf"} 9533
etcd_network_peer_round_trip_time_seconds_sum{To="9e77415325aab7d"} 1602.121938323003
etcd_network_peer_round_trip_time_seconds_count{To="9e77415325aab7d"} 9533

Here we have two sets of round trip time metrics, which I've separated with a blank line, since this was a three node cluster. Here we can see that this node was having serious network issues as the vast majority of round trip times to both of its peers took far longer than 100ms.

Was this article helpful?

0 out of 0 found this helpful