-
RTO - recovery time objective, target time to recover normal activities after a failure
-
RPO - recovery point objective, amount of data that can be lost
-
node - member of a cluster
-
ccm - consensus cluster membership, determination of cluster members and sharing this information
-
quorum - majority (eg. 50% + 1) defines a cluster partition has quorum (is "quorate")
-
epoch - version of cluste metadata
-
split brain - competing cluster "groups" that do not know about each other
-
fencing - prevention of access to shared resource, eg. via STONITH
-
resources, anything managed by cluster (eg. IP, service, filesystems...), defined in CIB, use RA scripts, active/passive or active/active
-
primitive resource - single instance on one node
-
group resource - group of one or more privitive resources as a single entity, managed as whole, order is important!
-
clone resource - resources running simultaneously on multiple nodes at the same time
- anonymous clone - exactly the same primitive runs on each node
- globally unique clone - distinct primitives run on each node, unique identity (eg. unique IP)
-
multi-state resource - special clone resource, active or passive, promote/demote for active/passive
-
promote resource action - promotes a resource from a slave resource to a master one
-
demote resource action - demotes a resource from a master resource to a slave one
- corosync - messaging and membership layer (can replicate data across cluster?)
- CRM -
crmd/pacemaker-controld, cluster resource manager, CRM, part of resource allocation layer,crmdis main process; maintains a consistent view of the cluster membership and orchestrates all the other components - CIB -
cib/pacemaker-based, cluster information base, configuration, current status, synchronized the CIB across the cluster and handles requests to modify it pacemaker, part of resource allocation layer; shared copy of state, versioned - DC - designated controller, in-memory state, member managing the master copy of the CIB, so-called master node, communicate changes of the CIB copy to other nodes via CRM
- PE -
pegnine/pacemaker-schedulerd, policy engine, running on DC, the brain of the cluster; the scheduler determines which actions are necessary to achieve the desired state of the cluster; the input is a snapshot of the CIB monitors CIB and calculates changes required to align with desired state, informs CRM - LRM -
lrm/pacemaker-exec, local resource manager, instructed from CRM what to do, local executor - RA - resource agent, logic to start/stop/monitor a resource, called from LRM and return values are passed to the CRM, ideally OCF, LSB, systemd service units or STONITH
- OCF - open cluster framework, standardized resource agents
- STONITH - "shoot the other node in the head", fencing resource agent, eg. via IPMI…
- DLM - distributed lock manager, cluster wide locking (
ocf:pacemaker:controld) - CLVM - cluster logical volume manager,
lvmlockd, protects LVM metadata on shared storage - pacemaker-attrd - attribute manager, maintains a database of attributes for all the cluster nodes; the attributes are synchronized across the cluster; the attributes are usually recorded in the CIB (ie. not all!)
-
maintenance/standby does not make corosync ring detection ineffective! That is, node can be fenced even if it is in maintenance!
-
OS shutdown/reboot can cause a resource to be killed if it runs in user slice (eg. SAP or old Oracle DB)!
-
maintenances do NOT run monitor operation thus
crm_monoutput does not need to show reality! -
maintenance mode - global cluster property, no resource monitoring, no action on resource state change
-
node maintenance - monitoring operations will cease for the node, no action on node state change
-
resource maintenace - no monitoring of a resource, useful for changes in resource without cluster interference
-
is-managed mode - like resource maintenace mode except cluster still monitors resource, reports any failures but does not do any action
-
standby node mode - a node cannot run resources but still participates in quorum decisions
Best practice:
- standby
- stop cluster services (this includes corosync)
stonith_admin -L # list registered fencing devices
stonith_admin -I # list available fencing devicesstonith device helpers are either programs available as
/usr/sbin/fence_<device>, scripts in
/usr/lib64/stonith/plugins/external/ or libs in
/usr/lib64/stonith/plugins/stonith2.
stonith_admin -M -a <fence_device_agent> # show docs# testing fence_virsh for libvirt VMs
fence_virsh -x -a <libvirt_host> -l <username> \
-k <ssh_private_key> -o status -v -n <vm/domain>crm node fence
fence_virsh -a host -l -x -k <ssh_private_key> -o -v -n
stonith_admin -L- multicast, when used check that switch is configured correctly
(see IGMP snooping -
forwarding only to valid ports instead of broadcasting)
It usually does NOT work in clouds.
# for virtual bridge (not openvswitch!) grep -H '' /sys/class/net/<bridge_name>/bridge/multicast_snooping /sys/class/net/docker0/bridge/multicast_snooping:1 # for openvswitch ovs-vsctl list bridge | grep mcast_snooping_enable - unicast, usually better; clouds needs higher token (eg. 30000) and consensus (eg. 36000); see Corosync Communication Failure
NOTE:
- corosync time values is in miliseconds!
token: 5000 (ms) = 5s timeouttoken_retransmits_before_loss_consts: 10 - means how many instances of token to send in token timeout interval- corosync ports note, see also a general ports as defined in RH
docs
or see firewalld
high-availability.xml(note mostly RH specific!)$ man corosync.conf | col -b | sed -n '/^ *mcastport/,/^ *$/{/^ *$/q; p}' | fmt -w72 mcastport This specifies the UDP port number. It is possible to use the same multicast address on a network with the corosync services configured for different UDP ports. Please note corosync uses two UDP ports mcastport (for mcast receives) and mcastport - 1 (for mcast sends). If you have multiple clusters on the same network using the same mcastaddr please configure the mcastports with a gap.
How corosync communication work:
-
communication is oneway to establish stable communication ring via "protocol" corosync_totemnet
a node sends instances of token based on token_retransmits_before_loss_consts value to next node and expects return from the last node in token timeout value
a three node scenario:
-
node1 sends instances of token to node2 and expects at least one token from node3 to return
-
once communication is stable - token passed from the source back to it via last node - the node which sent the original token can send messages to all other nodes in the already established stable token ring
-
if ring is broken, eg. node1 -> node2 communication error, node1 is the first one who detects split brain (timeout value)
-
-
once communiation ring is stable - token passed from the source back to it, a node with token can send messages (messages value which should fit to UDP datagram) to communicate directly to all nodes in the stable token ring
corosync-cmapctl nodelist.node # list corosync nodes
corosync-cmapctl runtime.totem.pg.mrp.srp.members # list members and state
corosync-cmapctl runtime.votequorum # runtime info about quorum
corosync-quorumtool -l # list nodes
corosync-quorumtool -s # show quorum status of corosync ring
corosync-quorumtool -e <number> # change number of extected votes
corosync-cfgtool -R # tell all nodes to reload corosync configcorosync logs after start:
Feb 01 11:16:53 s153cl01 systemd[1]: Starting Corosync Cluster Engine...
Feb 01 11:16:53 s153cl01 corosync[8725]: [MAIN ] Corosync Cluster Engine ('2.4.5'): started and ready to provide service.
Feb 01 11:16:53 s153cl01 corosync[8725]: [MAIN ] Corosync built-in features: testagents systemd qdevices qnetd pie relro bindnow
Feb 01 11:16:53 s153cl01 corosync[8730]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Feb 01 11:16:53 s153cl01 corosync[8730]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Feb 01 11:16:53 s153cl01 corosync[8730]: [TOTEM ] The network interface [192.168.123.189] is now up.
Feb 01 11:16:53 s153cl01 corosync[8730]: [SERV ] Service engine loaded: corosync configuration map access [0]
Feb 01 11:16:53 s153cl01 corosync[8730]: [QB ] server name: cmap
Feb 01 11:16:53 s153cl01 corosync[8730]: [SERV ] Service engine loaded: corosync configuration service [1]
Feb 01 11:16:53 s153cl01 corosync[8730]: [QB ] server name: cfg
Feb 01 11:16:53 s153cl01 corosync[8730]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 01 11:16:53 s153cl01 corosync[8730]: [QB ] server name: cpg
Feb 01 11:16:53 s153cl01 corosync[8730]: [SERV ] Service engine loaded: corosync profile loading service [4]
Feb 01 11:16:53 s153cl01 corosync[8730]: [QUORUM] Using quorum provider corosync_votequorum
Feb 01 11:16:53 s153cl01 corosync[8730]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Feb 01 11:16:53 s153cl01 corosync[8730]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 01 11:16:53 s153cl01 corosync[8730]: [QB ] server name: votequorum
Feb 01 11:16:53 s153cl01 corosync[8730]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 01 11:16:53 s153cl01 corosync[8730]: [QB ] server name: quorum
Feb 01 11:16:53 s153cl01 corosync[8730]: [TOTEM ] adding new UDPU member {192.168.123.189}
Feb 01 11:16:53 s153cl01 corosync[8730]: [TOTEM ] adding new UDPU member {192.168.123.192}
Feb 01 11:16:53 s153cl01 corosync[8730]: [TOTEM ] A new membership (192.168.123.189:76) was formed. Members joined: 1084783549
Feb 01 11:16:53 s153cl01 corosync[8730]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Feb 01 11:16:53 s153cl01 corosync[8730]: [CPG ] downlist left_list: 0 received
Feb 01 11:16:53 s153cl01 corosync[8730]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Feb 01 11:16:53 s153cl01 corosync[8730]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Feb 01 11:16:53 s153cl01 corosync[8730]: [QUORUM] Members[1]: 1084783549
Feb 01 11:16:53 s153cl01 corosync[8730]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 01 11:16:53 s153cl01 corosync[8715]: Starting Corosync Cluster Engine (corosync): [ OK ]corosync knows about two members but only nodeid 1084783549 joins for now.
This corresponds to (see there's no quorum in this two node cluster!):
$ corosync-cmapctl | grep member
runtime.totem.pg.mrp.srp.members.1084783549.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1084783549.ip (str) = r(0) ip(192.168.123.189)
runtime.totem.pg.mrp.srp.members.1084783549.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1084783549.status (str) = joined
$ corosync-cpgtool -e
Group Name PID Node ID
crmd
9126 1084783549 (192.168.123.189)
attrd
9124 1084783549 (192.168.123.189)
stonith-ng
9122 1084783549 (192.168.123.189)
cib
9121 1084783549 (192.168.123.189)
sbd:cluster
9100 1084783549 (192.168.123.189)
$ corosync-quorumtool -s
Quorum information
------------------
Date: Tue Feb 1 11:43:29 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 1084783549
Ring ID: 1084783549/112
Quorate: No
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 1 Activity blocked
Flags: 2Node WaitForAll
Membership information
----------------------
Nodeid Votes Name
1084783549 1 s153cl01.cl0.example.com (local)When other corosync node joins the following is logged (see quorum was reached in this two node cluster!):
Feb 01 11:24:05 s153cl01 corosync[8730]: [TOTEM ] A new membership (192.168.123.189:84) was formed. Members joined: 1084783552
Feb 01 11:24:05 s153cl01 corosync[8730]: [CPG ] downlist left_list: 0 received
Feb 01 11:24:05 s153cl01 corosync[8730]: [CPG ] downlist left_list: 0 received
Feb 01 11:24:05 s153cl01 corosync[8730]: [QUORUM] This node is within the primary component and will provide service.
Feb 01 11:24:05 s153cl01 corosync[8730]: [QUORUM] Members[2]: 1084783549 1084783552
Feb 01 11:24:05 s153cl01 corosync[8730]: [MAIN ] Completed service synchronization, ready to provide service.
And corosync-cmapctl would show:
$ corosync-cmapctl | grep member
runtime.totem.pg.mrp.srp.members.1084783549.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1084783549.ip (str) = r(0) ip(192.168.123.189)
runtime.totem.pg.mrp.srp.members.1084783549.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1084783549.status (str) = joined
runtime.totem.pg.mrp.srp.members.1084783552.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1084783552.ip (str) = r(0) ip(192.168.123.192)
runtime.totem.pg.mrp.srp.members.1084783552.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1084783552.status (str) = joined
$ corosync-cpgtool -e
Group Name PID Node ID
crmd
9126 1084783549 (192.168.123.189)
4320 1084783552 (192.168.123.192)
attrd
9124 1084783549 (192.168.123.189)
4318 1084783552 (192.168.123.192)
stonith-ng
9122 1084783549 (192.168.123.189)
4316 1084783552 (192.168.123.192)
cib
9121 1084783549 (192.168.123.189)
4315 1084783552 (192.168.123.192)
sbd:cluster
9100 1084783549 (192.168.123.189)
4292 1084783552 (192.168.123.192)
$ corosync-quorumtool -s
Quorum information
------------------
Date: Tue Feb 1 11:45:15 2022
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 1084783549
Ring ID: 1084783549/116
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 1
Flags: 2Node Quorate WaitForAll
Membership information
----------------------
Nodeid Votes Name
1084783549 1 s153cl01.cl0.example.com (local)
1084783552 1 s153cl02.cl0.example.comWhen a node leaves... (note "leaves", ie. not disappears!)
Feb 01 11:35:06 s153cl01 corosync[9101]: [TOTEM ] A new membership (192.168.123.189:100) was formed. Members left: 1084783552
Feb 01 11:35:06 s153cl01 corosync[9101]: [CPG ] downlist left_list: 1 received
Feb 01 11:35:06 s153cl01 corosync[9101]: [QUORUM] Members[1]: 1084783549
Feb 01 11:35:06 s153cl01 corosync[9101]: [MAIN ] Completed service synchronization, ready to provide service.
And corosync-cmapctl would show:
$ corosync-cmapctl | grep member
runtime.totem.pg.mrp.srp.members.1084783549.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1084783549.ip (str) = r(0) ip(192.168.123.189)
runtime.totem.pg.mrp.srp.members.1084783549.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1084783549.status (str) = joined
runtime.totem.pg.mrp.srp.members.1084783552.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1084783552.ip (str) = r(0) ip(192.168.123.192)
runtime.totem.pg.mrp.srp.members.1084783552.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1084783552.status (str) = leftAnd when a node or nodes disappear...
2022-04-04T19:51:10.069944+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A processor failed, forming new configuration.
2022-04-04T19:51:16.081236+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A new membership (10.121.239.29:2016) was formed. Members left: 1 2
2022-04-04T19:51:16.081489+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] Failed to receive the leave message. failed: 1 2
2022-04-04T19:51:16.081685+01:00 T3PRPDB011 corosync[28003]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
2022-04-04T19:51:16.081846+01:00 T3PRPDB011 corosync[28003]: [QUORUM] Members[1]: 3
2022-04-04T19:51:35.842885+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A new membership (10.121.239.29:2020) was formed. Members
2022-04-04T19:51:35.843290+01:00 T3PRPDB011 corosync[28003]: [QUORUM] Members[1]: 3
2022-04-04T19:51:42.061348+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A new membership (10.121.239.29:2024) was formed. Members
2022-04-04T19:51:42.061638+01:00 T3PRPDB011 corosync[28003]: [QUORUM] Members[1]: 3
2022-04-04T19:51:52.384522+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A new membership (10.121.239.29:2028) was formed. Members
2022-04-04T19:51:52.385047+01:00 T3PRPDB011 corosync[28003]: [QUORUM] Members[1]: 3
2022-04-04T19:51:59.892504+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A new membership (10.121.239.29:2032) was formed. Members
2022-04-04T19:51:59.892889+01:00 T3PRPDB011 corosync[28003]: [QUORUM] Members[1]: 3
2022-04-04T19:52:06.783170+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A new membership (10.121.239.29:2036) was formed. Members
2022-04-04T19:52:06.783665+01:00 T3PRPDB011 corosync[28003]: [QUORUM] Members[1]: 3
2022-04-04T19:52:18.403755+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A new membership (10.121.239.29:2040) was formed. Members
2022-04-04T19:52:18.404132+01:00 T3PRPDB011 corosync[28003]: [QUORUM] Members[1]: 3
2022-04-04T19:52:24.515971+01:00 T3PRPDB011 corosync[28003]: [TOTEM ] A new membership (10.121.239.29:2044) was formed. Members
2022-04-04T19:52:24.516287+01:00 T3PRPDB011 corosync[28003]: [QUORUM] Members[1]: 3A non-tuned corosync in virtualized environment could be detected this way:
2023-04-11T16:07:28.761850+02:00 node2 corosync[38932]: [MAIN ] Corosync main process was not scheduled (@1681222048760) for 7917.4106 ms (threshold is 800.0000 ms). Consider token timeout increase.
The above line demonstrates:
- that token is: 1000ms
- 80 % of the token timeout reached should be max scheduling timeout
During Live partition migration (Power) or vMotion (VMware) a short pause of the LPAR/VM occurs, so final memory changes could be migrated; this may have an impact of the LPAR/VM applications, namely HA stack. There is no other way how the "migration" could work considering both hosts do not share memory as one big hardware system.
See or how
corosync.conf
looks like in Azure documentation.
corosync can be also observed on network layer (although there's probably and issue):
$ tshark -r corosync-totemsrp--noencypted--2nodes.pcap \
-O corosync_totemnet,corosync_totemsrp \
-Y 'corosync_totemsrp.message_header.type==3' | \
sed -n '/^Frame/,/^ *$/{/^ *$/q;p}'
Frame 540: 200 bytes on wire (1600 bits), 200 bytes captured (1600 bits)
Linux cooked capture v1
Internet Protocol Version 4, Src: 192.168.0.101, Dst: 239.192.104.1
User Datagram Protocol, Src Port: 5149, Dst Port: 5405
Totem Single Ring Protocol implemented in Corosync Cluster Engine
Type: join message (3)
Encapsulated: not mcast message (0)
Endian detector: 0xff22
Node ID: 2
Membership join message (nprocs: 2 nfailed: 0)
Single Ring Protocol Address (node: 2)
Node IP address (interface: 0; node: 2)
Node ID: 2
Address family: AF_INET (2)
Address: 192.168.0.101
Address padding: 08000200c0a8006508000400
Node IP address (interface: 1; node: 0)
Node ID: 0
Address family: Unknown (0)
Address: 00000000000000000000000000000000
The number of processor list entries: 2
Single Ring Protocol Address (node: 2)
Node IP address (interface: 0; node: 2)
Node ID: 2
Address family: AF_INET (2)
Address: 192.168.0.101
Address padding: 08000200c0a8006508000400
Node IP address (interface: 1; node: 0)
Node ID: 0
Address family: Unknown (0)
Address: 00000000000000000000000000000000
Single Ring Protocol Address (node: 1)
Node IP address (interface: 0; node: 1)
Node ID: 1
Address family: AF_INET (2)
Address: 192.168.0.102
Address padding: 08000200c0a8006608000400
Node IP address (interface: 1; node: 0)
Node ID: 0
Address family: Unknown (0)
Address: 00000000000000000000000000000000
The number of failed list entries: 0
Ring sequence number: 56A "client" of corosync-qnetd; even its configuration is in /etc/corosync/corosync.conf,
it runs as a separate daemon, thus it has to be started/enabled:
$ sed -n '/^quorum/,$p' /etc/corosync/corosync.conf
quorum {
provider: corosync_votequorum
#expected_votes: 2
#two_node: 1
device {
votes: 1
model: net
net {
tls: off
host: 192.168.252.1
port: 5403
algorithm: ffsplit
tie_breaker: lowest
}
}
}$ grep -Pv '^\s*(#|$)' /etc/sysconfig/corosync-qdevice
COROSYNC_QDEVICE_OPTIONS="-q -d"When, for example, corosync-qdevice connects to corosync-qnetd, the corosync
will report:
Apr 30 13:42:31 debug [VOTEQ ] Received qdevice op 1 req from node 1 [Qdevice]
Apr 30 13:42:31 debug [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: Yes Qdevice: Yes QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Apr 30 13:42:31 debug [VOTEQ ] got nodeinfo message from cluster node 1
Apr 30 13:42:31 debug [VOTEQ ] nodeinfo message[0]: votes: 1, expected: 0 flags: 0
Apr 30 13:42:31 debug [VOTEQ ] got nodeinfo message from cluster node 1
Apr 30 13:42:31 debug [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 2 flags: 24
Apr 30 13:42:31 debug [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: Yes Qdevice: Yes QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Apr 30 13:42:31 debug [VOTEQ ] total_votes=2, expected_votes=2
Apr 30 13:42:31 debug [VOTEQ ] node 1 state=1, votes=1, expected=2
Apr 30 13:42:31 debug [VOTEQ ] got getinfo request on 0x56030111bcf0 for node 0
Apr 30 13:42:31 debug [VOTEQ ] getinfo response error: 1
Apr 30 13:42:31 debug [VOTEQ ] sending initial status to 0x56030111bcf0
Apr 30 13:42:31 debug [VOTEQ ] Sending nodelist callback. ring_id = 1/175
Apr 30 13:42:31 debug [VOTEQ ] Sending quorum callback, quorate = 0
Apr 30 13:42:31 debug [VOTEQ ] got getinfo request on 0x56030111bcf0 for node 1
Apr 30 13:42:31 debug [VOTEQ ] getinfo response error: 1
Apr 30 13:42:31 debug [VOTEQ ] got getinfo request on 0x56030111bcf0 for node 2
Apr 30 13:42:31 debug [VOTEQ ] getinfo response error: 12
Apr 30 13:42:31 debug [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: Yes Qdevice: Yes QdeviceAlive: Yes QdeviceCastVote: Yes QdeviceMasterWins: No
Apr 30 13:42:31 debug [VOTEQ ] got nodeinfo message from cluster node 1
Apr 30 13:42:31 debug [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 2 flags: 120
Apr 30 13:42:31 debug [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: Yes Qdevice: Yes QdeviceAlive: Yes QdeviceCastVote: Yes QdeviceMasterWins: No
Apr 30 13:42:31 debug [VOTEQ ] total_votes=2, expected_votes=2
Apr 30 13:42:31 debug [VOTEQ ] node 1 state=1, votes=1, expected=2
Apr 30 13:42:31 debug [VOTEQ ] node 0 state=1, votes=1
Apr 30 13:42:31 debug [VOTEQ ] lowest node id: 1 us: 1
Apr 30 13:42:31 debug [VOTEQ ] highest node id: 1 us: 1
Apr 30 13:42:31 debug [VOTEQ ] quorum regained, resuming activity
Apr 30 13:42:31 notice [QUORUM] This node is within the primary component and will provide service.
Apr 30 13:42:31 notice [QUORUM] Members[1]: 1
Apr 30 13:42:31 debug [QUORUM] sending quorum notification to (nil), length = 52
Apr 30 13:42:31 debug [VOTEQ ] Sending quorum callback, quorate = 1Thus, a vote is added and the quorum is obtained.
And, when, it leaves:
Apr 30 13:44:36 debug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: Yes Qdevice: Yes QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
Apr 30 13:44:36 debug [VOTEQ ] got nodeinfo message from cluster node 1
Apr 30 13:44:36 debug [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 2 flags: 57
Apr 30 13:44:36 debug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: Yes Qdevice: Yes QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
Apr 30 13:44:36 debug [VOTEQ ] total_votes=2, expected_votes=2
Apr 30 13:44:36 debug [VOTEQ ] node 1 state=1, votes=1, expected=2
Apr 30 13:44:36 debug [VOTEQ ] quorum lost, blocking activity
Apr 30 13:44:36 notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 30 13:44:36 notice [QUORUM] Members[1]: 1
Apr 30 13:44:36 debug [QUORUM] sending quorum notification to (nil), length = 52
Apr 30 13:44:36 debug [VOTEQ ] Sending quorum callback, quorate = 0
Apr 30 13:44:36 debug [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: Yes Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Apr 30 13:44:36 debug [QB ] HUP conn (/dev/shm/qb-3778-3780-19-WXUubU/qb)
Apr 30 13:44:36 debug [QB ] qb_ipcs_disconnect(/dev/shm/qb-3778-3780-19-WXUubU/qb) state:2
Apr 30 13:44:36 debug [MAIN ] cs_ipcs_connection_closed()
Apr 30 13:44:36 debug [MAIN ] cs_ipcs_connection_destroyed()
Apr 30 13:44:36 debug [QB ] Free'ing ringbuffer: /dev/shm/qb-3778-3780-19-WXUubU/qb-response-votequorum-header
Apr 30 13:44:36 debug [QB ] Free'ing ringbuffer: /dev/shm/qb-3778-3780-19-WXUubU/qb-event-votequorum-header
Apr 30 13:44:36 debug [QB ] Free'ing ringbuffer: /dev/shm/qb-3778-3780-19-WXUubU/qb-request-votequorum-header
Apr 30 13:44:36 debug [QB ] HUP conn (/dev/shm/qb-3778-3780-18-uSN3US/qb)
Apr 30 13:44:36 debug [QB ] qb_ipcs_disconnect(/dev/shm/qb-3778-3780-18-uSN3US/qb) state:2
Apr 30 13:44:36 debug [MAIN ] cs_ipcs_connection_closed()
Apr 30 13:44:36 debug [CMAP ] exit_fn for conn=0x560301124f90
Apr 30 13:44:36 debug [MAIN ] cs_ipcs_connection_destroyed()
Apr 30 13:44:36 debug [QB ] Free'ing ringbuffer: /dev/shm/qb-3778-3780-18-uSN3US/qb-response-cmap-header
Apr 30 13:44:36 debug [QB ] Free'ing ringbuffer: /dev/shm/qb-3778-3780-18-uSN3US/qb-event-cmap-header
Apr 30 13:44:36 debug [QB ] Free'ing ringbuffer: /dev/shm/qb-3778-3780-18-uSN3US/qb-request-cmap-header
Apr 30 13:44:36 debug [VOTEQ ] got nodeinfo message from cluster node 1
Apr 30 13:44:36 debug [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 2 flags: 8
Apr 30 13:44:36 debug [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: Yes Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Apr 30 13:44:36 debug [VOTEQ ] total_votes=2, expected_votes=2
Apr 30 13:44:36 debug [VOTEQ ] node 1 state=1, votes=1, expected=2
Apr 30 13:44:36 debug [VOTEQ ] Received qdevice op 0 req from node 1 [Qdevice]That is, losing the quorum.
Most distros setup TLS DB store in postinstall package scripts; an exaple from Debian:
# https://fedoraproject.org/wiki/Changes/NSSDefaultFileFormatSql
if ! [ -f "$db/cert9.db" ]; then
if [ -f "$dir/nssdb/cert8.db" ]; then
# password file should have an empty line to be accepted
[ -f "$pwdfile" -a ! -s "$pwdfile" ] && echo > "$pwdfile"
# upgrade to SQLite database
certutil -N -d "sql:$db" -f "$pwdfile" -@ "$pwdfile"
chmod g+r "$db/cert9.db" "$db/key4.db"
else
corosync-qnetd-certutil -i -G
fi
chgrp "$user" "$db" "$db/cert9.db" "$db/key4.db"
fiHowever, for testing purposes, this can be turned off:
$ grep -Pv '^\s*(#|$)' /etc/sysconfig/corosync-qnetd
COROSYNC_QNETD_OPTIONS="-s off -c off -d"
COROSYNC_QNETD_RUNAS=""Below, just summary:
$ corosync-qnetd-tool -s
QNetd address: *:5403
TLS: Unsupported
Connected clients: 0
Connected clusters: 0
Maximum send/receive size: 32768/32768 bytesWhen a client (corosync-qdevice) connects:
$ corosync-qnetd-tool -lv
Cluster "jb155sapqe":
Algorithm: Fifty-Fifty split
Tie-breaker: Node with lowest node ID
Node ID 1:
Client address: ::ffff:192.168.252.100:47222
HB interval: 8000ms
Configured node list: 1, 2
Ring ID: 1.a0
Membership node list: 1
Heuristics: Undefined (membership: Undefined, regular: Undefined)
TLS active: No
Vote: No change (ACK)$ man dlm_controld | sed -n '/^DESCRIPTION/,/^$/{/^$/q;p}' | fmt -w80
DESCRIPTION
The kernel dlm requires a user daemon to manage lockspace membership.
dlm_controld manages lockspace membership using corosync cpg groups,
and translates membership changes into dlm kernel recovery events.
dlm_controld also manages posix locks for cluster file systems using
the dlm.# see corosync CPG (control process group) exists for DLM
$ corosync-cpgtool -e
Group Name PID Node ID
dlm:controld
3114 1 (192.168.253.100)
crmd
3034 1 (192.168.253.100)
1859 2 (192.168.253.101)
attrd
3032 1 (192.168.253.100)
1857 2 (192.168.253.101)
stonith-ng
3030 1 (192.168.253.100)
1855 2 (192.168.253.101)
cib
3029 1 (192.168.253.100)
1854 2 (192.168.253.101)
sbd:cluster
3018 1 (192.168.253.100)
1842 2 (192.168.253.101)$ corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 192.168.253.100
status = ring 0 active with no faults
$ dlm_tool status
cluster nodeid 1 quorate 1 ring seq 270 270
daemon now 5844 fence_pid 0
node 1 M add 5674 rem 0 fail 0 fence 0 at 0 0
node 2 M add 5815 rem 0 fail 0 fence 0 at 0 0If there are lockspace members (for example lvmlockd RA), one should see:
# first our RA using the lockspace
$ crm configure show lvmlockd
primitive lvmlockd lvmlockd \
op start timeout=90s interval=0s \
op stop timeout=90s interval=0s \
op monitor timeout=90s interval=30s
# note: 'clustered' was for old `clvmd'
$ vgs -o vg_name,vg_clustered,vg_lock_type,vg_lock_args
VG Clustered LockType VLockArgs
clvg0 dlm 1.0.0:jb155sapqe
# listing DLM internal lockspace
$ dlm_tool ls
dlm lockspaces
name lvm_clvg0
id 0x45d1d4f1
flags 0x00000000
change member 1 joined 1 remove 0 failed 0 seq 1,1
members 1
name lvm_global
id 0x12aabd2d
flags 0x00000000
change member 1 joined 1 remove 0 failed 0 seq 1,1
members 1
# and corosync CPG for lockspace memberships
$ corosync-cpgtool -e | head -n9
Group Name PID Node ID
dlm:ls:lvm_clvg0
3702 1 (192.168.253.100)
dlm:ls:lvm_global
3702 1 (192.168.253.100)
2368 2 (192.168.253.101)
dlm:controld
3702 1 (192.168.253.100)
2368 2 (192.168.253.101)
So, what happens on network level when eg. vgs is typed?
$ tshark -n -i eth0 -f 'not (udp or stp ) and not (port 22 or port 3260)'
...
1 0.000000000 192.168.253.101 → 192.168.253.100 DLM3 206 options: message: conversion message
2 0.000115947 192.168.253.100 → 192.168.253.101 DLM3 222 acknowledge
3 0.000344159 192.168.253.101 → 192.168.253.100 DLM3 78 acknowledge
4 0.000361827 192.168.253.100 → 192.168.253.101 SCTP 62 SACK (Ack=1, Arwnd=4194288)
5 0.008363511 192.168.253.101 → 192.168.253.100 DLM3 206 options: message: conversion message
6 0.008479394 192.168.253.100 → 192.168.253.101 DLM3 78 acknowledge
7 0.008581370 192.168.253.101 → 192.168.253.100 SCTP 62 SACK (Ack=1, Arwnd=4194288)
8 0.209133597 192.168.253.100 → 192.168.253.101 SCTP 62 SACK (Ack=2, Arwnd=4194304)
9 6.161107176 192.168.253.100 → 192.168.253.101 SCTP 106 HEARTBEAT
10 6.161481187 192.168.253.101 → 192.168.253.100 SCTP 106 HEARTBEAT_ACK
$ ss -Sna | col -b # `-S' is SCTP, see below
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 5 192.168.253.100:21064 0.0.0.0:*
ESTAB 0 0 192.168.253.100:21064 192.168.253.101:53872
`- ESTAB 0 0 192.168.253.100%eth0:21064 192.168.253.101:53872
ESTAB 0 0 192.168.253.100:42013 192.168.253.101:21064
`- ESTAB 0 0 192.168.253.100%eth0:42013 192.168.253.101:21064SCTP??? See Protocols for DLM communication.
$ dlm_tool dump | grep -P '(rrp_mode|protocol)'
4244 cmap totem.rrp_mode = 'passive'
4244 set protocol 1
4244 receive_protocol 2 max 3.1.1.0 run 0.0.0.0
4244 receive_protocol 1 max 3.1.1.0 run 3.1.1.0
4244 run protocol from nodeid 1
4244 receive_protocol 2 max 3.1.1.0 run 3.1.1.0
$ lsmod | grep ^sctp
sctp 434176 10
$ modinfo sctp | head
filename: /lib/modules/5.14.21-150500.55.19-default/kernel/net/sctp/sctp.ko.zst
license: GPL
description: Support for the SCTP protocol (RFC2960)
author: Linux Kernel SCTP developers <linux-sctp@vger.kernel.org>
alias: net-pf-10-proto-132
alias: net-pf-2-proto-132
suserelease: SLE15-SP5
srcversion: FC2AFAA5AE6D0A503192391
depends: udp_tunnel,libcrc32c,ip6_udp_tunnel
supported: yesFor now only lvmlockd RA was added, thus 'lvm_clvg0' has only _one_member:
$ dlm_tool ls
dlm lockspaces
name lvm_clvg0
id 0x45d1d4f1
flags 0x00000000
change member 1 joined 1 remove 0 failed 0 seq 1,1
members 1
name lvm_global
id 0x12aabd2d
flags 0x00000000
change member 2 joined 1 remove 0 failed 0 seq 2,2
members 1 2After adding "shared" VG, that is one which is activated on both nodes (eg. for OCFS2), this happens:
$ crm configure show clvg0
primitive clvg0 LVM-activate \
params vgname=clvg0 vg_access_mode=lvmlockd activation_mode=shared \
op start timeout=90s interval=0 \
op stop timeout=90s interval=0 \
op monitor interval=90s timeout=90s
# see lvm_clvg0 has two member now!
$ dlm_tool ls
dlm lockspaces
name lvm_clvg0
id 0x45d1d4f1
flags 0x00000000
change member 2 joined 1 remove 0 failed 0 seq 1,1
members 1 2
name lvm_global
id 0x12aabd2d
flags 0x00000000
change member 2 joined 1 remove 0 failed 0 seq 1,1
members 1 2# see that we have 'rrp_mode = "passive"', and 'protocol'
$ dlm_tool dump
5674 config file log_debug = 1 cli_set 0 use 1
5674 dlm_controld 4.1.0 started
5674 our_nodeid 1
5674 node_config 1
5674 node_config 2
5674 found /dev/misc/dlm-control minor 124
5674 found /dev/misc/dlm-monitor minor 123
5674 found /dev/misc/dlm_plock minor 122
5674 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
5674 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
5674 set log_debug 1
5674 set mark 0
5674 cmap totem.rrp_mode = 'passive'
5674 set protocol 1
5674 set /proc/sys/net/core/rmem_default 4194304
5674 set /proc/sys/net/core/rmem_max 4194304
5674 set recover_callbacks 1
5674 cmap totem.cluster_name = 'jb155sapqe'
5674 set cluster_name jb155sapqe
5674 /dev/misc/dlm-monitor fd 13
5674 cluster quorum 1 seq 270 nodes 2
5674 cluster node 1 added seq 270
5674 set_configfs_node 1 192.168.253.100 local 1 mark 0
5674 cluster node 2 added seq 270
5674 set_configfs_node 2 192.168.253.101 local 0 mark 0
5674 cpg_join dlm:controld ...
5674 setup_cpg_daemon 15
5674 dlm:controld conf 1 1 0 memb 1 join 1 left 0
5674 daemon joined 1
5674 dlm:controld ring 1:270 2 memb 1 2
5674 receive_protocol 1 max 3.1.1.0 run 0.0.0.0
5674 daemon node 1 prot max 0.0.0.0 run 0.0.0.0
5674 daemon node 1 save max 3.1.1.0 run 0.0.0.0
5674 set_protocol member_count 1 propose daemon 3.1.1 kernel 1.1.1
5674 receive_protocol 1 max 3.1.1.0 run 3.1.1.0
5674 daemon node 1 prot max 3.1.1.0 run 0.0.0.0
5674 daemon node 1 save max 3.1.1.0 run 3.1.1.0
5674 run protocol from nodeid 1
5674 daemon run 3.1.1 max 3.1.1 kernel run 1.1.1 max 1.1.1
5674 plocks 16
5674 receive_protocol 1 max 3.1.1.0 run 3.1.1.0
5674 send_fence_clear 1 fipu
5674 receive_fence_clear from 1 for 1 result -61 flags 1
5674 fence_in_progress_unknown 0 all_fipu
5815 dlm:controld conf 2 1 0 memb 1 2 join 2 left 0
5815 daemon joined 2
5815 receive_protocol 2 max 3.1.1.0 run 0.0.0.0
5815 daemon node 2 prot max 0.0.0.0 run 0.0.0.0
5815 daemon node 2 save max 3.1.1.0 run 0.0.0.0
5815 receive_protocol 1 max 3.1.1.0 run 3.1.1.0
5815 receive_fence_clear from 1 for 2 result 0 flags 6
5815 receive_protocol 2 max 3.1.1.0 run 3.1.1.0
5815 daemon node 2 prot max 3.1.1.0 run 0.0.0.0
5815 daemon node 2 save max 3.1.1.0 run 3.1.1.0pacemaker is an advanced, scalable High-Availability cluster resource manager.
systemctl start pacemaker # on all nodes (or use 'crm cluster start' instead)
corosync-cpgtool # see if pacemaker is known to corosync,
# these are symlinks to pacemaker daemons,
# see `ls -l /usr/lib/pacemaker/'There was a rename of pacemaker components but there are still old names visible:
$ ls -l /usr/lib/pacemaker/
total 832
lrwxrwxrwx 1 root root 15 Oct 14 2021 attrd -> pacemaker-attrd
lrwxrwxrwx 1 root root 15 Oct 14 2021 cib -> pacemaker-based
-rwxr-xr-x 1 root root 14936 Oct 14 2021 cibmon
lrwxrwxrwx 1 root root 18 Oct 14 2021 crmd -> pacemaker-controld
-rwxr-xr-x 1 root root 24296 Oct 14 2021 cts-exec-helper
-rwxr-xr-x 1 root root 31552 Oct 14 2021 cts-fence-helper
lrwxrwxrwx 1 root root 15 Oct 14 2021 lrmd -> pacemaker-execd
-rwxr-xr-x 1 root root 56560 Oct 14 2021 pacemaker-attrd
-rwxr-xr-x 1 root root 107600 Oct 14 2021 pacemaker-based
-rwxr-xr-x 1 root root 370360 Oct 14 2021 pacemaker-controld
-rwxr-xr-x 1 root root 48464 Oct 14 2021 pacemaker-execd
-rwxr-xr-x 1 root root 143224 Oct 14 2021 pacemaker-fenced
-rwxr-xr-x 1 root root 19464 Oct 14 2021 pacemaker-schedulerd
lrwxrwxrwx 1 root root 20 Oct 14 2021 pengine -> pacemaker-schedulerd
lrwxrwxrwx 1 root root 16 Oct 14 2021 stonithd -> pacemaker-fencedcluster configuration:
- in-memory representation
/var/lib/pacemaker/cib
/var/lib/pacemaker
├── cib
│ ├── cib-X.raw # cluster configuration history
│ ├── cib-X.raw.sig
│ └── cib.xml # latest cluster configuration saved to the disk
└── pengine # snapshot of a moment of the cluster life
├── pe-input-0.bz2 # cluster state of a moment
└── pe-warn-0.bz2 # something went wrong (fence, reboot), state of what
# cluster wants to do about it
WARNING: this directory is not intended for editing when the cluster is online!
2022-04-19T13:56:34.055004+02:00 oldhanaa1 pacemaker-based[20832]: error: Digest comparison failed: expected 4010ded1087db5173bd9912cda6e302d, calculated abecc1d59c0b2293b57158cf745280d5
2022-04-19T13:56:34.055299+02:00 oldhanaa1 pacemaker-based[20832]: error: /var/lib/pacemaker/cib/cib.xml was manually modified while the cluster was active!
$ crmadmin -N # show member nodes
member node: oldhanad2 (178438534)
member node: oldhanad1 (178438533)
$ crmadmin -D # show designated coordinator (DC)
Designated Controller is: s153cl01Ongoing activities on the cluster?
$ crmadmin -qD # get DC name
s153cl1
$ crmadmin -qS s153cl1
Status of crmd@s153cl1: S_IDLE (ok)
S_IDLE$ crm_mon -r -1 # show cluster status
Cluster Summary:
* Stack: corosync
* Current DC: consap02 (version 2.0.4+20200616.2deceaa3a-3.9.1-2.0.4+20200616.2deceaa3a) - partition with quorum
* Last updated: Wed Apr 20 14:30:59 2022
* Last change: Wed Apr 20 12:12:57 2022 by root via cibadmin on consap01
* 2 nodes configured
* 9 resource instances configured (8 DISABLED)
*** Resource management is DISABLED ***
The cluster will not attempt to start, stop or recover services
Node List:
* Online: [ consap01 consap02 ]
Full List of Resources:
* stonith-sbd (stonith:external/sbd): Stopped (unmanaged)
* Clone Set: cln_SAPHanaTopology_SLE_HDB00 [rsc_SAPHanaTopology_SLE_HDB00] (unmanaged):
* Stopped (disabled): [ consap01 consap02 ]
* Clone Set: msl_SAPHana_SLE_HDB00 [rsc_SAPHana_SLE_HDB00] (promotable) (unmanaged):
* rsc_SAPHana_SLE_HDB00 (ocf::suse:SAPHana): FAILED consap02 (disabled, unmanaged)
* rsc_SAPHana_SLE_HDB00 (ocf::suse:SAPHana): Slave consap01 (disabled, unmanaged)
* rsc_ip_SLE_HDB00 (ocf::heartbeat:IPaddr2): Stopped (disabled, unmanaged)
* rsc_mail (ocf::heartbeat:MailTo): Stopped (disabled, unmanaged)
* Clone Set: cln_diskfull_threshold [sysinfo] (unmanaged):
* Stopped (disabled): [ consap01 consap02 ]
Failed Resource Actions:
* rsc_SAPHana_SLE_HDB00_monitor_0 on consap02 'error' (1): call=43, status='complete', exitreason='', last-rc-change='2022-04-20 14:28:02 +01:00', queued=0ms, exec=2476msdisabled above means resources were stopped before the cluster was put into maintenance.
Some crm_mon details...
- offline does not necessary mean the node is down, it inherits this value from corosync, which means the ring/communication is broken
- UNCLEAN means one node does not know what is going on on other node
$ cibadmin -Q -o nodes # list nodes in pacemaker
<nodes>
<node id="1084783552" uname="s153cl02"/>
<node id="1084783549" uname="s153cl01"/>
</nodes>
$ cibadmin -Q -o crm_config # list cluster options configuration in pacemaker
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="true"/>
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="2.0.5+20201202.ba59be712-4.13.1-2.0.5+20201202.ba59be712"/>
<nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
<nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="s153cl0"/>
<nvpair name="stonith-timeout" value="12" id="cib-bootstrap-options-stonith-timeout"/>
<nvpair name="maintenance-mode" value="true" id="cib-bootstrap-options-maintenance-mode"/>
</cluster_property_set>
</crm_config>
$ cibadmin -Q --xpath '//*/primitive[@type="external/sbd"]' # query with xpath
<primitive id="stonith-sbd" class="stonith" type="external/sbd">
<instance_attributes id="stonith-sbd-instance_attributes">
<nvpair name="pcmk_delay_max" value="30" id="stonith-sbd-instance_attributes-pcmk_delay_max"/>
</instance_attributes>
</primitive>$ crm_verify -LV # check configuration used by cluster, verbose
# can show important infogeneral cluster mgmt
$ crm_mon # general overview, part of pacemaker
$ crm_mon [-n | --group-by-node ]
$ crm_mon -nforA # incl. fail, counts, operations...
$ cibadmin [-Q | --query] # expert xml administration, part of pacemaker
$ crm_attribute --type <scope> --name <attribute> --query # another query solutionResources failures and what would happen:
- monitor failure -> stop -> start ("did you try to stop and start it again?")
- start failure -> blocked to start locally via fail-count
<nvpair id="status-2-fail-count-PlanetX.start_0" name="fail-count-PlanetX#start_0" value="INFINITY"/> - stop failure -> fence (we cannot be sure what a resource won't mess with data thus STONITH)
Note, one can define on-fail resource operation action, see resource operations. For example, start failure leads to block.
- by default root and haclient group members can manage cluster
- some
crmactions require SSH working between nodes, either passwordless root or via a user configured withcrm options user <user>(then it requires passwordlesssudoersrule)
crm
crm help <topic> # generic syntax for help
crm status # similar to *crm_mon*, part of crmsh- same users and userids on all nodes
- users must be in haclient user group
- users need to have rights to run
/usr/bin/crm
crm configure show > <backup_file> # remove node specific stuff
# something like this
crm configure show | \
perl -pe 's/\\\n/ /' | \
perl -ne 'print unless m/^property cib-bootstrap-options:/' | \
perl -pe 's/ {2,}/ \\\n /g'
cibadmin -E --force # remove cluster configuration
crm configure < <backup_file>Reads from right to left, if last right resource runs somewhere, do action defined with next-to-last and all previous resources.
colocation <id> <score>: <resource> <resource>
# an example
colocation c1 inf: p-goodservice p-badservice
The above reads: if p-badservice runs somehwere always run p-goodservice
together.
Reads from left to right, order is respected, this influence state of the resources, that is states is respected in order, that is if a resource on the left in the list is stopped, then resources on the right side are stopped too.
group <name> <res> <res>...
# an example
group g-grp1 p-goodservice p-badservice
The above reads: start p-goodservice and then p-badservice.
crm ra classes # list RA classes
crm ra list ocf # list ocf RA class resources
crm ra list ocf <providers> # list ocf:<provider> RAs
crm ra info [<class>:[<provider>:]]<type> # show RA info and optionscrm resource status # show status of resources
crm resource # interactive shell for resources
crm configure [edit] # configuration edit via editor
# do not forget commit changes!
crm move # careful, creates constraints
crm resource constraints <resource> # show resource constraintscrm resource [trace | untrace] <resource>tracing logs in /var/lib/heartbeat/trace_ra (SUSE), filenames as
<resource>.<action>.<date>.
$ crm configure show related:Dummy
primitive dummy Dummy \
op monitor timeout=20s interval=10s \
op_params depth=0 \
meta target-role=Stopped
# ex script for batch editing
$ cat /tmp/ex-script
/dummy/
s/dummy/dummy-test/
a
params state=/run/resource-agents/foobar.state fake=fake \
$ function myeditor() { ex '+source /tmp/ex-script' -sc '%wq!' $@; }
$ export -f myeditor
$ EDITOR=myeditor crm configure edit
$ crm configure show related:Dummy
primitive dummy-test Dummy \
params state="/run/resource-agents/foobar.state" fake=fake \
op monitor timeout=20s interval=10s \
op_params depth=0 \
meta target-role=StoppedOf course, crm configure show dump, edit and then crm configure load update <file> would be probably better ;)
$ diff --label current --label new -u0 \
<(printf 'cib use cib.xml\nconfigure show\n' | crm -f -) \
<(printf 'cib use temp\nconfigure show\n' | crm -f -)
--- current
+++ new
@@ -885,0 +886 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-CJP-ERS CLO-clvm GRP-CJP-ERS
@@ -886,0 +888,2 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-CJP-SAP CLO-clvm GRP-CJP-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-CRP-ERS CLO-clvm GRP-CRP-ERS
@@ -887,0 +891,2 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-CRP-SAP CLO-clvm GRP-CRP-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-M1P-ERS CLO-clvm GRP-M1P-ERS
@@ -888,0 +894,2 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-M1P-SAP CLO-clvm GRP-M1P-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-PR1-ERS CLO-clvm GRP-PR1-ERS
@@ -889,0 +897,2 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-PR1-SAP CLO-clvm GRP-PR1-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-PRX-ERS CLO-clvm GRP-PRX-ERS
@@ -890,0 +900,2 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-PRX-SAP CLO-clvm GRP-PRX-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-S1P-ERS CLO-clvm GRP-S1P-ERS
@@ -891,0 +903,16 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-S1P-SAP CLO-clvm GRP-S1P-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-W1E-SAP CLO-clvm GRP-W1E-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-W1I-SAP CLO-clvm GRP-W1I-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-W7E-SAP CLO-clvm GRP-W7E-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-W7I-SAP CLO-clvm GRP-W7I-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-W8E-SAP CLO-clvm GRP-W8E-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-W8I-SAP CLO-clvm GRP-W8I-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-W9E-SAP CLO-clvm GRP-W9E-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-W9I-SAP CLO-clvm GRP-W9I-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-WD0-SAP CLO-clvm GRP-WD0-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-WD1-SAP CLO-clvm GRP-WD1-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-WP0-SAP CLO-clvm GRP-WP0-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-WP1-SAP CLO-clvm GRP-WP1-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-WP2-SAP CLO-clvm GRP-WP2-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-WP3-SAP CLO-clvm GRP-WP3-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-WSP-ERS CLO-clvm GRP-WSP-ERS
@@ -892,0 +920,2 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-WSP-SAP CLO-clvm GRP-WSP-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-X7P-ERS CLO-clvm GRP-X7P-ERS
@@ -893,0 +923,2 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-X7P-SAP CLO-clvm GRP-X7P-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-X8P-ERS CLO-clvm GRP-X8P-ERS
@@ -894,0 +926,2 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-X8P-SAP CLO-clvm GRP-X8P-SAP
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-X9P-ERS CLO-clvm GRP-X9P-ERS
@@ -895,0 +929 @@
+order ORD-GRP-DLM-CLVM-BEFORE-GRP-X9P-SAP CLO-clvm GRP-X9P-SAP
@@ -917,0 +952 @@
+colocation col-vg-with-dlm inf: ( GRP-CJP-ERS GRP-CJP-NFS GRP-CJP-SAP GRP-CRP-ERS GRP-CRP-NFS GRP-CRP-SAP GRP-M1P-ERS GRP-M1P-NFS GRP-M1P-SAP GRP-PR1-ERS GRP-PR1-NFS GRP-PR1-SAP GRP-PRX-ERS GRP-PRX-NFS GRP-PRX-SAP GRP-S1P-ERS GRP-S1P-NFS GRP-S1P-SAP GRP-W1E-SAP GRP-W1I-SAP GRP-W7E-SAP GRP-W7I-SAP GRP-W8E-SAP GRP-W8I-SAP GRP-W9E-SAP GRP-W9I-SAP GRP-WD0-SAP GRP-WD1-SAP GRP-WP0-SAP GRP-WP1-SAP GRP-WP2-SAP GRP-WP3-SAP GRP-WSP-ERS GRP-WSP-NFS GRP-WSP-SAP GRP-X7P-ERS GRP-X7P-NFS GRP-X7P-SAP GRP-X8P-ERS GRP-X8P-NFS GRP-X8P-SAP GRP-X9P-ERS GRP-X9P-NFS GRP-X9P-SAP ) CLO-clvmThe patch might be applied to crm configure show dumped output, and
reapplied via crm configure load update <file>.
crm configure property maintenance-mode=<true|false> # global maintenance
crm node maintenance <node> # node maintenance start
crm node ready <node> # node maintenance stop
crm resource maintenance <on|off> # (un)sets meta maintenance attribute
crm resource <manage|unmanage> <resource> # set/unsets is-managed mode, ie. *unmanaged*
crm node standby <node> # put node into standby mode (moving away resources)
crm node online <node> # put node online to allow hosting resourcescrm resource ban <service_resource> <node> # prevent resource from running on the node
# where service resouce is going to be updated,
# moves resource out of node
...
crm resource clear <service_resource> # ...
...node in standby mode
crm -w node standby # on the node to be rebooted
crm status # search for 'Node <node>: standby'
crm cluster stop # stopping cluster services
reboot
crm cluster status # check cluster services have started
crm cluster start
crm cluster statuscrm configure edit
< order <id> Mandatory: <resource>:<status> <resource>:<action>
# an example
crm configure edit
< order o-mariadb_before_webserver Mandatory: g-mariadb:start g-webserver:start
crm configure edit
< location <id> <resource> <infinity>: [<node> | <resource>:<state>]
# an example
crm configure edit
< location l-mariadb_pref_node1 g-mariadb 100: node1
# an example of never collocate
crm configure edit
< location l-mariadb_never_with_webserver -inf: g-mariadb:Started g-webserver:Started
$ crm
crm(live/jb154sapqe01)# cib import /tmp/pe-input-325.bz2
crm(pe-input-325/jb154sapqe01)# cibstatus origin
shadow:pe-input-325$ crm
crm(pe-input-325/jb154sapqe01)# configure show related:grp_QSA_ASCS16
group grp_QSA_ASCS16 rsc_ip_QSA_ASCS16 rsc_ip_QSA_ECC_CI rsc_lvm_QSA_ASCS16 rsc_fs_QSA_ASCS16 rsc_sap_QSA_ASCS16 \
meta target-role=Started \
meta resource-stickiness=3000
colocation col_TWS_with_ASCS16 inf: grp_QSA_ASCS16 grp_QSA_TWS
colocation col_W00_with_ASCS16 inf: grp_QSA_ASCS16 grp_QSW_W00
colocation col_sap_QSA_not_both -5000: grp_QSA_ERS02 grp_QSA_ASCS16
order ord_cl-storage_before_grp_QSA_ASCS16 Mandatory: cl-storage grp_QSA_ASCS161. see transitions which trigger an action
Basically there's LogAction lines following by generated transition,
thus the next awk stuff gets only relevant transitions.
$ awk 'BEGIN { start=0; } /(LogAction.*\*|Calculated)/ { if($0 ~ /pe-input/ && start != 1) { next; }; print; if($0 ~ /LogAction/) { start=1; } else { start=0; }; }' pacemaker.log | head -n 8
Sep 02 13:18:02 [27794] example2 pengine: notice: LogAction: * Recover rsc_azure-events:1 ( example1 )
Sep 02 13:18:02 [27794] example2 pengine: notice: LogAction: * Recover rsc_SAPHana_UP3_HDB00:1 ( Master example1 )
Sep 02 13:18:02 [27794] example2 pengine: notice: LogAction: * Recover rsc_SAPHanaTopology_UP3_HDB00:1 ( example1 )
Sep 02 13:18:02 [27794] example2 pengine: notice: process_pe_message: Calculated transition 219198, saving inputs in /var/lib/pacemaker/pengine/pe-input-3141.bz2
Sep 02 13:18:02 [27794] example2 pengine: notice: LogAction: * Recover rsc_azure-events:1 ( example1 )
Sep 02 13:18:02 [27794] example2 pengine: notice: LogAction: * Recover rsc_SAPHana_UP3_HDB00:1 ( Master example1 )
Sep 02 13:18:02 [27794] example2 pengine: notice: LogAction: * Recover rsc_SAPHanaTopology_UP3_HDB00:1 ( example1 )
Sep 02 13:18:02 [27794] example2 pengine: notice: process_pe_message: Calculated transition 219199, saving inputs in /var/lib/pacemaker/pengine/pe-input-3142.bz2Sorting PE files is not so straightforward...
$ ls hb_report-Wed-11-Jan-2023/*/pengine/*.bz2 | while read f ; do date=$(bzcat $f | grep -Po 'execution-date="\K(\d+)(?=.*)'); echo $f $(date -d @${date}); done | sort -V -k4
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-340.bz2 Wed Jan 11 08:55:26 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-341.bz2 Wed Jan 11 08:56:01 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-342.bz2 Wed Jan 11 08:56:52 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-343.bz2 Wed Jan 11 08:57:35 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-344.bz2 Wed Jan 11 08:58:20 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-345.bz2 Wed Jan 11 08:58:22 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-warn-15.bz2 Wed Jan 11 09:00:54 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-157.bz2 Wed Jan 11 09:01:07 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-158.bz2 Wed Jan 11 09:04:31 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-159.bz2 Wed Jan 11 09:05:05 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-160.bz2 Wed Jan 11 09:06:55 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-161.bz2 Wed Jan 11 09:21:57 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-162.bz2 Wed Jan 11 09:30:44 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-163.bz2 Wed Jan 11 09:30:44 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-164.bz2 Wed Jan 11 09:47:42 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-165.bz2 Wed Jan 11 11:59:25 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-166.bz2 Wed Jan 11 11:59:34 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-167.bz2 Wed Jan 11 11:59:38 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-168.bz2 Wed Jan 11 11:59:42 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-169.bz2 Wed Jan 11 11:59:46 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-170.bz2 Wed Jan 11 11:59:48 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-171.bz2 Wed Jan 11 11:59:51 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-input-172.bz2 Wed Jan 11 11:59:54 CET 2023
hb_report-Wed-11-Jan-2023/example02/pengine/pe-warn-16.bz2 Wed Jan 11 12:09:11 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-346.bz2 Wed Jan 11 12:09:26 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-347.bz2 Wed Jan 11 12:09:54 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-348.bz2 Wed Jan 11 12:12:46 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-349.bz2 Wed Jan 11 12:13:14 CET 2023
hb_report-Wed-11-Jan-2023/example01/pengine/pe-input-350.bz2 Wed Jan 11 12:14:52 CET 2023A node is "back" online:
May 17 15:51:01 [4234] node1 crmd: info: peer_update_callback: Client node1/peer now has status [online] (DC=true, changed=4000000)
What is going to be executed for a transition in details?
$ crm_simulate -S -x /tmp/pe-input-3902.bz2 | sed -n '/^Transition Summary/,/^Using the/p'
Transition Summary:
* Start rsc_ip_db2ptr_EWP ( node2 ) blocked
* Start rsc_nc_db2ptr_EWP ( node2 ) blocked
* Promote rsc_Db2_db2ptr_EWP:0 ( Slave -> Master node2 )
Executing Cluster Transition:
* Pseudo action: msl_Db2_db2ptr_EWP_pre_notify_promote_0
* Resource action: rsc_Db2_db2ptr_EWP notify on node2
* Pseudo action: msl_Db2_db2ptr_EWP_confirmed-pre_notify_promote_0
* Pseudo action: msl_Db2_db2ptr_EWP_promote_0
* Resource action: rsc_Db2_db2ptr_EWP promote on node2
* Pseudo action: msl_Db2_db2ptr_EWP_promoted_0
* Pseudo action: msl_Db2_db2ptr_EWP_post_notify_promoted_0
* Resource action: rsc_Db2_db2ptr_EWP notify on node2
* Pseudo action: msl_Db2_db2ptr_EWP_confirmed-post_notify_promoted_0
* Pseudo action: g_ip_db2ptr_EWP_start_0
* Resource action: rsc_Db2_db2ptr_EWP monitor=31000 on node2
Using the original execution date of: 2023-05-17 07:50:58Z
$ crm_simulate -VVVVVV -S -x /tmp/pe-input-3902.bz2 2>&1 | grep log_synapse_action
(log_synapse_action) debug: [Action 19]: Pending pseudo op g_ip_db2ptr_EWP_start_0 (priority: 0, waiting: 44)
(log_synapse_action) debug: [Action 58]: Pending resource op rsc_Db2_db2ptr_EWP_post_notify_promote_0 on node2 (priority: 1000000, waiting: 43)
(log_synapse_action) debug: [Action 57]: Pending resource op rsc_Db2_db2ptr_EWP_pre_notify_promote_0 on node2 (priority: 0, waiting: 41)
(log_synapse_action) debug: [Action 26]: Pending resource op rsc_Db2_db2ptr_EWP_monitor_31000 on node2 (priority: 0, waiting: 25 44)
(log_synapse_action) debug: [Action 25]: Pending resource op rsc_Db2_db2ptr_EWP_promote_0 on node2 (priority: 0, waiting: 39)
(log_synapse_action) debug: [Action 44]: Pending pseudo op msl_Db2_db2ptr_EWP_confirmed-post_notify_promoted_0 (priority: 1000000, waiting: 43 58)
(log_synapse_action) debug: [Action 43]: Pending pseudo op msl_Db2_db2ptr_EWP_post_notify_promoted_0 (priority: 1000000, waiting: 40 42)
(log_synapse_action) debug: [Action 42]: Pending pseudo op msl_Db2_db2ptr_EWP_confirmed-pre_notify_promote_0 (priority: 0, waiting: 41 57)
(log_synapse_action) debug: [Action 41]: Pending pseudo op msl_Db2_db2ptr_EWP_pre_notify_promote_0 (priority: 0, waiting: none)
(log_synapse_action) debug: [Action 40]: Pending pseudo op msl_Db2_db2ptr_EWP_promoted_0 (priority: 1000000, waiting: 25)
(log_synapse_action) debug: [Action 39]: Pending pseudo op msl_Db2_db2ptr_EWP_promote_0 (priority: 0, waiting: 42)logs must be gathered from all nodes
hb_report -f <start_time> <filename> # tarball with information,
# run on each node separately!
hb_report -f $(date --rfc-3339=date) # ./YYYY-MM-DD.tar.bz2crm cluster health | tee outputresource action failure increases failcount on a node
crm resource failcount <resource> show <node>
crm resource failcount <resource> delete <node> # reset
crm resource cleanup <resource> <node> # same stuffcrm_simulate -x <pe_file> -S # what was going on during life of clustersimulating a cluster network failure via iptables: https://www.suse.com/support/kb/doc/?id=000018699
# pacemaker 2.x
# filter cluter related events and search for 'pe-input' string which shows
# what pengine/scheduler decided how transition configuration should look like
$ grep -P \
'(SAPHana|sap|corosync|pacemaker-(attrd|based|controld|execd|schedulerd|fenced)|stonith|systemd)\[\d+\]' \
/var/log/pacemaker/pacemaker.log | lessA hack to print ocf-based RA required paramenters and other stuff
/usr/lib/ocf/resource.d/linbit/drbd meta-data | \
xmllint --xpath '//*/parameter[@required="1"]' -
<parameter name="drbd_resource" unique="1" required="1">
<longdesc lang="en">
The name of the drbd resource from the drbd.conf file.
</longdesc>
<shortdesc lang="en">drbd resource name</shortdesc>
<content type="string"/>
</parameter>
/usr/lib/ocf/resource.d/linbit/drbd meta-data | \
xmllint --xpath '//*/actions/action[@name="monitor"]' -
<action name="monitor" timeout="20" interval="20" role="Slave"/>
<action name="monitor" timeout="20" interval="10" role="Master"/>
oldhana2 was rebooted, cca around 13:20.
$ grep 'Linux version' oldhanad2/messages
2022-04-21T13:23:03.881400+02:00 oldhanad2 kernel: [ 0.000000] Linux version 5.3.18-150300.59.60-default (geeko@buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Fri Mar 18 18:37:08 UTC 2022 (79e1683)Let's see if it was fenced...
$ sed -n '1,/^2022-04-21T13:23:03/p' ha-log.txt | \
grep -P \
'(SAPHana|sap|corosync|pacemaker-(attrd|based|controld|execd|schedulerd|fenced)|stonith|systemd)\[\d+\]' \
| grep -ni reboot
354307:2022-04-21T13:21:38.866101+02:00 oldhanad1 pacemaker-schedulerd[26189]: notice: * Fence (reboot) oldhanad2 'peer is no longer part of the cluster'
354312:2022-04-21T13:21:38.867641+02:00 oldhanad1 pacemaker-controld[26190]: notice: Requesting fencing (reboot) of node oldhanad2
354315:2022-04-21T13:21:38.868521+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Client pacemaker-controld.26190.b64beaf2 wants to fence (reboot) 'oldhanad2' with device '(any)'
354316:2022-04-21T13:21:38.868607+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Requesting peer fencing (reboot) targeting oldhanad2
354321:2022-04-21T13:21:38.989800+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Requesting that oldhanad1 perform 'reboot' action targeting oldhanad2
354322:2022-04-21T13:21:38.990116+02:00 oldhanad1 pacemaker-fenced[26186]: notice: killer is eligible to fence (reboot) oldhanad2: dynamic-list
354421:2022-04-21T13:21:51.257505+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Operation 'reboot' [1951] (call 2 from pacemaker-controld.26190) for host 'oldhanad2' with device 'killer' returned: 0 (OK)
354422:2022-04-21T13:21:51.257803+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Operation 'reboot' targeting oldhanad2 on oldhanad1 for pacemaker-controld.26190@oldhanad1.a127b270: OK
354424:2022-04-21T13:21:51.259400+02:00 oldhanad1 pacemaker-controld[26190]: notice: Peer oldhanad2 was terminated (reboot) by oldhanad1 on behalf of pacemaker-controld.26190: OKLet's see what was corosync ring before the fence/reboot happened...
$ sed -n '1,/^2022-04-21T13:23:03/p' ha-log.txt | \
grep -P 'corosync.*(TOTEM|QUORUM|CPG)' | \
grep -Pv '(ignoring|Invalid packet data|Digest does not match)'
2022-04-21T13:21:31.828859+02:00 oldhanad1 corosync[26152]: [TOTEM ] A processor failed, forming new configuration.
2022-04-21T13:21:37.830133+02:00 oldhanad1 corosync[26152]: [TOTEM ] A new membership (10.162.193.133:116) was formed. Members left: 178438534
2022-04-21T13:21:37.830212+02:00 oldhanad1 corosync[26152]: [TOTEM ] Failed to receive the leave message. failed: 178438534
2022-04-21T13:21:37.830267+02:00 oldhanad1 corosync[26152]: [CPG ] downlist left_list: 1 received
2022-04-21T13:21:37.831069+02:00 oldhanad1 corosync[26152]: [QUORUM] Members[1]: 178438533
2022-04-21T13:21:38.743694+02:00 oldhanad1 corosync[26152]: [TOTEM ] Automatically recovered ring 0The above shows ungraceful disappearance of the node from corosync ring. In this
case kill -9 was used but the same could be if whole network communication
would stop working between nodes.
$ sed -n '1,/^2022-04-21T13:23:03/p' ha-log.txt | \
grep -P '(pacemaker-(attrd|based|controld|execd|schedulerd|fenced)|stonith)\[\d+\]'
2022-04-21T13:21:37.831919+02:00 oldhanad1 pacemaker-controld[26190]: notice: Our peer on the DC (oldhanad2) is dead
2022-04-21T13:21:37.832181+02:00 oldhanad1 pacemaker-based[26185]: notice: Node oldhanad2 state is now lost
2022-04-21T13:21:37.832419+02:00 oldhanad1 pacemaker-attrd[26188]: notice: Lost attribute writer oldhanad2
2022-04-21T13:21:37.832478+02:00 oldhanad1 pacemaker-controld[26190]: notice: State transition S_NOT_DC -> S_ELECTION
2022-04-21T13:21:37.832525+02:00 oldhanad1 pacemaker-based[26185]: notice: Purged 1 peer with id=178438534 and/or uname=oldhanad2 from the membership cache
2022-04-21T13:21:37.832567+02:00 oldhanad1 pacemaker-controld[26190]: notice: Node oldhanad2 state is now lost
2022-04-21T13:21:37.832621+02:00 oldhanad1 pacemaker-attrd[26188]: notice: Node oldhanad2 state is now lost
2022-04-21T13:21:37.832667+02:00 oldhanad1 pacemaker-attrd[26188]: notice: Removing all oldhanad2 attributes for peer loss
2022-04-21T13:21:37.832707+02:00 oldhanad1 pacemaker-attrd[26188]: notice: Purged 1 peer with id=178438534 and/or uname=oldhanad2 from the membership cache
2022-04-21T13:21:37.832746+02:00 oldhanad1 pacemaker-attrd[26188]: notice: Recorded local node as attribute writer (was unset)
2022-04-21T13:21:37.832786+02:00 oldhanad1 pacemaker-controld[26190]: notice: State transition S_ELECTION -> S_INTEGRATION
2022-04-21T13:21:37.833037+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Node oldhanad2 state is now lost
2022-04-21T13:21:37.833125+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Purged 1 peer with id=178438534 and/or uname=oldhanad2 from the membership cache
2022-04-21T13:21:37.859192+02:00 oldhanad1 pacemaker-controld[26190]: notice: Updating quorum status to true (call=128)
2022-04-21T13:21:38.865044+02:00 oldhanad1 pacemaker-schedulerd[26189]: warning: Cluster node oldhanad2 will be fenced: peer is no longer part of the cluster
2022-04-21T13:21:38.865201+02:00 oldhanad1 pacemaker-schedulerd[26189]: warning: Node oldhanad2 is unclean
2022-04-21T13:21:38.865701+02:00 oldhanad1 pacemaker-schedulerd[26189]: warning: killer_stop_0 on oldhanad2 is unrunnable (node is offline)
2022-04-21T13:21:38.865825+02:00 oldhanad1 pacemaker-schedulerd[26189]: warning: p-IP1_stop_0 on oldhanad2 is unrunnable (node is offline)
2022-04-21T13:21:38.865907+02:00 oldhanad1 pacemaker-schedulerd[26189]: warning: Scheduling Node oldhanad2 for STONITH
2022-04-21T13:21:38.866101+02:00 oldhanad1 pacemaker-schedulerd[26189]: notice: * Fence (reboot) oldhanad2 'peer is no longer part of the cluster'
2022-04-21T13:21:38.866214+02:00 oldhanad1 pacemaker-schedulerd[26189]: notice: * Move killer ( oldhanad2 -> oldhanad1 )
2022-04-21T13:21:38.866304+02:00 oldhanad1 pacemaker-schedulerd[26189]: notice: * Move p-IP1 ( oldhanad2 -> oldhanad1 )
2022-04-21T13:21:38.867223+02:00 oldhanad1 pacemaker-schedulerd[26189]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-6.bz2
2022-04-21T13:21:38.867566+02:00 oldhanad1 pacemaker-controld[26190]: notice: Processing graph 0 (ref=pe_calc-dc-1650540098-21) derived from /var/lib/pacemaker/pengine/pe-warn-6.bz2
2022-04-21T13:21:38.867641+02:00 oldhanad1 pacemaker-controld[26190]: notice: Requesting fencing (reboot) of node oldhanad2
2022-04-21T13:21:38.867755+02:00 oldhanad1 pacemaker-controld[26190]: notice: Initiating start operation killer_start_0 locally on oldhanad1
2022-04-21T13:21:38.867855+02:00 oldhanad1 pacemaker-controld[26190]: notice: Requesting local execution of start operation for killer on oldhanad1
2022-04-21T13:21:38.868521+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Client pacemaker-controld.26190.b64beaf2 wants to fence (reboot) 'oldhanad2' with device '(any)'
2022-04-21T13:21:38.868607+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Requesting peer fencing (reboot) targeting oldhanad2
2022-04-21T13:21:38.868833+02:00 oldhanad1 pacemaker-execd[26187]: notice: executing - rsc:killer action:start call_id:70
2022-04-21T13:21:38.989800+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Requesting that oldhanad1 perform 'reboot' action targeting oldhanad2
2022-04-21T13:21:38.990116+02:00 oldhanad1 pacemaker-fenced[26186]: notice: killer is eligible to fence (reboot) oldhanad2: dynamic-list
2022-04-21T13:21:40.078030+02:00 oldhanad1 pacemaker-execd[26187]: notice: killer start (call 70) exited with status 0 (execution time 1208ms, queue time 0ms)
2022-04-21T13:21:40.078369+02:00 oldhanad1 pacemaker-controld[26190]: notice: Result of start operation for killer on oldhanad1: ok
2022-04-21T13:21:51.257505+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Operation 'reboot' [1951] (call 2 from pacemaker-controld.26190) for host 'oldhanad2' with device 'killer' returned: 0 (OK)
2022-04-21T13:21:51.257803+02:00 oldhanad1 pacemaker-fenced[26186]: notice: Operation 'reboot' targeting oldhanad2 on oldhanad1 for pacemaker-controld.26190@oldhanad1.a127b270: OK
2022-04-21T13:21:51.259260+02:00 oldhanad1 pacemaker-controld[26190]: notice: Stonith operation 2/1:0:0:455cd14f-a928-4de5-a0df-2e579a28b160: OK (0)
2022-04-21T13:21:51.259400+02:00 oldhanad1 pacemaker-controld[26190]: notice: Peer oldhanad2 was terminated (reboot) by oldhanad1 on behalf of pacemaker-controld.26190: OK
2022-04-21T13:21:51.259532+02:00 oldhanad1 pacemaker-controld[26190]: notice: Initiating start operation p-IP1_start_0 locally on oldhanad1
2022-04-21T13:21:51.259653+02:00 oldhanad1 pacemaker-controld[26190]: notice: Requesting local execution of start operation for p-IP1 on oldhanad1
2022-04-21T13:21:51.260111+02:00 oldhanad1 pacemaker-execd[26187]: notice: executing - rsc:p-IP1 action:start call_id:71
2022-04-21T13:21:51.752904+02:00 oldhanad1 pacemaker-execd[26187]: notice: p-IP1 start (call 71, PID 1998) exited with status 0 (execution time 493ms, queue time 0ms)
2022-04-21T13:21:51.753448+02:00 oldhanad1 pacemaker-controld[26190]: notice: Result of start operation for p-IP1 on oldhanad1: ok
2022-04-21T13:21:51.754685+02:00 oldhanad1 pacemaker-controld[26190]: notice: Transition 0 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-6.bz2): Complete
2022-04-21T13:21:51.754792+02:00 oldhanad1 pacemaker-controld[26190]: notice: State transition S_TRANSITION_ENGINE -> S_IDLEWhat happened? controld (cluster resource manager) is informed node is dead,
finally schedulerd (policy engine) decided to fence the node because it is
unclean (it does not have an idea what is going on with this node), fenced
is asked to prepare fencing the node, execd (local resource manager) in
practice runs fence agent to STONITH the node.
Summary:
- an unexpected node left
corosync[26152]: [TOTEM ] Failed to receive the leave message. failed: 178438534 - because of unclean status the node is going to be fenced
pacemaker-schedulerd[26189]: notice: * Fence (reboot) oldhanad2 'peer is no longer part of the cluster' - fence actually happends
hawk needs that a user is in haclient group; it uses PAM (passwd service):
and web-based hawk (suse) 7630/tcp
# from a login attempt
$ strace -e status=successful -s 256 -f -e trace=file $(systemd-cgls -u hawk-backend.service | tail -n +2 | awk '{ ORS=" "; printf("-p %d ", $2) }') 2>&1 | grep /etc
[pid 6849] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 6849] stat("/etc/crypto-policies/back-ends/gnutls.config", {st_mode=S_IFREG|0644, st_size=1413, ...}) = 0
[pid 6849] openat(AT_FDCWD, "/etc/crypto-policies/back-ends/gnutls.config", O_RDONLY|O_CLOEXEC) = 3
[pid 6855] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 6855] stat("/etc/crypto-policies/back-ends/gnutls.config", {st_mode=S_IFREG|0644, st_size=1413, ...}) = 0
[pid 6855] openat(AT_FDCWD, "/etc/crypto-policies/back-ends/gnutls.config", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] access("/etc/pam.d/passwd", R_OK) = 0
[pid 6856] openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/group", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] stat("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 6856] openat(AT_FDCWD, "/etc/pam.d/passwd", O_RDONLY) = 3
[pid 6856] openat(AT_FDCWD, "/etc/pam.d/common-auth", O_RDONLY) = 4
[pid 6856] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
[pid 6856] openat(AT_FDCWD, "/etc/pam.d/common-account", O_RDONLY) = 4
[pid 6856] openat(AT_FDCWD, "/etc/pam.d/common-password", O_RDONLY) = 4
[pid 6856] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
[pid 6856] openat(AT_FDCWD, "/etc/pam.d/common-session", O_RDONLY) = 4
[pid 6856] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
[pid 6856] openat(AT_FDCWD, "/etc/pam.d/other", O_RDONLY) = 3
[pid 6856] openat(AT_FDCWD, "/etc/login.defs", O_RDONLY) = 3
[pid 6856] openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/shadow", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/security/pam_env.conf", O_RDONLY) = 3
[pid 6856] openat(AT_FDCWD, "/etc/environment", O_RDONLY) = 3
[pid 6856] openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
[pid 6856] openat(AT_FDCWD, "/etc/login.defs", O_RDONLY) = 3
[pid 6857] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 6857] stat("/etc/crypto-policies/back-ends/gnutls.config", {st_mode=S_IFREG|0644, st_size=1413, ...}) = 0
[pid 6857] openat(AT_FDCWD, "/etc/crypto-policies/back-ends/gnutls.config", O_RDONLY|O_CLOEXEC) = 3
[pid 6858] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 6858] stat("/etc/crypto-policies/back-ends/gnutls.config", {st_mode=S_IFREG|0644, st_size=1413, ...}) = 0
[pid 6858] openat(AT_FDCWD, "/etc/crypto-policies/back-ends/gnutls.config", O_RDONLY|O_CLOEXEC) = 3
[pid 6864] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 6864] stat("/etc/crypto-policies/back-ends/gnutls.config", {st_mode=S_IFREG|0644, st_size=1413, ...}) = 0
[pid 6864] openat(AT_FDCWD, "/etc/crypto-policies/back-ends/gnutls.config", O_RDONLY|O_CLOEXEC) = 3What is purpose of SBD?
- fence agent/device (STONITH)
- self-fence if SBD device can't be read for some time (Shoot myself in the head, SMITH) ?? https://www.suse.com/support/kb/doc/?id=000017950
- monitor SBD device (if any)
- monitor Pacemaker CIB
- monitor corosync health
SBD_STARTMODE=clean in /etc/sysconfig/sdb (SUSE) to prevent
starting cluster if non-clean state exists on SBD
SBD sets a used watchdog timeout based on SBD_WATCHDOG_TIMEOUT on its start.
SBD_WATCHDOG_TIMEOUT (e.g. in /etc/sysconfig/sbd) is already the timeout the hardware watchdog is configured to by sbd-daemon. sbd-daemon is triggering faster - timeout_loop defaults to 1s but is configurable. Cf. https://lists.clusterlabs.org/pipermail/users/2016-December/021051.html
$ grep -H '' /sys/class/watchdog/watchdog0/{timeout,identity,state} 2>/dev/null
/sys/class/watchdog/watchdog0/timeout:5
/sys/class/watchdog/watchdog0/identity:iTCO_wdt
/sys/class/watchdog/watchdog0/state:active
$ grep -Pv '^\s*(#|$)' /etc/sysconfig/sbd | grep WATCHDOG_TIMEOUT
SBD_WATCHDOG_TIMEOUT=5RH does not support softdog based SBDs!
sbd query-watchdog # check if sbd finds watcdog devices
sbd -w <watchdog_device> test-watchdog # test if reset via watchdog works,
# this RESETS node!sbd watches both corosync and pacemaker; as for Pacemaker:
Pacemaker is setting the node unclean which pacemaker-watcher (one of sbd daemons) sees as it is connected to the cib. This is why the mechanism is working (sort of - see the discussion in my pull request in the sbd-repo) on nodes without stonithd as well (remote-nodes). If you are running sbd with a block-device there is of course this way of communication as well between pacemaker and sbd. (e.g. via fence_sbd fence-agent) Cf. https://lists.clusterlabs.org/pipermail/users/2016-December/021074.html
SBD watchers:
$ systemd-cgls -u sbd.service
Unit sbd.service (/system.slice/sbd.service):
├─2703 sbd: inquisitor
├─2704 sbd: watcher: /dev/disk/by-id/scsi-36001405714d7f9602b045ee82274b815 - slot: 5 - uuid: 41e0e03c-5618-459e-b3ea-73ddb98d442a
├─2705 sbd: watcher: Pacemaker
└─2706 sbd: watcher: Clusterinquisitor, a kind of dead-men switchwatcher: /dev/disk/by-id/scsi-36001405714d7f9602b045ee82274b815 - slot: 5 - uuid: 41e0e03c-5618-459e-b3ea-73ddb98d442a, monitors shared disk devicewatcher: Pacemaker, monitors if the cluster partition the node is in is still quorate according to Pacemaker CIB, and the node itself is still considered online and healthy by Pacemakerwatcher: Cluster, monitors if the cluster is still quorate according to Corosync's node count
As for corosync watcher, it seems it is "registred" into corosync:
$ corosync-cpgtool | grep -A1 sbd
sbd:cluster\x00
2706 178438533 (10.162.193.133)
$ ps auxww | grep '[2]706'
root 2706 0.0 0.0 135268 39364 ? SL Apr21 3:59 sbd: watcher: Cluster
sbd -d /dev/<shared_lun> create # prepares shared lun for SBD
sbd -d /dev/<shared_lun> dump # info about SBD# . /etc/sysconfig/sbd;export SBD_DEVICE;sbd dump;sbd list
==Dumping header on disk /dev/loop0
Header version : 2.1
UUID : 8233d0a1-1f94-4d88-9e22-485d4dfc1080
Number of slots : 255
Sector size : 512
Timeout (watchdog) : 5
Timeout (allocate) : 2
Timeout (loop) : 1
Timeout (msgwait) : 10
# dd if=/dev/loop0 bs=32 count=2 2>/dev/null | xxd
00000000: 5342 445f 5342 445f 02ff 0000 0000 0200 SBD_SBD_........
00000010: 0000 0005 0000 0002 0000 0001 0000 000a ................
00000020: 0182 33d0 a11f 944d 889e 2248 5d4d fc10 ..3....M.."H]M..
00000030: 8000 0000 0000 0000 0000 0000 0000 0000 ................systemctl enable sbd # MUST be enabled, creates dependency
# on cluster stack services
systemctl list-dependencies sbd --reverse --all # check sbd is part of cluster stack
sbd -d /dev/<shared_lun> list # list nodes slots and messages
# to make following tests work, sbd has to be running (as part of corosync)
sbd -d /dev/<shared_lun> message <node> test # node's sbd would log the test
sbd -d <block_dev> message <node> clear # clear sbd state for a node, restart pacemaker!Two disks SBD:
TODO: ...
int quorum_read(int good_servants)
{
if (disk_count > 2)
return (good_servants > disk_count/2);
else
return (good_servants > 0);
}
Diskless SBD:
Usually three nodes, a kind of self-fence feature.
- inquisitor
- watcher: Pacemaker
- watcher: Cluster
It's not visible on crm configure show as resource, only property
cib-bootstrap-options stonith options need to be set. It will
self-fence if cannot see other nodes.
In diskless SBD, this is what stonith_admin thinks about it:
$ crm configure show type:property
property cib-bootstrap-options: \
have-watchdog=true \
no-quorum-policy=freeze \
dc-version="2.1.5+20221208.a3f44794f-150500.6.14.4-2.1.5+20221208.a3f44794f" \
cluster-infrastructure=corosync \
cluster-name=jb155sapqe \
stonith-watchdog-timeout=-1
$ crm configure show related:external/sbd | grep -c '' # that is, no 'external/sbd' primitive
0
$ stonith_admin -L
watchdog
1 fence device found
$ stonith_admin -l $(hostname)
watchdog
1 fence device foundIn pacemaker.log the above query logs...
May 10 08:28:15.633 jb155sapqe02 pacemaker-fenced [1865] (can_fence_host_with_device) info: watchdog is eligible to fence (off) jb155sapqe02: static-list
Thus, it is a hack, IMO.
Clusterfile syncronization, /etc/csync2/csync2.cfg, 30865/tcp, key-based authentication
systemctl cat csync2.socket # at suse
[Socket]
ListenStream=30865
Accept=yes
[Install]
WantedBy=sockets.targetcsync2 -xvWrong peer certificate!
jb155sapqe02:~ # csync2 -xv
Connecting to host jb155sapqe01 (SSL) ...
Connect to 192.168.252.100:30865 (jb155sapqe01).
Peer did provide a wrong SSL X509 cetrificate.
jb155sapqe02:~ # openssl x509 \
-in /etc/csync2/csync2_ssl_cert.pem -inform PEM -outform DER | xxd -p
308202cf30820256a00302010202140b22b7456312e2ad6410c924d2f55d
bf0bace40d300a06082a8648ce3d04030230819e310b3009060355040613
022d2d3112301006035504080c09536f6d6553746174653111300f060355
04070c08536f6d654369747931193017060355040a0c10536f6d654f7267
616e697a6174696f6e31193017060355040b0c10536f6d654f7267616e69
7a6174696f6e3111300f06035504030c08536f6d654e616d65311f301d06
092a864886f70d01090116106e616d65406578616d706c652e636f6d301e
170d3234313132383133343732335a170d3333303231343133343732335a
30819e310b3009060355040613022d2d3112301006035504080c09536f6d
6553746174653111300f06035504070c08536f6d65436974793119301706
0355040a0c10536f6d654f7267616e697a6174696f6e3119301706035504
0b0c10536f6d654f7267616e697a6174696f6e3111300f06035504030c08
536f6d654e616d65311f301d06092a864886f70d01090116106e616d6540
6578616d706c652e636f6d3076301006072a8648ce3d020106052b810400
2203620004af32f54fd831a468c78bd4bd4c271fad9d19fd2e1ec6cf18c4
c6ca8edaa8529cca22f811e979bbb5fbc5eb53a3c07308c9c755671196fc
70f6345294cd5422c73a7a592406869028d5fdd5bf85421708e230c6a6eb
d752cc9e9429d17c5adf34a3533051301d0603551d0e04160414c8757186
c1a1b3705ffdef78131998a961c9849f301f0603551d23041830168014c8
757186c1a1b3705ffdef78131998a961c9849f300f0603551d130101ff04
0530030101ff300a06082a8648ce3d04030203670030640230425117a116
e284b9c5bc01862c91e21e233f57044b4597cda2f775caa770427a9bf118
00c3e0fa17cb28f535be3657ee02300ae2ea87439066cc793b9d640b44b0
23f34d33f8ee65615a2d31a8e94657ad5bda7cab220345bcecfeb16ade26
8341f7
jb155sapqe02:~ # ssh jb155sapqe01 \
"sqlite3 /var/lib/csync2/jb155sapqe01.db3 \"SELECT certdata from x509_cert where peername = 'jb155sapqe02';\"" | fold -w60
308202CF30820256A0030201020214666D487F078E301754D06DB028FF5E
C4B5F8E782300A06082A8648CE3D04030230819E310B3009060355040613
022D2D3112301006035504080C09536F6D6553746174653111300F060355
04070C08536F6D654369747931193017060355040A0C10536F6D654F7267
616E697A6174696F6E31193017060355040B0C10536F6D654F7267616E69
7A6174696F6E3111300F06035504030C08536F6D654E616D65311F301D06
092A864886F70D01090116106E616D65406578616D706C652E636F6D301E
170D3233303931323037333632305A170D3331313132393037333632305A
30819E310B3009060355040613022D2D3112301006035504080C09536F6D
6553746174653111300F06035504070C08536F6D65436974793119301706
0355040A0C10536F6D654F7267616E697A6174696F6E3119301706035504
0B0C10536F6D654F7267616E697A6174696F6E3111300F06035504030C08
536F6D654E616D65311F301D06092A864886F70D01090116106E616D6540
6578616D706C652E636F6D3076301006072A8648CE3D020106052B810400
2203620004D395908B7DC38DF493366BB8FF92DD99ABBA3C8F8423CAEF0A
CEB1A7C46A3EC04DB83E82BDF61C43A53716FCC4F01C9BE8D664E62BE3DD
590F0E5AAC262E173EDE1ECC6853AEB403ED45D096C8C4CA2A649DD9EBEA
71BF1195F57B87E890E91AA3533051301D0603551D0E04160414C5997188
D0FB5CC99344780BF729203DDBA4608C301F0603551D23041830168014C5
997188D0FB5CC99344780BF729203DDBA4608C300F0603551D130101FF04
0530030101FF300A06082A8648CE3D040302036700306402304A3837F9CE
7FE76E5CA1A344861D1B00118AF2A39D700D87A1A128A2946085509F2B3C
47B1E886DB37A25561835152A302307270994CB73C1AD05B40FE4DE42C11
3C9FC10BDA5E7D771C1301439CF958409C62B973060F7A795F08F75F88C5
EB3B33
# so they really differ!Some notes about drbd design:
- primary/secondary: primary has r/w, secondary NOT even r/o !!!
- single primary mode: ONLY ONE node has r/w (a fail-over scenario)
- dual-primary mode: more nodes has r/w (this would require a filesystem which implements locking, eg. ocfs2)
- async replication: aka 'Protocol A', write is considered completed on primary node(s) if written to local disk and the replication packet placed in local TCP send buffer (thus, when this 'local' machine crashed, no updates on the other node !!!)
- sync replicaiton: aka 'Protocol C', write is considered completed on primary node(s) if local and remote disk writes are confirmed
- DRBD >= 9 allows multiple nodes replication without 'stacking'
- inconsistent data state: data are partly obsolete, partly updated
- suspended replication: when a replication link is congested, drbd can temporarily suspend replication
- online device verification: an intergrity check, a good candicate for cron job (think about it as mdraid sync_action-like action)
- split brain: a situation when both nodes were switched to primary while being previosly disconnected (ie. likely two diverging sets of data exist); do NOT confuse with a cluster partition !!!
Configs /etc/drbd.{conf,d/*.res}
drbdadm [create | up | status] <resource>
drbdadm new-current-uuid --clear-bitmap <resource>/0DRBD can be used under Pacemaker/Corosync, the RA is in drbd-utils package on SUSE.
primitive p-drbd_<resource> ocf:linbit:drbd \
params drbd_resource=<drbd_resource> \
op monitor interval=15 role=Master \
op monitor interval=30 role=Slave
ms ms-drbd_<resource> \
meta master-max=1 \
master-node-max=1 \
clone-max=2 \
clone-node-max=1 \
notify=true
Handlers, fence-peer, fences other node and puts location constraint
so other node cannot be used if not synced:
$ grep crm-fence /var/log/messages
2023-05-22T12:47:51.528164+02:00 node02 crm-fence-peer.9.sh[18335]: DRBD_BACKING_DEV_0=/dev/sdb DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/sdb DRBD_MINOR=0 DRBD_MINOR_0=0 DRBD_MY_ADDRESS=10.40.40.42 DRBD_MY_AF=ipv4 DRBD_MY_NODE_ID=1 DRBD_NODE_ID_0=node01 DRBD_NODE_ID_1=node02 DRBD_PEER_ADDRESS=10.40.40.41 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=0 DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000002 /usr/lib/drbd/crm-fence-peer.9.sh
2023-05-22T12:47:51.930204+02:00 node02 crm-fence-peer.9.sh[18335]: /
2023-05-22T12:47:51.931920+02:00 node02 crm-fence-peer.9.sh[18335]: INFO peers are (node-level) fenced, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-r0-ms-p-drbd-r0'After fence, the location constraint is created:
$ crm configure show type:location
location drbd-fence-by-handler-r0-ms-p-drbd-r0 ms-p-drbd-r0 \
rule $role=Master -inf: #uname ne node02
after-resync-target is a handler which removes location constraint
when the node is in sync:
$ grep -P '(Linux version|unfence)' /var/log/messages | tail -n3
2023-05-22T13:48:19.311670+02:00 node01 kernel: [ 0.000000][ T0] Linux version 5.14.21-150400.24.21-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP PREEMPT_DYNAMIC Wed Sep 7 06:51:18 UTC 2022 (974d0aa)
2023-05-22T13:52:46.024106+02:00 node01 crm-unfence-peer.9.sh[3681]: DRBD_BACKING_DEV=/dev/sdb DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connected DRBD_LL_DISK=/dev/sdb DRBD_MINOR=0 DRBD_MY_ADDRESS=10.40.40.41 DRBD_MY_AF=ipv4 DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=node01 DRBD_NODE_ID_1=node02 DRBD_PEER_ADDRESS=10.40.40.42 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES='' /usr/lib/drbd/crm-unfence-peer.9.sh
2023-05-22T13:52:46.179836+02:00 node01 crm-unfence-peer.9.sh[3681]: INFO Removed constraint 'drbd-fence-by-handler-r0-ms-p-drbd-r0'
On first node, drbdadm up <res> is executed (drbd01 is the resource name here):
2023-02-06T16:43:30.921702+01:00 jb154sapqe01 kernel: [18310.608989][T25362] drbd drbd01: Starting worker thread (from drbdsetup [25362])
2023-02-06T16:43:30.926543+01:00 jb154sapqe01 kernel: [18310.615090][T25368] drbd drbd01 jb154sapqe02: Starting sender thread (from drbdsetup [25368])
2023-02-06T16:43:31.002626+01:00 jb154sapqe01 kernel: [18310.688562][T25381] drbd drbd01/0 drbd1: meta-data IO uses: blk-bio
2023-02-06T16:43:31.002637+01:00 jb154sapqe01 kernel: [18310.689931][T25381] drbd drbd01/0 drbd1: disk( Diskless -> Attaching )
2023-02-06T16:43:31.002638+01:00 jb154sapqe01 kernel: [18310.691079][T25381] drbd drbd01/0 drbd1: Maximum number of peer devices = 1
2023-02-06T16:43:31.002639+01:00 jb154sapqe01 kernel: [18310.692308][T25381] drbd drbd01: Method to ensure write ordering: flush
2023-02-06T16:43:31.006611+01:00 jb154sapqe01 kernel: [18310.693393][T25381] drbd drbd01/0 drbd1: drbd_bm_resize called with capacity == 2097016
2023-02-06T16:43:31.006621+01:00 jb154sapqe01 kernel: [18310.694739][T25381] drbd drbd01/0 drbd1: resync bitmap: bits=262127 words=4096 pages=8
2023-02-06T16:43:31.006622+01:00 jb154sapqe01 kernel: [18310.695950][T25381] drbd1: detected capacity change from 0 to 2097016
2023-02-06T16:43:31.006623+01:00 jb154sapqe01 kernel: [18310.696925][T25381] drbd drbd01/0 drbd1: size = 1024 MB (1048508 KB)
2023-02-06T16:43:31.011445+01:00 jb154sapqe01 kernel: [18310.699910][T25381] drbd drbd01/0 drbd1: recounting of set bits took additional 0ms
2023-02-06T16:43:31.011454+01:00 jb154sapqe01 kernel: [18310.701195][T25381] drbd drbd01/0 drbd1: disk( Attaching -> Inconsistent )
2023-02-06T16:43:31.014416+01:00 jb154sapqe01 kernel: [18310.702390][T25381] drbd drbd01/0 drbd1 jb154sapqe02: pdsk( DUnknown -> Outdated )
2023-02-06T16:43:31.014421+01:00 jb154sapqe01 kernel: [18310.703587][T25381] drbd drbd01/0 drbd1: attached to current UUID: 0000000000000004
2023-02-06T16:43:31.022032+01:00 jb154sapqe01 kernel: [18310.710893][T25384] drbd drbd01 jb154sapqe02: conn( StandAlone -> Unconnected )
2023-02-06T16:43:31.025714+01:00 jb154sapqe01 kernel: [18310.713251][T25363] drbd drbd01 jb154sapqe02: Starting receiver thread (from drbd_w_drbd01 [25363])
2023-02-06T16:43:31.025723+01:00 jb154sapqe01 kernel: [18310.714925][T25388] drbd drbd01 jb154sapqe02: conn( Unconnected -> Connecting )
Then on the second node, drbdadm up <res> was executed:
2023-02-06T16:44:25.097928+01:00 jb154sapqe02 kernel: [ 4133.662936][T23971] drbd drbd01: Starting worker thread (from drbdsetup [23971])
2023-02-06T16:44:25.105031+01:00 jb154sapqe02 kernel: [ 4133.668966][T23977] drbd drbd01 jb154sapqe01: Starting sender thread (from drbdsetup [23977])
2023-02-06T16:44:25.121096+01:00 jb154sapqe02 systemd[1]: Started Disk encryption utility (cryptctl) - contact key server to unlock disk sys-devices-virtual-block-drbd1 and keep the server informed.
2023-02-06T16:44:25.157486+01:00 jb154sapqe02 kernel: [ 4133.712273][T23985] drbd drbd01/0 drbd1: meta-data IO uses: blk-bio
2023-02-06T16:44:25.157517+01:00 jb154sapqe02 kernel: [ 4133.714282][T23985] drbd drbd01/0 drbd1: disk( Diskless -> Attaching )
2023-02-06T16:44:25.157520+01:00 jb154sapqe02 kernel: [ 4133.716103][T23985] drbd drbd01/0 drbd1: Maximum number of peer devices = 1
2023-02-06T16:44:25.157522+01:00 jb154sapqe02 kernel: [ 4133.717721][T23985] drbd drbd01: Method to ensure write ordering: flush
2023-02-06T16:44:25.157522+01:00 jb154sapqe02 kernel: [ 4133.719474][T23985] drbd drbd01/0 drbd1: drbd_bm_resize called with capacity == 2097016
2023-02-06T16:44:25.157524+01:00 jb154sapqe02 kernel: [ 4133.721096][T23985] drbd drbd01/0 drbd1: resync bitmap: bits=262127 words=4096 pages=8
2023-02-06T16:44:25.157527+01:00 jb154sapqe02 kernel: [ 4133.722356][T23985] drbd1: detected capacity change from 0 to 2097016
2023-02-06T16:44:25.157528+01:00 jb154sapqe02 kernel: [ 4133.723414][T23985] drbd drbd01/0 drbd1: size = 1024 MB (1048508 KB)
2023-02-06T16:44:25.179919+01:00 jb154sapqe02 kernel: [ 4133.739954][T23985] drbd drbd01/0 drbd1: recounting of set bits took additional 0ms
2023-02-06T16:44:25.179943+01:00 jb154sapqe02 kernel: [ 4133.741308][T23985] drbd drbd01/0 drbd1: disk( Attaching -> Inconsistent )
2023-02-06T16:44:25.179948+01:00 jb154sapqe02 kernel: [ 4133.742480][T23985] drbd drbd01/0 drbd1 jb154sapqe01: pdsk( DUnknown -> Outdated )
2023-02-06T16:44:25.179949+01:00 jb154sapqe02 kernel: [ 4133.743973][T23985] drbd drbd01/0 drbd1: attached to current UUID: 0000000000000004
2023-02-06T16:44:25.206361+01:00 jb154sapqe02 kernel: [ 4133.771504][T24001] drbd drbd01 jb154sapqe01: conn( StandAlone -> Unconnected )
2023-02-06T16:44:25.209702+01:00 jb154sapqe02 kernel: [ 4133.773775][T23972] drbd drbd01 jb154sapqe01: Starting receiver thread (from drbd_w_drbd01 [23972])
2023-02-06T16:44:25.213091+01:00 jb154sapqe02 kernel: [ 4133.777431][T24004] drbd drbd01 jb154sapqe01: conn( Unconnected -> Connecting )
2023-02-06T16:44:25.753573+01:00 jb154sapqe02 kernel: [ 4134.315811][T24004] drbd drbd01 jb154sapqe01: Handshake to peer 0 successful: Agreed network protocol version 120
2023-02-06T16:44:25.753606+01:00 jb154sapqe02 kernel: [ 4134.317930][T24004] drbd drbd01 jb154sapqe01: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
2023-02-06T16:44:25.753620+01:00 jb154sapqe02 kernel: [ 4134.320172][T24004] drbd drbd01 jb154sapqe01: Starting ack_recv thread (from drbd_r_drbd01 [24004])
2023-02-06T16:44:25.845197+01:00 jb154sapqe02 kernel: [ 4134.406569][T24004] drbd drbd01 jb154sapqe01: Preparing remote state change 639728590
2023-02-06T16:44:25.863084+01:00 jb154sapqe02 kernel: [ 4134.423659][T24004] drbd drbd01/0 drbd1 jb154sapqe01: drbd_sync_handshake:
2023-02-06T16:44:25.863102+01:00 jb154sapqe02 kernel: [ 4134.424900][T24004] drbd drbd01/0 drbd1 jb154sapqe01: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:24
2023-02-06T16:44:25.863105+01:00 jb154sapqe02 kernel: [ 4134.427164][T24004] drbd drbd01/0 drbd1 jb154sapqe01: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:24
2023-02-06T16:44:25.863107+01:00 jb154sapqe02 kernel: [ 4134.429441][T24004] drbd drbd01/0 drbd1 jb154sapqe01: uuid_compare()=no-sync by rule=just-created-both
2023-02-06T16:44:25.873019+01:00 jb154sapqe02 kernel: [ 4134.434324][T24004] drbd drbd01 jb154sapqe01: Committing remote state change 639728590 (primary_nodes=0)
2023-02-06T16:44:25.873030+01:00 jb154sapqe02 kernel: [ 4134.436028][T24004] drbd drbd01 jb154sapqe01: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
2023-02-06T16:44:25.873031+01:00 jb154sapqe02 kernel: [ 4134.437557][T24004] drbd drbd01/0 drbd1 jb154sapqe01: pdsk( Outdated -> Inconsistent ) repl( Off -> Established )
And the log on the first node continues...
2023-02-06T16:44:25.754225+01:00 jb154sapqe01 kernel: [18365.439854][T25388] drbd drbd01 jb154sapqe02: Handshake to peer 1 successful: Agreed network protocol version 120
2023-02-06T16:44:25.754257+01:00 jb154sapqe01 kernel: [18365.441974][T25388] drbd drbd01 jb154sapqe02: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
2023-02-06T16:44:25.754263+01:00 jb154sapqe01 kernel: [18365.444465][T25388] drbd drbd01 jb154sapqe02: Starting ack_recv thread (from drbd_r_drbd01 [25388])
2023-02-06T16:44:25.841754+01:00 jb154sapqe01 kernel: [18365.528511][T25370] drbd drbd01: Preparing cluster-wide state change 639728590 (0->1 499/146)
2023-02-06T16:44:25.859864+01:00 jb154sapqe01 kernel: [18365.544541][T25388] drbd drbd01/0 drbd1 jb154sapqe02: drbd_sync_handshake:
2023-02-06T16:44:25.859876+01:00 jb154sapqe01 kernel: [18365.545914][T25388] drbd drbd01/0 drbd1 jb154sapqe02: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:24
2023-02-06T16:44:25.859879+01:00 jb154sapqe01 kernel: [18365.548222][T25388] drbd drbd01/0 drbd1 jb154sapqe02: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:24
2023-02-06T16:44:25.859880+01:00 jb154sapqe01 kernel: [18365.550515][T25388] drbd drbd01/0 drbd1 jb154sapqe02: uuid_compare()=no-sync by rule=just-created-both
2023-02-06T16:44:25.870707+01:00 jb154sapqe01 kernel: [18365.555374][T25370] drbd drbd01: State change 639728590: primary_nodes=0, weak_nodes=0
2023-02-06T16:44:25.870717+01:00 jb154sapqe01 kernel: [18365.556749][T25370] drbd drbd01: Committing cluster-wide state change 639728590 (28ms)
2023-02-06T16:44:25.870718+01:00 jb154sapqe01 kernel: [18365.558110][T25370] drbd drbd01 jb154sapqe02: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
2023-02-06T16:44:25.870719+01:00 jb154sapqe01 kernel: [18365.559746][T25370] drbd drbd01/0 drbd1 jb154sapqe02: pdsk( Outdated -> Inconsistent ) repl( Off -> Established )
$ mkfs.ocfs2 -L jb155sapqe-shared-lvm-ocfs2-0 /dev/sda
mkfs.ocfs2 1.8.7
Cluster stack: pcmk
Cluster name: jb155sapqe
Stack Flags: 0x0
NOTE: Feature extended slot map may be enabled
Label: jb155sapqe-shared-lvm-ocfs2-0
Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg append-dio
Block size: 4096 (12 bits)
Cluster size: 4096 (12 bits)
Volume size: 1073741824 (262144 clusters) (262144 blocks)
Cluster groups: 9 (tail covers 4096 clusters, rest cover 32256 clusters)
Extent allocator size: 4194304 (1 groups)
Journal size: 67108864
Node slots: 2
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 0 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Formatting quota files: done
Writing lost+found: done
mkfs.ocfs2 successful
$ wipefs /dev/sda
DEVICE OFFSET TYPE UUID LABEL
sda 0x2000 ocfs2 2f8ab0fa-5f5f-486f-9518-5837f0662116 jb155sapqe-shared-lvm-ocfs2-0SLES does not support anymore o2cb cluster stack:
$ /sys/fs/ocfs2/cluster_stack
pcmkPacemaker/corosync stack configuration for OCFS2 with "independent" DLM and OCFS2 primitive:
primitive dlm ocf:pacemaker:controld \
op monitor interval=60 timeout=60 \
op start timeout=90s interval=0s \
op stop timeout=100s interval=0s
primitive ocfs2-0 Filesystem \
params directory="/srv/ocfs2-0" fstype=ocfs2 device="/dev/disk/by-id/scsi-3600140534d0904dc8a24843897c4ad18" \
op monitor interval=20 timeout=40
clone cl-dlm dlm \
meta interleave=true
clone cl-ocfs2-0 ocfs2-0 \
meta interleave=true
colocation co-ocfs2-1-with-dlm inf: cl-ocfs2-0 cl-dlm
order o-dlm-before-ocfs2-0 Mandatory: cl-dlm cl-ocfs2-0Some o2info details:
$ o2info --mkfs /dev/disk/by-id/scsi-3600140534d0904dc8a24843897c4ad18 | fmt -w80
-N 2 -J size=67108864 -b 4096 -C 4096 --fs-features
backup-super,strict-journal-super,sparse,extended-slotmap,userspace-stack,inline-data,xattr,indexed-dirs,refcount,discontig-bg,append-dio,unwritten
-L jb155sapqe-shared-lvm-ocfs2-0See userspace-stack, that refers to:
$ modinfo -d ocfs2_stack_user
ocfs2 driver for userspace cluster stackso2info --volinfo reveals number of cluster nodes too:
$ o2info --volinfo /dev/sda
Label: jb155sapqe-shared-lvm-ocfs2-0
UUID: 80038ED0D6B44AF7B617AE40DC337DF8
Block Size: 4096
Cluster Size: 4096
Node Slots: 2
Features: backup-super strict-journal-super sparse extended-slotmap
Features: userspace-stack inline-data xattr indexed-dirs refcount
Features: discontig-bg append-dio unwritten