diff --git a/docs/plugin_cloud_ha.md b/docs/plugin_cloud_ha.md index e54ac9c50d..9b8809d50f 100644 --- a/docs/plugin_cloud_ha.md +++ b/docs/plugin_cloud_ha.md @@ -6,7 +6,7 @@ sidebar_label: Cloud HA The SSR-cloud-ha plugin provides High Availability (HA) functionality for the SSR Networking Platform deployed in the cloud. HA for SSR routers in a non-cloud environment uses traditional techniques such as VRRP and GARP which both rely on a virtual MAC and virtual IP. In cloud environments such as AWS, Azure, etc., any techniques that rely on broadcast and multicast are not supported. This plugin uses node health metrics sent over SVR, as well as cloud API interactions to perform failovers in these cloud environments. :::note -The instructions for installing and managing the plugin can be found [here](plugin_intro.md#installation-and-management). +For instructions to install and manage the plugin, see [Installation and Management](plugin_intro.md#installation-and-management). ::: ## Supported Solutions @@ -15,14 +15,13 @@ The instructions for installing and managing the plugin can be found [here](plug | ------------------ | --------------- | -------------------- | | Azure VNET | `azure-vnet` | 2.0.0 | | Azure Loadbalancer | `azure-lb` | 2.0.0 | -| Alicloud VPC | `alicloud-vpc` | 3.0.0 | - +| AWS TGW | `aws-tgw` | 5.1.0 | ## Version Restrictions - The router component can only be installed on versions of SSR which support provisional state on _device interfaces_. This is necessary for the plugin to be able to prevent asymmetrical routing. - The versions of SSR that support this feature have a `Provides: 128T-device-interface-api(1.0)`, so it can be checked ahead of time by performing a `rpm -q --whatprovides 128T-device-interface-api(1.0)` to see if the currently installed 128T satisfies this requirement or `dnf list --whatprovides 128T-device-interface-api(1.0)` to see all versions of SSR that satisfy this requirement. +The router component can only be installed on versions of SSR which support provisional state on _device interfaces_. This is necessary for the plugin to be able to prevent asymmetrical routing. +The versions of SSR that support this feature have a `Provides: 128T-device-interface-api(1.0)`, so it can be checked ahead of time by performing a `rpm -q --whatprovides 128T-device-interface-api(1.0)` to see if the currently installed 128T satisfies this requirement or `dnf list --whatprovides 128T-device-interface-api(1.0)` to see all versions of SSR that satisfy this requirement. ## Plugin Behavior @@ -51,12 +50,11 @@ The internal state machines wait for the `peer-reachability-timeout` after every When the HA Agent determines that a node must become active, the first active redundant interface's MAC address and the list of configured prefixes are sent to the API Agent's appropriate REST endpoint. The HA Agent changes the provisional status of the configured `redundant-interface` to **Up**. -### API Agents +## API Agents The job of the API Agent is to perform failover actions specific to the cloud provider and the chosen solution. - -#### Azure Loadbalancer +### Azure Loadbalancer A `solution-type` of `azure-lb` can be used to enable the Azure Loadbalancer API agent. This solution requires an [Azure Loadbalancer](https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview) to be configured using an HTTP probe on the `probe-port` with backend pools pointing towards the redundant interfaces. @@ -68,21 +66,22 @@ Backend Pool example: ![Azure Loadblancer Backend Pool Configuration](/img/cloud-ha-azure-lb-backend-pool-config.png) - The Azure Loadbalancer sends a health probe to the redundant SSR's redundant interfaces. These probes are routed through the `cloud-ha` interface, through a `128T-azure-lb-nginx` instance, and down to the Azure Loadbalancer API Agent. The Azure Loadbalancer API Agent responds to the probes with a `200` status code when the current node is active and a `500` code when its inactive. A probe to the inactive node will not reach the SSR when the redundant interfaces are set provisionally down. -#### Azure VNET +### Azure VNET + +![Azure VNET Architecture](/img/cloud_ha_azure_vnet_architecture.png) A `solution-type` of `azure-vnet` can be used to enable the Azure VNET API agent. It requires an Azure Route Table setup on the same VNET as the redundant interfaces. The Virtual Machines where these members are running must be granted the following permissions in order for the route updates to work correctly: -* Microsoft.Network/routeTables/read -* Microsoft.Network/networkInterfaces/read -* Microsoft.Network/routeTables/routes/*/write -* Microsoft.Network/routeTables/routes/read -* Microsoft.Compute/virtualMachines/read -* Microsoft.Network/virtualNetworks/read +- Microsoft.Network/routeTables/read +- Microsoft.Network/networkInterfaces/read +- Microsoft.Network/routeTables/routes/*/write +- Microsoft.Network/routeTables/routes/read +- Microsoft.Compute/virtualMachines/read +- Microsoft.Network/virtualNetworks/read The agent finds all of the route tables within the VNET using the Azure REST APIs. When a redundant interface becomes active, the agent updates the route tables for all the configured prefixes to point to that interface. The solution is designed to be idempotent, so the peer member's redundant interface will now be inactive. There is no update to the route table needed when becoming inactive. @@ -92,21 +91,81 @@ The agent finds all of the route tables within the VNET using the Azure REST API To prevent routing loops, the solution will not update the Azure Route Tables assigned to a subnet that has an activating node's _network interface_. ::: -#### Alicloud VPC +#### Extra Route Tables -A `solution-type` of `alicloud-vpc` can be used to enable the Alicloud VPC API agent. It requires an Alicloud Route Table setup on the same VNET as the redundant interfaces. +By default, the Azure VNET solution automatically discovers all route tables within the VNET where the redundant interfaces are located. However, you can also explicitly specify additional Azure Route Tables (User Defined Routes or UDRs) to be managed by the plugin using the `extra-route-table` configuration. -Since the deployment of Juniper SDWAN HA requires granting RAM role permissions, you need to add a RAM username for this under your current Alicloud account and grant the following permissions. The Virtual Machines where these members are running must be granted with the RAM role with following permissions in order for the route updates to work correctly: +Each `extra-route-table` entry requires: +* `subscription-id`: The unique identifier of the Azure subscription containing the UDR +* `resource-group`: The name of the Azure resource group where the route table is hosted +* `route-table-name`: The specific name of the Azure Route Table resource -* AliyunECSFullAccess -* AliyunVPCFullAccess +This is particularly useful when you need to manage route tables in different subscriptions or resource groups that are not automatically discovered. -The agent finds all of the route tables within the Alicloud VPC using the Alicloud REST APIs. When a redundant interface becomes active, the agent updates the route tables for all the configured prefixes to point to that interface. The solution is designed to be idempotent, so the peer member's redundant interface will now be inactive. There is no update to the route table needed when becoming inactive. +#### Route Table Discovery -:::warning -To prevent routing loops, the solution will not update the Alicloud Route Tables assigned to a subnet that has an activating node's _network interface_. +The `auto-discover-route-table` setting (default: `true`) enables automatic discovery of Azure Route Tables within the specified VNET. When enabled, the system first scans for tags (see [Tagging](#tagging) below). If none are found, then the system scans for existing UDRs instead of requiring manual entry via `extra-route-table`. Setting this to `false` means only explicitly configured `extra-route-table` entries are managed. + +#### Tagging + +When `auto-discover-route-table` is set to `true` in the `cloud-redundancy-group`, the API Agent looks for the Route Table by looking at the tags assigned to the VM. + +To utilize this feature, the tags must follow the following form: +* `SSR-CLOUD-HA--SUBSCRIPTION-ID` : `` +* `SSR-CLOUD-HA--RESOURCE-GROUP` : `` +* `SSR-CLOUD-HA--ROUTE-TABLE-NAME` : `` + +:::note +The Subscription ID, Resource Group, and Route Table Name must be the same for both nodes in a redundancy group. ::: +The following are some example tags: +* `SSR-CLOUD-HA-0-SUBSCRIPTION-ID` : `88888888-bbbb-4444-aaaa-ffffffffffff` +* `SSR-CLOUD-HA-0-RESOURCE-GROUP` : `myResourceGroup` +* `SSR-CLOUD-HA-0-ROUTE-TABLE-NAME` : `myRouteTable` + +When `auto-discover-route-table` is `true` and the tags are missing or incomplete, the API Agent uses the existing UDRs. + +These tags are queried every time the API Agent tries to grab state or `become-active`, so modifying these tags during runtime is allowed. + +### AWS TGW + +A `solution-type` of `aws-tgw` can be used to enable the AWS TGW API agent. This solution requires a TGW to be created with attachments which are connected to the VPCs that the SSR resides in. The Virtual Machines where these members are running must be granted the following permissions in order for the route updates to work correctly: + +- ec2:DescribeInstances +- ec2:CreateTransitGatewayRoute +- ec2:DescribeTransitGatewayRouteTables +- ec2:ReplaceTransitGatewayRoute + +If you want the state output to report the current routes in the TGW Route Rable, the VMs will also need: +- ec2:SearchTransitGatewayRoutes + +:::note +The [AWS metadata configuration](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html) for the SSR instances must set the Hop Limit to at least 2 for the plugin to be able to make the correct API calls. The BYOL AMI has this setting defaulted to 2, but that value is only used if the instance or account settings are not specified. +::: + +The TGW Route Table is the object which is updated by the plugin, pointing to one of the node's TGW Attachments. Each SSR node must know the TGW Route Table as well as the corresponding attachment that points to itself. How this is configured is explained in the [AWS TGW Config Section](#aws-tgw-specific-configuration) + +#### Tagging + +When `auto-discover-route-table` is set to `true` in the `cloud-redundancy-group`, the API Agent looks for the TGW Route Table and Attachment by looking at the tags assigned to the EC2 instance. + +To utilize this feature, the tags must follow the following form: +* `SSR-CLOUD-HA--TGW-ROUTE-TABLE-ID` : `` +* `SSR-CLOUD-HA--TGW-ATTACHMENT-ID` : `` + +:::note +The Route Table ID must be the same for both nodes in a redundancy group. The Attachment ID must be unique across nodes in a redundancy group. +::: + +The following are some example tags: +* `SSR-CLOUD-HA-0-TGW-ROUTE-TABLE-ID` : `tgw-rtb-000000000000` +* `SSR-CLOUD-HA-0-TGW-ATTACHMENT-ID` : `tgw-attach-000000000000` + +When `auto-discover-route-table` is `true` and the tags are missing or incomplete, the API Agent does not function and errors can be found in the journal. + +These tags are queried every time the API Agent tries to grab state or `become-active`, so modifying these tags during runtime is allowed. + ## Scenarios ### Both Healthy @@ -185,7 +244,7 @@ A group is a collection of nodes which all share the same settings such as the s ``` authority - cloud-redundandy-group group1 + cloud-redundancy-group group1 name group1 enabled true solution-type azure-lb @@ -196,26 +255,55 @@ authority remote-health-network 169.254.180.0/24 health-interval 2 exit - cloud-redundandy-group group2 + cloud-redundancy-group group2 name group2 solution-type azure-vnet + extra-route-table a1b2c3d4-e5f6-4789-a0b1-c2d3e4f5a6b7 + subscription-id a1b2c3d4-e5f6-4789-a0b1-c2d3e4f5a6b7 + resource-group azure-resource-group + route-table-name azure-route-table + exit + auto-discover-route-table false include-peer-vnets false exit exit ``` -| Element | Type | Properties | Description | | -| ------------------------- | --------------- | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | --- | -| name | string | key | The name of the group to be referenced in other places. | | -| enabled | boolean | default: true | Whether the group is enabled. | | -| solution-type | enum | required | The solution to use on member nodes. Value can be one of the values in #Supported Solutions. | | -| additional-branch-prefix | list: ip-prefix | | Additional ip prefixes that the member routers will control. | | -| up-holddown-timeout | int | default: 2 | The number of seconds to wait before declaring a member up. | | -| peer-reachability-timeout | int | default: 10 | The number of seconds to wait before declaring a peer unreachable. This field must be at least twice the value of `health-interval`. | | -| health-interval | int | default: 2 | The interval in seconds for health reports to be collected. | | -| remote-health-network | ip-prefix | default: 169.254.180.0/24 | The ip prefix to use for inter-member health status messages. | | -| include-peer-vnets | boolean | if: solution-type = azure-vnet, default: false | Whether to include peer VNETs as part of the route table discovery algorithm. | | -| probe-port | port | if: solution-type = azure-lb, default: 12801 | The port that the Azure Loadbalancer will be sending the HTTP probes on. | | +#### Common Configuration + +| Element | Type | Properties | Description | +| ---- | --- | --- | --- | +| name | string | key | The name of the group to be referenced. | +| enabled | boolean | default: true | Whether the group is enabled. | +| solution-type | enum | required | The solution to use on member nodes. Value can be one of the values in [Supported Solutions](#supported-solutions. | +| additional-branch-prefix | list: ip-prefix | | Additional IP prefixes that the member routers will control. | +| up-holddown-timeout | int | default: 2 | The number of seconds to wait before declaring a member up. | +| peer-reachability-timeout | int | default: 10 | The number of seconds to wait before declaring a peer unreachable. This field must be at least twice the value of `health-interval`. | +| health-interval | int | default: 2 | The interval in seconds for health reports to be collected. | +| remote-health-network | ip-prefix | default: 169.254.180.0/24 | The ip-prefix to use for inter-member health status messages. | + +#### Azure VNET Specific Configuration + +| Element | Type | Properties | Description | +| --- | --- | --- | --- | +| include-peer-vnets | boolean | default: false | Whether to include peer VNETs as part of the route table discovery algorithm. | +| auto-discover-route-table | boolean | default: true | Enables automatic discovery of Azure Route Tables within the specified VNET. | +| extra-route-table | list | | A list of Azure User Defined Route (UDR) tables where custom routing entries will be injected or modified. | +| subscription-id | string | key (within extra-route-table) | The unique identifier of the Azure subscription containing the UDR. | +| resource-group | string | required (within extra-route-table) | The name of the Azure resource group where the route table is hosted. | +| route-table-name | string | required (within extra-route-table) | The specific name of the Azure Route Table resource. | + +#### Azure Loadbalancer Specific Configuration + +| Element | Type | Properties | Description | +| ---------- | ---- | ---------------- | --- | +| probe-port | port | default: 12801 | The port where the Azure Loadbalancer sends HTTP probes. | + +#### AWS TGW Specific Configuration + +| Element | Type | Properties | Description | +| ---------- | ---- | ---------------- | --- | +| tgw-route-table-id | AWS ID | | The TGW Route Table ID that should be updated. Ex. `tgw-rtb-00000000000000001`. Must be paired with the `tgw-attachment-id` on each member. | ### Membership @@ -250,21 +338,27 @@ node exit ``` -| Element | Type | Properties | Description | -| ------------------------------- | --------------- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Element | Type | Properties | Description | +| --- | --- | --- | --- | | cloud-redundancy-plugin-network | ip-network | default: 169.254.137.0/30 | The ip network to use for internal networking. This should only be configured when the default value conflicts with a different service in the configuration. | -| cloud-redundancy-group | reference | required | The group that this member belongs to. | -| priority | int | min-value: 1, max-value:2 | The priority of the member where lower priority has higher preference. | -| redundant-interface | list: reference | min-number: 1 | The _device interfaces_ that will be redundant with the `redundant-interfaces` on the peer members. | -| additional-interface | list: reference | | The _device interfaces_ that will be considered for node health, but not considered for redundant operations. | -| log-level/ha-agent | log-level | default: info | The log level for the HA Agent. | -| log-level/api-agent | log-level | default: info | The log level for the active API Agent. | - +| cloud-redundancy-group | reference | required | The group that this member belongs to. | +| enabled | bool | default: true | Whether to enable the HA Agent from becoming active/inactive for the member. | +| priority | int | min-value: 1, max-value:2 | The priority of the member where lower priority has higher preference. | +| redundant-interface | list: reference | min-number: 1 | The _device interfaces_ that will be redundant with the `redundant-interfaces` on the peer members. | +| additional-interface | list: reference | | The _device interfaces_ that will be considered for node health, but not considered for redundant operations. | +| log-level/ha-agent | log-level | default: info | The log level for the HA Agent. | +| log-level/api-agent | log-level | default: info | The log level for the active API Agent. | :::note Nodes can only be members of one group. ::: +#### AWS TGW Specific Configuration + +| Element | Type | Properties | Description | +| ---------- | ---- | ---------------- | --- | +| tgw-attachment-id | AWS ID | | The TGW Attachment ID that should be used for this node when updating the route table entry. Ex. `tgw-attach-00000000000000001` | + ### Address Prefixes Certain prefixes can be configured to be controlled by the Cloud HA SSR by tagging services with groups. The services that are tagged with a certain group will have all of their `address` fields added to the list of configured prefixes. @@ -280,9 +374,341 @@ exit Any additional prefixes that should be included in the controlled prefix list can be configured as `additional-branch-prefix`s under the group. +## Complete Azure VNET Configuration Example + +Below is a complete working example of an Azure VNET High Availability deployment. This configuration demonstrates the use of explicit route table configuration using `extra-route-table` with `auto-discover-route-table` set to `false`. + +:::info +This example is based on a production deployment pattern. Adjust the IP addresses, subscription IDs, and resource identifiers for your specific environment. +::: + +```config {3-18,62-67,108-113} +config + authority + cloud-redundancy-group group1 + enabled true + name group1 + solution-type azure-vnet + + extra-route-table a1b2c3d4-e5f6-4789-a0b1-c2d3e4f5a6b7 + subscription-id a1b2c3d4-e5f6-4789-a0b1-c2d3e4f5a6b7 + resource-group azure-resource-group + route-table-name azure-route-table + exit + auto-discover-route-table false + health-interval 6 + additional-branch-prefix 1.1.1.1/32 + up-holddown-timeout 2 + peer-reachability-timeout 13 + exit + + router azure-ha + name azure-ha + inter-node-security internal + + node node0 + name node0 + asset-id node0-asset-id + role combo + + device-interface ge-0-1 + name ge-0-1 + enabled true + + network-interface ge-0-1-intf + name ge-0-1-intf + tenant lan + source-nat true + + address 172.16.10.4 + ip-address 172.16.10.4 + prefix-length 24 + gateway 172.16.10.1 + exit + exit + exit + + device-interface ge-0-2 + name ge-0-2 + forwarding false + + network-interface ge-0-2-intf + name ge-0-2-intf + type fabric + + address 192.168.10.132 + ip-address 192.168.10.132 + prefix-length 29 + gateway 192.168.10.129 + exit + exit + exit + + cloud-redundancy-membership group1 + cloud-redundancy-group group1 + dns-server 168.63.129.16 + priority 1 + redundant-interface ge-0-1 + exit + exit + + node node1 + name node1 + asset-id node1-asset-id + role combo + + device-interface ge-0-1 + name ge-0-1 + enabled true + + network-interface ge-0-1-intf + name ge-0-1-intf + tenant lan + source-nat true + + address 172.16.10.6 + ip-address 172.16.10.6 + prefix-length 24 + gateway 172.16.10.1 + exit + exit + exit + + device-interface ge-0-2 + name ge-0-2 + forwarding false + + network-interface ge-0-2-intf + name ge-0-2-intf + type fabric + + address 192.168.10.134 + ip-address 192.168.10.134 + prefix-length 29 + gateway 192.168.10.129 + exit + exit + exit + + cloud-redundancy-membership group1 + cloud-redundancy-group group1 + dns-server 168.63.129.16 + priority 2 + redundant-interface ge-0-1 + exit + exit + + service-route internet-from-lan-route + name internet-from-lan-route + service-name internet-from-lan + enable-failover true + + next-hop node1 ge-0-0-intf + node-name node1 + interface ge-0-0-intf + exit + + next-hop node0 ge-0-0-intf + node-name node0 + interface ge-0-0-intf + exit + exit + exit + + tenant lan + name lan + exit + + service internet-from-lan + name internet-from-lan + enabled true + tenant lan + address 0.0.0.0/0 + exit + exit +exit +``` + +## Complete AWS TGW Configuration Example + +Below is a complete working example of an AWS TGW High Availability deployment. This configuration demonstrates the use of explicit route table configuration using `tgw-route-table-id` and `tgw-attachment-id`. + +:::info +This example is based on a production deployment pattern. Adjust the IP addresses, subscription IDs, and resource identifiers for your specific environment. +::: + +```config {3-13,57-63,104-110} +config + authority + cloud-redundancy-group group1 + enabled true + name group1 + solution-type aws-tgw + + tgw-route-table-id tgw-rtb-00000000000000001 + health-interval 6 + additional-branch-prefix 1.1.1.1/32 + up-holddown-timeout 2 + peer-reachability-timeout 13 + exit + + router aws-ha + name aws-ha + inter-node-security internal + + node node0 + name node0 + asset-id node0-asset-id + role combo + + device-interface ge-0-1 + name ge-0-1 + enabled true + + network-interface ge-0-1-intf + name ge-0-1-intf + tenant lan + source-nat true + + address 172.16.10.4 + ip-address 172.16.10.4 + prefix-length 24 + gateway 172.16.10.1 + exit + exit + exit + + device-interface ge-0-2 + name ge-0-2 + forwarding false + + network-interface ge-0-2-intf + name ge-0-2-intf + type fabric + + address 192.168.10.132 + ip-address 192.168.10.132 + prefix-length 29 + gateway 192.168.10.129 + exit + exit + exit + + cloud-redundancy-membership group1 + cloud-redundancy-group group1 + dns-server 8.8.8.8 + priority 1 + redundant-interface ge-0-1 + tgw-attachment-id tgw-attach-00000000000000001 + exit + exit + + node node1 + name node1 + asset-id node1-asset-id + role combo + + device-interface ge-0-1 + name ge-0-1 + enabled true + + network-interface ge-0-1-intf + name ge-0-1-intf + tenant lan + source-nat true + + address 172.16.10.6 + ip-address 172.16.10.6 + prefix-length 24 + gateway 172.16.10.1 + exit + exit + exit + + device-interface ge-0-2 + name ge-0-2 + forwarding false + + network-interface ge-0-2-intf + name ge-0-2-intf + type fabric + + address 192.168.10.134 + ip-address 192.168.10.134 + prefix-length 29 + gateway 192.168.10.129 + exit + exit + exit + + cloud-redundancy-membership group1 + cloud-redundancy-group group1 + dns-server 8.8.8.8 + priority 2 + redundant-interface ge-0-1 + tgw-attachment-id tgw-attach-00000000000000002 + exit + exit + + service-route internet-from-lan-route + name internet-from-lan-route + service-name internet-from-lan + enable-failover true + + next-hop node1 ge-0-0-intf + node-name node1 + interface ge-0-0-intf + exit + + next-hop node0 ge-0-0-intf + node-name node0 + interface ge-0-0-intf + exit + exit + exit + + tenant lan + name lan + exit + + service internet-from-lan + name internet-from-lan + enabled true + tenant lan + address 0.0.0.0/0 + exit + exit +exit +``` + +### Key Configuration Points +1. **AWS TGW Route Table ID**: On the group, specifies which TGW Route Table to update. +2. **AWS TGW Attachment ID**: On each member, specifies the attachment which corresponds to the VPC that the SSR node lives in. This setting needs to be different across nodes. +3. **Redundant Interfaces**: Each node specifies `ge-0-1` as the redundant interface to be monitored for failover. +4. **Priority**: Node0 has priority 1 (primary), Node1 has priority 2 (secondary). +5. **DNS Servers**: AWS's internal DNS server (168.63.129.16) is configured for the cloud HA membership. +6. **Fabric Interface**: A dedicated fabric interface (ge-0-2) is used for inter-node communication. + ## Generated SSR Configuration -To see a full blown configuration and the configuration it generates, look at `Complete Example Configuration` in the `Appendix`. +To see a full configuration and the configuration it generates, refer to the [Complete Example Configuration #1 for azure-lb](#complete-example-configuration-1-for-azure-lb) in the [Appendix](#appendix). + +### Management Traffic Config Generation + +When the plugin is configured on a router which is consisting of two nodes, the plugin will try to generate a service and service route for each node to get the DNS and API calls from the network namespace to the cloud providers. + +This config generation is dependent on if the node has an interface marked `management true`. + +If there are no `management true` interfaces, then the user must configure them manually. The key configuration to include is: +* The higher priority member's tenant is `cloud-ha-0` +* The lower priority member's tenant is `cloud-ha-1` +* The service must allow: + * TCP 80 - for metadata service interactions + * TCP 443 - for API interactions + * UDP 53 - for DNS to resolve the API hostnames +* Each service route must have at least one service route with a next hop which allows it out an external interface + +The service and service-route will be named either `cloud-ha-management-0` or `cloud-ha-management-1` depending on whether it is node0 or node1. ### Validation @@ -334,9 +760,17 @@ The different services on the router all log to the files captured by the glob ` ### PCLI Enhancements + +#### Become Active + +The command `request cloud-ha become-active router node ` will try to force the given node to become active and try to tell the other node to become inactive. If the peer node is not accessible at the time of the command, the command will not fail. + +This command can be useful if the routes have been modified outside of the plugin and they need to be fixed. + +#### State To check the state of the Cloud HA solution running on the router, the plugin adds output to the `show device-interface` command for the `cloud-ha` interface. This state information is also accessible from the SSR's public REST API with a `GET` on `/api/v1/router//node//cloud-ha/state`. -#### State Fields +##### State Fields | Field | Description | | ------------------------ | --------------------------------------------------------------------------------------------------------- | @@ -422,48 +856,6 @@ Example output for the `azure-lb` solution: Completed in 1.72 seconds ``` -Example output for the `alicloud-vpc` solution: -``` -# show device-interface name cloud-ha -Wed 2022-09-21 10:31:57 CST -✔ Retrieving device interface information... - -====================================================== - AlicloudSDWAN-HA-Router01:cloud-ha -====================================================== - Type: host - Forwarding: true - Mode: host - MAC Address: ca:4d:38:ac:9b:5c - - Admin Status: up - Operational Status: up - Provisional Status: up - Redundancy Status: non-redundant - Speed: 1 Gb/s - Duplex: full - - in-octets: 1059977756 - in-unicast-pkts: 10915259 - in-errors: 0 - out-octets: 1053204137 - out-unicast-pkts: 10915347 - out-errors: 0 - - Cloud HA: - is-active: True - last-activity-change:Fri 2022-09-09 10:31:15 CST - first-active-interface:LAN - first-active-mac-address:00:16:3e:10:a9:34 - prefixes: - 10.0.0.0/8 - 172.16.0.0/12 - local-status: healthy - remote-status: healthy - -Completed in 0.04 seconds -``` - ### Systemd Services * `128T-telegraf@cloud_ha_health`: the instance of the monitoring agent that produces the health statuses @@ -481,7 +873,8 @@ If the Cloud HA Agent does not come up cleanly or fails for a prolonged period o Example output when the agent did not come up cleanly: -```admin@combo-west.RTR_WEST_COMBO# show alarms +``` +admin@combo-west.RTR_WEST_COMBO# show alarms Tue 2020-10-09 16:42:50 UTC ============== ===================== ========== ============= =========== ====================================== @@ -505,7 +898,6 @@ Error: The router will need to be upgraded to a compatible version. Compatible versions can be listed with `dnf list --whatprovides 128T-device-interface-api(1.0)`. - ## Appendix ### Health Status Transport @@ -964,137 +1356,21 @@ config exit ``` +## Release Notes -### Part of Example Configuration #2 for alicloud-vpc after auto generation. - -Below is a sample running configuration for a node and a service. - -:::info -The following configuration is an example only - It is not for use on a system. -::: - - - node AlicloudSDWAN-HA-Router01 - name AlicloudSDWAN-HA-Router01 - asset-id iZwz96to20bbnnb - - device-interface WAN - name WAN - pci-address 0000:00:04.0 - - network-interface WAN - name WAN - global-id 17 - - neighborhood ChinaUnicom - name ChinaUnicom - external-nat-address 119.23.200.200 - vector wan1 - - path-mtu-discovery - enabled true - exit - exit - inter-router-security internal - dhcp v4 - exit - exit - - device-interface LAN - name LAN - pci-address 0000:00:05.0 - enabled true - - network-interface LAN - name LAN - global-id 18 - - neighborhood ALICLOUD - name ALICLOUD - exit - inter-router-security internal - dhcp v4 - exit - exit - - device-interface farbic - name farbic - pci-address 0000:00:06.0 - - network-interface fabric - name fabric - global-id 19 - - neighborhood intracloud - name intracloud - topology mesh - - path-mtu-discovery - enabled true - exit - exit - inter-router-security internal - dhcp v4 - exit - exit - - device-interface cloud-ha - name cloud-ha - description "Auto generated device interface cloud-ha" - type host - - network-interface cloud-ha-intf - name cloud-ha-intf - global-id 27 - type external - tenant cloud-ha - source-nat true - - address 169.254.137.1 - ip-address 169.254.137.1 - prefix-length 30 - gateway 169.254.137.2 - exit - exit - exit - exit - - service-route Alicloud-HA-test - name Alicloud-HA-test - service-name Alicloud-HA-test - vector peer1 - - next-hop AlicloudSDWAN-HA-Router01 LAN - node-name AlicloudSDWAN-HA-Router01 - interface LAN - vector peer1 - exit - exit - - service-route AliCloud-HA-test2 - name AliCloud-HA-test2 - service-name Alicloud-HA-test - vector peer2 - peer AlicloudSDWAN-HA-Router02 - exit +### Release 5.1.0 - service-route cloud-ha-service-route-alicloud-1 - name cloud-ha-service-route-alicloud-1 - service-name cloud-ha-service-alicloud-1 - nat-target 169.254.137.2 - exit +#### New Features and Improvements - service-route cloud-ha-peer-service-route-alicloud-2 - name cloud-ha-peer-service-route-alicloud-2 - service-name cloud-ha-service-alicloud-2 - peer AlicloudSDWAN-HA-Router02 - exit - exit +Add support for dual node HA which allows the plugin to manage redundancy across two nodes of the same router. +Azure VNET mode was extended to support updates to additional route tables including those in a different subscription-id. +- See the [Azure VNET Configuration Example](#complete-azure-vnet-configuration-example) for more details. +AWS TGW mode was added. -## Release Notes +- See the [AWS TGW Configuration Example](#complete-aws-tgw-configuration-example) for more details. ### Release 5.0.0 @@ -1106,7 +1382,6 @@ Support for install and upgrade of a customized upstream Linux-based SSR OS dist On conductor, the plugin will auto upgrade to this version when upgrading from 6.x to 7.x version of SSR software. In addition, all routers will also be auto upgraded to their respective Oracle Linux 7 or Oracle Linux 9 plugin version depending on the SSR version running on the device. ::: - **Release Date:** Oct 13, 2025 ### Release 4.0.2 diff --git a/static/img/cloud_ha_azure_vnet_architecture.png b/static/img/cloud_ha_azure_vnet_architecture.png new file mode 100644 index 0000000000..744b4a6b6e Binary files /dev/null and b/static/img/cloud_ha_azure_vnet_architecture.png differ