Kubernetes infrastructure for rriv's IoT cloud
Authors:
- Proton Pass - for storing credentials to various accounts
- DigitalOcean account access - this is where our cloud lives
- Tailscale VPN access - this is the VPN that gatekeeps some of our infrastructure
- kubectl - k8s CLI tool
- Helm and Helmfile - Kubernetes templating
- Helm Diff Plugin installed:
helm plugin install https://github.com/databus23/helm-diff - doctl - CLI tool for DigitalOcean
- Terraform - IaC for our multi-cloud environment
RRIV cloud is set up to use DigitalOcean (DO) for its Chirpstack servers. Its main database is a Postgres managed db, also in DO. There are two k8s clusters for each environment - one for the project, and one for the Vault secrets.
We use helm for configuring the k8s clusters. Additionally, the DO infrastructure is captured in Terraform. The Terraform statefiles live in DO.
AWS is used sparingly. Its main purpose is to act as the guardian of our Vault root token. For this, we have some very basic infrastructure captured in Terraform as well, stored again in DigitalOcean statefiles.
This diagram attempts to show where each piece of infrastructure lives, and where various resources retrieve data/access from.
There are a handful of resources that have been set up manually. If you are setting up your own project, you will need to create these yourself in your own cloud provider accounts:
- An infrastructure management AWS account (ours is named
rriv) - AWS S3 bucket in the
rrivaccount for the initial AWS Terraform statefile (see./00-tf-bootstrap/account-setup/backend.hcl.example) - 4 DO Teams linked to the same organization:
rriv-management,rriv-dev,rriv-staging,rriv-prod - DO S3 bucket in each of the dev, staging, and prod team accounts - for storing the Terraform statefiles - see the
backend.hcl.examplefiles in each environment in ./02-tf/environments - DO S3 bucket access key with read/write/delete on the above S3 buckets (limited scope, and unable to be created with Terraform)
This section is under construction :)
00-tf-bootstrap: initial terraform that is applied from within this directory to set up AWS resources that then administer the DO resources- TODO: resources in 02-tf need to be moved into the bootstrap directory
01-helm: Vault and rriv k8s clusters setup- Unseal the Vault clusters in the various environments
02-tf: Final terraform to create the remaining resources, including DNS and Vault secrets. You must begin withenvironments/management/, then dodev, etc.
In order to get set up with k8s, you must first be able to run the terraform. The tf maintains the k8s cluster itself and everything around it. Helm maintains everything inside the cluster.
- Open the
.zshrc.examplefile and follow the instructions. - See the section below on "Applying Terraform changes". The secret tfvars and backend.hcl must be set up.
- Read below to get authenticated into DigitalOcean. Before you can run kubectl commands, run a
terraform refreshto generate the kubeconfig files from each environment directory (./02-tf/environments/...)
You will need to set up the following files (copy and alter the .example files):
secrets.auto.tfvars
- DO token: This is the token that you use to authenticate to DO and apply terraform. See the section "Authenticating into DigitalOcean" on how to create a new PAT. Make sure it has permissions to modify clusters and droplets - you may need to create a new one if it doesn't.
- Tailscale auth key: Auth keys are generated by an admin here: https://login.tailscale.com/admin/settings/keys. You can ask an admin to generate a key for you if you are unable to. Save this in your Proton Pass personal vault.
- ssh fingerprint: Create an ssh key locally:
ssh-keygen -t ed25519 -C "your_email@rriv.org". Go to the Security section of the Settings page of the rriv account in DO, and add your key. Copy the fingerprint.
backend.hcl
- DO bucket & key: Get these from a project admin.
- DO access key & secret key: Create an access key in the Access Keys section of the Spaces Object Storage in DO. Save this in your Proton Pass personal vault.
To apply Terraform changes, cd into the environment directory (./02-tf/environments/...) and run your terraform command from here. It is important that you begin with the management environment, in order to populate DNS from the loadbalancers that helm already created. When you run terraform init, you must pass in -backend-config=backend-$ENV.hcl.
If you get this error on the kubernetes terraform resources, you may need to re-apply the k8s-cluster module in order to regenerate the kubeconfig: tf apply -target=module.$ENV_do_sfo2_k8s_rriv_cluster or tf apply -target=module.$ENV_do_sfo2_k8s_vault_cluster
The first time you authenticate, you will need to set up DigitalOcean's CLI, doctl. See DO's documentation for using the package manager of your choice.
Log in to DO on the web. Go to the API tokens page and create a new Personal Access Token. Give it an expiration of 90 days (for security purposes, we don't want to create forever tokens). Give it full access, as you will need to be able to touch most different kinds of resources in order to develop in DO. Save the PAT in your personal Proton Pass vault.
With the PAT copied, run doctl auth init --context rriv-dev. Paste in the token. You will need to come back and do this step with each team (account) you plan to develop in.
Now run: doctl kubernetes cluster list. You should see the rriv-dev cluster and the vault-dev clusters.
You should now be able to run kubectl get no and see the DO droplets.
Changes to the k8s yaml configurations must be applied using the helmfile command. helmfile must be run from the helm directory.
Show all new changes without applying them: helmfile -e dev diff
Apply all new changes from all k8s yamls: helmfile -e dev apply
To target a specific module using release name in the helmfile gotmpl: helmfile -e dev apply --selector name=rriv-chirpstack
The k8s files were generated from this chirpstack-docker project.
The Chirpstack Gateway Bridge receives packets from LoRaWAN devices in the field. Currently we are hardcoded to a single channel plan, us915_1. In order to use multiple channel plans, we need to run separate gateway bridges with their own ClusterIp services, and route to them behind an nginx loadbalancer.
Admin credentials are stored in ProtonPass. Have an admin create an account for you.
Hashicorp Vault is installed in a separate k8s cluster. To switch over:
kubectl config get-contexts
kubectl config use-context <vault-context-name>
This should update the current cluster to be the one containing Vault.
The Vault cluster is still maintained under the same root helmfile, as a separate release. The environments directory takes care of pointing to the correct cluster when updating Vault.
You may need to manually install the dependencies with the CRDs.
After this you may run helmfile -e $ENV apply.
Vault is set to auto-unseal via AWS KMS keys. This means that as vault pods go down, they come back up without needing to be manually unsealed. However, the first time vault is installed, manual unseal is required. To do this:
$ kubectl exec -it rriv-vault-0 -n vault -- /bin/sh
/ $ export VAULT_ADDR="https://127.0.0.1:8200"
/ $ export VAULT_SKIP_VERIFY=true # skip TLS verification
/ $ vault status # check your connection
Key Value
--- -----
Seal Type awskms
Recovery Seal Type n/a
Initialized false
Sealed true
...
/ $ vault operator init
Recovery Key 1: ...
Recovery Key 2: ...
Recovery Key 3: ...
Recovery Key 4: ...
Recovery Key 5: ...
Initial Root Token: ...
Success! Vault is initialized
...
(If any of this errors, try adding the flag: -tls-skip-verify)
After this, it is critical that you write down the recovery keys and root token. SAVE these in Proton Pass. Save half of the recovery keys and distribute half to another admin.
You must now exec into the rriv-vault-0 pod and unseal Vault. You only need to do this once.
Now, if the vault pod goes down, it should be able to come up and unseal itself. Test this by deleting the pod and checking its logs.
The Terraform Vault module needs a bit of manual work to set up. Once you have Vault installed, you will need to create a temporary token in order to set up the Vault auth method for k8s. Refer to vault-kubernetes.sh.
The Vault agent injects secrets into pods that are specified via annotations. For an example of this, look at the Chirpstack deployment file. Secrets are injected into /vault/secrets/credentials, then used by the container. This centralizes secrets across cloud resources and keeps them out of plaintext k8s.
The secrets needed by Chirpstack are:
POSTGRES_CONN_STRING # Chirpstack's database connection string
PG_INTEGRATION_CONN_STRING # The TimescaleDB database that Chirpstack pipes data into
Install the CLI tool via the package manager of your choice: https://developer.hashicorp.com/vault/install
With your kubectl context set to the Vault cluster (see above), forward traffic to localhost:
kubectl port-forward svc/vault -n vault 8200:8200
In another terminal:
export VAULT_ADDR=http://localhost:8200
vault secrets list
Installed with: kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
The message queue uses nats.
TBD: Might need to create the PVC manually with the pvc.yml file.
Modify the consumers in jetstream.yaml as desired. For more info: https://github.com/nats-io/k8s/blob/main/helm/charts/nack/README.md.
To test the sending and receiving of messages:
# Run nats-box which includes the NATS management utilities:
kubectl exec -it deploy/nats-box -- /bin/sh -l
# Publish a couple of messages
$ nats req orders.received "order 1"
$ nats req orders.received "order 2"
# Pull first message.
$ nats consumer next mystream my-pull-consumer
# Subscribe to the message stream:
$ nats sub my-push-consumer.orders
Keycloak is used as an identity provider throughout the project. Its purpose is twofold:
- Authenticate end users who wish to use the rriv monitoring software.
- Act as an IDP for the VPN, which gates access to certain private backend services, (such as Vault).
Once you have applied the $ENV_vault_sfo2 module, you are ready to log in to keycloak. You may need to also do a helm apply/sync again, in case the certs aren't working yet:
helmfile -e $ENV sync --selector name=rriv-network
- To use keycloak, you must first log in with the
keycloak_username/keycloak_passwordcredentials that are stored in Vault in the<$ENV>-keycloak-credssecret. Use these to log in and manually create in the UI a new Client with service accounts activated. - You can leave "OpenID Connect" selected.
- Give the client the name "terraform".
- Turn on "Client Authentication" and make sure that the following are checked: "OAuth 2.0 Device Authorization Grant", "Direct access grants", and "Service accounts roles". Root URL and Home URL should be set: "https://auth.<$ENV>.rriv.org" or "https://auth.rriv.org" for prod.
- Put the credentials in your local secrets file so that terraform can use it to create a "rriv-beta" realm.
- Under the "Service accounts roles" tab, select "Assign role" > "Filter by realm roles" > select "admin" > Save. From here, once you have applied the keycloak module terraform, you can log in and create users.
You must authenticate with DO and AWS before running terraform. Use doctl to authenticate to DO.
For AWS, you need an SSO user. After logging in, you will have the option to retrieve access/secret keys. Paste these into your terminal and run aws sts get-caller-identity to verify that you are logged in as a user with an admin role.
To update the node_size variable in the k8s-cluster module, you must NOT allow the cluster to be destroyed. This is a shortcoming of the current setup. Upgrade each node one at a time to the new size, and change the node_size value to reflect the changes when you are done. Vault data is persisted in k8s persistent volumes, but currently, they will be destroyed if the entire cluster is destroyed. Having the nodes come back up one at a time ensures that there is always a leader node that can pass the data along.
