Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
## Goal

## Changes

## Testing

## Checklist
- [ ] Clear PR title
- [ ] README updated
- [ ] No secrets and temp. files
Binary file added labs/screenshots/alert.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added labs/screenshots/browser_check.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added labs/screenshots/browser_check_pass.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added labs/screenshots/check_success.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added labs/screenshots/dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added labs/screenshots/freq_settings.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
309 changes: 309 additions & 0 deletions labs/submission8.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
# Lab 8 Submission — Site Reliability Engineering (SRE)

## Task 1 — Key Metrics for SRE and System Analysis

### 1.1 Monitor System Resources

`htop` is interactive and this installed version does not support batch output, so I used `top -b -n 1` to capture a readable non-interactive CPU and memory snapshot for the report.

Command:

```bash
top -b -n 1 | sed -n '1,20p'
```

Output:

```text
top - 20:42:04 up 33 min, 2 users, load average: 1.78, 1.85, 1.86
Tasks: 4 total, 1 running, 3 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 19906.0 total, 10745.9 free, 4549.4 used, 5556.8 buff/cache
MiB Swap: 977.0 total, 977.0 free, 0.0 used. 15356.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 nikkimen 20 0 130484 4220 856 S 0.0 0.0 0:00.00 code
2 nikkimen 20 0 7332 3020 2756 S 0.0 0.0 0:00.00 bash
3 nikkimen 20 0 12060 4812 2912 R 0.0 0.0 0:00.00 top
4 nikkimen 20 0 6812 1048 928 S 0.0 0.0 0:00.00 sed
```

Command:

```bash
free -h
```

Output:

```text
total used free shared buff/cache available
Mem: 19Gi 4.4Gi 10Gi 584Mi 5.4Gi 14Gi
Swap: 976Mi 0B 976Mi
```

Command:

```bash
iostat -x 1 5
```

Output:

```text
Linux 6.1.0-44-amd64 (debian) 03/23/26 _x86_64_ (4 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
19.24 0.19 5.77 0.45 0.00 74.34

Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 24.93 1114.00 10.87 30.37 0.25 44.69 8.07 158.49 6.49 44.59 2.16 19.64 0.00 0.00 0.00 0.00 0.00 0.00 0.40 2.10 0.02 0.79
sda 43.04 645.07 18.11 29.61 0.44 14.99 7.05 394.10 7.94 52.96 1.51 55.91 0.00 0.00 0.00 0.00 0.00 0.00 1.32 3.28 0.03 1.58


avg-cpu: %user %nice %system %iowait %steal %idle
47.73 0.00 7.32 0.25 0.00 44.70

Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 97.00 436.00 0.00 0.00 0.64 4.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.80
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00


avg-cpu: %user %nice %system %iowait %steal %idle
36.99 0.00 11.48 0.00 0.00 51.53

Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 13.00 76.00 5.00 27.78 3.46 5.85 0.00 0.00 0.00 0.00 0.00 0.00 2.00 4.00 0.05 0.80


avg-cpu: %user %nice %system %iowait %steal %idle
32.74 0.00 5.37 0.00 0.00 61.89

Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 4.00 16.00 0.00 0.00 0.50 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00


avg-cpu: %user %nice %system %iowait %steal %idle
16.54 0.00 6.46 0.00 0.00 77.00

Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 3.00 44.00 1.00 25.00 0.33 14.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
```

CPU-heavy processes snapshot:

```bash
ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head -n 10
```

Output:

```text
PID COMMAND %CPU %MEM
1 code 0.0 0.0
2 bash 0.0 0.0
3 ps 0.0 0.0
4 head 0.0 0.0
```

Memory-heavy processes snapshot:

```bash
ps -eo pid,comm,%mem,%cpu --sort=-%mem | head -n 4
```

Output:

```text
PID COMMAND %MEM %CPU
3 ps 0.0 0.0
1 code 0.0 0.0
2 bash 0.0 100
```

I/O-heavy processes snapshot:

```bash
pidstat -d 1 3
```

Output:

```text
Linux 6.1.0-44-amd64 (debian) 03/23/26 _x86_64_ (4 CPU)

20:41:44 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command

20:41:45 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command

20:41:46 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command

Average: UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
```

Top 3 most consuming applications observed:

1. CPU usage: `code`, `bash`, `ps`
2. Memory usage: `ps`, `code`, `bash`
3. I/O usage: No significant per-process I/O activity was observed during the sampling window, so no clear top 3 I/O consumers were present.

### 1.2 Disk Space Management

Command:

```bash
df -h
```

Output:

```text
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p7 91G 77G 10G 89% /
tmpfs 9.8G 0 9.8G 0% /dev
tmpfs 2.0G 1.9M 2.0G 1% /run
tmpfs 5.0M 8.0K 5.0M 1% /run/lock
tmpfs 2.0G 88K 2.0G 1% /run/user/1000
/dev/sda3 206G 103G 93G 53% /home
/dev/nvme0n1p5 517M 197M 321M 39% /boot/efi
udev 9.7G 0 9.7G 0% /dev/tty
```

Command:

```bash
du -h /var 2>/dev/null | sort -rh | head -n 10
```

Output:

```text
sort: write failed: 'standard output': Broken pipe
sort: write error
7.7G /var
5.9G /var/lib
5.5G /var/lib/flatpak/repo/objects
5.5G /var/lib/flatpak/repo
5.5G /var/lib/flatpak
1.2G /var/log/journal/b80ea517224743079f1869f508a54efd
1.2G /var/log/journal
1.2G /var/log
607M /var/cache
564M /var/cache/apt
```

Command:

```bash
find /var -type f -exec du -h {} + 2>/dev/null | sort -rh | head -n 3
```

Output:

```text
sort: write failed: 'standard output': Broken pipe
198M /var/lib/flatpak/repo/objects/16/f0a92fc05aa87097e2305db2492361268aa3fb8e0787d486bf338c996cd27b.file
198M /var/lib/flatpak/app/net.ankiweb.Anki/x86_64/stable/3d1fae7d305f5ca3b9b05caa23d15c494450d3d096bbc303d06695594c6f1eff/files/lib/x86_64-linux-gnu/libQt6WebEngineCore.so.6.10.1
176M /var/lib/flatpak/repo/objects/18/046e9be743f0c9fdbf00e07c6b523c42f4ffa4b30133c7d39fdb63981fb85d.file
sort: write error
```

Top 3 largest files in `/var`:

1. `/var/lib/flatpak/repo/objects/16/f0a92fc05aa87097e2305db2492361268aa3fb8e0787d486bf338c996cd27b.file` — 198M
2. `/var/lib/flatpak/app/net.ankiweb.Anki/x86_64/stable/3d1fae7d305f5ca3b9b05caa23d15c494450d3d096bbc303d06695594c6f1eff/files/lib/x86_64-linux-gnu/libQt6WebEngineCore.so.6.10.1` — 198M
3. `/var/lib/flatpak/repo/objects/18/046e9be743f0c9fdbf00e07c6b523c42f4ffa4b30133c7d39fdb63981fb85d.file` — 176M

### Analysis

The main pattern is low real-time pressure despite relatively high persistent disk usage. CPU was mostly idle during the direct `top` snapshot, but the `iostat` samples showed periodic bursts in user and system CPU activity. I/O utilization stayed low overall, which matches the `pidstat` output where no process showed meaningful disk reads or writes during sampling. Disk space is the clearest concern: the root filesystem is already at 89% usage, and most of `/var` is consumed by Flatpak data and systemd journal logs.

### Reflection

I would optimize this system first by reducing disk pressure. The best candidates are cleaning old Flatpak objects, reviewing unused Flatpak apps, and rotating or vacuuming journal logs. I would also keep swap unused as it is now by avoiding unnecessary background processes. If CPU spikes became more frequent, I would sample over a longer period and correlate them with specific commands or scheduled jobs before changing anything.

## Task 2 — Practical Website Monitoring Setup

### Chosen Website

Website URL:

```text
https://vk.com/
```

### Check Design

I chose `https://vk.com/` because it is a public site with a clear landing page and stable core content. A good monitoring setup for this site should cover:

1. Availability: an API or HTTP check that confirms the site returns HTTP `200`.
2. Content validation: a browser check that confirms the landing page loads and key visible text such as the sign-in or sign-up area appears.
3. Basic interaction: a browser step that checks that the login form fields or a primary action button are present and interactable.
4. Performance: a threshold on page load or response time to detect slowdowns before the site is fully unavailable.

### Suggested Checkly Configuration

API check:

```text
URL: https://vk.com/
Assertion: status code equals 200
Frequency: every 5 minutes
Scheduling strategy: round-robin
```

Browser check:

```text
Open https://vk.com/
Use a Playwright browser check named "Browser Check #1"
Navigate to the page and verify the response status is below 400
Capture a screenshot during the run
Browser frequency shown in the dashboard: every 10 minutes
```

Alerting:

```text
Notify when a check has failed 1 time
Also support alerting when a check is failing for more than 5 minutes
No reminder fan-out configured beyond the default screen shown
Notification channel: email
```

### Screenshots

Browser check configuration:

![Browser check configuration](screenshots/browser_check.png)

Browser check successful run:

![Browser check passed](screenshots/browser_check_pass.png)

API check successful result:

![API check success](screenshots/check_success.png)

API check frequency settings:

![API check scheduling settings](screenshots/freq_settings.png)

Alert settings:

![Alert settings](screenshots/alert.png)

Dashboard overview:

![Dashboard overview](screenshots/dashboard.png)

### Analysis

These checks were chosen to cover both raw availability and real user experience. The API check is lightweight and runs every 5 minutes, so it can detect hard downtime quickly. The browser check runs less frequently, every 10 minutes, and validates that a real Chromium-based flow can open the site successfully. The screenshots show a passed API result with HTTP `200` in `1.21s` from Singapore and a passed browser run in about `7.9s`, which gives a useful baseline for future comparisons. The alert rule that triggers after one failed run is aggressive, but it makes sense for a public site where fast detection matters.

### Reflection

This monitoring setup improves reliability by covering both infrastructure symptoms and user-visible behavior. The API check confirms reachability and status code correctness, while the browser check proves that the page can actually load in a browser session. The dashboard screenshot also shows both checks at `100%` availability, with the API check averaging `788 ms` and the browser check averaging `7.49 s`, which is exactly the kind of baseline data SRE teams use to spot regressions early. Combined with email alerts, this setup helps surface problems before users have to report them.