Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 62 additions & 7 deletions .hyf/test.sh
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,13 +1,68 @@
#!/usr/bin/env bash
set -euo pipefail

# Run your test scripts here.
# Auto grade tool will execute this file within the .hyf working directory.
# The result should be stored in score.json file with the format shown below.
cat << EOF > score.json
# Week 9 is a SQL assignment, graded by teacher review against the rubric.
# This auto-grade is a COMPLETENESS smoke check only: it confirms every required
# deliverable exists, is non-empty, and has had its TODO placeholders filled in.
# It does NOT run SQL against a database, and it is NOT the final grade.
#
# The tool runs this script from the .hyf working directory and reads .hyf/score.json,
# so we resolve the repo root explicitly and write score.json next to this script.

HERE="$(cd "$(dirname "$0")" && pwd)"
ROOT="$(cd "$HERE/.." && pwd)"
score=0

# A deliverable counts as "done" only when it exists, is non-empty, and has no TODO left.
# This is what makes the untouched scaffold score 0: every starter file is full of TODOs.
done_file() {
local f="$ROOT/$1"
[ -s "$f" ] && ! grep -qiE "todo" "$f"
}

# Task 1 (20): validation_queries.sql filled, with the expected check patterns.
if done_file validation_queries.sql; then
score=$((score + 8))
grep -qiE "having[[:space:]]+count" "$ROOT/validation_queries.sql" && score=$((score + 4))
grep -qiE "is[[:space:]]+null" "$ROOT/validation_queries.sql" && score=$((score + 4))
grep -qiE "min\(|max\(" "$ROOT/validation_queries.sql" && score=$((score + 4))
fi

# Task 2 (30): schema_setup.sql creates both views and references fares.
if done_file schema_setup.sql; then
score=$((score + 6))
grep -qiE "view[[:space:]]+vw_dim_zones" "$ROOT/schema_setup.sql" && score=$((score + 8))
grep -qiE "view[[:space:]]+vw_fact_trips" "$ROOT/schema_setup.sql" && score=$((score + 8))
grep -qiE "fare_amount" "$ROOT/schema_setup.sql" && score=$((score + 8))
fi

# Task 3 (20): data_dictionary.md filled and states a grain.
if done_file data_dictionary.md; then
score=$((score + 14))
grep -qiE "grain" "$ROOT/data_dictionary.md" && score=$((score + 6))
fi

# Task 4 (20): verification_results.sql filled + borough screenshot present.
if done_file verification_results.sql; then
score=$((score + 10))
grep -qiE "borough" "$ROOT/verification_results.sql" && score=$((score + 5))
fi
[ -f "$ROOT/assets/borough_count.png" ] && score=$((score + 5))

# Task 5 (10): AI_ASSIST.md filled.
if done_file AI_ASSIST.md; then
score=$((score + 10))
fi

[ "$score" -gt 100 ] && score=100
if [ "$score" -ge 60 ]; then pass=true; else pass=false; fi

cat > "$HERE/score.json" <<EOF
{
"score": 0,
"pass": true,
"passingScore": 0
"score": ${score},
"pass": ${pass},
"passingScore": 60
}
EOF

echo "Completeness score: ${score}/100 (pass=${pass}). Final grade is teacher review against the rubric."
21 changes: 21 additions & 0 deletions AI_ASSIST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# AI Assistance Log

Document one session where you used an LLM to help with a query or a design decision while completing Tasks 1-4. Replace every TODO.

> ⚠️ Never paste real customer data or PII into an LLM. The NYC taxi dataset used here is public, so sample rows are safe to share.

## The problem

TODO: What were you trying to solve? Paste the relevant SQL or schema fragment.

## The prompt

TODO: What did you ask the AI? Include the context you provided.

## The response

TODO: What did it suggest? Did it work first try?

## Reflection

TODO: Did you understand *why* the suggestion worked, or did you accept it blindly?
43 changes: 33 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,40 @@
# [Track] week X assignment
HackYourFuture <Track> week X assignment
The Week X assignment for the HackYourFuture <TRACK> can be found at the following link: [TODO: Assignment url in the learning platform]
# Data Track Week 9 Assignment: SQL for Analytics

HackYourFuture Data Track, Week 9. The full brief (scenario, tasks, and grading) lives in the curriculum: **Week 9 → Assignment** in the HackYourFuture learning platform. This repo holds the starter files you fill in.

## Implementation Instructions
You audit the raw NYC taxi data, model it as a star schema of SQL **views**, and document it. Run every query against **your own assigned schema** on the shared Azure PostgreSQL instance, not the shared `public` schema. The data is two tables: `nyc_taxi.raw_trips` (~57K green-taxi trips, January 2024) and `nyc_taxi.raw_zones` (265 location lookups).

Provide clear instructions on how trainees should implement the tasks.
## What you submit

### Task 1
Instructions for Task 1
Fill in these files (starters are provided). Keep them at the repo root and do not rename them.

### Task 2
Instructions for Task 2
| File | Task | What it holds |
|---|---|---|
| `validation_queries.sql` | Task 1 | Data-quality audit: duplicates, nulls, range, orphaned keys |
| `schema_setup.sql` | Task 2 | `CREATE OR REPLACE VIEW vw_dim_zones` and `vw_fact_trips` |
| `data_dictionary.md` | Task 3 | Grain, keys, and measures for both views |
| `verification_results.sql` | Task 4 | Verification queries (volume, revenue, geospatial, time patterns) |
| `assets/borough_count.png` | Task 4 | Screenshot of the per-borough row-count result |
| `AI_ASSIST.md` | Task 5 | One documented LLM session |

...
## Tasks (summary)

1. **Data Quality Audit** (`validation_queries.sql`): find duplicate trips, count NULL pickup/dropoff location IDs, check the `fare_amount` range for negatives, and find `pickup_location_id` values not present in `nyc_taxi.raw_zones`.
2. **Star Schema Views** (`schema_setup.sql`): `vw_dim_zones` (one row per `location_id`, the primary key) and `vw_fact_trips` (one row per trip; exclude `fare_amount < 0`; cast `pickup_datetime` to `TIMESTAMP`; keep the location IDs so it joins to `vw_dim_zones`).
3. **Data Dictionary** (`data_dictionary.md`): state each view's grain in one sentence, identify keys, list measures.
4. **Verification Queries** (`verification_results.sql`): query the views for volume, revenue, geospatial, and time-pattern questions, joining through `vw_dim_zones` for any borough/zone name. Save a screenshot of the per-borough counts to `assets/borough_count.png`.
5. **AI Assistance Log** (`AI_ASSIST.md`): document one LLM session honestly.

## How you are graded

- **Auto-grade (on PR creation):** a **completeness** smoke check confirms every required deliverable exists, is non-empty, and contains the expected views and checks. It does **not** run SQL against a database and is **not** your final grade.
- **Teacher review:** your teacher grades correctness against the rubric: do the queries run, do findings match the real data, does `vw_fact_trips` filter negatives and join cleanly, is the grain stated precisely.

## Submit

1. Work on a branch in your copy of this repo.
2. Fill in each deliverable file.
3. Commit, push, and open a Pull Request against `main`. The auto-grade runs on PR creation and posts a completeness score.
4. Share the PR URL with your teacher.

> ⚠️ Never paste real customer data or PII into an LLM. The NYC taxi dataset used here is public and safe to share.
1 change: 1 addition & 0 deletions assets/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Save your Task 4 screenshot here as borough_count.png
17 changes: 17 additions & 0 deletions data_dictionary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Data Dictionary

Document both views. State the grain in one sentence, identify the keys, and list the measures (the columns you can aggregate). Replace every TODO.

## vw_fact_trips

- **Grain:** TODO (one sentence, e.g. "One row per ...")
- **Primary key:** TODO
- **Foreign keys:** TODO
- **Measures:** TODO (columns you would SUM or AVG)

## vw_dim_zones

- **Grain:** TODO
- **Primary key:** TODO
- **Foreign keys:** TODO (or "none")
- **Measures:** TODO (or "none, descriptive attributes only")
26 changes: 26 additions & 0 deletions schema_setup.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
-- Task 2: Star Schema Views (create these in YOUR OWN schema, not public).
-- CREATE OR REPLACE VIEW lets you re-run this script while you iterate.

-- Dimension: one row per location_id. Treat location_id as the primary key.
-- TODO: complete the SELECT (location_id, zone, borough).
CREATE OR REPLACE VIEW vw_dim_zones AS
SELECT
-- TODO
FROM nyc_taxi.raw_zones;

-- Fact: one row per taxi trip.
-- - Exclude rows where fare_amount is less than 0.
-- - Cast pickup_datetime to TIMESTAMP.
-- - Keep the location IDs so the view can join to vw_dim_zones.
-- TODO: complete the SELECT and the WHERE.
CREATE OR REPLACE VIEW vw_fact_trips AS
SELECT
-- TODO
FROM nyc_taxi.raw_trips
-- TODO: WHERE fare_amount >= 0
;

-- Join-readiness test (run after creating the views; it must run without error
-- and return a count close to the vw_fact_trips row count):
-- SELECT COUNT(*) FROM vw_fact_trips f
-- JOIN vw_dim_zones d ON f.pickup_location_id = d.location_id;
Empty file removed task-1/task 1 files
Empty file.
Empty file removed task-2/task 2 files
Empty file.
20 changes: 20 additions & 0 deletions validation_queries.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
-- Task 1: Data Quality Audit
-- Run every query against nyc_taxi.raw_trips / nyc_taxi.raw_zones in YOUR OWN schema (not public).
-- The shared pattern is a query that returns the bad rows (or a count).
-- Zero rows back means the check passed.

-- 1. Duplicate check: are there rows with the same vendor_id, pickup_datetime, dropoff_datetime?
-- TODO: GROUP BY the three columns and keep only groups with HAVING COUNT(*) > 1.


-- 2. Null integrity: how many rows have a NULL pickup_location_id or dropoff_location_id?
-- TODO: count the NULLs (COUNT(*) FILTER (WHERE ... IS NULL) is handy for several columns at once).


-- 3. Range validation: what are the min and max fare_amount? Are there negative values?
-- TODO: SELECT MIN(fare_amount), MAX(fare_amount), and a count of rows where fare_amount < 0.


-- 4. Relationship check: which pickup_location_id values in nyc_taxi.raw_trips do NOT exist in nyc_taxi.raw_zones?
-- TODO: LEFT JOIN nyc_taxi.raw_zones ... WHERE z.location_id IS NULL (or NOT EXISTS).
-- Do NOT use NOT IN: a single NULL in the subquery hides every orphan.
22 changes: 22 additions & 0 deletions verification_results.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
-- Task 4: Verification Queries.
-- Query your views and label each query with the question it answers.
-- Borough and zone names live in vw_dim_zones, so join on pickup_location_id = location_id.

-- 1. Volume: how many total rows in vw_fact_trips? How many rows per borough?
-- What is the most common pickup/dropoff location combination?
-- TODO
-- (Take a screenshot of the per-borough counts and save it as assets/borough_count.png.)


-- 2. Revenue: which pickup zone (name, not ID) generated the highest total fare_amount?
-- Which pickup zone collected the highest total fare_amount on any single day?
-- TODO


-- 3. Geospatial: total number of trips and average trip_distance for each borough.
-- TODO


-- 4. Time patterns: which day of the week had the highest total tip_amount?
-- What hour of the day has the highest average tip?
-- TODO