Skip to content

Commit 3babeb4

Browse files
Update practicals
1 parent c2540a0 commit 3babeb4

10 files changed

Lines changed: 3863 additions & 1096 deletions

File tree

en/Project/project.md

Lines changed: 390 additions & 147 deletions
Large diffs are not rendered by default.

en/practical0/practical0.ipynb

Lines changed: 285 additions & 61 deletions
Large diffs are not rendered by default.

en/practical1/practical1.ipynb

Lines changed: 107 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,23 @@
1111
"\n",
1212
"- Parsing and working with CSV, TSV and JSON files\n",
1313
"- Querying external data sources\n",
14-
"- Data analyses\n",
14+
"- Data analysis with pandas\n",
1515
"\n",
16-
"### Exercises\n",
16+
"## Prerequisites\n",
1717
"\n",
18-
"1. Parsing and reading CSV/TSV files\n",
19-
"2. Parsing and reading JSON files\n",
20-
"3. Querying external data sources (Query endpoints and API)\n",
21-
"4. Performing classical data analyses\n",
22-
"\n"
18+
"- Completion of [Practical 0](../practical0/practical0.ipynb)\n",
19+
"- Basic understanding of Python data structures\n",
20+
"\n",
21+
"## Exercises Overview\n",
22+
"\n",
23+
"| Exercise | Difficulty | Topics |\n",
24+
"|----------|------------|--------|\n",
25+
"| Exercise 1 | ★ | Setup and package installation |\n",
26+
"| Exercise 2 | ★ | Parsing CSV/TSV files with NumPy |\n",
27+
"| Exercise 3 | ★★ | Parsing JSON files with pandas |\n",
28+
"| Exercise 4 | ★★ | Querying external data sources (Wikidata) |\n",
29+
"| Exercise 5 | ★★★ | Data grouping and aggregation |\n",
30+
"| Exercise 6 | ★★★ | Downloading and processing images |"
2331
]
2432
},
2533
{
@@ -58,7 +66,7 @@
5866
"metadata": {},
5967
"outputs": [],
6068
"source": [
61-
"!pip3 install numpy pandas matplotlib sparqlwrapper"
69+
"!pip3 install numpy pandas matplotlib sparqlwrapper opencv-python"
6270
]
6371
},
6472
{
@@ -108,7 +116,7 @@
108116
"id": "805f1ac0-711b-42a1-90fd-f45a33b69d50",
109117
"metadata": {},
110118
"source": [
111-
"Practise the exercises given in [practicals 0](../practical0/practical0.ipynb)."
119+
"Practise the exercises given in [Practical 0](../practical0/practical0.ipynb) before continuing."
112120
]
113121
},
114122
{
@@ -482,7 +490,27 @@
482490
"id": "2c698304-8f67-439e-b3ca-b350cfa29430",
483491
"metadata": {},
484492
"source": [
485-
"**Question**: What is the difference between np.loadtxt() and np.genfromtxt() when reading CSV files? When should you prefer one over the other?"
493+
"**Question 1**: What is the difference between `np.loadtxt()` and `np.genfromtxt()` when reading CSV files? When should you prefer one over the other?\n",
494+
"\n",
495+
"**Key Differences:**\n",
496+
"\n",
497+
"| Feature | `np.loadtxt()` | `np.genfromtxt()` |\n",
498+
"|---------|----------------|-------------------|\n",
499+
"| Missing values | Raises error | Handles gracefully |\n",
500+
"| Speed | Faster | Slower |\n",
501+
"| Flexibility | Less flexible | More options |\n",
502+
"| Use case | Clean, well-formatted data | Data with potential issues |\n",
503+
"\n",
504+
"**When to use which:**\n",
505+
"- Use `loadtxt()` when you're confident your data is clean and complete\n",
506+
"- Use `genfromtxt()` when dealing with real-world data that may have missing values or inconsistencies\n",
507+
"\n",
508+
"**Question 2**: Write a data validation function that checks the loaded dataset for:\n",
509+
"1. Any year values that are negative or in the future (> current year)\n",
510+
"2. Any names that are empty or contain only whitespace\n",
511+
"3. Return a report showing how many invalid entries were found\n",
512+
"\n",
513+
"**Hint:** Use a loop to iterate through the dataset and check each entry."
486514
]
487515
},
488516
{
@@ -628,6 +656,7 @@
628656
"outputs": [],
629657
"source": [
630658
"# Get some descriptive summary of the dataframe\n",
659+
"# By default, describe() only shows statistics for numerical columns\n",
631660
"dataframe.describe()"
632661
]
633662
},
@@ -815,12 +844,15 @@
815844
"dataframe[\"year\"] = pd.to_numeric(dataframe[\"year\"], errors=\"coerce\")\n",
816845
"\n",
817846
"# Identify missing values\n",
847+
"print(\"Missing values per column:\")\n",
818848
"print(dataframe.isnull().sum())\n",
819849
"\n",
820850
"# Fill missing values in \"year\" with the median value\n",
821-
"dataframe[\"year\"].fillna(dataframe[\"year\"].median(), inplace=True)\n",
851+
"# Note: Using assignment instead of inplace=True (which is deprecated)\n",
852+
"dataframe[\"year\"] = dataframe[\"year\"].fillna(dataframe[\"year\"].median())\n",
853+
"\n",
822854
"# Fill missing values in \"languageLabel\" with \"Unknown\"\n",
823-
"dataframe[\"languageLabel\"].fillna(\"Unknown\", inplace=True)"
855+
"dataframe[\"languageLabel\"] = dataframe[\"languageLabel\"].fillna(\"Unknown\")"
824856
]
825857
},
826858
{
@@ -946,7 +978,24 @@
946978
"id": "9a53a3ac-d287-4a8b-8b62-4d7cac4063b4",
947979
"metadata": {},
948980
"source": [
949-
"**Question:** What is the difference between `json.load()` and `json.loads()` in Python? "
981+
"**Question 1:** What is the difference between `json.load()` and `json.loads()` in Python?\n",
982+
"\n",
983+
"**Question 2:** Create a new column `decade` that groups programming languages by their decade of creation (1950s, 1960s, etc.):\n",
984+
"\n",
985+
"```python\n",
986+
"# Example: 1962 should become \"1960s\"\n",
987+
"dataframe['decade'] = (dataframe['year'] // 10 * 10).astype(str) + 's'\n",
988+
"```\n",
989+
"\n",
990+
"Then:\n",
991+
"1. Count how many languages were created in each decade\n",
992+
"2. Find the decade with the most programming languages\n",
993+
"3. Create a bar chart showing the number of languages per decade\n",
994+
"\n",
995+
"**Question 3:** Clean and transform the `languageLabel` column:\n",
996+
"1. Convert all names to title case (first letter capitalized)\n",
997+
"2. Remove any leading/trailing whitespace\n",
998+
"3. Identify any duplicate entries (same language, same year)"
950999
]
9511000
},
9521001
{
@@ -980,7 +1029,13 @@
9801029
"cell_type": "markdown",
9811030
"id": "c01a458e-d30f-4b75-aa33-0932a183b23f",
9821031
"metadata": {},
983-
"source": []
1032+
"source": [
1033+
"**What is SPARQL?**\n",
1034+
"\n",
1035+
"SPARQL (SPARQL Protocol and RDF Query Language) is a query language used to retrieve and manipulate data stored in RDF (Resource Description Framework) format. Wikidata uses RDF to store its knowledge graph, and SPARQL allows us to query this data.\n",
1036+
"\n",
1037+
"The URL above contains an encoded SPARQL query that retrieves programming languages and their inception years from Wikidata."
1038+
]
9841039
},
9851040
{
9861041
"cell_type": "code",
@@ -994,8 +1049,16 @@
9941049
"import pandas as pd\n",
9951050
"\n",
9961051
"url = \"https://query.wikidata.org/sparql?query=SELECT%20%3FlanguageLabel%20(YEAR(%3Finception)%20as%20%3Fyear)%0AWHERE%0A%7B%0A%20%20%23instances%20of%20programming%20language%0A%20%20%3Flanguage%20wdt%3AP31%20wd%3AQ9143%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP571%20%3Finception%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20rdfs%3Alabel%20%3FlanguageLabel.%0A%20%20FILTER(lang(%3FlanguageLabel)%20%3D%20%22en%22)%0A%7D%0AORDER%20BY%20%3Fyear%0ALIMIT%20100&format=json\"\n",
997-
"response = urllib.request.urlopen(url)\n",
998-
"responsedata = json.loads(response.read().decode(\"utf-8\"))\n",
1052+
"\n",
1053+
"req = urllib.request.Request(\n",
1054+
" url,\n",
1055+
" headers={\n",
1056+
" \"User-Agent\": \"GitHubAction/1.0 (https://github.com/johnsamuelwrites/DataMining)\"\n",
1057+
" },\n",
1058+
")\n",
1059+
"\n",
1060+
"with urllib.request.urlopen(req) as response:\n",
1061+
" responsedata = json.loads(response.read().decode(\"utf-8\"))\n",
9991062
"\n",
10001063
"array = []\n",
10011064
"\n",
@@ -1359,15 +1422,39 @@
13591422
" the above query results) since 2010.\n",
13601423
"\n",
13611424
"**Hint**:Take a look at functions groupby, reset_index, head, tail, sort_values, count of Pandas\n",
1362-
"\n"
1425+
"\n",
1426+
"**Question 3: Merging DataFrames**\n",
1427+
"\n",
1428+
"You now have two datasets from this practical:\n",
1429+
"- Programming languages data (language name, year)\n",
1430+
"- Population data (country, year, population)\n",
1431+
"\n",
1432+
"Create a summary DataFrame that shows, for each year:\n",
1433+
"1. Number of programming languages created\n",
1434+
"2. Any population data available\n",
1435+
"\n",
1436+
"Use `pd.merge()` to combine these datasets:\n",
1437+
"```python\n",
1438+
"# Example of merging on year\n",
1439+
"merged = pd.merge(lang_df, pop_df, on='year', how='outer')\n",
1440+
"```\n",
1441+
"\n",
1442+
"**Questions to answer:**\n",
1443+
"- Which years have both programming language and population data?\n",
1444+
"- What type of merge (inner, outer, left, right) is most appropriate here? Why?"
13631445
]
13641446
},
13651447
{
13661448
"cell_type": "markdown",
13671449
"id": "99eb67df-0773-4a58-be20-6d22d0dfaa3b",
13681450
"metadata": {},
13691451
"source": [
1370-
"**Note**: If you get time-out errors, please change the LIMIT to some lower values (1000, 2000, 5000)."
1452+
"**Troubleshooting:**\n",
1453+
"\n",
1454+
"- If you get timeout errors, reduce the `LIMIT` value in the query (try 1000, 2000, or 3000)\n",
1455+
"- If the query service is unavailable, try again after a few minutes\n",
1456+
"- Make sure you have a stable internet connection\n",
1457+
"- The Wikidata query service may have usage limits during peak hours"
13711458
]
13721459
},
13731460
{
@@ -1536,7 +1623,7 @@
15361623
],
15371624
"metadata": {
15381625
"kernelspec": {
1539-
"display_name": "Python 3 (ipykernel)",
1626+
"display_name": ".venv",
15401627
"language": "python",
15411628
"name": "python3"
15421629
},
@@ -1550,7 +1637,7 @@
15501637
"name": "python",
15511638
"nbconvert_exporter": "python",
15521639
"pygments_lexer": "ipython3",
1553-
"version": "3.10.12"
1640+
"version": "3.13.0"
15541641
}
15551642
},
15561643
"nbformat": 4,

0 commit comments

Comments
 (0)