"Hello World! This is Marc Planas :)"The aim of this project was:
-
To identify the top 10 activities (sports) in which most of the shark attacks occur and investigate their survivability.
Hypothesis: Since surfing is a very popular sport, most of the reports will be of people practicing it. However, since surfers are very aware of their surroundings, surf might not be the most deadliest activity.
-
To identify a so called 'Shark Season' based on the number of reports per month and country and investigate whether it has changed over the past years. This might give us insights on the migratory patterns of these animals.
Hypothesis: Global warming might impact sharks' migratory events in such a way that is reflected in the number attack reports.
An analysis on the sharks.csv dataset was conducted to investigate these issues.
The process described below was performed in the following jupyter notebook: my_project_cleaning_exploration.ipynb
- Dimensions of the original dataset:
25723 rows × 24 columns - Almost all categories (columns) contained
~20,000 NaNmissing values. - The dataset also contained 2392 duplicated reports (rows).
- Reports with
NaNvalues in all columns and duplicated rows were dropped/removed, lowering the number of rows with potentially meaningful data to 6311.
After a closer inspection, it was decided to drop some of the categories that would not be useful for the scope of the analysis and which contained repetitive information. Columns dropped were: pdf, href formula, href, original order, Unnamed: 22, Unnamed: 23, Case Number.1 and Case Number.2, lowering the number of columns to 16.
Then, reports in which half of the information was missing (i. e. NaN values in 8 out of the 16 columns) were also dropped, since they were ruled not reliable reports. This resulted in a significantly cleaner and more trustworthy dataset, with 6302 rows. In addition, column headers were cleaned using the strip() method.
The resulting (6302 x 16) dataframe was saved as sharks_clean.csv and was used for further exploration and analysis. 1
The process described below was performed in the following jupyter notebook: my_project_cleaning_exploration.ipynb
An initial exploration was conducted using the value_counts() method to see the distribution of values within each column.
In addition, after inspecting the number of NaN in each column (the lower the better - more statistically powerfull), the following categories appeared good alternatives to investigate:
- Date/Year
- Type
- Country/Area/Location
- Activity
- Injury/Fatal (Y/N) a.k.a survivability
- Gender (Sex)
The process described below was performed in the following jupyter notebook: my_project_analysis.ipynb
Starting dataset: sharks_clean.csv.
Outcome after cleaning and grouping: sharks_clean_activity_fatal.csv.
- Removing rows with
NaNin activity column. Result:df.shape = (5758, 16). - Cleaning inconsistencies:
lower(),strip(),replace(). - Grouping activities using RegEx:
- Dictionary with RegEx pattern as value.
- Iterate over the dictionary and
replace()to group similar activities.
dict_activity_regex = {
"Board surfing" : ".*(surf).*|.*(boogie board).*|.*(body board).*",
"Kayaking & similar" : ".*(kayak).*|.*(canoe).*|.*(rowing).*",
"Diving" : ".*(diving).*",
"Paddle boarding" : ".*(paddle).*",
"Sailing" : ".*(boat).*|.*(sail).*|.*(ship).*|.*(overboard).*",
"Snorkeling" : ".*(snorkel).*",
"Swimming" : ".*(bathing).*|.*(swimming).*|.*(float).*",
"Spear-fishing" : ".*(spearfishing).*",
"Fishing" : "[\w\s]+?(fishing).*|^(fishing).*",
"Wading" : ".*wad.*|.*(walking).*|.*(standing).*|.*(treading).*"
}
for key, value in dict_activity_regex.items():
sharks_clean["Activity"] = sharks_clean["Activity"].str.replace(value, key, regex = True)- Removing rows with
NaNin fatal column. Result:df.shape = (5344, 16). - Cleaning inconsistencies:
strip(),upper(). - Removing if value of fatal is not
YorNusing filtering conditions.
- Countplot of top 10 activities/sports, using the fatal category as
hue. - Image file:
activity_fatality.jpg
- Although board-related sports represent most of the reports of shark attacks, the fatality concentrates on swimmers, which have less tools to escape the attack or are less aware of their surroundings.
Starting dataset: sharks_clean.csv.
Outcome after extracting month and cleaning both month and country: sharks_clean_month_country.csv.
- The actual month of the shark attack report was extracted from the
Datecolumn using the RegEx-(\w{3})-and theextract()method to create a newMonthcolumn. - When a row did not contain that pattern, the month was filled with
NaN. Out of the6302initial rows,910contained missing values now. I was willing to sacrifice this rows, since they either did not contain the month or the annotation was deemed unreliable (i. e. "It happened after August 1800"). - Rows with
NaNin theMonthcolumn were removed, resulting in5392shark attack reports.
For visualization purposes, months were then encoded using numbers 1-12. Otherwise they would be ordered alphabetically in the x-axis of the plot:
# Creating a dictionary to store the old and new names for months:
dict_months = {"jan" : 1,
"feb" : 2,
"mar" : 3,
"apr" : 4,
"may" : 5,
"jun" : 6,
"jul" : 7,
"aug" : 8,
"sep" : 9,
"oct" : 10,
"nov" : 11,
"dec" : 12
}
# Replacing the name by value
sharks_clean.replace({"Month" : dict_months}, inplace = True)- Rows with
NaNin Country were removed. Result:df.shape = (5361, 17). - Unique countries were checked to see inconsistencies.
- The top 3 countries (USA, Australia, South Africa) were selected as they represent most of the reports.
- Defined filtering conditions for each Country.
- Report counts were plotted
groupbymonth. - Image file:
shark_season_countries.jpg.
- 'Shark Season' actually reflects summer season in the northern hemisphere (USA) and southern hemisphere (Australia, South Africa). 🙃
- Most reports are observed during summer season due to people practicing sea-related activities.
'Shark Season' / summer season might have looked different in the past due to:
- Climate change / global warming.
- Increased ease to report attacks over the years.
Starting dataframe: sharks_clean_month_country.csv
Outcome after cleaning the year: sharks_clean_month_country_year.csv
- The
Yearcolumn was cleaned after checkingunique()values:
try:
sharks_clean["Year"] = sharks_clean["Year"].astype(int)
except Exception:
sharks_clean['Year'] = sharks_clean['Year'].fillna(0).astype(int)- Rows in which
Year == 0were removed. Result:df.shape = (5352, 17).
- Worldwide reports were visualized using a histplot.
- Image file:
shark_reports_year.jpg.
- Most reports occur since year 1900, mainly due to increased ease in documenting the attacks (probably). Thus, in order to have more powerful data (the more reports the better) it makes sense to focus in this last century.
- Created a list of the countries in each hemisphere.
- Applied a function to rename the countries (in a new column) if they belonged to each of the lists (re-checked the
unique()values). Data frame stored as:sharks_clean_month_year_hemisphere.csv.
south = ["country_1", "country_2", ...]
north = ["country_3", "country_4", ...]
def hemisphere(x):
if x.lower() in south:
return "Southern"
elif x.lower() in north:
return "Northern"
else:
return x- Created one dataframe for each hemisphere and visualized the evolution of reports
groupbymonth. Each line represents a span of 20 years (1940-2018).northern.shape = (2931, 18). Image file:month_year_northern.jpg.southern.shape = (2365, 18). Image file:month_year_southern.jpg
- A new dataframe was created (
sharks_clean_month_year_usa.csv) selecting only those reports belonging to USA using filtering conditions. Result:df.shape = (2049, 17). - Report were plotted
groupbymonth and filtering each plot by spans of 20 years, from 1940 to 2018. - Image file:
month_year_USA.jpg.
- Since the last 20-40 years concentration of attack reports seems to start earlier (from Apr) and lasts longer in time (till Oct). This could be due to:
- Increased ease to report cases.
- Summer season starts earlier and lasts longer due to global warming, leading to more attacks.
- A similar process was followed for the second country with most reports and which is located in the southern hemisphere, Australia. The dataframe was saved as:
sharks_clean_month_year_australia.csv. - Image file:
month_year_australia.jpg.
- Similar to USA, an increase in reports during 'winter' months is observed in the last 20 years. Global warming is not a hoax!
- This long summer effect is easily seen in the northern hemisphere.
Further reading: click here! (just don't donate to the Science journal...)
Finally, some demographic investigation were conducted in order to visualize which age and gender accumulated more fatality reports.
Starting dataframe: sharks_clean_activity_fatal.csv.
- 2180 rows with
NaNage were removed. - Cleaned inconsistencies:
.map(str), followed bystrip(). - RegEx was used to extract the age:
(\d{1,2}). - New
NaNthat did not follow this pattern (26) were removed. - Ages were transformed back to
intusingmap. - Age distribution was plotted using a violin plot and fatality was shown in the y-axis.
- Image file:
age_fatality.jpg.
- People that report shark attacks are young, since they are the ones that mostly practice the abovementioned sports.
Starting dataframe: sharks_clean_activity_fatal.csv.
- 362 rows with
NaNgender were removed. - Cleaned inconsistencies:
strip(). - Only those that contained a value of
MorFwere kept. - Gender counts were plotted using a countplot and fatality was shown in the y-axis.
- Image file:
gender_fatality.jpg.
- Most shark attack reports accumulate in men. So, if you are a man in your 20s, just don't swim in summer 🤷♂️ (you can still surf though!).
Footnotes
-
The dataset was also cleaned during the analysis stage when aiming at specific categories. ↩









