To enrich a Pokémon dataset obtained from Kaggle:
- As of today, the Pokémon games consist of 8 generations, adding up a total of #905 creatures. Thus, this dataset is outdated.
- In addition, it is missing some categories that would make for an interesting investigation.
2.1- Using the Poke API
Defined a function to fetch the missing pokémon iterating over the url using the pokémon #ID:
def fetchMissing(a, b):
'''This functions receives two pokemon id numbers as integers ('a' and 'b'),
and returns a dataframe containing the specified pokemon from the ids 'a' to 'b'.
'''
missing_pokes = []
for i in range(a,b+1):
time.sleep(1)
print(f"Fetching pokemon with id: {i}")
response_api = requests.get(f"https://pokeapi.co/api/v2/pokemon/{i}/").json()
missing_pokes.append(response_api)
return json_normalize(missing_pokes)The request returned loads of information, but it needed to be extracted and cleaned. Luckily enough, weight and height were already clean, I just needed to transform the units.
Since they all followed the same format, I could use indexing and split methods to extract the stats:
def extractStats(df):
'''This function receives a dataframe and extracts the HP, Attack, Defense, Sp. Atk, Sp. Def, and Speed from 'stats'.
It returns a df with the extracted values of each stat for each pokemon in new columns.
'''
HP = []
Attack = []
Defense = []
Sp_Atk = []
Sp_Def = []
Speed = []
for i in range(len(df)):
HP.append(int("".join([x for x in missing_pokes.iloc[i]['stats'].split(",")[0] if x.isdigit()])))
Attack.append(int("".join([x for x in missing_pokes.iloc[i]['stats'].split(",")[4] if x.isdigit()])))
Defense.append(int("".join([x for x in missing_pokes.iloc[i]['stats'].split(",")[8] if x.isdigit()])))
Sp_Atk.append(int("".join([x for x in missing_pokes.iloc[i]['stats'].split(",")[12] if x.isdigit()])))
Sp_Def.append(int("".join([x for x in missing_pokes.iloc[i]['stats'].split(",")[16] if x.isdigit()])))
Speed.append(int("".join([x for x in missing_pokes.iloc[i]['stats'].split(",")[20] if x.isdigit()])))
df['HP'] = HP
df['Attack'] = Attack
df['Defense'] = Defense
df['Sp. Atk'] = Sp_Atk
df['Sp. Def'] = Sp_Def
df['Speed'] = Speed
return dfOnce extracted, I added some categorical columns that were missing in the request:
- Total score, based on the other stats:
def calculateTotal(df):
'''This function receives a dataframe and calculates the Total score of a pokemon
based on the sum of its stats (Hp, Attack, Defense, Sp. Atk, Sp. Def, Speed).
It returns the dataframe with the Total column.
'''
df['Total'] = df['Hp'] + df['Attack'] + df['Defense'] + df['Sp. Atk'] + df['Sp. Def'] + df['Speed']
return df- Generation to which they belong, based on ID#:
def setGeneration(df):
'''This function receives a dataframe and returns the generation to which the pokemon belongs based on its ID.
'''
generation_7 = [n for n in range(722,810)]
generation_8 = [n for n in range(810,905)]
df['Generation'] = df['Id'].apply(lambda x: 7 if x in generation_7 else 8)
return df- Legendary status, based on ID#:
def setLegendary(df):
'''This function receives a dataframe and returns the legendary status (True/False) of a pokemon based on its ID.
'''
gen_7_legendaries = [772,773,785,786,787,788,789,790,791,792,793,800]
gen_8_legendaries = [888,889,890,891,892,894,895,896,897,898,905]
legendaries = gen_7_legendaries + gen_8_legendaries
df['Legendary'] = df['Id'].apply(lambda x: True if x in legendaries else False)
return dfEnded up with a much cleaner version of the dataframe:
From the Pokémon wiki (WikiDex) I extracted the pokémon Types for the missing Pokémon (#722 to #905), as well as the Height, Weight and Catch rate for the pokémon I already had in the dataset from Kaggle:
- Types:
To fetch the types I selected the first paragraph, and I split by "tipo " since all descriptions followed the same format. First I checked that it worked for one:
And then I defined a function to iterate over the url using the name of the pokémon:
def getType(list_of_pokes):
'''This functions appends the Type of each pokemon to a list.
If the pokemon is not found in the wiki, it appends "NaN" instead.
'''
types_list = []
for i in range(len(list_of_pokes)):
pokemon_url = f"https://www.wikidex.net/wiki/{list_of_pokes[i]}"
response = requests.get(pokemon_url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
types = soup.find("p")
try:
types_list.append(types.getText().split("tipo ")[1].split(" ")[0])
except IndexError:
types_list.append('NaN')
return types_listThen I split the column types in two, to get Type 1 and Type 2.
- Height and Weight:
To fetch both categories I used the data from a table, tagged the tr with their respectrive attrs and cleaned it by getting the text, splitting and indexing. I checked it worked for one:
And then I defined a function to iterate over the url using the name of the pokémon, again:
def getWeight(list_of_pokes):
'''This functions appends the weight of each pokemon to a list.
If the pokemon is not found in the wiki, it appends "NaN" instead.
'''
weight_list = []
for i in range(len(list_of_pokes)):
pokemon_url = f"https://www.wikidex.net/wiki/{list_of_pokes[i]}"
response = requests.get(pokemon_url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
weight = soup.find_all("tr", attrs = {"title" : "Peso del Pokémon"})
try:
weight_float = float(weight[0].getText().strip().split('\n')[-1].split(" ")[0].replace(",","."))
weight_list.append(weight_float)
except IndexError:
weight_list.append('NaN')
return weight_list- Catch rate:
Followed the same procedure as before, in this case the tag was ul and I used RegEx to extract the digit fraction after splitting by "Ratio de captura".
Defined a function to iterate over the url:
def getCatchrate(list_of_pokes):
'''This functions appends the Catch rate of each pokemon to a list.
If the pokemon is not found in the wiki, it appends "NaN" instead.
'''
catch_list = []
for i in range(len(list_of_pokes)):
pokemon_url = f"https://www.wikidex.net/wiki/{list_of_pokes[i]}"
response = requests.get(pokemon_url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
catch = soup.find_all("ul")
try:
catch_list.append(int(re.search(r'\d+', str(catch).split("Ratio de captura</a>: ")[1]).group()))
except IndexError:
catch_list.append('NaN')
return catch_listAfter that I reorganized the two datasets so that the columns matched and I concatenated them. Ended up with a clean pokedex:
- Apparently from Generation 6:
- Water types are the most prevalent pokémon.
- Type 1 and Type 2 follow an opposite distribution.
- Based on total score, you should definitely go for dragon types.
- Overall, the developers did a good job balancing the different types.
- The taller, the heavier. There are exceptions, though.
- The more powerful the pokemon, the more difficult to catch.
- Legendaries are also more difficult to catch.
- Cathch rate is inversely proportional to power.
- Legendaries are the most powerful and difficult to catch.
Click here to see an interactive version!
- Height and weight show a positive correlation.
- The amount of health points (HP) correlates pretty well with the size of the pokémon.
- Apparently the smaller the pokemon, the more easy it is to catch it.
Click here to see an interactive version!
The particular case of Cosmoem: 999,9 kg and 0,1 m.
At least these stats make more sense than PokemonGo stats:
Well, sometimes...






















