Skip to content

Latest commit

 

History

History
55 lines (45 loc) · 2.54 KB

File metadata and controls

55 lines (45 loc) · 2.54 KB

Hello World

Using Pandas there are commands we can run to show the basic shape and information about our data. This will help understand and clean it to get ready for ML.

Pandas

These examples use the vgsales.csv dataset from https://www.kaggle.com/datasets/gregorut/videogamesales

import pandas as pd
df = pd.read_csv('vgsales.csv')
# shows the shape `(16598, 11)` so thats 16k records and 11 columns
df.shape 

(16598, 11)
# shows information about each column in the dataset when it has numerical values
# `count` tells us that `year` doesnt have values for all records so will need cleaning
# `mean` is the average value, in this dataset for `Rank` its meaningless but for `Year` it tells us the average year in the dataset 2006, this could be helpful based on the problem we are trying to solve
# `std` is the standard deviation, quantifies the amount of variation in the dataset
# `min` is the minimum value for that column, so here the oldest record is from 1980
# `25,50,75%` are the percentiles, so 2003 is in the 25th percentile
# `max` is the maximum value, so the newest record is from 2020 
df.describe()

	    Rank	        Year	      NA_Sales	    EU_Sales	JP_Sales	    Other_Sales	    Global_Sales
count	    16598.000000	  16327.000000	16598.000000    16598.000000	16598.000000    16598.000000	    16598.000000
mean	    8300.605254	  2006.406443	0.264667	    0.146652	0.077782	    0.048063	    0.537441
std	    4791.853933	  5.828981	      0.816683	    0.505351	0.309291	    0.188588	    1.555028
min	    1.000000	  1980.000000	0.000000	    0.000000	0.000000	    0.000000	    0.010000
25%	    4151.250000	  2003.000000	0.000000	    0.000000	0.000000	    0.000000	    0.060000
50%	    8300.500000	  2007.000000	0.080000	    0.020000	0.000000	    0.010000	    0.170000
75%	    12449.750000	  2010.000000	0.240000	    0.110000	0.040000	    0.040000	    0.470000
max	    16600.000000	  2020.000000	41.490000	    29.020000	10.220000	    10.570000	    82.740000
# returns a 2 dimensional array showing samples of the top/bottom data
df.values

array([[1, 'Wii Sports', 'Wii', ..., 3.77, 8.46, 82.74],
       [2, 'Super Mario Bros.', 'NES', ..., 6.81, 0.77, 40.24],
       [3, 'Mario Kart Wii', 'Wii', ..., 3.79, 3.31, 35.82],
       ...,
       [16598, 'SCORE International Baja 1000: The Official Game', 'PS2',
        ..., 0.0, 0.0, 0.01],
       [16599, 'Know How 2', 'DS', ..., 0.0, 0.0, 0.01],
       [16600, 'Spirits & Spells', 'GBA', ..., 0.0, 0.0, 0.01]],
      shape=(16598, 11), dtype=object)