- Separate Incorrect/Missing Yearly Data from Correct Yearly Data
- Data cleaning of incorrect data
- Merge newly cleaned data with original cleaned data
- Separate remaining data into csv for later cleaning
- Prepare Yearly Iowa County Population Data for per capita consumption anaylsis
- Improve markdown clarity
- retesting measures
- Establish connection between Linux environment and local MySQL Desktop
- Import Data into MySQL Database using batch processing
- Extract data subsets as .txt files to for dashboard building in Tableau
- County Per Capita Consumption Visualization Completed
- Other visualizations to be determined
- Publish Dashboards
The goal of this project was to clean and analyze a large publicly available dataset using techniques in Python, SQL, and Tableau. For this endeavor, I chose the Sales Data from Iowa Class “E” liquor licensees dataset from data.iowa.gov, which details all the liquor sales in Iowa from January 2012 through the end of September 2023, since I downloaded this data in early November of 2023.
This dataset was very large at 27489743 values. Upon downloading the data, over 20% of the rows had missing values. Compounding this issue was a large number of rows with mismatched and incorrect data types.Thus, cleaning the data was a massive undertaking. It soon became apparent that cleaning 100% of this data was not going to be possible given time constraints and my sanity as many errors were very granular in nature. After an intensive effort, I was able to reduce the number of missing values down to only 1.03% of the original data. However, some simplifying allowances were used to fill in some of the missing geodata. It is possible that a very tiny fraction of the data will have geodata that is slightly off from from the listed address. In such cases, however, the row address will share the same city, zip code, and county as the geo-point used to fill in said data. Again, I emphasize that achieving perfection here was not possible due to time constraints. However, given this allowance, I advise slight caution when using the data generated by this process to conduct analysis at the neighborhood level. My process is outlined in detail in the noteboks in the data folder.
The first visualization I built was a simple line chart showing sales by month. From this, we can see a general seasonal trend of higher in sales in the summer, but lower sales in the winter.
I also built a map showing average per capita consumption (in liters) across Iowa counties. Interestingly, Dickinson county had by far the highest rate of any county. Having spent a lot of time at the lakes growing up, I'm fairly certain the huge boating/party scene on Okoboji/Spirit Lake has a big impact on this value.
You can also check out this additional map showing the sales volume by locations in Iowa. Each dot represents a different store location with bigger circles indicating a higher number of sales.


