This project demonstrates how to read multiple Excel files from a specific directory, concatenate the data into a single DataFrame, and then plot the distribution of file sizes.
├── README.md│
├── requirements.txt
├── src
│ │
│ ├── cached_df
│ │ └── .gitkeep
│ ├── data_visualization
│ │ └── .gitkeep
│ │
│ ├── files_here
│ │ └── .gitkeep
│ ├── get_data.py
│ └── plot_graph.py
└── .gitignore
-
get_data.py: Python script to read Excel files from thesrc/files_heredirectory, concatenate them into a single DataFrame, and cache the result. -
plot_graph.py: Python script to plot the distribution of file sizes from the concatenated DataFrame and save the plot as a PNG file. -
src/files_here/: Directory containing Excel files to read -
cached_df.pkl: Pickle file storing the cached DataFrame after concatenation. -
src/cached_df/: Directory containing pickled DataFrame after concatenation. -
src/data_visualization/: Directory for generated graphs
-
Clone the repository:
git clone https://github.com/egekaplan/concat-excel-data.git
-
Navigate to the project directory:
cd concat-excel-data -
Install the required dependencies:
pip install -r requirements.txt
-
Ensure your Excel files are placed in the
src/files_heredirectory. -
Run
src/get_data.pyto read and concatenate the Excel files:python3 get_data.py
This will generate a cached DataFrame
cached_df.pklinsrc/cached_df/cached_df.pkl. -
Run
src/plot_graph.pyto plot the size distribution of the files:python3 plot_graph.py
The resulting histogram chart will be saved as
file_size_histogram.pngandextension_frequency_histogram.
-
Ensure that you have Python installed on your system.
-
Additional libraries such as
pandas,matplotlib, andseabornare required. These dependencies are listed inrequirements.txt. -
Modify the
sheet_nameandheader_rowvariables inget_data.pyaccording to your Excel file structure.