A comprehensive web scraping and data analysis project for extracting, processing, and analyzing laptop data from Flipkart
Features β’ Architecture β’ Installation β’ Usage β’ Database Schema
- Overview
- Features
- Architecture
- Tech Stack
- Project Structure
- Installation
- Usage
- Database Schema
- Data Flow
- Key Components
- Challenges & Solutions
- Future Enhancements
- Contributing
- License
This project demonstrates end-to-end data engineering capabilities by scraping laptop product data from Flipkart, processing it through a robust ETL pipeline, and storing it in a relational database for analysis. The system handles rate limiting, data cleaning, deduplication, and provides a structured dataset ready for analytical insights.
- π― Multi-Query Scraping: Targets multiple laptop brands (HP, Dell, Apple, ASUS, Lenovo, Acer)
- π API Rotation: Implements ScraperAPI with multiple key rotation for reliability
- π§Ή Data Cleaning: Comprehensive data validation and normalization
- ποΈ Database Integration: Full MySQL/MariaDB integration with optimized schema
- π Scalable Design: Handles 1000+ products with duplicate detection
- β Dynamic XPath Parsing - Robust extraction of product details
- β Rate Limit Handling - Smart delays and cooldown periods
- β Multiple User Agents - Rotation to avoid detection
- β Error Recovery - Automatic retry mechanisms
- β API Key Rotation - Seamless failover between ScraperAPI keys
- β Duplicate Detection - ROW_NUMBER() based deduplication
- β Price Normalization - Currency symbol and comma removal
- β Null Handling - Intelligent replacement of 'NaN' values
- β Data Validation - Type checking and constraint enforcement
- β Optimized Schema - Indexed columns for fast queries
- β Backup Strategy - Raw data preservation before transformations
- β SQL Automation - Bulk insert operations via Python
- β Data Export - CSV export functionality
graph TB
subgraph "Data Collection Layer"
A[Flipkart Website] -->|HTTP Requests| B[ScraperAPI Proxy]
B -->|Rotates Keys| C[Web Scraper]
C -->|User Agent Rotation| B
end
subgraph "Processing Layer"
C -->|Raw HTML| D[LXML Parser]
D -->|XPath Extraction| E[Data Validator]
E -->|Structured Data| F[Pandas DataFrame]
end
subgraph "Storage Layer"
F -->|CSV Export| G[(Local File System)]
F -->|PyMySQL| H[(MySQL Database)]
H -->|Backup| I[Raw_Data Table]
H -->|Cleaned| J[fp Table]
end
subgraph "Data Transformation"
J -->|Deduplicate| K[Remove Duplicates]
K -->|Normalize| L[Clean Prices]
L -->|Validate| M[Final Dataset]
end
style A fill:#e1f5ff
style H fill:#ffe1e1
style M fill:#e1ffe1
flowchart LR
A[Start Scraping] --> B{Select Query}
B -->|hp laptop| C[Page Loop 1-40]
B -->|dell laptop| C
B -->|apple laptop| C
B -->|asus laptop| C
B -->|lenevo laptop| C
B -->|acer laptop| C
C --> D{API Request}
D -->|Success| E[Parse HTML]
D -->|429 Rate Limit| F[Rotate API Key]
D -->|403 Forbidden| F
F --> D
E --> G[Extract Product Data]
G --> H{Valid Data?}
H -->|Yes| I[Append to List]
H -->|No| J[Skip Product]
I --> K{Page % 10 == 0?}
K -->|Yes| L[Cooldown 20s]
K -->|No| M[Delay 2-4s]
L --> N{More Pages?}
M --> N
N -->|Yes| C
N -->|No| O{More Queries?}
O -->|Yes| B
O -->|No| P[Save to CSV]
P --> Q[Load to Database]
Q --> R[Data Cleaning]
R --> S[Remove Duplicates]
S --> T[End]
style A fill:#90EE90
style T fill:#FFB6C1
style D fill:#FFE4B5
style R fill:#E0BBE4
sequenceDiagram
participant CSV as CSV File
participant PY as Python Script
participant DB as MySQL Database
participant BK as Backup Tables
CSV->>PY: Load Dataset
PY->>PY: Read with Pandas
PY->>DB: Create Connection
loop For Each Row
PY->>DB: INSERT INTO fp
end
PY->>DB: COMMIT Transaction
DB->>BK: CREATE TABLE raw_data AS SELECT * FROM fp
Note over DB: Data Cleaning Phase
DB->>DB: Remove 'NaN' β NULL
DB->>DB: Strip Currency Symbols
DB->>DB: CAST Prices to INT
Note over DB: Deduplication Phase
DB->>DB: ROW_NUMBER() OVER (PARTITION BY product_id)
DB->>DB: DELETE WHERE rank > 1
DB->>BK: Backup Cleaned Data
DB->>CSV: Export to CSV
| Category | Technology | Purpose |
|---|---|---|
| Web Scraping | requests |
HTTP requests to Flipkart |
lxml |
HTML parsing and XPath queries | |
ScraperAPI |
Proxy rotation and anti-bot bypass | |
| Data Processing | pandas |
Data manipulation and CSV handling |
regex |
Pattern matching for data extraction | |
| Database | pymysql |
MySQL database connectivity |
MariaDB |
Relational database management | |
| Utilities | time |
Rate limiting and delays |
random |
User agent rotation |
surajakhuli-flipkart_laptop_analysis/
β
βββ π final_scrapper.py # Main production scraper
βββ π scrape.py # Initial scraper prototype
βββ π dbConn.py # Database connection & data loader
βββ π xpath_tester.py # XPath query testing utility
βββ π DB_Code.txt # SQL commands and database setup
βββ π output/
βββ flipkart_laptops_cleaned.csv
βββ removed_duplicates_file.csv
final_scrapper.py: Production-ready scraper with multi-query support, API key rotation, and comprehensive error handlingscrape.py: Initial prototype for testing XPath selectors and basic scraping logicdbConn.py: Handles bulk data insertion from Excel/CSV to MySQL databasexpath_tester.py: Standalone utility for testing and debugging XPath expressionsDB_Code.txt: Complete SQL documentation including table creation, cleaning operations, and backup strategies
- Python 3.8 or higher
- MySQL/MariaDB Server
- ScraperAPI Account (for API keys)
git clone https://github.com/surajakhuli/flipkart-laptop-analysis.git
cd flipkart-laptop-analysispip install -r requirements.txtRequired Libraries:
requests==2.31.0
lxml==4.9.3
pandas==2.1.0
pymysql==1.1.0
openpyxl==3.1.2-- Create database
CREATE DATABASE project;
-- Create table
CREATE TABLE flipkart_products (
product_id VARCHAR(20),
name VARCHAR(255),
brand_name VARCHAR(20),
processor_name VARCHAR(120),
RAM VARCHAR(120),
storage VARCHAR(20),
display VARCHAR(100),
gpu VARCHAR(100),
os VARCHAR(150),
warranty_year VARCHAR(100),
original_price VARCHAR(20),
mrp VARCHAR(20),
discount FLOAT,
rating FLOAT,
rating_count INT,
reviews_count INT,
flipkart_assured VARCHAR(10),
product_link VARCHAR(500),
image_url VARCHAR(500),
PRIMARY KEY(product_id)
);Update final_scrapper.py with your ScraperAPI keys:
api_keys = [
"YOUR_API_KEY_1",
"YOUR_API_KEY_2",
"YOUR_API_KEY_3"
]python final_scrapper.pyExpected Output:
π Scraping query: hp laptop
Scraping page 1 of hp laptop...
Scraping page 2 of hp laptop...
...
π΄ Cooling down after 10 pages...
...
β
Scraping complete. Saved 1267 records to flipkart_laptops_cleaned.csv
python dbConn.pyExpected Output:
β
Loaded 1267 rows from the dataset.
β
Connected to database.
β
Done! Inserted 1267 rows into 'flipkart_products'.
π Connection closed.
python xpath_tester.pyerDiagram
fp {
int id PK
varchar product_id
varchar name
varchar brand_name
varchar processor_name
varchar RAM
varchar storage
varchar display
varchar gpu
varchar os
varchar warranty_year
int original_price
int mrp
float discount
float rating
int rating_count
int reviews_count
varchar flipkart_assured
varchar product_link
varchar image_url
}
raw_data: Original scraped data before any transformationsraw_data_removed_duplicates: Backup after deduplication process
-- Add indexes for performance
CREATE INDEX idx_brand ON fp(brand_name);
CREATE INDEX idx_price ON fp(original_price);
CREATE INDEX idx_rating ON fp(rating);stateDiagram-v2
[*] --> Initialize
Initialize --> SelectQuery
SelectQuery --> MakeRequest
MakeRequest --> CheckResponse
CheckResponse --> ParseHTML: Status 200
CheckResponse --> RetryAPI: Status 429/403
CheckResponse --> SkipPage: Other Errors
RetryAPI --> MakeRequest
ParseHTML --> ExtractData
ExtractData --> ValidateData
ValidateData --> AppendData: Valid
ValidateData --> SkipProduct: Invalid
AppendData --> CheckPage
SkipProduct --> CheckPage
SkipPage --> CheckPage
CheckPage --> Cooldown: Page % 10 == 0
CheckPage --> Delay: Continue
Cooldown --> NextPage
Delay --> NextPage
NextPage --> MakeRequest: More Pages
NextPage --> NextQuery: Pages Done
NextQuery --> SelectQuery: More Queries
NextQuery --> SaveCSV: All Done
SaveCSV --> [*]
- Null Handling: Replace 'NaN' strings with NULL
- Price Normalization: Remove βΉ symbols and commas
- Type Conversion: Cast string prices to integers
- Duplicate Removal: ROW_NUMBER() window function
- Validation: Ensure data integrity constraints
def get_scraperapi_response(url, headers, api_keys):
for key in api_keys:
scraper_url = f"https://api.scraperapi.com/?api_key={key}&url={url}"
try:
response = requests.get(scraper_url, headers=headers, timeout=15)
if response.status_code == 200:
return response
elif response.status_code == 429:
continue # Try next key
except Exception as e:
continue
return None# Product name
name = block.xpath('.//div[@class="KzDlHZ"]/text()')
# Rating
rating = block.xpath('string(.//div[@class="XQDdHH"])').strip()
# Specifications
specs = block.xpath('.//ul[@class="G4BRas"]/li/text()')DELETE FROM fp WHERE id IN (
SELECT id FROM (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY id) AS ranknum
FROM fp
) AS fulldata
WHERE ranknum > 1
);| Challenge | Solution |
|---|---|
| Rate Limiting | Implemented ScraperAPI with multiple key rotation |
| Dynamic Content | Used robust XPath with fallback selectors |
| Data Inconsistency | Created comprehensive validation layer |
| Duplicate Records | Window function based deduplication |
| Price Formatting | Regex-based cleaning and type casting |
| File Locking | Automatic filename increment on save |
- π Power BI/Tableau dashboard integration
- π€ Machine learning price prediction model
- π§ Email alerts for price drops
- π Automated daily scraping with cron jobs
- π REST API for data access
- π± Mobile app for product tracking
- π§ͺ Unit tests and CI/CD pipeline
- π³ Docker containerization
SELECT name, brand_name, rating, original_price
FROM fp
WHERE rating >= 4.5
ORDER BY rating DESC, rating_count DESC
LIMIT 10;SELECT
CASE
WHEN original_price < 30000 THEN 'Budget'
WHEN original_price BETWEEN 30000 AND 60000 THEN 'Mid-Range'
ELSE 'Premium'
END AS price_category,
COUNT(*) AS count,
AVG(rating) AS avg_rating
FROM fp
GROUP BY price_category;SELECT
brand_name,
COUNT(*) AS products,
AVG(original_price) AS avg_price,
AVG(discount) AS avg_discount,
AVG(rating) AS avg_rating
FROM fp
GROUP BY brand_name
ORDER BY products DESC;Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Suraj Akhuli
- GitHub: @Suraj Akhuli
- LinkedIn: Suraj Akhuli
- ScraperAPI for reliable proxy services
- Flipkart for providing accessible product data
- The open-source community for amazing libraries
β Star this repository if you found it helpful!
Made with β€οΈ by Suraj Akhuli