This project is a web scraper designed to automatically extract titles, dates, links, and article content from various news websites.
To automate the collection of news articles for purposes like text analysis, media monitoring, or building a dataset for NLP and data science projects.
- π Scrapes titles, dates, URLs, and full article content
- π§ Modular structure for easily adding new news sources
- π Saves the extracted data to a
.csvfile - π΅οΈ Simple text cleaning utility
Python 3.xrequestsBeautifulSoup(bs4)pandasre(regex)
Web_Scraper_news_content/ βββ data/ β βββ news_data.csv # Extracted data βββ scrapers/ β βββ scraper_lemonde.py # Scraper for Le Monde β βββ scraper_liberation.py # Scraper for LibΓ©ration β βββ ... # Additional scrapers go here βββ utils/ β βββ text_cleaning.py # Utility functions for text processing βββ main.py # Main script to run all scrapers βββ requirements.txt # Python dependencies βββ README.md # This file
π Future Improvements Add logging system
Improve speed with multithreading or async requests
Build a Streamlit dashboard for article exploration
π¨βπ» Author Saad Yaqine π§ saadyaqine91@gmail.com