Skip to content

saadyaq/Web_Scraper_news_content

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“° Web Scraper - News Content Extractor

This project is a web scraper designed to automatically extract titles, dates, links, and article content from various news websites.

πŸ“Œ Objective

To automate the collection of news articles for purposes like text analysis, media monitoring, or building a dataset for NLP and data science projects.

πŸš€ Features

  • πŸ”Ž Scrapes titles, dates, URLs, and full article content
  • 🧠 Modular structure for easily adding new news sources
  • πŸ“‚ Saves the extracted data to a .csv file
  • πŸ•΅οΈ Simple text cleaning utility

πŸ› οΈ Technologies Used

  • Python 3.x
  • requests
  • BeautifulSoup (bs4)
  • pandas
  • re (regex)

πŸ—‚οΈ Project Structure

Web_Scraper_news_content/ β”œβ”€β”€ data/ β”‚ └── news_data.csv # Extracted data β”œβ”€β”€ scrapers/ β”‚ β”œβ”€β”€ scraper_lemonde.py # Scraper for Le Monde β”‚ β”œβ”€β”€ scraper_liberation.py # Scraper for LibΓ©ration β”‚ └── ... # Additional scrapers go here β”œβ”€β”€ utils/ β”‚ └── text_cleaning.py # Utility functions for text processing β”œβ”€β”€ main.py # Main script to run all scrapers β”œβ”€β”€ requirements.txt # Python dependencies └── README.md # This file

πŸ“Œ Future Improvements Add logging system

Improve speed with multithreading or async requests

Build a Streamlit dashboard for article exploration

πŸ‘¨β€πŸ’» Author Saad Yaqine πŸ“§ saadyaqine91@gmail.com

About

Automated news content extractor for media monitoring and NLP datasets - scrapes article titles, dates, URLs, and full text from multiple news websites

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages