This project automates the process of fetching, analyzing, and visualizing crime-related news articles. It leverages NLP techniques to extract crime types, locations, and trends, and provides interactive dashboards and reports for deeper insights.
- Automated News Fetching: Retrieves recent crime news articles using NewsAPI.
- Text Preprocessing: Cleans and prepares news text for analysis.
- Named Entity Recognition (NER): Extracts locations and crime-related entities using spaCy.
- Crime Classification: Categorizes news articles into crime types using keyword-based classification.
- Geocoding: Maps extracted locations to latitude and longitude.
- Trend & Keyword Analysis: Generates word clouds and trend plots.
- Report Generation: Produces Markdown and PDF reports with visualizations.
- Interactive Dashboard: Explore data and visualizations via a Streamlit web app.
.
├── data/ # Input and output data files
├── src/ # Source code
│ ├── main.py # Main pipeline script
│ ├── pipeline.py # (Alternative) Full pipeline and Streamlit app
│ ├── fetch_news.py # News fetching logic
│ ├── preprocess.py # Text cleaning and preprocessing
│ ├── ner_extractor.py # Named Entity Recognition
│ ├── crime_classifier.py # Crime type classification
│ ├── geocode_locations.py# Geocoding logic
│ ├── visualizer.py # Streamlit dashboard and plotting
│ ├── keyword_cloud.py # Word cloud generation
│ ├── report_generator.py # Report generation (Markdown, PDF)
│ └── ... # Other supporting modules
├── requirements.txt # Python dependencies
└── README.md # Project documentation
-
Clone the repository:
git clone <repo-url> cd Crime
-
Install dependencies:
pip install -r requirements.txt
-
Download spaCy model and NLTK stopwords:
python -m spacy download en_core_web_sm python -c "import nltk; nltk.download('stopwords')" -
Set up NewsAPI Key:
- The API key is currently hardcoded in
src/fetch_news.pyandsrc/pipeline.pyasAPI_KEY = 'b515596519fb4396a7c6aad3ff98ab2b'. - For production, replace this with your own key and consider using environment variables for security.
- The API key is currently hardcoded in
To fetch news, preprocess, analyze, and generate reports:
python src/main.pyThis will:
- Fetch recent crime news
- Clean and preprocess the text
- Extract entities and classify crime types
- Geocode locations
- Generate word clouds and reports
- Save outputs in the
data/andreports/directories
To explore the data and visualizations interactively:
streamlit run src/visualizer.pyor, if using the integrated pipeline/dashboard:
streamlit run src/pipeline.pyFeatures include:
- Crime type and location distribution
- Keyword cloud
- Downloadable reports (PDF, Markdown)
- Crime heatmaps
- File upload and custom analysis
data/raw_news.json: Raw news articles fetched from NewsAPIdata/cleaned_news.csv: Preprocessed news datadata/ner_output.csv: NER-annotated data with locations and crime typesdata/geo_news.csv: Geocoded data with latitude/longitude
- Markdown and PDF reports are generated in the
reports/directory.
- Crime Keywords: Edit
CRIME_KEYWORDSinsrc/main.pyorsrc/pipeline.pyto adjust classification. - API Key: Replace the hardcoded NewsAPI key with your own.
- Visualization: Modify
src/visualizer.pyfor custom dashboard features.
We welcome contributions to enhance this project!
If you'd like to contribute:
- Fork the repository.
- Create a new branch:
git checkout -b your-feature-name
- Commit your changes with clear messages.
- Push to your fork and submit a Pull Request (PR).
Please ensure your code follows the existing style and structure. For major changes, feel free to open an issue to discuss your ideas before implementation.



