Topic Modeling for Financial News

This project focuses on extracting key themes and trends from financial news articles using natural language processing (NLP) techniques. By applying topic modeling, we aim to uncover insights about prevalent topics in the financial domain.

Overview

Financial news articles often reflect the pulse of the market, highlighting trends and concerns. This project leverages topic modeling to analyze financial news and identify key topics such as stock market updates, technological advancements, housing trends, and more.

Data Collection

To ensure a comprehensive dataset, we scraped financial news articles from multiple reputable sources, including:

Yahoo Finance
Bloomberg
Other trusted financial platforms

The data collection process was automated to enable continuous updates, ensuring our analysis remains current and relevant.

Data Preprocessing

High-quality text data is crucial for meaningful topic modeling. Our preprocessing steps included:

Lemmatization
- Converted words to their base forms (e.g., "running" → "run") to group similar words together.
Stop-word Removal
- Removed common, uninformative words like "the," "and," and "is."
Dynamic Part-of-Speech (POS) Tagging
- Retained the most relevant parts of speech while filtering out irrelevant terms.
Fuzzy Matching
- Identified and removed duplicate or near-duplicate articles to focus on unique content.

Exploratory Data Analysis (EDA)

We conducted exploratory analysis to understand the dataset's structure and key features. Below are the findings from different financial domains:

Stock Market: Frequent terms include Trump, investor, earnings, tariff.
Morning Brief: Common words include Trump, business, price.
Economies: Words like inflation, oil, cut, president stand out.
Earnings: Terms such as growth, last, beat are prominent.
Tech: Highlights include AI, Trump, gas, tax.
Housing: Words like rate, mortgage, loan are frequent.

Visualization

Here’s a visualization of frequent terms across domains:

Sure! Here's how you can update the Results section to include the Word2Vec approach with TF-IDF weighting and clustering:

Results

Using topic modeling techniques like Latent Dirichlet Allocation (LDA) along with Word2Vec (weighted by TF-IDF) and clustering, we identified the following insights:

Topic Modeling with LDA:

We applied LDA to uncover the prevalent topics in the financial news dataset. The following are the key topics identified:

Topic 0: Real Estate & Housing Market
Keywords: "home_value," "city," "housing_market," "real_estate," "growth"
Topic 1: AI & Revenue Reports
Keywords: "ai," "meta," "report_revenue," "chip," "analyst_expectation"
Topic 2: Trump & Business
Keywords: "trump," "ai," "yahoo," "finance," "business," "ceo"
Topic 3: Trump, Tariffs, and Inflation
Keywords: "trump," "tariff," "inflation," "investor," "policy," "bloomberg"
Topic 4: Tesla & AI Investments
Keywords: "tesla," "ai," "investor," "model," "share," "result"
Topic 5: Inflation & Fed Rates
Keywords: "rate," "fed," "inflation," "homeowner," "tax," "reward"
Topic 6: Real Estate & Revenue Growth
Keywords: "exist_home," "sale_price," "growth," "revenue," "increase," "million"
Topic 7: Trump & Market Sentiment
Keywords: "trump," "fed," "investor," "rise," "tariff," "index"
Topic 8: Mortgage Rates & Loans
Keywords: "rate," "mortgage," "lender," "bank," "loan," "mortgage_rate"

Word2Vec with TF-IDF Weighting & Clustering:

We used Word2Vec to create embeddings for each word and document. To refine the embeddings, we incorporated TF-IDF weighting to emphasize more important words in the context of each article. The embeddings were then clustered using techniques such as K-Means and DBSCAN to group articles with similar content.

Word Embeddings:
Each word in the article was transformed into a vector representation using Word2Vec.
TF-IDF Weighting:
TF-IDF weighting was applied to adjust the influence of each word in the embedding, giving more importance to unique and relevant terms across the dataset.
Clustering:
Using K-Means, DBSCAN, HDBSCAN, Spectral Clustering and Gaussian Mixtrue clustering methods, we grouped similar articles together. The clusters we got from this methods were not very informative.

Word2Vec Embedding Visualization

We also visualized the clustering of articles based on their Word2Vec embeddings using techniques like t-SNE to reduce dimensionality:

This gives a visual representation of how similar or different articles are, based on their semantic content.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md
git.png		git.png
pylda.png		pylda.png
tsne.png		tsne.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Modeling for Financial News

Table of Contents

Overview

Data Collection

Data Preprocessing