This project focuses on extracting key themes and trends from financial news articles using natural language processing (NLP) techniques. By applying topic modeling, we aim to uncover insights about prevalent topics in the financial domain.
Financial news articles often reflect the pulse of the market, highlighting trends and concerns. This project leverages topic modeling to analyze financial news and identify key topics such as stock market updates, technological advancements, housing trends, and more.
To ensure a comprehensive dataset, we scraped financial news articles from multiple reputable sources, including:
- Yahoo Finance
- Bloomberg
- Other trusted financial platforms
The data collection process was automated to enable continuous updates, ensuring our analysis remains current and relevant.
High-quality text data is crucial for meaningful topic modeling. Our preprocessing steps included:
- Lemmatization
- Converted words to their base forms (e.g., "running" → "run") to group similar words together.
- Stop-word Removal
- Removed common, uninformative words like "the," "and," and "is."
- Dynamic Part-of-Speech (POS) Tagging
- Retained the most relevant parts of speech while filtering out irrelevant terms.
- Fuzzy Matching
- Identified and removed duplicate or near-duplicate articles to focus on unique content.
We conducted exploratory analysis to understand the dataset's structure and key features. Below are the findings from different financial domains:
- Stock Market: Frequent terms include Trump, investor, earnings, tariff.
- Morning Brief: Common words include Trump, business, price.
- Economies: Words like inflation, oil, cut, president stand out.
- Earnings: Terms such as growth, last, beat are prominent.
- Tech: Highlights include AI, Trump, gas, tax.
- Housing: Words like rate, mortgage, loan are frequent.
Here’s a visualization of frequent terms across domains:
Sure! Here's how you can update the Results section to include the Word2Vec approach with TF-IDF weighting and clustering:
Using topic modeling techniques like Latent Dirichlet Allocation (LDA) along with Word2Vec (weighted by TF-IDF) and clustering, we identified the following insights:
We applied LDA to uncover the prevalent topics in the financial news dataset. The following are the key topics identified:
-
Topic 0: Real Estate & Housing Market
Keywords: "home_value," "city," "housing_market," "real_estate," "growth" -
Topic 1: AI & Revenue Reports
Keywords: "ai," "meta," "report_revenue," "chip," "analyst_expectation" -
Topic 2: Trump & Business
Keywords: "trump," "ai," "yahoo," "finance," "business," "ceo" -
Topic 3: Trump, Tariffs, and Inflation
Keywords: "trump," "tariff," "inflation," "investor," "policy," "bloomberg" -
Topic 4: Tesla & AI Investments
Keywords: "tesla," "ai," "investor," "model," "share," "result" -
Topic 5: Inflation & Fed Rates
Keywords: "rate," "fed," "inflation," "homeowner," "tax," "reward" -
Topic 6: Real Estate & Revenue Growth
Keywords: "exist_home," "sale_price," "growth," "revenue," "increase," "million" -
Topic 7: Trump & Market Sentiment
Keywords: "trump," "fed," "investor," "rise," "tariff," "index" -
Topic 8: Mortgage Rates & Loans
Keywords: "rate," "mortgage," "lender," "bank," "loan," "mortgage_rate"
We used Word2Vec to create embeddings for each word and document. To refine the embeddings, we incorporated TF-IDF weighting to emphasize more important words in the context of each article. The embeddings were then clustered using techniques such as K-Means and DBSCAN to group articles with similar content.
-
Word Embeddings:
Each word in the article was transformed into a vector representation using Word2Vec. -
TF-IDF Weighting:
TF-IDF weighting was applied to adjust the influence of each word in the embedding, giving more importance to unique and relevant terms across the dataset. -
Clustering:
Using K-Means, DBSCAN, HDBSCAN, Spectral Clustering and Gaussian Mixtrue clustering methods, we grouped similar articles together. The clusters we got from this methods were not very informative.
We also visualized the clustering of articles based on their Word2Vec embeddings using techniques like t-SNE to reduce dimensionality:
This gives a visual representation of how similar or different articles are, based on their semantic content.


