Skip to content

haverstein/Topic-Modeling-for-Financial-News

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Topic Modeling for Financial News

This project focuses on extracting key themes and trends from financial news articles using natural language processing (NLP) techniques. By applying topic modeling, we aim to uncover insights about prevalent topics in the financial domain.

Table of Contents


Overview

Financial news articles often reflect the pulse of the market, highlighting trends and concerns. This project leverages topic modeling to analyze financial news and identify key topics such as stock market updates, technological advancements, housing trends, and more.


Data Collection

To ensure a comprehensive dataset, we scraped financial news articles from multiple reputable sources, including:

  • Yahoo Finance
  • Bloomberg
  • Other trusted financial platforms

The data collection process was automated to enable continuous updates, ensuring our analysis remains current and relevant.


Data Preprocessing

High-quality text data is crucial for meaningful topic modeling. Our preprocessing steps included:

  1. Lemmatization
    • Converted words to their base forms (e.g., "running" → "run") to group similar words together.
  2. Stop-word Removal
    • Removed common, uninformative words like "the," "and," and "is."
  3. Dynamic Part-of-Speech (POS) Tagging
    • Retained the most relevant parts of speech while filtering out irrelevant terms.
  4. Fuzzy Matching
    • Identified and removed duplicate or near-duplicate articles to focus on unique content.

Exploratory Data Analysis (EDA)

We conducted exploratory analysis to understand the dataset's structure and key features. Below are the findings from different financial domains:

  • Stock Market: Frequent terms include Trump, investor, earnings, tariff.
  • Morning Brief: Common words include Trump, business, price.
  • Economies: Words like inflation, oil, cut, president stand out.
  • Earnings: Terms such as growth, last, beat are prominent.
  • Tech: Highlights include AI, Trump, gas, tax.
  • Housing: Words like rate, mortgage, loan are frequent.

Visualization

Here’s a visualization of frequent terms across domains:

Word Cloud

Sure! Here's how you can update the Results section to include the Word2Vec approach with TF-IDF weighting and clustering:


Results

Using topic modeling techniques like Latent Dirichlet Allocation (LDA) along with Word2Vec (weighted by TF-IDF) and clustering, we identified the following insights:

Topic Modeling with LDA:

We applied LDA to uncover the prevalent topics in the financial news dataset. The following are the key topics identified:

  1. Topic 0: Real Estate & Housing Market
    Keywords: "home_value," "city," "housing_market," "real_estate," "growth"

  2. Topic 1: AI & Revenue Reports
    Keywords: "ai," "meta," "report_revenue," "chip," "analyst_expectation"

  3. Topic 2: Trump & Business
    Keywords: "trump," "ai," "yahoo," "finance," "business," "ceo"

  4. Topic 3: Trump, Tariffs, and Inflation
    Keywords: "trump," "tariff," "inflation," "investor," "policy," "bloomberg"

  5. Topic 4: Tesla & AI Investments
    Keywords: "tesla," "ai," "investor," "model," "share," "result"

  6. Topic 5: Inflation & Fed Rates
    Keywords: "rate," "fed," "inflation," "homeowner," "tax," "reward"

  7. Topic 6: Real Estate & Revenue Growth
    Keywords: "exist_home," "sale_price," "growth," "revenue," "increase," "million"

  8. Topic 7: Trump & Market Sentiment
    Keywords: "trump," "fed," "investor," "rise," "tariff," "index"

  9. Topic 8: Mortgage Rates & Loans
    Keywords: "rate," "mortgage," "lender," "bank," "loan," "mortgage_rate"

LDA Visualization

Word2Vec with TF-IDF Weighting & Clustering:

We used Word2Vec to create embeddings for each word and document. To refine the embeddings, we incorporated TF-IDF weighting to emphasize more important words in the context of each article. The embeddings were then clustered using techniques such as K-Means and DBSCAN to group articles with similar content.

  1. Word Embeddings:
    Each word in the article was transformed into a vector representation using Word2Vec.

  2. TF-IDF Weighting:
    TF-IDF weighting was applied to adjust the influence of each word in the embedding, giving more importance to unique and relevant terms across the dataset.

  3. Clustering:
    Using K-Means, DBSCAN, HDBSCAN, Spectral Clustering and Gaussian Mixtrue clustering methods, we grouped similar articles together. The clusters we got from this methods were not very informative.

Word2Vec Embedding Visualization

We also visualized the clustering of articles based on their Word2Vec embeddings using techniques like t-SNE to reduce dimensionality:

This gives a visual representation of how similar or different articles are, based on their semantic content.

t-SNE reduction

About

Grouping financial news article using NLP techniques using Latent Dirichlet Allocation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors