This repository contains the code, data, and analysis for the research article "Beyond Language Barriers: Evaluating the Efficacy of Open-Source LLMs in Analyzing Sentiment in Non-English Textual Data."
This research evaluates the cross-linguistic capabilities of general-purpose large language models (LLMs) in sentiment analysis of French texts. We assess whether open-source foundational models can accurately analyze sentiment in non-English content without task-specific fine-tuning, comparing their performance to traditional dictionary-based methods and examining the impact of prompt language on analytical outcomes.
- Can general-purpose LLMs accurately evaluate sentiment in non-English texts without task-specific fine-tuning?
- Can open-source general-purpose foundational LLMs accurately evaluate sentiment in non-English texts without task-specific fine-tuning?
- How does the language of the prompt affect performance across different languages?
The corpus consists of 2,683 French-language news articles from 13 media sources collected from the Eureka database, spanning 1991-2025. These articles cover discourse about open-source software with the following search query:
:q
a
TEXT= ("logiciel libre" | "logiciels libres" | "open source" | "open-source" | "logiciel open source" | "logiciels open source" | "code source ouvert" | "software libre" | "free software" | "code source libre" | "Free Software Foundation" | "Richard Stallman")
The sources include 9 newspapers from Québec, 1 from Ontario, and 3 from France, ensuring a diverse representation of the francophone media landscape.
/data/- Contains all data files/clean/- Processed data ready for analysis/dict/- Dictionary files for lexicon-based sentiment analysis/eureka_articles/- Raw HTML files of collected news articles/raw/- Unprocessed data files/tmp/- Temporary files and checkpoints/translation_files/- Files for the translation process
/docs/- Documentation and manuscript files/pub/- Publication-ready files
/results/- Analysis results and outputs/analysis/- Statistical analysis results/graphs/- Visualizations/tables/- Result tables
/src/- Source code for data collection, processing, and analysis
Data was collected using a custom Python web scraper (src/00_eureka_scraper.py) that extracts article content from the Eureka database. The scraper uses Selenium WebDriver to navigate the database interface and download articles in HTML format.
HTML documents were preprocessed with custom R functions to extract publication metadata (date, source, title) and full text content.
To facilitate cross-linguistic analysis, French texts were translated into English using Google Translate. The process involved:
- Batching the articles into Word documents using the
officerR package - Adding unique delimiters to maintain document structure
- Translating the documents using Google Translate
- Extracting and realigning the translated text with the original French content
- 200 randomly selected sentences were manually annotated on a scale from -1 to 1
- A custom Shiny application was developed to streamline the annotation process
- This annotated dataset served as the ground truth for evaluation
- French corpus: Analyzed using a French Lexicoder Sentiment Dictionary (frlsd)
- English translations: Analyzed using the standard Lexicoder Sentiment Dictionary
The research evaluates 11 different LLMs across three linguistic configurations:
- French text with French prompts (FR→FR)
- French text with English prompts (EN→FR)
- English translations with English prompts (EN→EN)
Models evaluated:
- Open-source models:
- Llama 3.2 (1B)
- Llama 3.2 (3B)
- Gemma 2 (9B)
- Mistral Saba (24B)
- QWQ (32B)
- Llama 3.3 (70B)
- DeepSeek R1 Basic (671B)
- Closed-source models:
- Claude 3.5 Haiku
- Gemini 2.0 Flash
- DeepSeek Chat
- GPT-4o
Each model was prompted three times per sentence to ensure robust results, for a total of 19,800 prompts.
Three complementary methodologies were used to assess performance:
- Correlation analysis: Pearson correlation coefficients between model predictions and ground truth
- F1 score with 7-category classification: Very negative, negative, somewhat negative, neutral, somewhat positive, positive, very positive
- F1 score with 3-category classification: Negative, neutral, positive
-
General-purpose foundation models can effectively analyze sentiment in non-English texts, with the best-performing models achieving correlation coefficients above 0.70 and F1 scores exceeding 0.70 for 3-category classification.
-
Proprietary closed-weight models consistently outperformed their open-weight counterparts, with DeepSeek Chat, GPT-4o, and Gemini 2.0 demonstrating the strongest performance.
-
For correlation metrics, French prompts with French text yielded the strongest performance, closely followed by English prompts with French text. English prompts with English-translated text showed marginally lower performance.
-
For classification metrics, English prompts with French text yielded the highest average performance, slightly outperforming both French-prompted French text and English-prompted English text.
-
All models demonstrated substantially higher performance on the simplified 3-category task compared to the more granular 7-category classification, with models struggling most with "somewhat positive" and "somewhat negative" categories.
This project uses R for data analysis and Python for web scraping. Python dependencies are managed with uv.
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install Python dependencies
uv sync
# Required R packages
# install.packages(c("tidyverse", "quanteda", "caret", "ellmer", "officer", "polyglotR"))# Run the Eureka scraper (requires login credentials)
uv run python src/00_eureka_scraper.py --start-date 1991-01-01 --end-date 2025-01-01 --output-dir ./data/eureka_articlesThe analysis can be run step-by-step following the numbered scripts in the /src/ directory:
- HTML parsing and data extraction:
src/01_html_parsing.R - Data summarization:
src/02_data_summary.R - Data cleaning:
src/10_dataframe_cleaning.R - Dictionary-based sentiment analysis:
src/20_frlsd.R,src/22_lsd_prep.R,src/23_lsd.R - Translation:
src/21_translate_to_english.R - Sample creation and manual annotation:
src/30_create_sample.R,src/31_validate_manual_anotation.R - LLM prompting:
src/40_prompt.R,src/41_prompt_cleaning.R - Performance evaluation:
src/50_cor.R,src/51_fscore_7.R,src/52_fscore_3.R - Visualization:
src/60_cor_graph.R,src/61_mae_graph.R,src/62_fscore_graphs.R - Result tables:
src/63_fscore_tables.R,src/64_results_summary.R
To replicate the study with your own API keys:
-
Set up the required API keys as environment variables:
Sys.setenv(FIREWORKS_API_KEY="your_key_here") Sys.setenv(OPENAI_API_KEY="your_key_here") Sys.setenv(ANTHROPIC_API_KEY="your_key_here") Sys.setenv(GOOGLE_API_KEY="your_key_here") Sys.setenv(GROQ_API_KEY="your_key_here") Sys.setenv(DEEPSEEK_API_KEY="your_key_here")
-
Run the
src/40_prompt.Rscript to perform sentiment analysis with all models. -
Run the evaluation scripts to analyze the results.
If you use this code or data in your research, please cite:
Foisy, L.-O. M., Pelletier, C., Proulx, É., Vincent, S.-J., & Dufresne, Y. (2025).
Beyond Language Barriers: Evaluating the Efficacy of Open-Source LLMs in Analyzing
Sentiment in Non-English Textual Data.
This repository is licensed under the terms included in the LICENSE file.