Skip to content

SimonMcCallum/AI-Thematic-Analysis-code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Enhanced Thematic Analysis Code

Code repository for the paper "AI-Enhanced Abductive Analysis: Integrating Large Language Models and Semantic Clustering for Qualitative Social Research" by Deborah Rhodes and Simon McCallum (Victoria University of Wellington, New Zealand).

Overview

This repository implements an AI-powered thematic analysis pipeline for qualitative social research. It uses OpenAI's GPT models to extract themes from semi-structured interview data, then applies semantic embeddings and hierarchical clustering to validate and organize the resulting themes.

The pipeline follows Laura Nelson's three-stage Computational Grounded Theory (CGT) framework:

  1. Pattern Detection — Multi-framework theme extraction via LLMs
  2. Pattern Refinement — Human-reviewed organization by research dimensions
  3. Pattern Confirmation — Embedding-based clustering and semantic similarity analysis

The empirical application analyses interview data from 29 early-career dairy farm workers in New Zealand, collected via WhatsApp over 15 days across 5 safety dimensions.

Research Dimensions

Dimension Interview Days Focus
Employer Risk Expectation 1, 6, 11 Management safety priorities vs production pressure
Justice (Safety) 2, 7, 12 Fairness in incident handling and investigations
Workers Risk Expectation 3, 8, 13 Worker responsibility and peer safety collaboration
Communication & Danger Tolerance 4, 9, 14 Risk communication and acceptance of unsafe conditions
Trust (in System) 5, 10, 15 Confidence in management, training, and guidance

Repository Structure

Stage 1: Theme Extraction

File Description
thematic_extraction.py Primary analysis engine. Processes interview JSON through OpenAI using three frameworks: default, Timmermans & Tavory abductive analysis, and Nelson's CGT.
Theme extraction.py Prototype for basic theme extraction with structured JSON output.
thematic_view.py Post-processing to organize and structure extracted themes.

Stage 2: Organization and Pairing

File Description
pair_themes.py Pairs scenario and safety themes by interaction index, organized by dimension.
theme_lists_clean.py Aggregates themes by research dimension and type (scenario vs safety).
extract_theme_data.py Extracts interview data for specific days from pickle metadata.
thematic_view_excel.py Exports themes and analysis to Excel with structured columns.
collate_to_word.py Exports collated interview data to Word documents.

Stage 3: Embedding, Clustering, and Validation

File Description
thematic_view_heirachy_create.py Core clustering pipeline — extracts unique themes, generates OpenAI embeddings, builds FAISS index, creates dendrograms and t-SNE visualizations.
thematic_view_heirachy_clustering.py Hierarchical and agglomerative clustering with configurable cluster counts.
thematic_view_cluster_theme_analyzer.py Uses OpenAI to generate human-readable cluster names (1–4 words).
thematic_veiw_focus_code.nearest.py Nearest-neighbour search using FAISS to find semantically similar themes from probe themes.

Utility Scripts (code/)

File Description
build_index.py Builds FAISS embeddings index with batching for large datasets.
query_index.py Query interface for the FAISS index.
kmeansClustering.py K-means clustering on theme embeddings with visualization.
pca_visualisation.py PCA dimensionality reduction and visualization.
pca_Kmeans_visualisation.py Combined PCA and K-means visualization.
ThemeExtractPass1.py Initial theme extraction pass with validation.
theme_text_validation.py Validates extracted themes against source text.
day_1_6_11_etraction.py Extracts data for specific interview days.

Dimension-Specific Analysis (theme_pairs_by_dimensions/)

File Description
faiss_thematic_pipeline.py Builds unified FAISS index across all dimension files.
name_clusters_openai.py OpenAI-based cluster naming with batch processing.
theme_pair_docx.py Generates Word documents with paired theme analysis.

Data Flow

Interview JSON (json/)
        │
        ▼
thematic_extraction.py  ──►  Three-framework GPT analysis
        │
        ▼
outputsTimmermansNoBlanks/  ──►  Individual + combined JSON results
        │
        ▼
thematic_view_heirachy_create.py  ──►  Embeddings + FAISS index
        │
        ▼
thematic_view_heirachy_clustering.py  ──►  Hierarchical clusters
        │
        ▼
thematic_view_cluster_theme_analyzer.py  ──►  Named clusters
        │
        ▼
pair_themes.py + theme_lists_clean.py  ──►  Dimension-organized outputs
        │
        ▼
Excel / Word / JSON / PNG exports

Dependencies

  • openai — GPT-4 API for theme extraction and embeddings
  • faiss — Vector similarity search and indexing
  • numpy — Numerical computation
  • scikit-learn — K-means, agglomerative clustering, t-SNE, PCA
  • scipy — Hierarchical clustering and dendrograms
  • matplotlib — Visualization
  • nltk — Sentence tokenization
  • pandas — Data manipulation and Excel export
  • python-docx — Word document generation

Configuration

An OpenAI API key must be set as the environment variable OPENAI_API_KEY_SAFETY.

Key parameters used across scripts:

Parameter Default Description
Embedding model text-embedding-3-large 3,072-dimension embeddings
Cluster count 50 For agglomerative and K-means clustering
t-SNE perplexity 30 Controls local vs global structure
LLM temperature 0 Deterministic output for reproducibility
Analysis frameworks 3 Default, Timmermans & Tavory, Nelson CGT

Key Outputs

  • themes_list.json — 390+ unique extracted themes
  • themes_embeddings.faiss / themes_metadata.pkl — FAISS index and metadata
  • theme_clusters.*.txt — Clustered theme groupings
  • theme_dendrogram.*.png — Dendrogram visualizations
  • theme_tsne_*.png — t-SNE cluster visualizations
  • nearest_neighbours_*.json — Semantic similarity results
  • excel_export_by_dimension/ — Excel workbooks per dimension
  • theme_pairs_by_dimensions/ — Dimension-organized theme pairs with cluster analysis

License

See the linked paper repository for licensing and citation information.

About

Code repository for AI Thematic Analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages