AI-Enhanced Thematic Analysis Code

Code repository for the paper "AI-Enhanced Abductive Analysis: Integrating Large Language Models and Semantic Clustering for Qualitative Social Research" by Deborah Rhodes and Simon McCallum (Victoria University of Wellington, New Zealand).

Overview

This repository implements an AI-powered thematic analysis pipeline for qualitative social research. It uses OpenAI's GPT models to extract themes from semi-structured interview data, then applies semantic embeddings and hierarchical clustering to validate and organize the resulting themes.

The pipeline follows Laura Nelson's three-stage Computational Grounded Theory (CGT) framework:

Pattern Detection — Multi-framework theme extraction via LLMs
Pattern Refinement — Human-reviewed organization by research dimensions
Pattern Confirmation — Embedding-based clustering and semantic similarity analysis

The empirical application analyses interview data from 29 early-career dairy farm workers in New Zealand, collected via WhatsApp over 15 days across 5 safety dimensions.

Research Dimensions

Dimension	Interview Days	Focus
Employer Risk Expectation	1, 6, 11	Management safety priorities vs production pressure
Justice (Safety)	2, 7, 12	Fairness in incident handling and investigations
Workers Risk Expectation	3, 8, 13	Worker responsibility and peer safety collaboration
Communication & Danger Tolerance	4, 9, 14	Risk communication and acceptance of unsafe conditions
Trust (in System)	5, 10, 15	Confidence in management, training, and guidance

Repository Structure

Stage 1: Theme Extraction

File	Description
`thematic_extraction.py`	Primary analysis engine. Processes interview JSON through OpenAI using three frameworks: default, Timmermans & Tavory abductive analysis, and Nelson's CGT.
`Theme extraction.py`	Prototype for basic theme extraction with structured JSON output.
`thematic_view.py`	Post-processing to organize and structure extracted themes.

Stage 2: Organization and Pairing

File	Description
`pair_themes.py`	Pairs scenario and safety themes by interaction index, organized by dimension.
`theme_lists_clean.py`	Aggregates themes by research dimension and type (scenario vs safety).
`extract_theme_data.py`	Extracts interview data for specific days from pickle metadata.
`thematic_view_excel.py`	Exports themes and analysis to Excel with structured columns.
`collate_to_word.py`	Exports collated interview data to Word documents.

Stage 3: Embedding, Clustering, and Validation

File	Description
`thematic_view_heirachy_create.py`	Core clustering pipeline — extracts unique themes, generates OpenAI embeddings, builds FAISS index, creates dendrograms and t-SNE visualizations.
`thematic_view_heirachy_clustering.py`	Hierarchical and agglomerative clustering with configurable cluster counts.
`thematic_view_cluster_theme_analyzer.py`	Uses OpenAI to generate human-readable cluster names (1–4 words).
`thematic_veiw_focus_code.nearest.py`	Nearest-neighbour search using FAISS to find semantically similar themes from probe themes.

Utility Scripts (`code/`)

File	Description
`build_index.py`	Builds FAISS embeddings index with batching for large datasets.
`query_index.py`	Query interface for the FAISS index.
`kmeansClustering.py`	K-means clustering on theme embeddings with visualization.
`pca_visualisation.py`	PCA dimensionality reduction and visualization.
`pca_Kmeans_visualisation.py`	Combined PCA and K-means visualization.
`ThemeExtractPass1.py`	Initial theme extraction pass with validation.
`theme_text_validation.py`	Validates extracted themes against source text.
`day_1_6_11_etraction.py`	Extracts data for specific interview days.

Dimension-Specific Analysis (`theme_pairs_by_dimensions/`)

File	Description
`faiss_thematic_pipeline.py`	Builds unified FAISS index across all dimension files.
`name_clusters_openai.py`	OpenAI-based cluster naming with batch processing.
`theme_pair_docx.py`	Generates Word documents with paired theme analysis.

Data Flow

Interview JSON (json/)
        │
        ▼
thematic_extraction.py  ──►  Three-framework GPT analysis
        │
        ▼
outputsTimmermansNoBlanks/  ──►  Individual + combined JSON results
        │
        ▼
thematic_view_heirachy_create.py  ──►  Embeddings + FAISS index
        │
        ▼
thematic_view_heirachy_clustering.py  ──►  Hierarchical clusters
        │
        ▼
thematic_view_cluster_theme_analyzer.py  ──►  Named clusters
        │
        ▼
pair_themes.py + theme_lists_clean.py  ──►  Dimension-organized outputs
        │
        ▼
Excel / Word / JSON / PNG exports

Dependencies

openai — GPT-4 API for theme extraction and embeddings
faiss — Vector similarity search and indexing
numpy — Numerical computation
scikit-learn — K-means, agglomerative clustering, t-SNE, PCA
scipy — Hierarchical clustering and dendrograms
matplotlib — Visualization
nltk — Sentence tokenization
pandas — Data manipulation and Excel export
python-docx — Word document generation

Configuration

An OpenAI API key must be set as the environment variable OPENAI_API_KEY_SAFETY.

Key parameters used across scripts:

Parameter	Default	Description
Embedding model	`text-embedding-3-large`	3,072-dimension embeddings
Cluster count	50	For agglomerative and K-means clustering
t-SNE perplexity	30	Controls local vs global structure
LLM temperature	0	Deterministic output for reproducibility
Analysis frameworks	3	Default, Timmermans & Tavory, Nelson CGT

Key Outputs

themes_list.json — 390+ unique extracted themes
themes_embeddings.faiss / themes_metadata.pkl — FAISS index and metadata
theme_clusters.*.txt — Clustered theme groupings
theme_dendrogram.*.png — Dendrogram visualizations
theme_tsne_*.png — t-SNE cluster visualizations
nearest_neighbours_*.json — Semantic similarity results
excel_export_by_dimension/ — Excel workbooks per dimension
theme_pairs_by_dimensions/ — Dimension-organized theme pairs with cluster analysis

License

See the linked paper repository for licensing and citation information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Enhanced Thematic Analysis Code

Overview

Research Dimensions

Repository Structure

Stage 1: Theme Extraction

Stage 2: Organization and Pairing

Stage 3: Embedding, Clustering, and Validation

Utility Scripts (`code/`)

Dimension-Specific Analysis (`theme_pairs_by_dimensions/`)

Data Flow

Dependencies

Configuration

Key Outputs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
excel_export_by_dimension		excel_export_by_dimension
theme_pairs_by_dimensions		theme_pairs_by_dimensions
Items clustered by coding similarity.jpg		Items clustered by coding similarity.jpg
README.md		README.md
Thematic_extractionNelsonextension		Thematic_extractionNelsonextension
Theme extraction.py		Theme extraction.py
collate_to_word.py		collate_to_word.py
embeddings.py		embeddings.py
extract_theme_data.py		extract_theme_data.py
extracted_themes_list.txt		extracted_themes_list.txt
pair_themes.py		pair_themes.py
paired_theme_table.py		paired_theme_table.py
thematic_extraction.py		thematic_extraction.py
thematic_veiw_focus_code.nearest.py		thematic_veiw_focus_code.nearest.py
thematic_view.py		thematic_view.py
thematic_view_cluster_extractor.py		thematic_view_cluster_extractor.py
thematic_view_cluster_theme_analyzer.py		thematic_view_cluster_theme_analyzer.py
thematic_view_excel.py		thematic_view_excel.py
thematic_view_heirachy_clustering.py		thematic_view_heirachy_clustering.py
thematic_view_heirachy_create.py		thematic_view_heirachy_create.py
theme_lists_clean.py		theme_lists_clean.py
themes_list.json		themes_list.json

Folders and files

Latest commit

History

Repository files navigation

AI-Enhanced Thematic Analysis Code

Overview

Research Dimensions

Repository Structure

Stage 1: Theme Extraction

Stage 2: Organization and Pairing

Stage 3: Embedding, Clustering, and Validation

Utility Scripts (code/)

Dimension-Specific Analysis (theme_pairs_by_dimensions/)

Data Flow

Dependencies

Configuration

Key Outputs

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Utility Scripts (`code/`)

Dimension-Specific Analysis (`theme_pairs_by_dimensions/`)

Packages