Code repository for the paper "AI-Enhanced Abductive Analysis: Integrating Large Language Models and Semantic Clustering for Qualitative Social Research" by Deborah Rhodes and Simon McCallum (Victoria University of Wellington, New Zealand).
This repository implements an AI-powered thematic analysis pipeline for qualitative social research. It uses OpenAI's GPT models to extract themes from semi-structured interview data, then applies semantic embeddings and hierarchical clustering to validate and organize the resulting themes.
The pipeline follows Laura Nelson's three-stage Computational Grounded Theory (CGT) framework:
- Pattern Detection — Multi-framework theme extraction via LLMs
- Pattern Refinement — Human-reviewed organization by research dimensions
- Pattern Confirmation — Embedding-based clustering and semantic similarity analysis
The empirical application analyses interview data from 29 early-career dairy farm workers in New Zealand, collected via WhatsApp over 15 days across 5 safety dimensions.
| Dimension | Interview Days | Focus |
|---|---|---|
| Employer Risk Expectation | 1, 6, 11 | Management safety priorities vs production pressure |
| Justice (Safety) | 2, 7, 12 | Fairness in incident handling and investigations |
| Workers Risk Expectation | 3, 8, 13 | Worker responsibility and peer safety collaboration |
| Communication & Danger Tolerance | 4, 9, 14 | Risk communication and acceptance of unsafe conditions |
| Trust (in System) | 5, 10, 15 | Confidence in management, training, and guidance |
| File | Description |
|---|---|
thematic_extraction.py |
Primary analysis engine. Processes interview JSON through OpenAI using three frameworks: default, Timmermans & Tavory abductive analysis, and Nelson's CGT. |
Theme extraction.py |
Prototype for basic theme extraction with structured JSON output. |
thematic_view.py |
Post-processing to organize and structure extracted themes. |
| File | Description |
|---|---|
pair_themes.py |
Pairs scenario and safety themes by interaction index, organized by dimension. |
theme_lists_clean.py |
Aggregates themes by research dimension and type (scenario vs safety). |
extract_theme_data.py |
Extracts interview data for specific days from pickle metadata. |
thematic_view_excel.py |
Exports themes and analysis to Excel with structured columns. |
collate_to_word.py |
Exports collated interview data to Word documents. |
| File | Description |
|---|---|
thematic_view_heirachy_create.py |
Core clustering pipeline — extracts unique themes, generates OpenAI embeddings, builds FAISS index, creates dendrograms and t-SNE visualizations. |
thematic_view_heirachy_clustering.py |
Hierarchical and agglomerative clustering with configurable cluster counts. |
thematic_view_cluster_theme_analyzer.py |
Uses OpenAI to generate human-readable cluster names (1–4 words). |
thematic_veiw_focus_code.nearest.py |
Nearest-neighbour search using FAISS to find semantically similar themes from probe themes. |
| File | Description |
|---|---|
build_index.py |
Builds FAISS embeddings index with batching for large datasets. |
query_index.py |
Query interface for the FAISS index. |
kmeansClustering.py |
K-means clustering on theme embeddings with visualization. |
pca_visualisation.py |
PCA dimensionality reduction and visualization. |
pca_Kmeans_visualisation.py |
Combined PCA and K-means visualization. |
ThemeExtractPass1.py |
Initial theme extraction pass with validation. |
theme_text_validation.py |
Validates extracted themes against source text. |
day_1_6_11_etraction.py |
Extracts data for specific interview days. |
| File | Description |
|---|---|
faiss_thematic_pipeline.py |
Builds unified FAISS index across all dimension files. |
name_clusters_openai.py |
OpenAI-based cluster naming with batch processing. |
theme_pair_docx.py |
Generates Word documents with paired theme analysis. |
Interview JSON (json/)
│
▼
thematic_extraction.py ──► Three-framework GPT analysis
│
▼
outputsTimmermansNoBlanks/ ──► Individual + combined JSON results
│
▼
thematic_view_heirachy_create.py ──► Embeddings + FAISS index
│
▼
thematic_view_heirachy_clustering.py ──► Hierarchical clusters
│
▼
thematic_view_cluster_theme_analyzer.py ──► Named clusters
│
▼
pair_themes.py + theme_lists_clean.py ──► Dimension-organized outputs
│
▼
Excel / Word / JSON / PNG exports
openai— GPT-4 API for theme extraction and embeddingsfaiss— Vector similarity search and indexingnumpy— Numerical computationscikit-learn— K-means, agglomerative clustering, t-SNE, PCAscipy— Hierarchical clustering and dendrogramsmatplotlib— Visualizationnltk— Sentence tokenizationpandas— Data manipulation and Excel exportpython-docx— Word document generation
An OpenAI API key must be set as the environment variable OPENAI_API_KEY_SAFETY.
Key parameters used across scripts:
| Parameter | Default | Description |
|---|---|---|
| Embedding model | text-embedding-3-large |
3,072-dimension embeddings |
| Cluster count | 50 | For agglomerative and K-means clustering |
| t-SNE perplexity | 30 | Controls local vs global structure |
| LLM temperature | 0 | Deterministic output for reproducibility |
| Analysis frameworks | 3 | Default, Timmermans & Tavory, Nelson CGT |
themes_list.json— 390+ unique extracted themesthemes_embeddings.faiss/themes_metadata.pkl— FAISS index and metadatatheme_clusters.*.txt— Clustered theme groupingstheme_dendrogram.*.png— Dendrogram visualizationstheme_tsne_*.png— t-SNE cluster visualizationsnearest_neighbours_*.json— Semantic similarity resultsexcel_export_by_dimension/— Excel workbooks per dimensiontheme_pairs_by_dimensions/— Dimension-organized theme pairs with cluster analysis
See the linked paper repository for licensing and citation information.