This project was built as a personal learning project to explore linguistic data science, and syntactic complexity modeling.
- Load real-world
.txtsamples by index - Visualize Syntactic Complexity Indices (e.g., MLS, CN_C, VP_T)
- Generate classification results with confidence scores
- Explore SHAP waterfall plots to interpret predictions
The classifier uses L2SCA Indices by TAASSC
Predictions are supported by SHAP contribution plots, showing how each feature influences the outcome toward AI or SLW.
The dataset consists of 300 text samples divided into three categories:
- 1–100: Human-written texts by second language writers (SLW)
- 101–200: AI-generated texts using general prompts
- 201–300: AI-generated texts created by prompting large language models (LLMs) to mimic SLW writing style
Data prepocessing by TAASSC
-
The
.txtfiles intxt_samples/are included only for demonstration and learning purposes.
They are not licensed for reuse, redistribution, or commercial use.
Seetxt_samples/LICENSE.txtfor full terms. -
The dataset file
X_binary.csvis private and is not licensed for reuse, redistribution, or modification.
It is shared solely for demonstration purposes and should not be used for any other purpose.
This project's code is licensed under the MIT License.