Skip to content

hassonlab/corpus-language-analysis

Repository files navigation

Sharedness and Frequency Analysis Scripts

This repository contains Python scripts for the language analysis plots of the paper: "The First 1,000 Days (1kD) Project - Collecting and Analyzing an Ultra-Dense Naturalistic Dataset of Human Baby Development".

The scripts are designed to be run from the command line.

Requirements

Install the required Python packages:

pip install numpy pandas polars matplotlib scipy

Sharedness across datasets

This workflow computes sharedness statistics across datasets and then plots the distribution of k, where k is the number of datasets in which a word/type is considered shared relative to a reference. The sharedness script expects Parquet input files for each dataset with one row per token/word observation in the dataset. reference can be CHILDES or any other dataset.

python compute_sharedness.py \
  --datasets dataset1,dataset2,dataset3 \
  --pattern "data/parquet/{name}/*.parquet" \
  --reference-pattern "data/reference/*.parquet" \
  --reference-name reference \
  --outdir results/sharedness \
  --pos-scope all \
  --vocab-criteria lemma \
  --smooth-alpha 0.5 \
  --ratio-max-list 2.0 \
  --reference-min-count-list 10 \
  --z-list 1.96 \
  --min-count-per-dataset 10

Results can be plottted using the plot script

python plot_sharedness.py \
  --runs-dir results/sharedness/sharedness_runs_all_lemma \
  --run-tag hybrid_alpha0p5_ratio2_refmin10_cmin10_z1p96_statsFalse

Cumulative hybrid sharedness

This workflow asks: as more dates are sampled from a dataset, how close does the subset become to the full dataset?

python compute_cumulative_hybrid_sharedness.py \
  --datasets dataset1,dataset2,dataset3 \
  --pattern "data/parquet/{name}/*.parquet" \
  --outdir results/cumulative_hybrid \
  --pos-scope all \
  --vocab-criteria lemma_pos \
  --min-count-full 10 \
  --min-count-subset 10 \
  --date-step 20 \
  --seed 123 \
  --smoothing-alpha 0.5 \
  --ratio-threshold 2.0 \
  --reference-min-expected-count 10.0 \
  --smoothing-mode beta

Results can be plotted using the plot script

python plot_cumulative_hybrid_sharedness.py \
  --csv results/cumulative_hybrid/cumulative_hybrid_all_lemma_pos_fullmin10_submin10_step20_refemin10.0_ratio2.0_statsFalse_beta.csv

Pairwise normalized frequency comparison

This script compares normalized token frequencies between every pair of datasets.

python plot_compared_normalized_frequencies.py \
  --base-dir data/counts \
  --output-dir results/normalized_frequency_comparisons \
  --datasets dataset1 dataset2 dataset3 \
  --part-of-speech all \
  --tokenization lemma \
  --top-k 50000 \
  --factor-band 2

data/counts should contain one subdirectory per dataset, and each subdirectory should include count JSON file - a dictionary mapping lemmas or terms to their counts.

Zipf-Mandelbrot plots

This script fits Zipf-Mandelbrot curves to token-count distributions.

python plot_zipf_mandelbrot.py \
  --base-dir data/counts \
  --output-dir results/zipf \
  --corpus-names-file config/corpus_names.json \
  --corpus-colors-file config/corpus_colors.json \
  --datasets dataset1 dataset2 dataset3 \
  --tokenizations lemma \
  --part-of-speech all \
  --min-count 10

The corpus names file maps dataset IDs to display labels:

{
  "dataset1": "Dataset 1",
  "dataset2": "Dataset 2"
}

The corpus colors file maps dataset IDs to plot colors:

{
  "dataset1": "red",
  "dataset2": "blue"
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages