This repository contains Python scripts for the language analysis plots of the paper: "The First 1,000 Days (1kD) Project - Collecting and Analyzing an Ultra-Dense Naturalistic Dataset of Human Baby Development".
The scripts are designed to be run from the command line.
Install the required Python packages:
pip install numpy pandas polars matplotlib scipyThis workflow computes sharedness statistics across datasets and then plots the distribution of k, where k is the number of datasets in which a word/type is considered shared relative to a reference. The sharedness script expects Parquet input files for each dataset with one row per token/word observation in the dataset. reference can be CHILDES or any other dataset.
python compute_sharedness.py \
--datasets dataset1,dataset2,dataset3 \
--pattern "data/parquet/{name}/*.parquet" \
--reference-pattern "data/reference/*.parquet" \
--reference-name reference \
--outdir results/sharedness \
--pos-scope all \
--vocab-criteria lemma \
--smooth-alpha 0.5 \
--ratio-max-list 2.0 \
--reference-min-count-list 10 \
--z-list 1.96 \
--min-count-per-dataset 10Results can be plottted using the plot script
python plot_sharedness.py \
--runs-dir results/sharedness/sharedness_runs_all_lemma \
--run-tag hybrid_alpha0p5_ratio2_refmin10_cmin10_z1p96_statsFalseThis workflow asks: as more dates are sampled from a dataset, how close does the subset become to the full dataset?
python compute_cumulative_hybrid_sharedness.py \
--datasets dataset1,dataset2,dataset3 \
--pattern "data/parquet/{name}/*.parquet" \
--outdir results/cumulative_hybrid \
--pos-scope all \
--vocab-criteria lemma_pos \
--min-count-full 10 \
--min-count-subset 10 \
--date-step 20 \
--seed 123 \
--smoothing-alpha 0.5 \
--ratio-threshold 2.0 \
--reference-min-expected-count 10.0 \
--smoothing-mode betaResults can be plotted using the plot script
python plot_cumulative_hybrid_sharedness.py \
--csv results/cumulative_hybrid/cumulative_hybrid_all_lemma_pos_fullmin10_submin10_step20_refemin10.0_ratio2.0_statsFalse_beta.csvThis script compares normalized token frequencies between every pair of datasets.
python plot_compared_normalized_frequencies.py \
--base-dir data/counts \
--output-dir results/normalized_frequency_comparisons \
--datasets dataset1 dataset2 dataset3 \
--part-of-speech all \
--tokenization lemma \
--top-k 50000 \
--factor-band 2data/counts should contain one subdirectory per dataset, and each subdirectory should include count JSON file - a dictionary mapping lemmas or terms to their counts.
This script fits Zipf-Mandelbrot curves to token-count distributions.
python plot_zipf_mandelbrot.py \
--base-dir data/counts \
--output-dir results/zipf \
--corpus-names-file config/corpus_names.json \
--corpus-colors-file config/corpus_colors.json \
--datasets dataset1 dataset2 dataset3 \
--tokenizations lemma \
--part-of-speech all \
--min-count 10The corpus names file maps dataset IDs to display labels:
{
"dataset1": "Dataset 1",
"dataset2": "Dataset 2"
}The corpus colors file maps dataset IDs to plot colors:
{
"dataset1": "red",
"dataset2": "blue"
}