Sharedness and Frequency Analysis Scripts

This repository contains Python scripts for the language analysis plots of the paper: "The First 1,000 Days (1kD) Project - Collecting and Analyzing an Ultra-Dense Naturalistic Dataset of Human Baby Development".

The scripts are designed to be run from the command line.

Requirements

Install the required Python packages:

pip install numpy pandas polars matplotlib scipy

Sharedness across datasets

This workflow computes sharedness statistics across datasets and then plots the distribution of k, where k is the number of datasets in which a word/type is considered shared relative to a reference. The sharedness script expects Parquet input files for each dataset with one row per token/word observation in the dataset. reference can be CHILDES or any other dataset.

python compute_sharedness.py \
  --datasets dataset1,dataset2,dataset3 \
  --pattern "data/parquet/{name}/*.parquet" \
  --reference-pattern "data/reference/*.parquet" \
  --reference-name reference \
  --outdir results/sharedness \
  --pos-scope all \
  --vocab-criteria lemma \
  --smooth-alpha 0.5 \
  --ratio-max-list 2.0 \
  --reference-min-count-list 10 \
  --z-list 1.96 \
  --min-count-per-dataset 10

Results can be plottted using the plot script

python plot_sharedness.py \
  --runs-dir results/sharedness/sharedness_runs_all_lemma \
  --run-tag hybrid_alpha0p5_ratio2_refmin10_cmin10_z1p96_statsFalse

Cumulative hybrid sharedness

This workflow asks: as more dates are sampled from a dataset, how close does the subset become to the full dataset?

python compute_cumulative_hybrid_sharedness.py \
  --datasets dataset1,dataset2,dataset3 \
  --pattern "data/parquet/{name}/*.parquet" \
  --outdir results/cumulative_hybrid \
  --pos-scope all \
  --vocab-criteria lemma_pos \
  --min-count-full 10 \
  --min-count-subset 10 \
  --date-step 20 \
  --seed 123 \
  --smoothing-alpha 0.5 \
  --ratio-threshold 2.0 \
  --reference-min-expected-count 10.0 \
  --smoothing-mode beta

Results can be plotted using the plot script

python plot_cumulative_hybrid_sharedness.py \
  --csv results/cumulative_hybrid/cumulative_hybrid_all_lemma_pos_fullmin10_submin10_step20_refemin10.0_ratio2.0_statsFalse_beta.csv

Pairwise normalized frequency comparison

This script compares normalized token frequencies between every pair of datasets.

python plot_compared_normalized_frequencies.py \
  --base-dir data/counts \
  --output-dir results/normalized_frequency_comparisons \
  --datasets dataset1 dataset2 dataset3 \
  --part-of-speech all \
  --tokenization lemma \
  --top-k 50000 \
  --factor-band 2

data/counts should contain one subdirectory per dataset, and each subdirectory should include count JSON file - a dictionary mapping lemmas or terms to their counts.

Zipf-Mandelbrot plots

This script fits Zipf-Mandelbrot curves to token-count distributions.

python plot_zipf_mandelbrot.py \
  --base-dir data/counts \
  --output-dir results/zipf \
  --corpus-names-file config/corpus_names.json \
  --corpus-colors-file config/corpus_colors.json \
  --datasets dataset1 dataset2 dataset3 \
  --tokenizations lemma \
  --part-of-speech all \
  --min-count 10

The corpus names file maps dataset IDs to display labels:

{
  "dataset1": "Dataset 1",
  "dataset2": "Dataset 2"
}

The corpus colors file maps dataset IDs to plot colors:

{
  "dataset1": "red",
  "dataset2": "blue"
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
compute_cumulative_hybrid_sharedness.py		compute_cumulative_hybrid_sharedness.py
compute_sharedness.py		compute_sharedness.py
plot_compared_normalized_frequencies.py		plot_compared_normalized_frequencies.py
plot_cummulative_hybrid_shardness.py		plot_cummulative_hybrid_shardness.py
plot_sharedness.py		plot_sharedness.py
plot_zipf_mandelbrot.py		plot_zipf_mandelbrot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sharedness and Frequency Analysis Scripts

Requirements

Sharedness across datasets

Cumulative hybrid sharedness

Pairwise normalized frequency comparison

Zipf-Mandelbrot plots

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sharedness and Frequency Analysis Scripts

Requirements

Sharedness across datasets

Cumulative hybrid sharedness

Pairwise normalized frequency comparison

Zipf-Mandelbrot plots

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages