-
Notifications
You must be signed in to change notification settings - Fork 2
6. Input Data Guidelines
This page describes recommended data formatting and preprocessing steps for gcPCA.
Following these guidelines helps ensure stable and interpretable results.
gcPCA requires two datasets:
- Ra — Condition A
- Rb — Condition B
Both datasets must have the following format:
- Rows = samples
- Columns = features
Matrix shapes:
-
Ra: (ma × p) -
Rb: (mb × p)
Where:
-
ma,mb= number of samples -
p= number of shared features
Neuroscience example:
- Rows → trials or time points
- Columns → neurons
Ra = trials × neurons (task condition)
Rb = trials × neurons (baseline condition)
Genomics example:
- Rows → cells
- Columns → genes
Ra = cells × genes (disease)
Rb = cells × genes (control)
This is one of the most common sources of errors.
Requirements:
- Same number of features
- Same feature order
- Same preprocessing pipeline
Incorrect examples:
- Different neuron ordering
- Missing neurons in one dataset
- Different gene sets
gcPCA assumes each column corresponds to the same feature in both datasets. It is okay to have different number of samples across datasets.
gcPCA operates on covariance structure, so preprocessing can affect results.
- Mean-center features
- Z-score features
Python and R implementations perform normalization by default.
MATLAB normalization can be controlled with optional parameters.
Users may want to disable normalization when:
- Data already normalized
- Working with firing rates or standardized signals
- Using PCA-reduced data as input
Applying PCA before gcPCA is not required, but may be helpful when:
- Feature dimensionality is extremely high
- Small variance dimensions dominate
- Numerical stability is a concern
In this case:
1. Apply PCA
2. Use PCA scores as input to gcPCA
This reduces dimensionality while preserving covariance structure.
gcPCA works well with:
- Moderate to large sample sizes
- High-dimensional datasets
gcPCA can handle:
- Different sample sizes between conditions
- p >> n settings (more features than samples)
Balancing sample sizes between conditions is good practice, but not required.
gcPCA does not support missing values.
Before running gcPCA:
- Remove samples with missing values
- Impute missing values
- Interpolate if appropriate
No NaN values allowed in Ra or Rb
Because gcPCA analyzes covariance:
- Feature scaling affects results
- Large-scale features dominate
Examples:
- firing rates vs normalized activity
- gene counts vs log-transformed counts
Normalization helps ensure balanced contributions.
Different features between conditions:
- different neurons
- different genes
- different channels
This produces incorrect results.
Example:
- Ra normalized
- Rb not normalized
This introduces artificial differences.
Small sample sizes can produce:
- noisy components
- unstable loadings
If noise dominates:
- gcPCs become harder to interpret
- consider smoothing or preprocessing
Before running gcPCA:
- Same features in Ra and Rb
- Samples in rows
- Features in columns
- No missing values
- Normalized or appropriately scaled
- Sufficient sample size
gcPCA works best when datasets:
- Share the same features
- Are consistently preprocessed
- Contain sufficient samples
- Are properly normalized
Following these guidelines improves interpretability and numerical stability.
1. Quickstart Guide
2. Installation
3. Conceptual Overview
4. Mathematical Formulation
5. Code Reference
7. Interpreting Results