Skip to content

6. Input Data Guidelines

Eliezyer edited this page Mar 29, 2026 · 2 revisions

Input Data Guidelines

This page describes recommended data formatting and preprocessing steps for gcPCA.
Following these guidelines helps ensure stable and interpretable results.


Data Format Requirements

gcPCA requires two datasets:

  • Ra — Condition A
  • Rb — Condition B

Both datasets must have the following format:

  • Rows = samples
  • Columns = features

Matrix shapes:

  • Ra: (ma × p)
  • Rb: (mb × p)

Where:

  • ma, mb = number of samples
  • p = number of shared features

Example

Neuroscience example:

  • Rows → trials or time points
  • Columns → neurons
Ra = trials × neurons (task condition)
Rb = trials × neurons (baseline condition)

Genomics example:

  • Rows → cells
  • Columns → genes
Ra = cells × genes (disease)
Rb = cells × genes (control)

Matching Features Between Conditions

This is one of the most common sources of errors.

Requirements:

  • Same number of features
  • Same feature order
  • Same preprocessing pipeline

Incorrect examples:

  • Different neuron ordering
  • Missing neurons in one dataset
  • Different gene sets

gcPCA assumes each column corresponds to the same feature in both datasets. It is okay to have different number of samples across datasets.


Normalization and Preprocessing

gcPCA operates on covariance structure, so preprocessing can affect results.

Recommended

  • Mean-center features
  • Z-score features

Python and R implementations perform normalization by default.

MATLAB normalization can be controlled with optional parameters.


When to Use Custom Normalization

Users may want to disable normalization when:

  • Data already normalized
  • Working with firing rates or standardized signals
  • Using PCA-reduced data as input

Using PCA Before gcPCA

Applying PCA before gcPCA is not required, but may be helpful when:

  • Feature dimensionality is extremely high
  • Small variance dimensions dominate
  • Numerical stability is a concern

In this case:

1. Apply PCA
2. Use PCA scores as input to gcPCA

This reduces dimensionality while preserving covariance structure.


Sample Size Considerations

gcPCA works well with:

  • Moderate to large sample sizes
  • High-dimensional datasets

gcPCA can handle:

  • Different sample sizes between conditions
  • p >> n settings (more features than samples)

Balancing sample sizes between conditions is good practice, but not required.


Handling Missing Data

gcPCA does not support missing values.

Before running gcPCA:

  • Remove samples with missing values
  • Impute missing values
  • Interpolate if appropriate
No NaN values allowed in Ra or Rb

Scaling and Units

Because gcPCA analyzes covariance:

  • Feature scaling affects results
  • Large-scale features dominate

Examples:

  • firing rates vs normalized activity
  • gene counts vs log-transformed counts

Normalization helps ensure balanced contributions.


Common Pitfalls

Feature Mismatch

Different features between conditions:

  • different neurons
  • different genes
  • different channels

This produces incorrect results.


Unequal Preprocessing

Example:

  • Ra normalized
  • Rb not normalized

This introduces artificial differences.


Too Few Samples

Small sample sizes can produce:

  • noisy components
  • unstable loadings

Highly Noisy Data

If noise dominates:

  • gcPCs become harder to interpret
  • consider smoothing or preprocessing

Quick Checklist

Before running gcPCA:

  • Same features in Ra and Rb
  • Samples in rows
  • Features in columns
  • No missing values
  • Normalized or appropriately scaled
  • Sufficient sample size

Summary

gcPCA works best when datasets:

  • Share the same features
  • Are consistently preprocessed
  • Contain sufficient samples
  • Are properly normalized

Following these guidelines improves interpretability and numerical stability.

Links to Other Pages

1. Quickstart Guide
2. Installation
3. Conceptual Overview
4. Mathematical Formulation
5. Code Reference
7. Interpreting Results