Skip to content

generate_corr() produces invalid correlation matrix for k > 2 analyses #48

@keaven

Description

@keaven

Bug description

generate_corr() produces an invalid correlation matrix (with entries > 1 and incorrect 0s) when the number of analyses k > 2. The function works correctly for k = 2.

Root cause

The within-hypothesis across-analysis loop only fills entries relative to analysis 1, and uses the wrong event count for non-adjacent analysis pairs:

# Current code (line ~14-20 of generate_corr):
for (i in 1:n_hypotheses) {
  for (j in 2:n_analyses) {
    count <- as.numeric(event %>% filter(H1 == i & H2 == i & Analysis == j - 1) %>% select(Event))
    D[i, n_hypotheses * (j - 1) + i] <- count
    D[n_hypotheses * (j - 1) + i, i] <- count
  }
}

Problem 1: This loop only fills D[row_at_A1, col_at_Aj] — it never fills entries between non-first analyses (e.g., D[H1_A2, H1_A3] is left as 0).

Problem 2: For j = 3 (analysis 3), Analysis == j - 1 fetches events at analysis 2, but the correct value for D[H1_A1, H1_A3] should be events at min(1, 3) = analysis 1.

Reproducible example

library(wpgsd)
library(tibble)

# 2 hypotheses, 3 analyses
event <- tribble(
  ~H1, ~H2, ~Analysis, ~Event,
  1, 1, 1, 50,   2, 2, 1, 80,   1, 2, 1, 50,
  1, 1, 2, 80,   2, 2, 2, 130,  1, 2, 2, 80,
  1, 1, 3, 100,  2, 2, 3, 170,  1, 2, 3, 100
)

corr <- generate_corr(event)
round(corr, 3)
#>       H1_A1 H2_A1 H1_A2 H2_A2 H1_A3 H2_A3
#> [1,]  1.000 0.791 0.791 0.620 1.131 0.542   # <-- 1.131 > 1 (invalid!)
#> [2,]  0.791 1.000 0.625 0.784 0.559 1.115   # <-- 1.115 > 1 (invalid!)
#> [3,]  0.791 0.625 1.000 0.784 0.000 0.686   # <-- 0.000 (should be ~0.894)
#> [4,]  0.620 0.784 0.784 1.000 0.702 0.000   # <-- 0.000 (should be ~0.875)
#> [5,]  1.131 0.559 0.000 0.702 1.000 0.767
#> [6,]  0.542 1.115 0.686 0.000 0.767 1.000

# Entries > 1 make this invalid as a correlation matrix
# Entries = 0 between H1_A2<->H1_A3 are wrong (should be sqrt(80/100) = 0.894)

Proposed fix

Replace the within-hypothesis loop with one that fills all analysis pairs using events at min(s, t):

for (i in 1:n_hypotheses) {
  for (s in 1:n_analyses) {
    for (t in 1:n_analyses) {
      if (s == t) next
      count <- as.numeric(event %>%
        filter(H1 == i & H2 == i & Analysis == min(s, t)) %>%
        select(Event))
      row_idx <- n_hypotheses * (s - 1) + i
      col_idx <- n_hypotheses * (t - 1) + i
      D[row_idx, col_idx] <- count
    }
  }
}

With this fix, the same example produces a valid positive-definite correlation matrix:

      H1_A1  H2_A1  H1_A2  H2_A2  H1_A3  H2_A3
[1,] 1.0000 0.7906 0.7906 0.6202 0.7071 0.5423
[2,] 0.7906 1.0000 0.6250 0.7845 0.5590 0.6860
[3,] 0.7906 0.6250 1.0000 0.7845 0.8944 0.6860
[4,] 0.6202 0.7845 0.7845 1.0000 0.7016 0.8745
[5,] 0.7071 0.5590 0.8944 0.7016 1.0000 0.7670
[6,] 0.5423 0.6860 0.6860 0.8745 0.7670 1.0000

Impact

This bug affects any use of generate_corr() followed by generate_bounds() for designs with 3 or more analyses. The existing tests only cover k = 2, so the bug was not caught.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions