Skip to content

ldiversity() underestimates distinct l-diversity when quasi-identifiers contain missing values #363

@MuellerRoman

Description

@MuellerRoman

Dear sdcMicro maintainers

When computing distinct l-diversity using ldiversity() in sdcMicro, groups where one or more quasi-identifiers contain missing values seem to yield incorrect l-diversity values.

Minimal reproducible example:

library(sdcMicro)

## create test data
data <- data.frame(
    sex = c(
        "female","female","female",   # EC1 (problematic)
        "male","male",                # EC2 (ok)
        "female","female"             # EC3 (ok)
    ),
    occupation = c(
        NA, NA, NA,                   # EC1: missing QI
        "teacher","teacher",          # EC2
        "nurse","nurse"               # EC3
    ),
    ethnicity = c(
        "other","other","other",       # EC1
        "other","other",               # EC2
        "majority","majority"          # EC3
    ),
    sensitive = c(
        1, 1, 0,                       # EC1 → two distinct values
        1, 0,                          # EC2 → two distinct values
        0, 1                           # EC3 → two distinct values
    )
)

# quasi-identifier variables
qi_vars   <- c("sex", "occupation", "ethnicity")

# create sdc object
sdcObj <- createSdcObj(data,
                       keyVars = qi_vars,
                       sensibleVar = "sensitive")

# compute l-diversity
ldiv_res <- ldiversity(sdcObj)

# extract l-diversity values
ldiv_res <- head(ldiv_res@risk$ldiversity, nrow(data))

# join quasi-identifier information
ldiv_res <- cbind(data, ldiv_res)
print(ldiv_res[, 1:5])

     sex occupation ethnicity sensitive sensitive_Distinct_Ldiversity
1 female       <NA>     other         1                             1
2 female       <NA>     other         1                             1
3 female       <NA>     other         0                             1
4   male    teacher     other         1                             2
5   male    teacher     other         0                             2
6 female      nurse  majority         0                             2
7 female      nurse  majority         1                             2

Individuals with sex = "female", occupation = NA, ethnicity = "other" have sensitive value 1 or 0 (i.e., two distinct values). However, according to the ldiversity() output, the l-diversity for this group is $l = 1$.

Suspected location of the problem: Measure_Risk.h, row 577 and following.

Thanks!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions