Skip to content

Normalise top dynamic k #69

@vdplasthijs

Description

@vdplasthijs

For the concept caption validation task, we set a threshold theta_k and then find the top_k such that all those locations have a value greater (or smaller) than theta_k. This results in different values of k, as intended.

However the total number of datapoints remains equal. We can easily compute the expected value of top-k score that a random ordering should give (it's a distribution), I think it's something like k / n_points * 100. (For example for a top-10 and n=100, you would expect 10% accuracy of random ordering baseline.

So the random baseline depends on k, which varies across aux columns, and n_points, which is fixed. So to compare the percentage values of two different columns, we need to know their k-dependent baselines.

I would suggest that (additionally) we scale each top-dynamic-k with a simple min/max rescaling to: top_k_norm = (top_k - top_k_baseline) / (100 - top_k_baseline). Then we get a 0-1 score for each that can be compared.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions