There are a few possibilities -- among them the degradation is unevenly distributed (some accounts/keys and not others), or they are gaming your tests while degrading elsewhere.
What is the solution to this? A random distributed test generated from first principles rather than tests made statically available open-source? I haven't looked to see how you are currently handling this adversarial possibility, if at all.