Hi, may I ask why the KL loss is used during validation? This doesn't match equation 9 in the paper which is a cross-entropy loss.