Benchmarking

I tried to bench mark this tool on the Mass Dataset in the same setting as mentioned in the paper (Distributed Representations of Sentences and Documents). Instead of testing it directly, I had created sentence representation of entire mass dataset (train+test) and did a cross validation. I am not getting more than 51%. Has anybody tested this implementation and bench marked?