The PTB dataset downloaded using the script in the manual has wrong unique words number

There are some problems with the ptb dataset downloaded using the script in the manual, it can not pass the test of `assert ntokens == 10000` in the `test_ptb_dataset`. The total number of the unique words is 8481 < 10000(I think my code is right). And I download the data from this [url](http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz), and it can pass the test above. Then I compare the datasets(train.txt, test.txt, valid.txt) from the two sources using `diff`, and they do differ on train.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The PTB dataset downloaded using the script in the manual has wrong unique words number #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The PTB dataset downloaded using the script in the manual has wrong unique words number #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions