Seed problem

Hello

Im trying to go though the 3rd week lab, however it seems to be a problem with the proportions by which the data is partitioned regarding train, validation and test. I'm using the supplied seed, along with the defined weights and i get a different number of examples within each set. Obviously, the following tests are sentenced to fail.

snippet:

```
weights = [.8, .1, .1]
seed = 42
raw_train_df, raw_validation_df, raw_test_df = raw_df.randomSplit(weights, seed)

n_train = raw_train_df.cache().count()
n_val = raw_validation_df.cache().count()
n_test = raw_test_df.cache().count()
print n_train, n_val, n_test, n_train + n_val + n_test
raw_df.show(1)
```

output:

```
80115 9955 9930 100000
+--------------------+
|                text|
+--------------------+
|0,1,1,5,0,1382,4,...|
+--------------------+
only showing top 1 row
```

the same thing happens in lab 2 linear regression


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed problem #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Seed problem #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions