[EGGO-30] Generate partitioned data.#33
[EGGO-30] Generate partitioned data.#33tomwhite wants to merge 6 commits intobigdatagenomics:masterfrom
Conversation
bbc5eed to
5f18147
Compare
5f18147 to
e186f7b
Compare
|
I think this is ready to review. @laserson can you have a look please? I have run this successfully locally, and on EC2 I generated flat data after using the workaround described in #43. Partitioning isn't working on EC2 since it has an old version of MR on it, so we need to work out what to do there. |
Do we need Hadoop 2.x for the partitioning to work? If so, we can just run the Spark EC2 scripts with |
|
Thanks for the pointer Frank. I tried it, but it doesn't start the Hadoop cluster daemons correctly, so I need to debug a bit more. |
|
Ah, odd. I haven't tried it myself as I don't often have an explicit need for Hadoop 2. |
There was a problem hiding this comment.
can this file be combined with the other test-genotypes.json file? Or you want to keep the partitioning separate?
There was a problem hiding this comment.
Yes, I want to have an example that exercises the partitioning.
|
lgtm, generally. if we end up pulling in other tools in the hadoop stack, it perhaps provides further rationale for switching to using cloudera director. in my experience, the spark-ec2 scripts are a bit uneven. |
|
I didn't have much luck with the MR2 installation on EC2, as it's using an old version. I'm looking into #44 to improve the cluster experience, so I won't commit this until I have a better idea of how feasible that it. It will also be useful if we want to use Impala for querying genome data. |
https://github.com/tomwhite/adam-partitioning, while Spark version is being debugged.
e186f7b to
1e23617
Compare
This is #30 rebased on #29