Pick last file sorted by path for schema#269
Pick last file sorted by path for schema#269koertkuipers wants to merge 6 commits intodatabricks:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #269 +/- ##
=========================================
+ Coverage 92.21% 92.61% +0.4%
=========================================
Files 5 5
Lines 321 325 +4
Branches 43 41 -2
=========================================
+ Hits 296 301 +5
+ Misses 25 24 -1 |
| ) | ||
| } | ||
| def sampleFilePath = if (conf.getBoolean(IgnoreFilesWithoutExtensionProperty, true)) { | ||
| files.iterator.map(_.getPath).filter(_.getName.endsWith(".avro")) |
There was a problem hiding this comment.
files.map(.getPath).sortBy(.getName)....
There was a problem hiding this comment.
it has same result right?
files can be a very large sequence. the iterator approach avoids creating 2 copies of that sequence. also it is not necessary to do a full sort just to get the first sorted element.
are you saying its not worth the optimization?
There was a problem hiding this comment.
You are right for not sorting all the file names.
But I don't think we need to convert it to an iterator.
Maybe we can try to make it more shorter like files.map(_.getPath).minBy(_.getName) ?
We can create a function which accepts parameter Seq(Path), then check if it is empty before getting the minimal one.
There was a problem hiding this comment.
iterator is lightweight and avoids materialization
minBy(_.getName) wouldnt work because we want to sort by the path, not just the filename (e.g. /some/path/x=1/part-0000.avro comes before /some/path/x=2/part-0000.avro)
minBy(_.toString) might work but i don't feel too certain about it. rather use Comparable to do the right thing. unfortunately Path is just Comparable, not Comparable[Path], so scala doesn't understand how to use it, which is why i resorted to using compareTo directly.
| files.headOption.getOrElse { | ||
| throw new FileNotFoundException("No Avro files found.") | ||
| } | ||
| files.iterator.map(_.getPath) |
| df1.write.avro(s"$tempDir/different_schemas/z=1") | ||
| val df2 = spark.createDataFrame(Seq(Tuple1("a"), Tuple1("b"))) | ||
| df2.write.avro(s"$tempDir/different_schemas/z=2") | ||
| val df3 = spark.read.avro(s"$tempDir/different_schemas") |
There was a problem hiding this comment.
maybe add a loop for the reading? I am not sure if the order will be different every time
|
Have you considered using the schema from the newest data file to get the most up to date version of the schema? Or perhaps a configuration option to do that? Seems like most would update their schemas in a backwards compatible way and using the most recent schema would expose newer fields in the schema. |
|
t
hat is not a bad idea. a switch seems reasonable. i would suggest to do
this in a separate branch
…On Mon, Jun 4, 2018 at 4:09 PM, Carl Laird ***@***.***> wrote:
Have you considered using the schema from the newest data file to get the
most up to date version of the schema? Or perhaps a configuration option to
do that? Seems like most would update their schemas in a backwards
compatible way and using the most recent schema would expose newer fields
in the schema.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAyIJDw2Qzkuu_e48jdnUwwy_QxJJfr5ks5t5ZPngaJpZM4SCQnp>
.
|
|
@cwlaird3 good idea |
|
@koertkuipers @cwlaird3 I checked with @liancheng , which is PMC member and one of the original author of Data source project. This PR changes the behavior and is possible to cause regression to other users. |
|
currently it uses a random file to pick schema. what would be an example of a user for which you break things by going from a random file to the last file? |
|
I agree with @koertkuipers .. but if there's still a concern adding a configuration option to change the behavior could address that. |
|
spark-avro already provides a mechanism for the user to provide a schema with the the thing that is currently missing is merging of schemas across all files |
|
By configuration I meant a flag to enable the behavior you've implemented here - not to provide a schema. |
|
oh a flag to go from random schema to non-random schema?
if someone can come up with a user for which this pullreq breaks their
usage i am up for that, otherwise no :)
…On Fri, Jun 8, 2018 at 2:11 PM, Carl Laird ***@***.***> wrote:
By configuration I meant a flag to enable the behavior you've implemented
here - not to provide a schema.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAyIJPNX1swwfXR9eCxVyPh_-8fUjgRmks5t6r5AgaJpZM4SCQnp>
.
|
Picking the same file consistently for schema avoids weird bugs where the schema of an avro data source changes randomly or unexpectedly.