Pick last file sorted by path for schema by koertkuipers · Pull Request #269 · databricks/spark-avro

koertkuipers · 2018-02-12T14:35:15Z

Picking the same file consistently for schema avoids weird bugs where the schema of an avro data source changes randomly or unexpectedly.

…cking the first

…istently

codecov-io · 2018-02-12T14:42:19Z

Codecov Report

Merging #269 into master will increase coverage by 0.4%.
The diff coverage is 87.5%.

@@            Coverage Diff            @@
##           master     #269     +/-   ##
=========================================
+ Coverage   92.21%   92.61%   +0.4%     
=========================================
  Files           5        5             
  Lines         321      325      +4     
  Branches       43       41      -2     
=========================================
+ Hits          296      301      +5     
+ Misses         25       24      -1

gengliangwang · 2018-06-01T21:44:49Z

src/main/scala/com/databricks/spark/avro/DefaultSource.scala

-        )
-      }
+    def sampleFilePath = if (conf.getBoolean(IgnoreFilesWithoutExtensionProperty, true)) {
+      files.iterator.map(_.getPath).filter(_.getName.endsWith(".avro"))


files.map(.getPath).sortBy(.getName)....

it has same result right?

files can be a very large sequence. the iterator approach avoids creating 2 copies of that sequence. also it is not necessary to do a full sort just to get the first sorted element.

are you saying its not worth the optimization?

You are right for not sorting all the file names.
But I don't think we need to convert it to an iterator.
Maybe we can try to make it more shorter like files.map(_.getPath).minBy(_.getName) ?
We can create a function which accepts parameter Seq(Path), then check if it is empty before getting the minimal one.

iterator is lightweight and avoids materialization

minBy(_.getName) wouldnt work because we want to sort by the path, not just the filename (e.g. /some/path/x=1/part-0000.avro comes before /some/path/x=2/part-0000.avro)

minBy(_.toString) might work but i don't feel too certain about it. rather use Comparable to do the right thing. unfortunately Path is just Comparable, not Comparable[Path], so scala doesn't understand how to use it, which is why i resorted to using compareTo directly.

gengliangwang · 2018-06-01T21:45:39Z

src/main/scala/com/databricks/spark/avro/DefaultSource.scala

-      files.headOption.getOrElse {
-        throw new FileNotFoundException("No Avro files found.")
-      }
+      files.iterator.map(_.getPath)


gengliangwang · 2018-06-01T21:51:07Z

src/test/scala/com/databricks/spark/avro/AvroSuite.scala

+      df1.write.avro(s"$tempDir/different_schemas/z=1")
+      val df2 = spark.createDataFrame(Seq(Tuple1("a"), Tuple1("b")))
+      df2.write.avro(s"$tempDir/different_schemas/z=2")
+      val df3 = spark.read.avro(s"$tempDir/different_schemas")


maybe add a loop for the reading? I am not sure if the order will be different every time

cwlaird3 · 2018-06-04T20:09:10Z

Have you considered using the schema from the newest data file to get the most up to date version of the schema? Or perhaps a configuration option to do that? Seems like most would update their schemas in a backwards compatible way and using the most recent schema would expose newer fields in the schema.

koertkuipers · 2018-06-04T20:22:50Z

t hat is not a bad idea. a switch seems reasonable. i would suggest to do this in a separate branch

…

On Mon, Jun 4, 2018 at 4:09 PM, Carl Laird ***@***.***> wrote: Have you considered using the schema from the newest data file to get the most up to date version of the schema? Or perhaps a configuration option to do that? Seems like most would update their schemas in a backwards compatible way and using the most recent schema would expose newer fields in the schema. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#269 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyIJDw2Qzkuu_e48jdnUwwy_QxJJfr5ks5t5ZPngaJpZM4SCQnp> .

gengliangwang · 2018-06-07T20:56:14Z

@cwlaird3 good idea
@koertkuipers how about by default use the latest AVRO file's schema?

gengliangwang · 2018-06-08T17:46:37Z

@koertkuipers @cwlaird3 I checked with @liancheng , which is PMC member and one of the original author of Data source project.
He doesn't think we should make such assumption. If the schema is different among files, users are supposed to specify the schema:
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

This PR changes the behavior and is possible to cause regression to other users.

koertkuipers · 2018-06-08T17:54:41Z

currently it uses a random file to pick schema. what would be an example of a user for which you break things by going from a random file to the last file?

cwlaird3 · 2018-06-08T17:59:15Z

I agree with @koertkuipers .. but if there's still a concern adding a configuration option to change the behavior could address that.

koertkuipers · 2018-06-08T18:07:00Z

spark-avro already provides a mechanism for the user to provide a schema with the avroSchema key in options

the thing that is currently missing is merging of schemas across all files

cwlaird3 · 2018-06-08T18:11:09Z

By configuration I meant a flag to enable the behavior you've implemented here - not to provide a schema.

koertkuipers · 2018-06-08T18:18:40Z

oh a flag to go from random schema to non-random schema? if someone can come up with a user for which this pullreq breaks their usage i am up for that, otherwise no :)

…

On Fri, Jun 8, 2018 at 2:11 PM, Carl Laird ***@***.***> wrote: By configuration I meant a flag to enable the behavior you've implemented here - not to provide a schema. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#269 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyIJPNX1swwfXR9eCxVyPh_-8fUjgRmks5t6r5AgaJpZM4SCQnp> .

koertkuipers added 2 commits February 11, 2018 18:15

make the avro file picked for schema stable by sorting on path and pi…

a30d700

…cking the first

test reading files with different schemas and picking the schema cons…

bdefef8

…istently

gengliangwang reviewed Jun 1, 2018

View reviewed changes

koertkuipers added 3 commits June 3, 2018 12:11

use min for picking first path

4867bcf

add one more test for predictably picking schema

dc6dcdd

can't stand scalastyle

64c10b1

pick last file instead of first

c82d354

koertkuipers changed the title ~~Pick first file sorted by path for schema~~ Pick last file sorted by path for schema Jun 7, 2018

Conversation

koertkuipers commented Feb 12, 2018

Uh oh!

codecov-io commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gengliangwang Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koertkuipers Jun 1, 2018

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koertkuipers Jun 3, 2018

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 1, 2018

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cwlaird3 commented Jun 4, 2018

Uh oh!

koertkuipers commented Jun 4, 2018 via email

Uh oh!

gengliangwang commented Jun 7, 2018

Uh oh!

gengliangwang commented Jun 8, 2018

Uh oh!

koertkuipers commented Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cwlaird3 commented Jun 8, 2018

Uh oh!

koertkuipers commented Jun 8, 2018

Uh oh!

cwlaird3 commented Jun 8, 2018

Uh oh!

koertkuipers commented Jun 8, 2018 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Feb 12, 2018 •

edited

Loading

gengliangwang Jun 1, 2018 •

edited

Loading

gengliangwang Jun 1, 2018 •

edited

Loading

gengliangwang Jun 1, 2018 •

edited

Loading

koertkuipers commented Jun 8, 2018 •

edited

Loading