Sparkdeduplication by axnow · Pull Request #429 · CeON/CoAnSys

axnow · 2017-07-25T09:06:55Z

Ready, working version of scala/spark deduplication, to replace the original mapreduce/pig solution.

a new version of the clustering mechanism with auto-sizing clusters.

to obtain better scalability.

Added programatical logging into stdout, for easier log reading

Added reshuffle for better work balance.

Needs code cleanup and qality assurance.

Version used for performance testing.

a new version of the clustering mechanism with auto-sizing clusters.

to obtain better scalability.

Added programatical logging into stdout, for easier log reading

Added reshuffle for better work balance.

Needs code cleanup and qality assurance.

Version used for performance testing.

Task tiling class rewritten to scala, with tests.

Cleaning up project files. Fixed workflow building for oozie.

…into sparkdeduplication

miconi · 2017-11-14T17:19:00Z

...ark-impl/src/main/scala/pl/edu/icm/coansys/document/deduplication/DeduplicateDocuments.scala

+import pl.edu.icm.coansys.deduplication.document.comparator.VotesProductComparator
+import pl.edu.icm.coansys.deduplication.document.comparator.WorkComparator
+import scala.collection.mutable.ListBuffer
+import pl.edu.icm.coansys.document.deduplication._


Unused import

miconi

The code is overall good, though there is one major issue: CartesianTaskSplit.processPairs function is returning an empty list. It doesn't look like it should work.

There are also some minor style & performance observations (reduceByKey and foldByKey used to achieve results of a groupByKey operation) noted in comments.

Apart from the places indicated in comments it would be great to go through IDE suggestions & do code reformatting for the whole code.

Thanks for the code.

miconi · 2017-11-14T17:20:31Z

...ark-impl/src/main/scala/pl/edu/icm/coansys/document/deduplication/DeduplicateDocuments.scala

+ * 
+ */
+object DeduplicateDocuments {
+  val log = org.slf4j.LoggerFactory.getLogger(getClass().getName())


Empty parentheses should be removed in method calls that do not have side effects (here and in other places in this code)

miconi · 2017-11-14T17:21:25Z

...ark-impl/src/main/scala/pl/edu/icm/coansys/document/deduplication/DeduplicateDocuments.scala

+ * 
+ */
+object DeduplicateDocuments {
+  val log = org.slf4j.LoggerFactory.getLogger(getClass().getName())


'log' variable can be private

miconi · 2017-11-14T17:21:53Z

...ark-impl/src/main/scala/pl/edu/icm/coansys/document/deduplication/DeduplicateDocuments.scala

+object DeduplicateDocuments {
+  val log = org.slf4j.LoggerFactory.getLogger(getClass().getName())
+
+  implicit def toJavaBiPredicate[A, B](predicate: (A, B) => Boolean) =


Import scala.language.implicitConversions to turn off compiler warnings about implicits

miconi · 2017-11-14T17:46:40Z

...ark-impl/src/main/scala/pl/edu/icm/coansys/document/deduplication/DeduplicateDocuments.scala

+    } else {
+        false
+    }
+  }


isValidDocument could be rewritten in a functional programming style using the Option monad:

def isValidDocument(doc: DocumentWrapper): Boolean = Option(doc.getDocumentMetadata) .flatMap(md => Option(md.getBasicMetadata)) .exists(bmd => bmd.getTitleCount > 0 || bmd.getAuthorCount > 0 || bmd.hasDoi || bmd.hasJournal)

"The Option companion object's apply method serves as a conversion function from nullable references" https://stackoverflow.com/questions/4692506/wrapping-null-returning-method-in-java-with-option-in-scala

It would be even better if we used scala protobuf compiler. That would directly support Option for optional values in protobufs (see https://scalapb.github.io/generated-code.html).

flatMap is acting on monads like bind in Haskell, if that clarifies something. Simple introduction to the concept of using Option as a monad in Scala and how it makes code clearer: https://www.slideshare.net/jankrag/introduction-to-option-monad-in-scala

miconi · 2017-11-14T18:03:25Z

...ark-impl/src/main/scala/pl/edu/icm/coansys/document/deduplication/DeduplicateDocuments.scala

+
+
+
+  def main(args: Array[String]): Unit = {


It would be more readable if the main method was placed before the helper methods.

miconi · 2017-11-17T19:34:32Z

...ark-impl/src/main/scala/pl/edu/icm/coansys/document/deduplication/DeduplicateDocuments.scala

+        }
+      }
+    }).mapValues(_._1)
+    inputDocs.join(selectedClusters).map(p => (p._2._2, p._2._1)).groupByKey


Last expression could be simplified to:
selectedClusters.join(inputDocs).values.groupByKey

miconi · 2017-11-17T19:36:13Z

...ark-impl/src/main/scala/pl/edu/icm/coansys/document/deduplication/DeduplicateDocuments.scala

+   */
+  def mergeDocuments(docs: List[DocumentWrapper]): DocumentWrapper = {
+    val merger = buildDocumentsMerger()
+    val merged = merger.merge(docs);


No need to define val merged

miconi · 2017-11-17T19:40:34Z

.../src/main/scala/pl/edu/icm/coansys/document/deduplication/MultiLengthTitleKeyGenerator.scala

+    val normalized = StringTools.normalize(title);
+    //seems that normalize removes stopwords, which is wrong, and quite expensive
+    //val normalized = StringTools.removeStopWords(StringTools.normalize(title));
+    val res = normalized.replaceAll("\\s+", "")


There is no need to define val res here, the expression assigned to it could just be simply the last expression in this function.

miconi · 2017-11-17T19:40:50Z

.../src/main/scala/pl/edu/icm/coansys/document/deduplication/MultiLengthTitleKeyGenerator.scala

+  def generateKeys(title: String): Seq[String] = {
+    val ctitle = cleanUpString(title)
+    val mlen = keySizes.max
+    val longestKey = ctitle.zipWithIndex.filter(_._2 % 2 == 0).map(_._1).take(mlen).mkString


Line too long.

miconi · 2017-11-17T19:40:59Z

.../src/main/scala/pl/edu/icm/coansys/document/deduplication/MultiLengthTitleKeyGenerator.scala

+    val ctitle = cleanUpString(title)
+    val mlen = keySizes.max
+    val longestKey = ctitle.zipWithIndex.filter(_._2 % 2 == 0).map(_._1).take(mlen).mkString
+    keySizes.map(keyLength => longestKey.substring(0, Math.min(keyLength, longestKey.size))).distinct


Line too long.

axnow added 21 commits March 17, 2017 16:32

Initial version of spark-based document deduplication. It contains

eafc85d

a new version of the clustering mechanism with auto-sizing clusters.

Work on the algorithm which splits large clusters among the cluster

e95a8cc

to obtain better scalability.

Complete version with tiled comparison task.

efb3fc2

Added programatical logging into stdout, for easier log reading

Work on tiled optimalization.

fab0915

Added reshuffle for better work balance.

Stable version, does proper job within 2.5h on full data set.

6180bba

Needs code cleanup and qality assurance.

Added options parsing from command line to control app behaviour.

28834d2

Version used for performance testing.

Added dependency for the scopt.

3a749d4

Task tiling class rewritten to scala, with tests.

0cfdcbb

Scala version.

2e0c205

Fixed oozie workflow building.

1a2c5dd

Cleaning up project files

ca4592b

Initial version of spark-based document deduplication. It contains

ece39dc

a new version of the clustering mechanism with auto-sizing clusters.

Work on the algorithm which splits large clusters among the cluster

c056a0b

to obtain better scalability.

Complete version with tiled comparison task.

e7ad7aa

Added programatical logging into stdout, for easier log reading

Work on tiled optimalization.

0cf6672

Added reshuffle for better work balance.

Stable version, does proper job within 2.5h on full data set.

013b53c

Needs code cleanup and qality assurance.

Added options parsing from command line to control app behaviour.

ac56042

Version used for performance testing.

Added dependency for the scopt.

81f6509

Task tiling class rewritten to scala, with tests.

Scala version.

cd1014c

Fixed oozie workflow building.

221cd52

Cleaning up project files. Fixed workflow building for oozie.

Merge branch 'sparkdeduplication' of https://github.com/axnow/CoAnSys …

20da6f8

…into sparkdeduplication

axnow requested a review from miconi July 25, 2017 09:06

miconi reviewed Nov 14, 2017

View reviewed changes

miconi suggested changes Nov 17, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Sparkdeduplication#429

Sparkdeduplication#429
axnow wants to merge 21 commits intoCeON:masterfrom
axnow:sparkdeduplication

axnow commented Jul 25, 2017

Uh oh!

miconi Nov 14, 2017

Uh oh!

miconi left a comment

Uh oh!

miconi Nov 14, 2017

Uh oh!

miconi Nov 14, 2017

Uh oh!

miconi Nov 14, 2017

Uh oh!

miconi Nov 14, 2017

Uh oh!

miconi Nov 14, 2017

Uh oh!

miconi Nov 17, 2017

Uh oh!

miconi Nov 17, 2017

Uh oh!

miconi Nov 17, 2017

Uh oh!

miconi Nov 17, 2017

Uh oh!

miconi Nov 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

axnow commented Jul 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miconi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants