Skip to content

Latest commit

 

History

History
123 lines (87 loc) · 3.88 KB

File metadata and controls

123 lines (87 loc) · 3.88 KB

ARGOT

ARGOT is an Apache spaRk based text mininG tOolkiT and library that supports and demonstrates the use of n-gram graphs within Natural Language Processing applications.

Argot is a french word and it translates in English as slang.

Specifications

Code Snippets

  • Create an n-gram graph from a string:
import graph.NGramGraph

...

	// Creates a dummy graph object with ngram and dwin size equal to 3 
	val g = new NGramGraph(3,3)
	// Create the graph
	g.fromString(Hello World!)
  • Create an n-gram graph from a file
import graph.NGramGraph

...
        
	val g = new NGramGraph(3,3) 
	// Load the data string from the file
	g.fromFile("file.txt")
  • Merge multiple small graphs to a huge distributed one
import graph.operators.MultipleGraphMergeOperator
import traits.DocumentGraph
import org.apache.spark.rdd.RDD

...

	// RDD of document graphs
	// A document graph may be an n-gram graph, a word n-gram graph
	// or any other graph that extends the DocumentGraph trait
	val manyGraphs: RDD[DocumentGraph] = ...
	// create an instance of the operator with 8 partitions
	val merger = new MultipleGraphMergeOperator(8)
	// create the distributed graph
	val dGraph = merger.getGraph(manyGraphs)
  • Compare a small graph with a distributed one, extracting their similarity
import graph.similarity.DiffSizeGSCalculator
import org.apache.spark.SparkContext
import graph.DistributedCachedNGramGraph
import traits.DocumentGraph

...

    val gsc = new DiffSizeGSCalculator(sc) // pass the spark context as parameter
    // extract their similarity
    val gs = gsc.getSimilarity(smallGraph,dGraph) 
    // print containment similarity
    println(gs.getSimilarityComponents("containment"))
    // println value similarity
    println(gs.getSimilarityComponents("value"))
    // print normalized value similarity
    println(gs.getSimilarityComponents("normalized"))
    // print size similarity
    println(gs.getSimilarityComponents("size"))
  • Run n-fold cross validation classification experiment
import experiments.CrossValidation
import org.apache.spark.{SparkConf, SparkContext}

...

	// create an instance of the experiment class
	// directory contains subdirectories with classes, each containing their corresponding texts
	// options for classifiers are: "Random Forest","Naive Bayes","SVMBinary","SVMMulticlass"
	// spark context, classification algorithm, directory to classify, number of folds
	val exp = new CrossValidation(sc,"Random Forest","docs",10)
	// run the experiment with 8 partitions
	exp.run(8)
    // or you can choose to classify on one random fold only
	exp.classify(8)

Details

Current version/branch of ARGOT contains the following:

  • The n-gram graphs (NGG) representations. See thesis, Chapter 3 of George Giannakopoulos for more info.
  • The NGG operators update/merge, intersect, allNotIn, etc. See thesis, Chapter 4 of George Giannakopoulos for more info.
  • A text tokenizer (extraction of n-grams, words, sentences from a text etc.).
  • Feature extraction algorithm for document classification with the use of n-gram graphs.
  • Naive Bayes Multinomial Classifier.
  • Support Vector Machines with Stochastic Gradient Descent Classifier (binary,multi-class).
  • Random Forest Classifier.
  • Markov clustering algorithm for similarity matrices. Many thanks to user joandre.
  • A simple clustering algorithm for documents based on graph similarities (under construction).
  • A multiple document summarizer (under construction).

Above operations can be executed in a cluster and make use of every available processor.