ARGOT

ARGOT is an Apache spaRk based text mininG tOolkiT and library that supports and demonstrates the use of n-gram graphs within Natural Language Processing applications.

Argot is a french word and it translates in English as slang.

Specifications

Apache Spark 2.0.1 (GraphX, MLlib)
Scala 2.11.7
Maven Project

Code Snippets

Create an n-gram graph from a string:

import graph.NGramGraph

...

	// Creates a dummy graph object with ngram and dwin size equal to 3 
	val g = new NGramGraph(3,3)
	// Create the graph
	g.fromString(Hello World!)

Create an n-gram graph from a file

import graph.NGramGraph

...
        
	val g = new NGramGraph(3,3) 
	// Load the data string from the file
	g.fromFile("file.txt")

Merge multiple small graphs to a huge distributed one

import graph.operators.MultipleGraphMergeOperator
import traits.DocumentGraph
import org.apache.spark.rdd.RDD

...

	// RDD of document graphs
	// A document graph may be an n-gram graph, a word n-gram graph
	// or any other graph that extends the DocumentGraph trait
	val manyGraphs: RDD[DocumentGraph] = ...
	// create an instance of the operator with 8 partitions
	val merger = new MultipleGraphMergeOperator(8)
	// create the distributed graph
	val dGraph = merger.getGraph(manyGraphs)

Compare a small graph with a distributed one, extracting their similarity

import graph.similarity.DiffSizeGSCalculator
import org.apache.spark.SparkContext
import graph.DistributedCachedNGramGraph
import traits.DocumentGraph

...

    val gsc = new DiffSizeGSCalculator(sc) // pass the spark context as parameter
    // extract their similarity
    val gs = gsc.getSimilarity(smallGraph,dGraph) 
    // print containment similarity
    println(gs.getSimilarityComponents("containment"))
    // println value similarity
    println(gs.getSimilarityComponents("value"))
    // print normalized value similarity
    println(gs.getSimilarityComponents("normalized"))
    // print size similarity
    println(gs.getSimilarityComponents("size"))

Run n-fold cross validation classification experiment

import experiments.CrossValidation
import org.apache.spark.{SparkConf, SparkContext}

...

	// create an instance of the experiment class
	// directory contains subdirectories with classes, each containing their corresponding texts
	// options for classifiers are: "Random Forest","Naive Bayes","SVMBinary","SVMMulticlass"
	// spark context, classification algorithm, directory to classify, number of folds
	val exp = new CrossValidation(sc,"Random Forest","docs",10)
	// run the experiment with 8 partitions
	exp.run(8)
    // or you can choose to classify on one random fold only
	exp.classify(8)

Details

Current version/branch of ARGOT contains the following:

The n-gram graphs (NGG) representations. See thesis, Chapter 3 of George Giannakopoulos for more info.
The NGG operators update/merge, intersect, allNotIn, etc. See thesis, Chapter 4 of George Giannakopoulos for more info.
A text tokenizer (extraction of n-grams, words, sentences from a text etc.).
Feature extraction algorithm for document classification with the use of n-gram graphs.
Naive Bayes Multinomial Classifier.
Support Vector Machines with Stochastic Gradient Descent Classifier (binary,multi-class).
Random Forest Classifier.
Markov clustering algorithm for similarity matrices. Many thanks to user joandre.
A simple clustering algorithm for documents based on graph similarities (under construction).
A multiple document summarizer (under construction).

Above operations can be executed in a cluster and make use of every available processor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARGOT

Specifications

Code Snippets

Details

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ARGOT

Specifications

Code Snippets

Details