GitHub - RobinHYuan/Document_Analyzer

Document Analyzer

The document analyzer is created for analyzing text documents based on various characteristics of each of the document. It can be used for text extraction, plagiarism checking and sentiment analysis using third-party APIs.

Part I: Text analysis

A document class is created for storing the text and helping perform certain operations and analyses. In this task, I performed the following operations after retrieving the text from a URL or a local file path:

Split the document into its consistent sentences¹ to determine:
- The average number of words² in a sentence, m₁
- The average number of phrases³ per sentence, m₂
Break the document down further into phrases and words to compute:
- The average word length, m₃
- The number of unique words to the total number of words, m₄
- The number of words occurring exactly once to the total number of words, m₅

Part II: Sentiment Analysis

In this part, Google Cloud Natural Language client library is used to detect the sentiment⁴ of each sentence in the given document. Using Google's Natural Language AI system, we can obtain a sentiment score and a sentiment magnitude. A sentence is positive if the sentiment score is ≥ 0.3. A sentence is negative is the sentiment score is ≤ -0.3.

Part III: Document Similarity Analysis

We now have all the required metrics to group different documents. The similarity of any two documents are determined by their document divergences and union-find. In order to determine their document divergences, we will first compute their Jensen-Shannon Divergences:

In our context, we defined p_i as the probability of word ω_i appearing in document P and q_i as the probability of word ω_i appearing in document Q. The summation is over all words that appear in the two documents together. If ω_i appears in P and not in Q then q_i = 0. Further, by definition, and . The Jensen-Shannon Divergence uses the frequency with which words appear to determine if two documents are divergent or not.

We will now calculate the group divergence of any two given documents using the following definition:
Note that 𝛿_js is the Jensen-Shannon divergence. The larger the divergence, the more dissimilar the two documents are.

We will consider a sentence to be a sequence of characters that is terminated by the characters ! ? . or EOF excludes whitespace on either end and is not empty. ↩
A word is a non-empty token that is not completely made up of punctuation. If a token begins or ends with punctuation then a word can be obtained by removing the starting and trailing punctuation. Specifically, the start of word should not contain any of ``! " $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | }. The end of a word should not include any of ``! # " $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~. Hyphenated words are considered to be one word. Words are allowed to start with hashtags #. ↩
A phrase is a non-empty (empty = empty string or whitespace only) part of a sentence that is separated from another phrase by commas, colons and semi-colons. ↩
It is important to note that the Natural Language API indicates differences between positive and negative emotion in a document, but does not identify specific positive and negative emotions. For example, "angry" and "sad" are both considered negative emotions. However, when the Natural Language API analyzes text that is considered "angry", or text that is considered "sad", the response only indicates that the sentiment in the text is negative, not "sad" or "angry". ↩

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
resources		resources
src		src
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Analyzer

Part I: Text analysis

Part II: Sentiment Analysis

Part III: Document Similarity Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document Analyzer

Part I: Text analysis

Part II: Sentiment Analysis

Part III: Document Similarity Analysis

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages