The document analyzer is created for analyzing text documents based on various characteristics of each of the document. It can be used for text extraction, plagiarism checking and sentiment analysis using third-party APIs.
A document class is created for storing the text and helping perform certain operations and analyses.
In this task, I performed the following operations after retrieving the text from a URL or a local file path:
- Split the document into its consistent sentences1 to determine:
- Break the document down further into phrases and words to compute:
- The average word length, m3
- The number of unique words to the total number of words, m4
- The number of words occurring exactly once to the total number of words, m5
In this part, Google Cloud Natural Language client library is used to detect the sentiment4 of each sentence in the given document. Using Google's Natural Language AI system, we can obtain a sentiment score and a sentiment magnitude. A sentence is positive if the sentiment score is ≥ 0.3. A sentence is negative is the sentiment score is ≤ -0.3.
We now have all the required metrics to group different documents. The similarity of any two documents are determined by their document divergences and union-find.
In order to determine their document divergences, we will first compute their Jensen-Shannon Divergences:
In our context, we defined pi as the probability of word ωi appearing in document P and qi as the probability of word ωi appearing in document Q.
The summation is over all words that appear in the two documents together. If ωi appears in P and not in Q then qi = 0. Further,
by definition, and
.
The Jensen-Shannon Divergence uses the frequency with which words appear to determine if two documents are divergent or not.
We will now calculate the group divergence of any two given documents using the following definition:
Note that 𝛿js is the Jensen-Shannon divergence. The larger the divergence, the more dissimilar the two documents are.
Footnotes
-
We will consider a sentence to be a sequence of characters that is terminated by the characters
! ? .or EOF excludes whitespace on either end and is not empty. ↩ -
A word is a non-empty token that is not completely made up of punctuation. If a token begins or ends with punctuation then a word can be obtained by removing the starting and trailing punctuation. Specifically, the start of word should not contain any of
``! " $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | }.The end of a word should not include any of``! # " $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.Hyphenated words are considered to be one word. Words are allowed to start with hashtags#. ↩ -
A phrase is a non-empty (empty = empty string or whitespace only) part of a sentence that is separated from another phrase by commas, colons and semi-colons. ↩
-
It is important to note that the Natural Language API indicates differences between positive and negative emotion in a document, but does not identify specific positive and negative emotions. For example, "angry" and "sad" are both considered negative emotions. However, when the Natural Language API analyzes text that is considered "angry", or text that is considered "sad", the response only indicates that the sentiment in the text is negative, not "sad" or "angry". ↩
