Skip to content

RobinHYuan/Document_Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Analyzer

The document analyzer is created for analyzing text documents based on various characteristics of each of the document. It can be used for text extraction, plagiarism checking and sentiment analysis using third-party APIs.

Part I: Text analysis

A document class is created for storing the text and helping perform certain operations and analyses. In this task, I performed the following operations after retrieving the text from a URL or a local file path:

  • Split the document into its consistent sentences1 to determine:
    • The average number of words2 in a sentence, m1
    • The average number of phrases3 per sentence, m2
  • Break the document down further into phrases and words to compute:
    • The average word length, m3
    • The number of unique words to the total number of words, m4
    • The number of words occurring exactly once to the total number of words, m5

Part II: Sentiment Analysis

In this part, Google Cloud Natural Language client library is used to detect the sentiment4 of each sentence in the given document. Using Google's Natural Language AI system, we can obtain a sentiment score and a sentiment magnitude. A sentence is positive if the sentiment score is ≥ 0.3. A sentence is negative is the sentiment score is ≤ -0.3.

Part III: Document Similarity Analysis

We now have all the required metrics to group different documents. The similarity of any two documents are determined by their document divergences and union-find. In order to determine their document divergences, we will first compute their Jensen-Shannon Divergences:


equation

In our context, we defined pi as the probability of word ωi appearing in document P and qi as the probability of word ωi appearing in document Q. The summation is over all words that appear in the two documents together. If ωi appears in P and not in Q then qi = 0. Further, by definition, equation and equation . The Jensen-Shannon Divergence uses the frequency with which words appear to determine if two documents are divergent or not.

We will now calculate the group divergence of any two given documents using the following definition: equation
Note that 𝛿js is the Jensen-Shannon divergence. The larger the divergence, the more dissimilar the two documents are.

Footnotes

  1. We will consider a sentence to be a sequence of characters that is terminated by the characters ! ? . or EOF excludes whitespace on either end and is not empty.

  2. A word is a non-empty token that is not completely made up of punctuation. If a token begins or ends with punctuation then a word can be obtained by removing the starting and trailing punctuation. Specifically, the start of word should not contain any of ``! " $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | }. The end of a word should not include any of ``! # " $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~. Hyphenated words are considered to be one word. Words are allowed to start with hashtags #.

  3. A phrase is a non-empty (empty = empty string or whitespace only) part of a sentence that is separated from another phrase by commas, colons and semi-colons.

  4. It is important to note that the Natural Language API indicates differences between positive and negative emotion in a document, but does not identify specific positive and negative emotions. For example, "angry" and "sad" are both considered negative emotions. However, when the Natural Language API analyzes text that is considered "angry", or text that is considered "sad", the response only indicates that the sentiment in the text is negative, not "sad" or "angry".

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages