Text Identifier Project

A python project of creating an algorithm for identifying an author of the passage based on comparisons with other texts.

General info

Statistical models of text are one way to quantify how similar one piece of text is to another. Such models were used as evidence that the book The Cuckoo’s Calling was written by J. K. Rowling (using the name Robert Galbraith) in the summer of 2013. That story seemed very interesting to me and motivated me to create such an algorithm on my own. Some of the techniques that can be used to identify an author of the text include analyzing similar patterns of used words, word and sentences lengths, frequency of unique words and others.

Attributes used in the algorithm:

Unique words
Word lengths
Unique stems
Sentence lengths
Use of special characters

Testing the Algorithm

To test my algorithm, I used Stephen King’s works and The Fault in Our Stars by John Green as source bodies of texts. My new texts/bodies of texts were:

Canterbury Tales by Geoffrey Chaucer,
The Running Man (a book written by Stephen King under a pseudonym),
a paper from the BU WR100 Journal titled “Sylvia Plath: The Dialogue Between Poetry and Painting”,
The Hunger Games by Suzanne Collins

In the end, the first two were more similar to Stephen King, whereas the second two resembled The Fault in Our Stars. I expected The Running Man to be similar to Stephen King’s work because it was written by him, just under a pseudonym. I also expected Canterbury Tales to be more like Stephen King’s work because it is an old piece of literature and Stephen King writes more formally than John Green. In the same way, I expected both the WR100 paper and The Hunger Games to resemble The Fault in Our Stars because they are more modern and casual in their style of writing.

Even though the algorithm is not 100% accurate and did not catch themes in the writing as much as I thought it would, it worked every time I compared a piece of work to something written by the same author. It could be more accurate if it had more elements to compare the pieces of text to make a more thorough comparison and longer texts input into the program, as it could have been more accurate because it would have had more data to choose from.

Setup

To quickly test if the code is working correctly and see the algorithm for yourself, you simply need three short strings that you would put into test(), as shown on the screenshots.

To actually work with larger pieces, it is necessary to have three chunks of text. Two of them would be used as source bodies, so, ideally, they should be as long as possible, in order for the program to have more data to analyze. Those two files will be saved by the code as "source1" and "source2" by the command source1.add_file('name_of_file.txt'). The third piece of text is the one you are trying to identify, which would be named "new". After you have chosen the files, you can run the test and see the conclusion.

Code Examples

Example of comparing Canterbury Tales to John Green and Shakespeare:

def run_tests():

source1 = TextModel('John Green')
source1.add_file('fault_stars.txt')

source2 = TextModel('Stephen King')
source2.add_file('stephen_king.txt')

new1 = TextModel('Canterbury Tales')
new1.add_file('canterbury.txt')

new1.classify(source1, source2)`

Screenshots

Examples of working algorithm using strings for test:

Status

Project is: finished

Contact

Created by Ekaterina Gorbunova - feel free to contact me at eginfo@bu.edu!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
README.md		README.md
Text_Identifier_Project.py		Text_Identifier_Project.py
input.png		input.png
output.png		output.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Identifier Project

Table of contents

General info

Testing the Algorithm

Setup

Code Examples

Screenshots

Status

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Identifier Project

Table of contents

General info

Testing the Algorithm

Setup

Code Examples

Screenshots

Status

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages