%pprintPretty printing has been turned OFF
This notebook is intended to document NER annotation of my data throughout this project. The data I am starting out with are transcripts of the presidential debates from the 2016 US Election- the 10 Democratic primary debates, the 12 Republican primary debates, and the debates for the general election between Hillary Clinton and Donald Trump. The transcripts were taken from UCSB's American Presidency Project. The citations for these transcripts can be found in the README.
- Defining a Tree-Generating Function
- Generating Trees
- Mapping Speaker to Tree
- Gathering Relevant Names
- Creating a Dictionary for NER Linking
- Tagging Missed Entities- Last Names
- NER Linking Part 1
- Pulling in Titles
- Tagging Missed Entities- Titles and First Names
- NER Linking Part 2
import nltk
from nltk.corpus import PlaintextCorpusReader
import pandas as pd
import glob
import os
from collections import defaultdict
import matplotlib.pyplot as plt
import re
import pickleI'm going to create a mapping function that will take the sentence in each row of each data frame and perform nltk's chunking operation on it to get a tree with annoted NEs
#Import the saved list of data frames I created in secondary_data_processing
import pickle
f = open('/Users/Paige/Documents/Data_Science/dataframes_list.p', 'rb')
dataframes = pickle.load(f)
f.close()def get_tree(sent):
sents = nltk.sent_tokenize(sent)
words = [nltk.word_tokenize(sent) for sent in sents]
pos = [nltk.pos_tag(sent) for sent in words]
chunk = nltk.ne_chunk_sents(pos)
return list(chunk)[0]master_df = pd.concat(dataframes)master_df = master_df.reset_index(drop=True)master_df.head()| Date | Debate Type | Speaker | Sents | |
|---|---|---|---|---|
| 0 | 1-14-16 | primary_rep | CAVUTO | It is 9:00 p.m. here at the North Charleston ... |
| 1 | 1-14-16 | primary_rep | CAVUTO | Welcome to the sixth Republican presidential o... |
| 2 | 1-14-16 | primary_rep | CAVUTO | I'm Neil Cavuto, alongside my friend and co-mo... |
| 3 | 1-14-16 | primary_rep | BARTIROMO | Tonight we are working with Facebook to ask t... |
| 4 | 1-14-16 | primary_rep | BARTIROMO | And according to Facebook, the U.S. election h... |
master_df.tail()| Date | Debate Type | Speaker | Sents | |
|---|---|---|---|---|
| 37054 | 9-26-16 | general | HOLT | The conversation will continue. |
| 37055 | 9-26-16 | general | HOLT | A reminder. |
| 37056 | 9-26-16 | general | HOLT | The vice presidential debate is scheduled for ... |
| 37057 | 9-26-16 | general | HOLT | My thanks to Hillary Clinton and to Donald Tru... |
| 37058 | 9-26-16 | general | HOLT | Good night, everyone. |
master_df['Tree']=master_df.Sents.map(get_tree)master_df.head()| Date | Debate Type | Speaker | Sents | Tree | |
|---|---|---|---|---|---|
| 0 | 1-14-16 | primary_rep | CAVUTO | It is 9:00 p.m. here at the North Charleston ... | [(It, PRP), (is, VBZ), (9:00, CD), (p.m., NN),... |
| 1 | 1-14-16 | primary_rep | CAVUTO | Welcome to the sixth Republican presidential o... | [(Welcome, VB), (to, TO), (the, DT), (sixth, J... |
| 2 | 1-14-16 | primary_rep | CAVUTO | I'm Neil Cavuto, alongside my friend and co-mo... | [(I, PRP), ('m, VBP), [(Neil, JJ), (Cavuto, NN... |
| 3 | 1-14-16 | primary_rep | BARTIROMO | Tonight we are working with Facebook to ask t... | [(Tonight, NN), (we, PRP), (are, VBP), (workin... |
| 4 | 1-14-16 | primary_rep | BARTIROMO | And according to Facebook, the U.S. election h... | [(And, CC), (according, VBG), (to, TO), [(Face... |
I've created an NER tree! Notice that nltk's chunker pulled out Neil Cavuto and Maria Bartiromo as people. Now I want to change the S label at the top of the tree to represent who said this utterence using the information in the Speaker column.
master_df.iloc[2][-1]#Using mapping involving 2 columns. Use the Speaker column to modify the Tree column.
for row in range(0, len(master_df)):
master_df.iloc[row][-1].set_label(master_df.iloc[row][2])master_df.iloc[2]['Tree']master_df.iloc[2]['Speaker']'CAVUTO'
#Uh oh. The good news is the chunker got Trump's title- Businessman. The bad news is it's separated from the rest of his name. I'll have to fix that.
master_df.iloc[8]['Tree']master_df.iloc[8]['Speaker']'CAVUTO'
#Again, it got Hillary Clinton and Martin O'Malley, but missed Secretary and Governor, but it DID get Senator Bernie Sanders
master_df.iloc[1603]['Tree']master_df.iloc[1603]['Speaker']'HOLT'
#This section is with help from a datacamp tutorial
#https://campus.datacamp.com/courses/natural-language-processing-fundamentals-in-python/named-entity-recognition?ex=3
ner_categories = defaultdict(int)
# Create the nested for loop
for tree in master_df['Tree']:
for chunk in tree:
if hasattr(chunk, 'label'):
ner_categories[chunk.label()] += 1
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())
# Create a list of the values: values
values = [ner_categories.get(l) for l in labels]
# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)
# Display the chart
plt.show()people = []
for tree in master_df['Tree']:
for chunk in tree:
if hasattr(chunk, 'label'):
if chunk.label() == 'PERSON':
people.append(chunk)people[1]people[1].leaves()[0][0]
words = [leaf[0] for leaf in people[1].leaves()]
words['Neil', 'Cavuto']
people_names = []
name = ''
for tree in people:
for leaf in tree.leaves():
name+=' '+str(leaf[0])
people_names.append(name.strip())
name = ''names = set(people_names)
list(names)[100:200]['ClINTON', 'Cardinal', 'Nebraska', 'Democrats', 'Daily News', 'Stevens', 'Mark Levin', 'Harvard', 'Collison', 'Schumer', 'Ronald Reagan', 'Saint Peter', 'Mr. Perkins', 'Did', 'Josh', 'John Kennedy', 'Maria Celesta Arrasas', 'America', 'Cuomo', 'Holder', 'Tip', 'Miriam', 'Nikki', 'Merkel', 'Ginger', 'Tamir Rice', 'Tata', 'Alexander Hamilton', 'Jeffrey Sonnenfeld', 'Drake', 'Kerry', 'Bill', 'Kentucky', 'Donald Sussman', 'Hank Paulson', 'Matter', 'Center', 'Lawrence Tribe', 'Paulson', 'Crimea', 'Kenya', 'Trayvon Martin', 'Pol Pot', 'Kasich', 'Carrier', 'Kelly Ayotte', 'Hotline Went', 'Mike Lee', 'Daniel Ortega', 'Ronald', 'Al Gore', 'Carly', 'Hugh Hewitt', 'Amazon', 'Darnell Ishmel', 'Heller', 'Nabela', 'Stephen', 'John Boehner', 'Black', 'Ashley Tofil', 'Shaheen', 'Steve King', 'William', 'Hitler', 'Haley', 'Wendy Sherman', 'Senator Clinton', 'John Pietricone', 'Panetta', 'Hill', 'Apple CEO Tim Cook', 'Mr. Cruz', 'Jane', 'Iowa Caucus', 'Hilary Clinton', 'Patti Solis Doyle', 'Brett', 'Warren Buffett', 'Nevada', 'Nancy Reagan', 'Bibi Netanyahu', 'Wired', 'Delaware', 'Putin', 'Keane', 'Ohio Governor', 'Trust', 'James Bergdahl', 'Doug', 'Jordan', 'Maria Celeste', 'Tax Reform', 'Harvard Law', 'Wolf Blitzer', 'Dan Tuohy', 'Henry Kissinger', 'Pope', 'Sinjar', 'Alzheimer']
len(names)1134
Notice we're missing a good amount of titles in this list.
'Senator' in namesTrue
'Governor' in namesFalse
'Mr.' in namesTrue
'Mrs.' in namesFalse
'Miss' in namesFalse
'Doctor' in namesTrue
'President' in namesFalse
'Secretary' in namesFalse
'Sir' in namesTrue
I'm going to go through all of these tags by hand, and link them to who they are referring to. I will create a dictionary of NEs. The expression that was used will be the key, and the person it refers to will be the value. I think copying and pasting this set and deleting things that obviously are not people by hand first will speed things up.
#f = open('/Users/Paige/Documents/Data_Science/names.txt', 'w')
#for name in names:
# f.write(str(name)+'\n')
#f.close()I'm going to read in the text file I made and turn it into the dictionary described above. In this linked.txt file, I removed everyone except for the most relevant people including leaders of countries, all of the candidates, all of the moderators, and people involved in the events discussed during the debate. If the chunker only pulled out a title like 'Madame', it was tagged as TITLE to be resolved later. If the title was a part of the chunk, obviously it was included and mapped to whoever it referred to. Titles that clearly could only refer to one of the relevant people, like Secretary, were mapped to their respective entity. Other "names" like Lady Liberty and Mr. Average were mapped to NICKNAME unless I knew right away who was being referred to.
with open('/Users/Paige/Documents/Data_Science/2016-Election-Project/data/Lists/linked.txt') as f:
name_link = f.readlines()#I considered 384 NEs to be relevant to the debates.
len(name_link)384
name_link[:20]['Sean Hannity;Sean Hannity\n', 'Sean;Sean Hannity\n', 'Hannity;Sean Hannity\n', 'Jake Tapper;Jake Tapper\n', 'Tapper;Jake Tapper\n', 'Jake;Jake Tapper\n', 'Florida Senator;Marco Rubio\n', 'Hill;Hillary Clinton\n', 'Bibi Netanyahu;Benjamin Netanyahu\n', 'Mr. Cruz;Ted Cruz\n', 'Ted;Ted Cruz\n', 'Cruz;Ted Cruz\n', 'Bret;Bret Baier\n', 'Baier;Bret Baier\n', 'John Quincy Adams;John Quincy Adams\n', 'Yasser Arafat;Yasser Arafat\n', 'Rubio;Marco Rubio\n', 'Marco;Marco Rubio\n', 'Bobby Jindal;Bobby Jindal\n', 'Andrea;Andrea Mitchell\n']
links = [x.strip().split(';') for x in name_link]
links[30:50][['Giuliani', 'Rudy Giuliani'], ['Rudy', 'Rudy Giuliani'], ['Miss Piggy', 'Alicia Machado'], ['Snowden', 'Edward Snowden'], ['Malley', "Martin O'Malley"], ["O'Malley", "Martin O'Malley"], ['Martin', "Martin O'Malley"], ['Martha', 'Martha Raddatz'], ['Raddatz', 'Martha Raddatz'], ['Dana', 'Dana Bash'], ['Hillary Rodham', 'Hillary Clinton'], ['John Mccain', 'John McCain'], ['McCain', 'John McCain'], ['Senator Lindsey Graham', 'Lindsey Graham'], ['Lindsey', 'Lindsey Graham'], ['Graham', 'Lindsey Graham'], ['Barbara Bush', 'Barbara Bush'], ['Lyndon Johnson', 'Lyndon Johnson'], ['Martin', "Martin O'Malley"], ['Senator Webb', 'Jim Webb']]
link_dict = {x[0]:x[1] for x in links}f = open('/Users/Paige/Documents/Data_Science/link_dict.pkl', 'wb')
pickle.dump(link_dict, f, -1)
f.close()#This is the set of all of the relevant people who were referred to in the debates.
set(link_dict.values()){'Megyn Kelly', 'TITLE', 'Dana Bash', 'Pope Francis', 'Chelsea Clinton', 'Deborah Wasserman Schultz', 'Antonin Scalia', 'Hillary Clinton', 'Omran Daqneesh', 'Major Garrett', 'Yasser Arafat', 'George Washington', 'Freddie Gray', 'Nancy Pelosi', 'Paul Ryan', 'Ben Carson', 'Barbara Bush', 'Merrick Garland', 'John Podesta', 'Michael Bloomberg', 'Marco Rubio', 'Abraham Lincoln', 'Abigail Adams', 'Sandra Bland', 'Sean Hannity', 'Lyndon Johnson', 'Humayun Khan', 'Donald Trump', 'Andrea Mitchell', 'Michael Brown', 'Calvin Coolidge', 'Neil Cavuto', 'Martha Raddatz', 'Rand Paul', 'Bernie Sanders', 'Bashar al-Assad', 'John Kasich', 'Theodore Roosevelt', 'Michael Flynn', 'Ivanka Trump', 'Mitch McConnell', 'NICKNAME', 'David Muir', 'George H. W. Bush', 'Nikki Haley', 'Rosa Parks', 'Joseph Stalin', 'Rudy Giuliani', 'John Adams', 'Adolf Hitler', 'Thomas Jefferson', "Tip O'Neill", 'Jeb Bush', 'Ashraf Ghani', 'Lindsey Graham', 'Lester Holt', "Martin O'Malley", 'Andrew Cuomo', 'Kim Davis', 'Rachel Maddow', 'Mark Zuckerburg', 'Ken Bone', 'Chris Cuomo', 'Ted Cruz', 'Ronald Reagan', 'George Bush', 'Kimberley Strassel', 'John Kennedy', 'Rush Limbaugh', 'Abdullah', 'Dwight Eisenhower', 'Winston Churchill', 'Tamir Rice', 'Benjamin Netanyahu', 'Alexander Hamilton', 'Lincoln Chafee', "Rosie O'Donnell", 'Trayvon Martin', 'Kim Jong Un', 'Hugh Hewitt', "Katie O'Malley", 'Chris Christie', 'Al Gore', 'Jake Tapper', 'John Boehner', 'Angela Merkel', 'Scott Walker', 'James Comey', 'Mitt Romney', 'Maria Bartiromo', 'Senator Clinton', 'Anderson Cooper', 'Kim Jong-Un', 'Osama bin Laden', 'Woodrow Wilson', 'Benjamin Franklin', 'Nancy Reagan', 'Saddam Hussein', 'Wolf Blitzer', 'Chuck Todd', 'Carly Fiorina', 'Frederick Douglas', 'Carl Quintanilla', 'Don Lemon', 'Dylann Roof', 'Chuck Schumer', 'Chris Wallace', 'Joe Biden', 'Jorge Ramos', 'James Carter', 'Bobby Jindal', 'Elizabeth Warren', 'Mike Huckabee', 'Bret Baier', 'Jeff Sessions', 'Eric Trump', 'George W. Bush', 'John Quincy Adams', 'Eric Garner', 'John Kerry', 'Edward Snowden', 'Richard Nixon', 'Abdel Fattah el-Sisi', 'John McCain', 'Hosni Mubarak', 'Rick Santorum', 'Jim Webb', 'Harry Truman', 'Maria Celeste Arraras', 'Vladimir Putin', 'Joseph Mattis', 'Sonia Sotomayor', 'Maria Elena Salinas', 'Alicia Machado', 'Nelson Mandela', 'Muammar Gaddafi', 'Franklin D. Roosevelt', 'Joe Arpaio', 'Michelle Obama', 'David Duke', 'Barack Obama', 'Fidel Castro', 'Bill Clinton'}
#Here are some of the ways those people were referred to.
list(link_dict.keys())[:40]['Sean Hannity', 'Sean', 'Hannity', 'Jake Tapper', 'Tapper', 'Jake', 'Florida Senator', 'Hill', 'Bibi Netanyahu', 'Mr. Cruz', 'Ted', 'Cruz', 'Bret', 'Baier', 'John Quincy Adams', 'Yasser Arafat', 'Rubio', 'Marco', 'Bobby Jindal', 'Andrea', 'Ohio Governor', 'Deborah Wasserman Schultz', 'Ted Cruz', 'Barak Obama America', 'Jim', 'Webb', 'Shultz', 'Fiorina', 'Carly', 'Mayor Giuliani', 'Giuliani', 'Rudy', 'Miss Piggy', 'Snowden', 'Malley', "O'Malley", 'Martin', 'Martha', 'Raddatz', 'Dana']
link_dict['Ohio Governor']'John Kasich'
link_dict['Secretary']'Hillary Clinton'
link_dict['Senator']'TITLE'
link_dict['Andrea']'Andrea Mitchell'
link_dict['Senator Webb']'Jim Webb'
link_dict['Hilary Clinton']'Hillary Clinton'
link_dict['Hillary Clinton']'Hillary Clinton'
link_dict['Senator Bernie Sanders']'Bernie Sanders'
link_dict['Christie']'Chris Christie'
link_dict['Mr. Trump']'Donald Trump'
I also want to find REs that were just completely missed by the chunker. To tag RE's that were completely missed, I am first going to go through and look for relevant last names. I will tag those as the respective person. Then, I am going to run the process that I did above, pulling in titles and first names into the new subtree I just created.
last_names = list(set(link_dict.values()))
last_names = [x.split() for x in last_names]
#Since all of the relevant people only have a two token name except for one of the moderators, Maria Celeste Arraras, and
#Debbie Wasserman Shultz, I'm just going to look for a single token, the last name.
last_names = [x[-1] for x in last_names]
last_names = set(last_names)
last_names{'Baier', 'TITLE', 'Huckabee', 'Quintanilla', 'Gore', 'Bone', 'Martin', 'Salinas', 'Jefferson', 'Schultz', 'Brown', 'Scalia', 'Netanyahu', 'Laden', 'Jong-Un', 'Bartiromo', 'Roosevelt', 'Strassel', 'Mattis', 'Pelosi', 'Hussein', 'Rice', 'Romney', 'Hannity', 'NICKNAME', 'Zuckerburg', 'Santorum', 'McCain', 'Boehner', 'Khan', 'Chafee', 'Trump', 'Arafat', 'Ryan', 'Carson', 'Biden', 'Garland', "O'Donnell", "O'Malley", 'Bash', 'Muir', 'Gray', 'Davis', 'Graham', 'Schumer', 'Cavuto', 'Ghani', 'Cuomo', 'Bush', 'Abdullah', 'Merkel', 'Clinton', 'Adams', 'Walker', 'Arpaio', 'Snowden', 'Washington', 'Kerry', 'Carter', 'Podesta', 'Holt', 'Kasich', 'al-Assad', "O'Neill", 'Ramos', 'Warren', 'Sotomayor', 'Wilson', 'Francis', 'Limbaugh', 'Truman', 'Hitler', 'Haley', 'Fiorina', 'Daqneesh', 'Wallace', 'Stalin', 'Raddatz', 'Coolidge', 'Jindal', 'Franklin', 'Rubio', 'Bland', 'Sessions', 'Blitzer', 'Cooper', 'Cruz', 'Putin', 'Un', 'Arraras', 'Garrett', 'Nixon', 'Lincoln', 'Webb', 'Comey', 'Christie', 'Flynn', 'Parks', 'Machado', 'Eisenhower', 'Churchill', 'Gaddafi', 'Mandela', 'Mubarak', 'Garner', 'Hewitt', 'Lemon', 'Kennedy', 'el-Sisi', 'Bloomberg', 'Kelly', 'Johnson', 'Mitchell', 'Sanders', 'McConnell', 'Giuliani', 'Tapper', 'Maddow', 'Duke', 'Obama', 'Reagan', 'Hamilton', 'Paul', 'Todd', 'Roof', 'Douglas', 'Castro'}
#This loop looks for REs that should have been tagged as stand alone REs, but were missed
for tree in master_df['Tree']:
for t in tree:
if type(t) == tuple:
if t[0] in last_names:
tree[tree.index(t)] = nltk.tree.Tree('PERSON', [t])Now I need to use this dictionary to label all of the named entities in my NER trees with who the NE is referring to instead of just PERSON or GPE, etc.
def name_linking(tree):
name = ''
for chunk in tree:
#Look for relevent names with ANY label. Maybe "Hillary Clinton" was mistakenly tagged as a GPE
if hasattr(chunk, 'label'):
for leaf in chunk.leaves():
name+=' '+str(leaf[0])
if name.strip() in link_dict.keys():
name = name.strip()
chunk.set_label(link_dict[name])
name = ''
else:
name = ''
return treemaster_df['Tree'] = master_df.Tree.map(name_linking)master_df.head()| Date | Debate Type | Speaker | Sents | Tree | |
|---|---|---|---|---|---|
| 0 | 1-14-16 | primary_rep | CAVUTO | It is 9:00 p.m. here at the North Charleston ... | [(It, PRP), (is, VBZ), (9:00, CD), (p.m., NN),... |
| 1 | 1-14-16 | primary_rep | CAVUTO | Welcome to the sixth Republican presidential o... | [(Welcome, VB), (to, TO), (the, DT), (sixth, J... |
| 2 | 1-14-16 | primary_rep | CAVUTO | I'm Neil Cavuto, alongside my friend and co-mo... | [(I, PRP), ('m, VBP), [(Neil, JJ), (Cavuto, NN... |
| 3 | 1-14-16 | primary_rep | BARTIROMO | Tonight we are working with Facebook to ask t... | [(Tonight, NN), (we, PRP), (are, VBP), (workin... |
| 4 | 1-14-16 | primary_rep | BARTIROMO | And according to Facebook, the U.S. election h... | [(And, CC), (according, VBG), (to, TO), [(Face... |
master_df.iloc[8][-1]master_df.iloc[9][-1]master_df.iloc[1509][-1]master_df.iloc[1603][-1]Next, I'm going to fix up some of the tagging to include titles and any first names that might have been missed. NLTK's RE chunker is supposed to remove titles like Mr., Senator, Mrs., etc. and those are the very things I'm looking for! Luckily, it doesn't always do this well, so some of those titles are included in the tagged trees already, but I'm going to go through and try to add back the missing titles. First, I'm going to create a list of titles and first names. Then, I'm going to cycle through all of the trees that refer to people, look at the word preceeding the tagged chunk, and if that world is a title, I'm going to pull it into the tagged chunk.
#Here, I'm creating a list of first names of all of the relevant people in the corpora so I can pull these first names
#into the correct label if they were mistakenly left untagged.
first_names = list(set(link_dict.values()))
first_names = [x.split() for x in first_names]
first_names = [x[0] for x in first_names]
first_names = set(first_names)titles = ['Mr.', 'Mister', 'Lady', 'Speaker', 'Mrs.', 'Miss', 'Madam', 'Sir', 'President', 'Senator', 'Governor', 'Secretary', 'Congressman', 'Dr.', 'Doctor', 'Sheriff', 'Chairman']
titles.extend(first_names)
#The following is a list of ways you can refer to a person that might not be followed by a name. I'm going to look for any
#of these that were missed as well.
re = ['Secretary', 'Governer', 'Congressman', 'Senator', 'Sir', 'Madam', 'Doctor', 'Dr.']
#Need a list of last names and a list of first names
titles['Mr.', 'Mister', 'Lady', 'Speaker', 'Mrs.', 'Miss', 'Madam', 'Sir', 'President', 'Senator', 'Governor', 'Secretary', 'Congressman', 'Dr.', 'Doctor', 'Sheriff', 'Chairman', 'Omran', 'TITLE', 'Megyn', 'Joe', 'Tamir', 'Mitch', 'Ted', 'Ken', 'Bernie', 'Jeff', 'Martin', 'Alicia', 'Wolf', 'Jeb', 'Alexander', 'Bashar', 'David', 'Trayvon', 'Fidel', 'Chuck', 'Jorge', 'Ben', 'Abigail', 'Neil', 'Rosie', 'Dwight', 'Jim', 'Angela', 'Don', 'Sonia', 'Kimberley', 'Saddam', 'Yasser', 'NICKNAME', 'Senator', 'Bret', 'Mitt', 'Andrew', 'Deborah', 'Lyndon', 'Hillary', 'Edward', 'Richard', 'Freddie', 'Benjamin', 'Donald', 'Sandra', 'Thomas', 'Rick', 'Eric', 'Dana', 'Abdullah', 'Tip', 'Nikki', 'Barack', 'Dylann', 'Michael', 'Sean', 'Anderson', 'Bill', 'Scott', 'Bobby', 'Rush', 'Abdel', 'Ronald', 'Rand', 'Joseph', 'George', 'Carly', 'Katie', 'Vladimir', 'Mark', 'Carl', 'John', 'Marco', 'Calvin', 'Chelsea', 'Rudy', 'Franklin', 'Barbara', 'Chris', 'Humayun', 'Martha', 'Muammar', 'Merrick', 'Michelle', 'Pope', 'Ivanka', 'Lincoln', 'Osama', 'Nelson', 'Mike', 'Kim', 'Woodrow', 'Nancy', 'Antonin', 'Hugh', 'Adolf', 'Elizabeth', 'Rachel', 'Maria', 'Lester', 'Harry', 'Jake', 'Theodore', 'Abraham', 'Hosni', 'Lindsey', 'James', 'Major', 'Winston', 'Frederick', 'Ashraf', 'Rosa', 'Andrea', 'Paul', 'Al']
for tree in master_df['Tree']:
for chunk in tree:
i = tree.index(chunk)
if type(tree[i]) == nltk.tree.Tree:
#if we find a subtree, and it is a relevant entity, we need to look at the node preceding it
if tree[i].label() in link_dict.values():
#if the leaf in front of the subtree is another subtree, and it has the same label or it's labelled 'TITLE'
#we want to pull it in.
if type(tree[i-1]) == nltk.tree.Tree and (tree[i-1].label() == tree[i].label() or tree[i-1].label() == 'TITLE'):
tree[i] = nltk.tree.Tree(tree[i].label(), list(tree[i-1])+list(tree[i]))
tree.remove(tree[i-1])
if tree[i-1][0] in titles:
if i != 0:
tree[i].insert(0, tree[i-1])
tree.remove(tree[i-1])master_df.iloc[1509][-1]master_df.iloc[1603][-1]master_df.iloc[8][-1]Next, I'm going to make sure look for just titles or first names that were missed that stand alone and don't precede a last name using the same method as above.
master_df.iloc[30537][-1]#This loop looks for REs that should have been tagged as stand alone REs, but were missed
for tree in master_df['Tree']:
for t in tree:
if type(t) == tuple:
if t[0] in titles:
tree[tree.index(t)] = nltk.tree.Tree('PERSON', [t])master_df.iloc[30537][-1]And again, I'm going to change the tag of PERSON to the appropriate entity, and pull in any tags that belong to one entity but were tagged separately
master_df['Tree'] = master_df.Tree.map(name_linking)for tree in master_df['Tree']:
for chunk in tree:
i = tree.index(chunk)
if type(tree[i]) == nltk.tree.Tree:
#if we find a subtree, and it is a relevant entity, we need to look at the node preceding it
if tree[i].label() in link_dict.values():
#if the leaf in front of the subtree is another subtree, and it has the same label or it's labelled 'TITLE'
#we want to pull it in.
if type(tree[i-1]) == nltk.tree.Tree and (tree[i-1].label() == tree[i].label() or tree[i-1].label() == 'TITLE'):
tree[i] = nltk.tree.Tree(tree[i].label(), list(tree[i-1])+list(tree[i]))
tree.remove(tree[i-1])
if tree[i-1][0] in titles:
if i != 0:
tree[i].insert(0, tree[i-1])
tree.remove(tree[i-1])master_df.iloc[30537][-1]Finally, if something was tagged as a PERSON, I've decided I'm going to leave that tag the way it is instead of untagging it, even if it was mistakenly tagged as a person, because I or another researcher might want to go back in the future and look at those other entities. The process of correcting all of them would be very time consuming and not particularly relevant to this project. The entites of importance are tagged and a dictionary of important entities is saved as link_dict.
##Saving master dataframe to a CSV
master_df.to_csv('/Users/Paige/Documents/Data_Science/2016-Election-Project/data/Debates/csv/master.csv')f = open('/Users/Paige/Documents/Data_Science/master_df.pkl', 'wb')
pickle.dump(master_df, f, -1)
f.close()














