- Scraper [MPI]
- Abstract class
- Process [MPI]
- Document similarity filtering
- Bag of words, bigrams
- Remove stopword / bigrams with both stopwords
- TF-IDF
- Similarity [MRJob]
- Cosine similarity; Jaccard index
- Topic Modeling [MPI and Gensim]
- LSA; LDA
- Scraper [masterslave, scattergather on same plots]
- Speedup
- Efficiency
- [n = 1-16, change "PAGES" at top of scraper.py to something more manageable]
- Process functions
- Speedup
- Efficiency
- Topic models [LSA, LDA on same plots]
- Example abstract, with LSA / LDA representation shown
- Finding numtopics with perplexity (holdout set of 10 docs) --> I'll do this later in the week for our website
- Speedup
- Efficiency
- Cluster [4 metrics: tfidfbow, tfidfbigram, lsa, lda; 3 distance: euc, cos, jac]; table format
- Sum of distances
- Purity
- Entropy
- Rand index
- F1 measure
- Motivation
- Data source + scraping
- Screenshot of Web of Science search page
- Screenshot of sample abstract
- Document representations: BoW, Bigrams
- tfidf
- Topic modeling: LSA, LDA
- Clustering, evaluation metrics
- Screenshot of shell
-
Clustering to evaluate? Measure with purity and entropy
-
Make sure you're in the git repo directory; install
pip,virtualenv, and the list of packages, test it out withyolk.sudo easy_install pip sudo pip install virtualenv virtualenv final [--no-site-packages] source final/bin/activate
-
On LOCAL MACHINE, clone the repo
-
In a RESONANCE NODE or RESONANCE HEADNODE:
echo "module load packages/epd/7.1-2" >> ~/.bashrc [reload shell?] virtualenv final source final/bin/activate
-
To deactivate a virtualenv:
deactivate -
Numpy and Scipy: link; you may need to install FORTRAN compilers as noted in the link
pip install numpy pip install scipy -
Yolk to see what packages you have*
pip install yolk yolk -l -
Beautiful Soup for scraping*
pip install beautifulsoup4 -
lxml for faster html parsing
pip install lxml -
Mechanize for web-navigation*
pip install mechanize -
Gensim for topic modeling*
pip install gensim -
nltk for K-means clustering*
pip install nltk