| layout | page |
|---|---|
| title | Resources |
There is no official textbook for this course. The following books are for your reference.
- Mining Text Data. Charu C. Aggarwal and ChengXiang Zhai, Springer, 2012
- Speech and Language Processing, 2nd edition, Daniel Jurafsky and James H. Martin, Pearson Education, 2000.
- Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze, Cambridge University Press, 2008.
- Statistical Language Models for Information Retrieval. ChengXiang Zhai, Morgan & Claypool Publishers, 2008.
- Foundations of Statistical Natural Language Processing. C. Manning and H. Schutze, MIT Press, 1999.
- Mining the Web: Analysis of Hypertext and Semi Structured Data (The Morgan Kaufmann Series in Data Management Systems). Soumen Chakrabarti, Morgan Kaufmann, 2002.
- Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Bing Liu, Springer, 2006.
It is beneficial to be aware of how text mining is taught in other top universities, especially by those top researchers in the field. Here is a list of wonderful text mining courses selected by the instructor.
- UIUC: Text Mining and Analytics: Explore algorithms for mining and analyzing big text data to discover interesting patterns, extract useful knowledge, and support decision making, by Dr. ChengXiang Zhai.
- CMU 95-865: Text Analytics: The focus is on algorithms and techniques, however the course also provides an introduction to open-source software tools, by Dr. Jamie Callan.
- Gatech CSE 6240: WEB SEARCH and TEXT MINING: by Dr. Alexander Gray. This course is a mix of information retrieval and text mining.
The following list and comments only represent the instructor's personal opinion.
- KDD: One of the most important and influential conference in the field of data mining, proceedings of publications can be found here.
- SIGIR: One of the most important and influential conference in the field of information retrieval (attract more attention from academia), proceedings of publications can be found here.
- WWW: Another most important and influential conference in IR field (attract more attention from industry), proceedings of publications can be found here.
- WSDM: A new but quickly raising conference in the field, attracking attentions from both industry and academia. Proceedings of publications can be found here.
- CIKM: A major conference in the field of data mining and information retrieval. Proceedings of publications can be found here.
- ACL: A major conference for computational linguistics research. A Digital archive of research papers in computational linguistics at ACL Anthology.
- TOIS: One of major journals for information retrieval and data mining field.
- If you are interested in rankings or indices of those conferences and journals, you may take a look at Google Scholar's Metrics.
- Lucene Apache Lucene is a free open source information retrieval software library. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized[4][5] for its utility in the implementation of Internet search engines and local, single-site searching.
- MeTA MeTA is a modern C++ data sciences toolkit developed by Timan group in University of Illinois. Various text mining and machine learning algorithms are implemented.
- RankLib (A collection of learning-to-rank algorithms University of Massachusetts Amherst)
- Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
- Stanford NLP parser (Stanford University NLP group)
- OpenNLP (Apache)
- LingPipe (Jave-based)
- NLTK(Python-based)
- Weka: A rich collection of machine learning algorithms, Machine Learning Group at the University of Waikato.
- Mallet: An alternative package for Weka, developed by Andrew McCallum at University of Massachusetts Amherst
- LibSVM: A collection of SVMs, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University
- SVM-light: Another collection of SVMs, developed by Thorsten Joachims at Cornell University
- GraphLab: Large-scale machine learning package
- mahout: Apache large-scale machine learning package
- Spark: A fast and general engine for large-scale data processing.
- Topic Models (David Blei's collection of various topic models)
- Twitter: Twitter is currently open to public, twitter streams can be accessed via their APIs, and also there are some crawled twitter available: e.g., Stanford SNAP twitter data set, and TREC microblog collection.
- UCI Machine Learning Repository: a standard machine learning benchmark repository (a bit small and old).
- Yelp Dataset Challenge: A large set of Yelp reviews and entities provided by Yelp. Also, "If you are a student and come up with an appealing project, you’ll have the opportunity to win one of ten Yelp Dataset Challenge awards for $5,000."
- Here are the LaTeX files necessary to write the project report.
- We want everyone to use the same format so we can grade each paper fairly.
- Additionally, LaTeX is a skill we feel you should learn if you haven't
already!
- Official website of latex: http://www.latex-project.org/
- TEX editor for windows: WinEdt, LEd
- TEX editor for MacOS: TeXPad, Latexian
- Please share the best TEX editor or integrated solutions in your mind to the class via Pizza.