-
Notifications
You must be signed in to change notification settings - Fork 27
Project Assignments
The point of the Project Assignments is to try out the skills you've learned in the course on your own dataset. The goal is ambitious because we want you to try to examine your dataset and learn something new. Something new that you've learned is also called a "research finding". This means that you can write a scientific paper about it. (You may be thinking. What!? What? That's too much!! But not to worry, we will guide you. More on this below.
The point is simply that we've been working on understanding networks and natural language processing, so the idea is to find a dataset to analyze that will let you show off that you can use these tools to learn about the world.
Here are some example datasets that might be fun to work with.
- Specialized wikis for whatever you're into. E.g.
- Wookiepedia
- Game of Thrones Wiki
- Simpson's Wiki or one of the other Simpson's wikis.
- Bruce base
- Etc, etc.
- IMDb (networks of movies connected by actors), reviews of movies, movie durations, etc. For movie-datasets, also check out https://grouplens.org/datasets/movielens/latest/
- Wikileaks
- More examples can be found in Jim Bagrow's list
- And new ideas for datasets are very welcome. You should work with something that interests you - that way the project will be much more fun to work on.
Combining datasets is encouraged. This is a top way to learn new things (e.g. how the weather impacts network structures on Reddit or something).
You will be working together in groups just as for the first two assignments. For allowed group sizes, see this Wiki's Groups page.
- Note: If the network you want to analyze is big (in the order of thousands of nodes) you are more likely to find interesting patterns in your data. If it is small (in the order of hundreds of nodes) we expect you to compensate for the possible lack of complex patterns/structure with other types of analysis, e.g., more in-depth text analysis, deeper characterization of nodes and connections, etc.
The first part of the final project is a 2-minute movie, which should explain the central idea/concept that you will investigate in your final project. You're making the movie so that the TAs and I can give you feedback, and so that other groups can 'steal' your ideas (and you can steal ideas from them). The movie must contain the following:
- An explanation of the central idea behind your final project (What is the idea? Why is it interesting? Which datasets did you need to explore the idea? How did you download them?)
- A walk-through of your preliminary data analysis, addressing
- What is the total size of your data? (MB, number of pages, other variables, etc.)
- What is the network you will be analyzing? (number of nodes? number of links?, degree distributions, what are node attributes?, etc.)
- What is the text you will be analyzing?
- How will you tie networks and text together in your paper?
But other than that, there are no constraints on the video format. We appreciate funny/inventive/beautiful movies, although the academic content is most important. Note, that we will display the movie to the entire class.
I've put some example videos here for your viewing pleasure.
Handing in the assignment: You should upload your videos to vimeo.com (the higher the resolution the better) and submit the link to peergrade (in the assignment called Project A).
You have 2 weeks to create the video presentation.
The deliverables for the Final project will be
- A Paper (
.pdfformat). The paper should contain your analysis, it should tell the story about the data and the research finding. - An Explainer Notebook (
.ipynbformat). The Notebook should contain all the behind-the-scenes stuff. You should link to the notebook from the paper (in the Methods section).
With the project, we try to push you to go beyond simply running the analyses we did on the rappers and actually use your skills to do some science. By asking you to present a finding, I want you to think carefully about the results of your analyses. What is it that you learn through your analysis? Once you can answer that, you have the first steps toward a real research finding 🤓
It doesn't have to be something super fancy. Maybe your finding is just that the houses in Game of Thrones don't align with the communities in the networks. Then think about why that's the case given what you know about the show, and run further analyses to support and expand your theories. Or you find that the most important character in the Simpson's is Moe the bartender, then you confirm using additional centrality measures and investigate if this changes over time in the episodes. Perhaps you have a hypothesis that you'd like to disprove. You can of course also be much more ambitious - I'm giving these examples to provide a sense that it's perhaps not so hard.
This part of the assignment is quite free. The main point of the paper is to present your idea and analyses to the world in a way that showcases your use of what you've learned in class.
Super important video Click on the link below to hear me talk about the elements of scientific papers. Link to video
The paper should have all the elements that are in the templates (links below):
- Abstract
- Significance Statement
- Introduction
- Results
- Discussion
- Methods
- References
In the video, I discuss tips & tricks that will bring success. Also, use the template's author contributions (it is not OK simply to write "All group members contributed equally".).
The paper has a maximum length of 5 pages (everything must be within those 5 pages, also the references) and no more than 5 figures. Less/shorter is OK, longer is not OK.
Link to templates here
- [Article template (overleaf)](https://www.overleaf.com/latex/templates/template-for-preparing-your-research-report-submission-to-pnas-using-overleaf/fzcbzjvpvnxn)
- [Article template (MS Word)](https://www.pnas.org/pb-assets/authors/PNASTemplateforMainManuscript-1645821915200.docx)
Make sure that you use references when they're needed and follow academic standards.
TIP: When you have an idea for analysis, do a search to see if someone already studied your dataset - or the question you're interested in. There are lots of cool analyses out there that could be an inspiration. And it's OK to use that stuff, just remember to cite the work that you're drawing on (if you copy without citing, then it's cheating ... don't cheat).
The notebook is where we go to check exactly what you did to get your analysis. The notebook should be a pleasant read: Please structure it nicely with clear headlines of what cells make what figures, etc. Also, put in enough text to ensure that a fellow student taking this class could understand exactly what you are doing in the present cell, and why you are doing it like this.
Please also add the following further details about your data and analysis,
- Write in more detail about your choices in data cleaning and preprocessing
- Did you do analyses / calculate statistics that didn't make it to the main text, put them here, but clearly mark that these did not make it into the main text.
- The main point is to show off what you've learned in the course, so the first thing is to make sure your dataset contains both networks and text.
- It is also important that you do a thorough analysis that shows what you've learned in the class. (And we can only know about this if you use and show key parts of that analysis in the paper
- Did you manage to get to a research finding about your dataset? (And not just reproduce the analyses from the lectures on your own dataset)
- All the formal things in the paper-writing video. Good abstract, Intro, etc. Informative figures with thoughtful captions, etc. The right references. Readable text.
- A well-structured and pleasant-to-read explainer notebook.
- Q: May I use methods for analyses we didn't learn in class?
- A: Yes for sure! But remember that the point is to show off what you've learned in the class, so using tools from network science and NLP is essential.
- Q: My network only has 50 nodes, is that enough? A: Hmm. That's a tough question to answer. My best guess is "probably not". But it could be that you have interesting temporal information and lots of textual data for your network, perhaps it could be great. If in doubt, talk to me.