Added the first custom crawler documentation files by ebenp · Pull Request #25 · datatogether/learning

ebenp · 2017-09-03T15:24:26Z

Here's some structure for the custom crawlers documentation

Added folder / files for custom crawls

ebenp · 2017-09-03T15:27:33Z

This is a response to #19
Also, my first ever pull request!

cc @jeffreyliu @dcwalk @weatherpattern

dcwalk · 2017-09-03T21:56:00Z

Hey @ebenp 🎉

It looks like you deleted the main README in this PR, are you able to revert that?

dcwalk · 2017-09-03T21:56:41Z

README.md

@@ -1,74 +0,0 @@
-# Data Together Learning Materials


We still need the main README, can you revert to have it in this commit?

dcwalk

Dropped some comments in!

dcwalk · 2017-09-03T21:57:26Z

custom-crawls/README.md

+## Lessons
+
+1. What is custom crawling?
+	*	Why do some websites need custom crawls?


We use 2 spaces not tabs, NBD, but nice to be consistent

dcwalk · 2017-09-03T21:59:17Z

custom-crawls/README.md

+
+## Prerequisites
+
+* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading


I think we could cut out "You care about a dataset that exists on the web and..."
As the prereq on this is probably more complex (e.g., you have a dataset that can't be downloaded automatically.)

We may want to revisit the DR guides we prepared and pull in some language:
https://edgi-govdata-archiving.github.io/guides/

dcwalk · 2017-09-03T21:59:30Z

custom-crawls/README.md

+
+## Prerequisites
+
+* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading


s/benifit/benefit

dcwalk · 2017-09-03T22:00:28Z

custom-crawls/README.md

+After going through this tutorial you will know
+
+* What a custom crawler is and why some websites need one
+* What should your custom crawler extract from a webpage?


Can we have this second point phrased not as a question, so
something like: "What your custom crawler needs to extract from a web page"

dcwalk · 2017-09-03T22:00:45Z

custom-crawls/README.md

+
+* What a custom crawler is and why some websites need one
+* What should your custom crawler extract from a webpage?
+* How to write a custom crawler that works with DataTogether


s/DataTogether/Data Together

dcwalk · 2017-09-03T22:01:07Z

custom-crawls/README.md

+
+## Key Concepts
+
+* Custom Crawler:  An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset.


s/auotmated/automated
s/DataTogether/Data Together

dcwalk · 2017-09-03T22:01:43Z

custom-crawls/README.md

+## Key Concepts
+
+* Custom Crawler:  An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset.
+* Morph.io: An online service that automates and saves user created scripts.


Link to morph.io? Morph.io

dcwalk · 2017-09-03T22:02:24Z

custom-crawls/README.md

+
+* Custom Crawler:  An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset.
+* Morph.io: An online service that automates and saves user created scripts.
+* Archivertools:  An Python package to aid in accessing Morph.io and DataTogether APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3.


delete extra space
s/DataTogether/Data Together

wondering if you want to link to archivertools? (And how is this package published?)

dcwalk · 2017-09-03T22:03:05Z

custom-crawls/README.md

+3.	Some example custom crawls scripts and implementation
+
+## Next Steps
+Look at the other resources under DataTogether for more bakground on DataTogether and storing datasets


s/DataTogether/Data Together
s/bakground/background

fixed! Thanks!

dcwalk · 2017-09-03T22:03:28Z

custom-crawls/lessons/lesson-name.md

@@ -0,0 +1,11 @@
+# Lesson: Name


You can delete this if you aren't using separate lessons for now!

Sounds good! Will do.

ebenp · 2017-09-07T23:40:39Z

Updated the text and files.

ebenp · 2017-09-07T23:42:02Z

@jeffreyliu @weatherpattern @b5

ebenp · 2017-10-12T23:17:14Z

Bump. Any thoughts?

dcwalk · 2017-10-24T03:22:36Z

I think this looks great @ebenp 🎉 just made a minor formatting tweak -- my suggestion is that this is ready to flesh out during the sprint, but we could decide on the Thursday call :)

ebenp · 2017-10-25T23:43:45Z

Sounds great! Thanks @dcwalk

jeffreyliu

Looks good to me!

dcwalk · 2017-10-27T15:13:55Z

Okay, so can I merge this?

jeffreyliu · 2017-10-27T15:27:03Z

Yup!

…

On Fri, Oct 27, 2017, 11:13 AM dcwalk ***@***.***> wrote: Okay, so can I merge this? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#25 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABatxf5uo_kL6O3eSnBL4k2HRXk2WTCGks5swfM0gaJpZM4PLNvI> .

dcwalk · 2017-10-27T18:50:31Z

Okay, just realized this is only the table of contents, without the tutorial. I'd like to wait and add them all together. Maybe in the coming week I can help with review/writing of those content.

(or am I missing something and it is there?)

ebenp · 2017-10-27T19:07:18Z

So I think initially this was to provide some structure to start writing and the the tutorial was started in the readme. I'm fine with adding a tutorial.md and standardizing the readme and contribing docs here in the PR if you want to take that on. Or feel free to close out this structure one and start a new one with the standard documentation and have a new tutorial file with what's in the readme. Sorry, I think this is product of the PR sitting for a while and documentation changing.

Splitting out lesson stubs

ebenp · 2017-11-02T23:27:21Z

ok, I'm (slowly) moving on this. Split out the topic areas and will begin to flush them out

ebenp added 2 commits September 3, 2017 11:20

Added custom crawler files

64415c9

Added custom crawl files

f371856

Added folder / files for custom crawls

flyingzumwalt added the in progress label Sep 3, 2017

ebenp added 2 commits September 3, 2017 11:30

Delete .DS_Store

d472687

Delete .DS_Store

17f3382

dcwalk self-requested a review September 3, 2017 21:56

dcwalk reviewed Sep 3, 2017

View reviewed changes

README.md

@@ -1,74 +0,0 @@

# Data Together Learning Materials

Copy link

Member

dcwalk Sep 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need the main README, can you revert to have it in this commit?

dcwalk suggested changes Sep 3, 2017

View reviewed changes

ebenp added 4 commits September 7, 2017 19:06

added back readme.md

b5847ea

Deleted lesson-name.md

a9598bd

Updated readme.md to address PR comments

41c1b4a

Added archivertools links

295a126

minor formatting tweaks

b8ecb2f

dcwalk approved these changes Oct 24, 2017

View reviewed changes

dcwalk requested a review from jeffreyliu October 24, 2017 03:22

dcwalk added the doc-sprint label Oct 24, 2017

patcon assigned ebenp Oct 26, 2017

jeffreyliu approved these changes Oct 27, 2017

View reviewed changes

ebenp mentioned this pull request Oct 27, 2017

Add links to morph.io tutorials to README datatogether/archivertools#16

Open

ebenp mentioned this pull request Oct 31, 2017

Five ideas for followup tickets after sprint datatogether/roadmap#73

Closed

split out lesson stubs

01805db

Splitting out lesson stubs


		## Prerequisites

		* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading


		## Key Concepts

		* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset.

Conversation

ebenp commented Sep 3, 2017

Uh oh!

ebenp commented Sep 3, 2017

Uh oh!

dcwalk commented Sep 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcwalk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebenp commented Sep 7, 2017

Uh oh!

ebenp commented Sep 7, 2017

Uh oh!

ebenp commented Oct 12, 2017

Uh oh!

dcwalk commented Oct 24, 2017

Uh oh!

ebenp commented Oct 25, 2017

Uh oh!

jeffreyliu left a comment

Choose a reason for hiding this comment

Uh oh!

dcwalk commented Oct 27, 2017

Uh oh!

jeffreyliu commented Oct 27, 2017 via email

Uh oh!

dcwalk commented Oct 27, 2017

Uh oh!

ebenp commented Oct 27, 2017

Uh oh!

ebenp commented Nov 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants