Added the first custom crawler documentation files#25
Added the first custom crawler documentation files#25ebenp wants to merge 10 commits intodatatogether:masterfrom
Conversation
Added folder / files for custom crawls
|
This is a response to #19 |
|
Hey @ebenp 🎉 It looks like you deleted the main README in this PR, are you able to revert that? |
| @@ -1,74 +0,0 @@ | |||
| # Data Together Learning Materials | |||
There was a problem hiding this comment.
We still need the main README, can you revert to have it in this commit?
custom-crawls/README.md
Outdated
| ## Lessons | ||
|
|
||
| 1. What is custom crawling? | ||
| * Why do some websites need custom crawls? |
There was a problem hiding this comment.
We use 2 spaces not tabs, NBD, but nice to be consistent
custom-crawls/README.md
Outdated
|
|
||
| ## Prerequisites | ||
|
|
||
| * You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading |
There was a problem hiding this comment.
I think we could cut out "You care about a dataset that exists on the web and..."
As the prereq on this is probably more complex (e.g., you have a dataset that can't be downloaded automatically.)
We may want to revisit the DR guides we prepared and pull in some language:
https://edgi-govdata-archiving.github.io/guides/
custom-crawls/README.md
Outdated
|
|
||
| ## Prerequisites | ||
|
|
||
| * You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading |
custom-crawls/README.md
Outdated
| After going through this tutorial you will know | ||
|
|
||
| * What a custom crawler is and why some websites need one | ||
| * What should your custom crawler extract from a webpage? |
There was a problem hiding this comment.
Can we have this second point phrased not as a question, so
something like: "What your custom crawler needs to extract from a web page"
custom-crawls/README.md
Outdated
|
|
||
| * What a custom crawler is and why some websites need one | ||
| * What should your custom crawler extract from a webpage? | ||
| * How to write a custom crawler that works with DataTogether |
custom-crawls/README.md
Outdated
|
|
||
| ## Key Concepts | ||
|
|
||
| * Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset. |
There was a problem hiding this comment.
s/auotmated/automated
s/DataTogether/Data Together
custom-crawls/README.md
Outdated
| ## Key Concepts | ||
|
|
||
| * Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset. | ||
| * Morph.io: An online service that automates and saves user created scripts. |
custom-crawls/README.md
Outdated
|
|
||
| * Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset. | ||
| * Morph.io: An online service that automates and saves user created scripts. | ||
| * Archivertools: An Python package to aid in accessing Morph.io and DataTogether APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3. |
There was a problem hiding this comment.
delete extra space
s/DataTogether/Data Together
wondering if you want to link to archivertools? (And how is this package published?)
custom-crawls/README.md
Outdated
| 3. Some example custom crawls scripts and implementation | ||
|
|
||
| ## Next Steps | ||
| Look at the other resources under DataTogether for more bakground on DataTogether and storing datasets No newline at end of file |
There was a problem hiding this comment.
s/DataTogether/Data Together
s/bakground/background
custom-crawls/lessons/lesson-name.md
Outdated
| @@ -0,0 +1,11 @@ | |||
| # Lesson: Name | |||
There was a problem hiding this comment.
You can delete this if you aren't using separate lessons for now!
|
Updated the text and files. |
|
Bump. Any thoughts? |
|
I think this looks great @ebenp 🎉 just made a minor formatting tweak -- my suggestion is that this is ready to flesh out during the sprint, but we could decide on the Thursday call :) |
|
Sounds great! Thanks @dcwalk |
|
Okay, so can I merge this? |
|
Yup!
…On Fri, Oct 27, 2017, 11:13 AM dcwalk ***@***.***> wrote:
Okay, so can I merge this?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABatxf5uo_kL6O3eSnBL4k2HRXk2WTCGks5swfM0gaJpZM4PLNvI>
.
|
|
Okay, just realized this is only the table of contents, without the tutorial. I'd like to wait and add them all together. Maybe in the coming week I can help with review/writing of those content. (or am I missing something and it is there?) |
|
So I think initially this was to provide some structure to start writing and the the tutorial was started in the readme. I'm fine with adding a tutorial.md and standardizing the readme and contribing docs here in the PR if you want to take that on. Or feel free to close out this structure one and start a new one with the standard documentation and have a new tutorial file with what's in the readme. Sorry, I think this is product of the PR sitting for a while and documentation changing. |
Splitting out lesson stubs
|
ok, I'm (slowly) moving on this. Split out the topic areas and will begin to flush them out |
Here's some structure for the custom crawlers documentation