Skip to content

Task Manager's task implementation

Tingxuan Gu edited this page Sep 5, 2019 · 2 revisions

This page is to show you how to construct a task for the task manager to run

Description

Typically, a runnable/task will have these three parts as support:

  • crawler - crawl data from selected website
  • extractor - extract the downloaded data and get the information you need
  • dumper - put the information into the database

These parts can be altered if the task is for a different purpose, e.g. classification.
In the task file itself, there usually is only a run function which can be called in the Task Manager.

Implementation

Normally, you do the pipeline file by file.
For each file you want from the website:

  • first you crawl that file(using wget or request)
  • then extract and dump it
  • finally you delete that file to save space
  • move on to the next file you are going to get

Remember to put logging information to catch the possible exceptions in the task you are working on

Clone this wiki locally