Skip to content

Architecture Design

johans edited this page May 29, 2018 · 16 revisions

Architecture

1. Core

    The following diagram shows an overview of how pider interacts with different components and an outline of data flow that takes place inside the framework.

Pider

Spiders

     The Spiders is responsible for providing an interface Spider for programmer to customize their crawlers under different senarios.

Kernel

     The Kernel is responsible for controlling the data flow between all of components and modules , and triggering events when certain actions occur.

2. Components

ActivedCarbon

     ActievedCarbon component is little same as Item , ItemLoader and ItemPipeline in Scrapy, but it supplies more functionalities.It finally purposes to offer a complete mechanism for data cleaning.

ActivedCarbon

As shown in above prototype, ActivedCarbon has a lot of Pores which supply variety ETL operation method.

  • Pore

     Pore can be regarded as a container that holds a collection of ETL handlers, which act data transformation (Reaction), data filter(Filter), data assimilation(Absorber).

  • Reaction

     Reaction does transform operation on data.

  • Absorber

     Absorber performs a role to collect information, which can be used to analyze.

  • Filter

     If you don't want to process all the data, then you can define a Filter to avoid processing invalue data.

More details in customizing your own ETL model, please checkout DataProcess in Pider

Clone this wiki locally