Database And STACK Architecture

This wiki section details the structure of the databases used by STACK for configuration and data storage, in addition to a discussion of the architecture and semantics of STACK.

Databases

STACK works with MongoDB to store information as document objects. This system works well for both managing configuration information and storing the JSON responses typical of social APIs.

In this section, we assume that you have at least a cursory knowledge of using MongoDB. If you are new to MongoDB, check out the MongoDB Manual for an introduction to working with the database from the command line.

While our DB wrapper (see Interacting with STACK) abstracts away most of the standard needs to interact with Mongo, the database is still accessible to any user. Often, it can be helpful and quicker to manually crawl through Mongo to double check configuration information, examine collected data, etc.

STACK uses three types of databases, each of which is detailed further below:

Config Database - The master configuration database used to store information on the project account(s) for a given collection server.
Project Config Database - The configuration database for a given project account, used to identify the various collectors a project owns.
Data Database - The database used to store processed social data for a given project account.

Config Database

The config database is used to track project accounts. DB.auth() refers to the config database to authenticate the account, and other STACK processes check the config database for account info as well.

Database Name: config

Info Stored:

project_name
password
description
collectors - A list of all collectors owned by a project account
- active - If the given collector is actively running or not
- collector_id
- collector_name
configdb - The name of the project config database (see below)

Example Document Structure:

{
  "_id": ObjectID(mongo_id_value),
  "project_name": your_project_name,
  "description": your_project_description,
  "collectors": [
    {
      "active": 0|1,
      "collector_id": collector_id,
      "name": collector_name
    }
  ],
  "configdb": project_config_db,
  "password": project_account_password
}

Project Config Database

The project config database contains the information for each collector owned by the given project account. In addition, this database includes configuration information for each network module, which is used to control the data processor and inserter processes. To learn more about the difference between collectors, data processors, and inserters, see below in the STACK Architecture section.

In addition to the standard configuration information, each document here has a series of what we "flags." These flags are set and reset by the controller script as a way to communicate with the daemon processes that run in the background as part of STACK. Interaction with flags as a way to control STACK processes is abstracted away from the user through the wrappers detailed in Interacting w/ STACK.

Database Name: [project_account_name]Config, an example:

project_account_name - test
project_config_db - testConfig

Collector Documents

Info Stored:

project_name
collector_name
api - The type of API filter called when collecting data
api_auth - OAuth information
active
collector - Flags for the collector
- collect
- update
- run
terms_list - List of terms to collect (if inputted), for each:
- collect - Set to collect on this term or not
- term
- type - Term or handle (if a user screenname)
- id - Twitter ID value if type == 'handle', or null
languages - Array of language codes if applicable
location - Array of location coordinates if applicable
project_id
stream_limit_loss - Total number of tweets lost to a stream rate limit
rate_limit_count - Number of times collection process has been rate limited
error_code - Logged error code if collection disconnected due to error

Example Doc Structure:

{
  "_id": ObjectID(mongo_id),
  "project_name": your_project_name,
  "collector_name": your_collector_name,
  "api": "track"|"follow"|"none",
  "api_auth": {
    "consumer_key": your_consumer_key,
    "consumer_secret": your_consumer_secret,
    "access_token": your_token,
    "access_token_secret": your_token_secret
  },
  "active": 0|1,
  "collector": {
    "collect": 0|1,
    "run": 0|1,
    "update": 0|1
  },
  "network": "twitter",
  "terms_list": [
    {
      "collect": 0|1,
      "term": your_term,
      "type": "term"|"handle",
      "id": twitter_id|null
    }
  ]|null,
  "languages": ["your", "array", "of", "lang", "codes"]|null,
  "location": ["your", "array", "of", "location", "coords"]|null,
  "stream_limit_loss": {
    "counts": [],
    "total": 0
  },
  "rate_limit_count": 0,
  "error_code": 0
}

Network Module Documents

To learn more about the file directories mentioned below, please see the File Structure section of the wiki.

Info Stored:

module - Network name, currently Twitter is the sole network supported
insert_queue_dir - Directory for the raw file insert queue
raw_tweets_dir - Directory for raw tweets to be stored
tweet_archive_dir - Directory for archived processed files
inserter_active
inserter - Flags for the inserter process
- run
processor_active
processor - Flags for the data processor
- run
collection_script - Name of the collection script to be used by a Twitter collector
insertion_script - Name of the script to be called by the inserter
processor_script - Name of the script to be called by the processor

Example Doc Structure:

{
  "_id": ObjectID(mongo_id),
  "insert_queue_dir": your_insert_queue_dir,
  "inserter_active": 0|1,
  "inserter": {
    "run": 0|1
  },
  "module": "twitter",
  "raw_tweets_dir": your_raw_dir,
  "processor_active": 0|1,
  "tweet_archive_dir": your_archive_dir,
  "collection_script": "ThreadedCollector",
  "insertion_script": "mongoBatchInsert",
  "processor_script": "preprocess",
  "processor": {
    "run": 0|1
  }
}

Data Storage Database

The data storage database stores all the final, processed social data for a given project account. Each collection with the data storage database corresponds to a given social network.

Database Name: [project_account_name]_[project_id], an example:

project_account_name - test
project_id - 1234
database_name - test_1234

Database Collections:

Twitter - .tweets

STACK's Architecture

STACK is a tiered application that abstracts much of the complex backend work away from the user. Through the command line syntax detailed in Interacting w/ STACK, users call wrappers that in turn run processes.

In the '/stack/stack/' directory, controller.py is the main script that interfaces between the command line and the processes. The controller calls the given process and starts it as a daemon. All STACk process then run as daemons, persistently working in the background.

Collectors, Processors, and Inserters

There are three types of processes that can be run individually or concurrently, based on a researcher's needs: collectors, data processors, and inserters.

Collectors - These are the most complex STACK processes: they establish and maintain an API connection to scrape raw social data. Collectors work at the collector level: a project account creates a collector for a given set of parameters. These are represented above by the collector documents in the project config database.

Processors - Processors take raw data files and clean them up for insertion into the MongoDB data storage database. Processors work at a network level: processors handle all raw data for a given social network for a project account. Therefore, if a project account is running multiple Twitter collectors, all these data will be processed together by the processor.

Inserters - Inserters take the processed data files from the processor and insert the information into the MongoDB data storage database. Inserters also work at a network level: they handle all processed data for a given social network for a project account. Therefore, if a project account is running multiple Twitter collectors, all these data will be inserted together into a single project database.

NOTE - If you wish to have multiple storage databases, simply create multiple project accounts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database And STACK Architecture

Databases

Config Database

Project Config Database

Data Storage Database

STACK's Architecture

Collectors, Processors, and Inserters

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally