Skip to content

Database And STACK Architecture

Billy Ceskavich edited this page Jan 22, 2015 · 2 revisions

This wiki section details the structure of the databases used by STACK for configuration and data storage, in addition to a discussion of the architecture and semantics of STACK.

Databases

STACK works with MongoDB to store information as document objects. This system works well for both managing configuration information and storing the JSON responses typical of social APIs.

In this section, we assume that you have at least a cursory knowledge of using MongoDB. If you are new to MongoDB, check out the MongoDB Manual for an introduction to working with the database from the command line.

While our DB wrapper (see Interacting with STACK) abstracts away most of the standard needs to interact with Mongo, the database is still accessible to any user. Often, it can be helpful and quicker to manually crawl through Mongo to double check configuration information, examine collected data, etc.

STACK uses three types of databases, each of which is detailed further below:

  • Config Database - The master configuration database used to store information on the project account(s) for a given collection server.
  • Project Config Database - The configuration database for a given project account, used to identify the various collectors a project owns.
  • Data Database - The database used to store processed social data for a given project account.

Config Database

The config database is used to track project accounts. DB.auth() refers to the config database to authenticate the account, and other STACK processes check the config database for account info as well.

Database Name: config

Info Stored:

  • project_name
  • password
  • description
  • collectors - A list of all collectors owned by a project account
    • active - If the given collector is actively running or not
    • collector_id
    • collector_name
  • configdb - The name of the project config database (see below)

Example Document Structure:

{
  "_id": ObjectID(mongo_id_value),
  "project_name": your_project_name,
  "description": your_project_description,
  "collectors": [
    {
      "active": 0|1,
      "collector_id": collector_id,
      "name": collector_name
    }
  ],
  "configdb": project_config_db,
  "password": project_account_password
}

Project Config Database

The project config database contains the information for each collector owned by the given project account. In addition, this database includes configuration information for each network module, which is used to control the data processor and inserter processes. To learn more about the difference between collectors, data processors, and inserters, see below in the STACK Architecture section.

In addition to the standard configuration information, each document here has a series of what we "flags." These flags are set and reset by the controller script as a way to communicate with the daemon processes that run in the background as part of STACK. Interaction with flags as a way to control STACK processes is abstracted away from the user through the wrappers detailed in Interacting w/ STACK.

Database Name: [project_account_name]Config, an example:

  • project_account_name - test
  • project_config_db - testConfig

Collector Documents

Info Stored:

  • project_name
  • collector_name
  • api - The type of API filter called when collecting data
  • api_auth - OAuth information
  • active
  • collector - Flags for the collector
    • collect
    • update
    • run
  • terms_list - List of terms to collect (if inputted), for each:
    • collect - Set to collect on this term or not
    • term
    • type - Term or handle (if a user screenname)
    • id - Twitter ID value if type == 'handle', or null
  • languages - Array of language codes if applicable
  • location - Array of location coordinates if applicable
  • project_id
  • stream_limit_loss - Total number of tweets lost to a stream rate limit
  • rate_limit_count - Number of times collection process has been rate limited
  • error_code - Logged error code if collection disconnected due to error

Example Doc Structure:

{
  "_id": ObjectID(mongo_id),
  "project_name": your_project_name,
  "collector_name": your_collector_name,
  "api": "track"|"follow"|"none",
  "api_auth": {
    "consumer_key": your_consumer_key,
    "consumer_secret": your_consumer_secret,
    "access_token": your_token,
    "access_token_secret": your_token_secret
  },
  "active": 0|1,
  "collector": {
    "collect": 0|1,
    "run": 0|1,
    "update": 0|1
  },
  "network": "twitter",
  "terms_list": [
    {
      "collect": 0|1,
      "term": your_term,
      "type": "term"|"handle",
      "id": twitter_id|null
    }
  ]|null,
  "languages": ["your", "array", "of", "lang", "codes"]|null,
  "location": ["your", "array", "of", "location", "coords"]|null,
  "stream_limit_loss": {
    "counts": [],
    "total": 0
  },
  "rate_limit_count": 0,
  "error_code": 0
}

Network Module Documents

To learn more about the file directories mentioned below, please see the File Structure section of the wiki.

Info Stored:

  • module - Network name, currently Twitter is the sole network supported
  • insert_queue_dir - Directory for the raw file insert queue
  • raw_tweets_dir - Directory for raw tweets to be stored
  • tweet_archive_dir - Directory for archived processed files
  • inserter_active
  • inserter - Flags for the inserter process
    • run
  • processor_active
  • processor - Flags for the data processor
    • run
  • collection_script - Name of the collection script to be used by a Twitter collector
  • insertion_script - Name of the script to be called by the inserter
  • processor_script - Name of the script to be called by the processor

Example Doc Structure:

{
  "_id": ObjectID(mongo_id),
  "insert_queue_dir": your_insert_queue_dir,
  "inserter_active": 0|1,
  "inserter": {
    "run": 0|1
  },
  "module": "twitter",
  "raw_tweets_dir": your_raw_dir,
  "processor_active": 0|1,
  "tweet_archive_dir": your_archive_dir,
  "collection_script": "ThreadedCollector",
  "insertion_script": "mongoBatchInsert",
  "processor_script": "preprocess",
  "processor": {
    "run": 0|1
  }
}

Data Storage Database

The data storage database stores all the final, processed social data for a given project account. Each collection with the data storage database corresponds to a given social network.

Database Name: [project_account_name]_[project_id], an example:

  • project_account_name - test
  • project_id - 1234
  • database_name - test_1234

Database Collections:

  • Twitter - .tweets

STACK's Architecture

STACK is a tiered application that abstracts much of the complex backend work away from the user. Through the command line syntax detailed in Interacting w/ STACK, users call wrappers that in turn run processes.

In the '/stack/stack/' directory, controller.py is the main script that interfaces between the command line and the processes. The controller calls the given process and starts it as a daemon. All STACk process then run as daemons, persistently working in the background.

Collectors, Processors, and Inserters

There are three types of processes that can be run individually or concurrently, based on a researcher's needs: collectors, data processors, and inserters.

Collectors - These are the most complex STACK processes: they establish and maintain an API connection to scrape raw social data. Collectors work at the collector level: a project account creates a collector for a given set of parameters. These are represented above by the collector documents in the project config database.

Processors - Processors take raw data files and clean them up for insertion into the MongoDB data storage database. Processors work at a network level: processors handle all raw data for a given social network for a project account. Therefore, if a project account is running multiple Twitter collectors, all these data will be processed together by the processor.

Inserters - Inserters take the processed data files from the processor and insert the information into the MongoDB data storage database. Inserters also work at a network level: they handle all processed data for a given social network for a project account. Therefore, if a project account is running multiple Twitter collectors, all these data will be inserted together into a single project database.

NOTE - If you wish to have multiple storage databases, simply create multiple project accounts.

Clone this wiki locally