Skip to content

Latest commit

 

History

History
482 lines (359 loc) · 32.8 KB

File metadata and controls

482 lines (359 loc) · 32.8 KB

DeepExtract: Database Schema and Binary Analysis Data Format

This document provides a technical reference for the data architecture, schemas, and analysis heuristics generated by the DeepExtract binary analysis extractor for IDA Pro.


System Overview

DeepExtract is an IDA Pro Python extractor that converts binaries into structured, queryable data. It performs PE metadata extraction and function‑level analysis (xrefs, stack frames, loop analysis, indirect call resolution, vtable context reconstruction) and stores results in SQLite. Optional report generation produces file_info.md/file_info.json and per‑function .cpp files only when --generate-cpp is enabled and at least one function has usable decompiled output (C++ files are emitted only for functions passing the decompilation filter).

Data Architecture

The structured output uses a SQLite‑based storage model supporting concurrent analysis and large‑scale binary repositories.

  1. Tracking Database (analyzed_files.db): A registry tracking status, hashes, and metadata for binaries in the analysis pipeline. analysis_flags stores AnalysisConfig.to_dict() (see key list below; it does not include cpp_output_dir or SQLite pragma overrides).
  2. Individual Analysis Databases: Per‑binary databases containing disassembly/decompilation output, cross‑references, and analysis metadata. In batch mode the default is {filename}_{hash}.db under extracted_dbs/; in single‑file mode this is the explicit --sqlite-db path.
  3. Optional Report/Code Output: When --generate-cpp is enabled and at least one function has valid decompiled output, the extractor writes file_info.md, file_info.json, and per‑function .cpp files under {cpp_output_dir}/{module_name}/ (or {sqlite_db_dir}/extracted_raw_code/{module_name}/ if --cpp-output-dir is not provided).

Headless Batch Extractor Outputs (PowerShell)

When using headless_batch_extractor.ps1, the script orchestrates multiple IDA runs and produces a consistent on‑disk layout in {StorageDir}:

  • analyzed_files.db — Tracking database used by check_analyzed_files.py
  • extracted_dbs/ — SQLite analysis databases ({base_name}.db, see naming below)
  • extracted_code/ — Generated .cpp files plus file_info.md/file_info.json (only when -NoGenerateCpp is not set)
  • idb_cache/ — IDA database cache files (.i64)
  • logs/ — IDA run logs ({base_name}_{timestamp}.log), batch execution log (batch_extractor_{timestamp}.log), and optional symbol download logs (symchk_*.log, symchk_*.log.err)
  • analyzed_modules_list.txt — One absolute input file path per line (UTF-8), for the files that were analyzed
  • extraction_report.json — JSON summary report for the batch run

extraction_report.json schema (generated by Write-ExtractionReport)

{
  "extraction_info": {
    "timestamp": "YYYY-MM-DDTHH:mm:ss",
    "mode": "Directory | FileList | Pid",
    "storage_directory": "path",
    "timeout_hours": number,
    "extract_dirs": ["path", ...],
    "extract_dirs_recursive": ["path", ...],
    "target_pids": [number, ...],
    "process_names": ["string", ...]
  },
  "summary": {
    "total_files": number,
    "successful": number,
    "failed": number,
    "timed_out": number,
    "skipped": number,
    "empty_log_files": number
  },
  "successful_extractions": [
    { "FileName": "path", "DbPath": "path", "IdbPath": "path", "LogFile": "path" }
  ],
  "failed_extractions": [
    { "FileName": "path", "ExitCode": number, "LogFile": "path" }
  ],
  "timed_out_extractions": [
    { "FileName": "path", "LogFile": "path", "ElapsedTime": number, "Reason": "string" }
  ],
  "skipped_files": [
    { "FileName": "path", "Reason": "string" }
  ],
  "empty_log_files": [
    { "FileName": "path", "LogFile": "path", "ExitCode": "number | \"unknown\"", "Reason": "string" }
  ]
}

Notes:

  • target_pids and process_names are arrays included only when mode is Pid.
  • extract_dirs (non-recursive) and extract_dirs_recursive are arrays included only when mode is Directory. Either or both may be present depending on which parameters were used.
  • Per-file base name ({base_name}) is computed as: {file.Name.Replace('.', '_')}_{md5_prefix} lower-cased, where md5_prefix is the first 10 characters of the MD5 of the file contents. Example: kernel32.dll -> kernel32_dll_a1b2c3d4e5.

Analysis Heuristics for Prioritization

The extractor records function‑level heuristics in the functions table (JSON strings) and binary‑level heuristics in the file_info table.

1. Loop Analysis (loop_analysis)

  • Algorithm: Tarjan SCC + dominator/back‑edge detection over IDA flow graphs (loop_analysis.py).
  • Metrics: Per‑loop entries include complexity, block_count, instruction_count, has_function_calls, memory_access, is_infinite, exit_condition_count, cyclomatic_complexity, back_edges, and nesting_level.
  • Availability: If analysis is disabled or unavailable, the extractor stores { "loops": [], "loop_count": 0, "analysis_available": false }. Large functions may be skipped with skipped, reason, block_count, and limit. Errors may add an error field.

2. Dangerous API Detection (dangerous_api_calls)

  • Algorithm: Matches outbound xref function_name values against dangerous_apis.json via constants.is_dangerous_api() (case‑insensitive, strips import prefixes like __imp_/_imp_, and treats WPP_ as dangerous).
  • Output: JSON string containing a list of unique API names. Empty list when none are detected or on error.

3. String Reference Extraction (string_literals)

  • Algorithm: Builds a function→strings map by enumerating idautils.Strings() and xrefs to functions (string_analysis.py). A per‑binary cache in %TEMP% is used when available (validated against IDB mtime).
  • Output: JSON list of unique referenced string literals (capped at 2000 items, order not guaranteed).

4. Stack Frame Analysis (stack_frame)

  • Fields (when available): local_vars_size, args_size, saved_regs_size, exception_handler, frame_pointer_present, has_canary.
  • Canary detection: Combines frame variable names with instruction‑level patterns (__security_check_cookie, stack_chk_fail, XOR against frame values).
  • Availability: When disabled or on failure, the extractor stores { "local_vars_size": null, "args_size": null, "has_canary": null, "analysis_available": false } (other keys are omitted).

5. VTable Contexts (vtable_contexts)

  • Source: Inbound xrefs flagged as vtable candidates; surrounding vtable slots are read and resolved.
  • Output: Entries include reconstructed_classes (C++ skeleton strings), source_ea, and extraction_type. Capped at 100 entries.

6. TLS Callback Analysis (tls_callbacks)

  • Scope: Binary‑level analysis (file‑info) when TLS data exists.
  • Output: Per‑callback metadata includes function identity, segment info, xrefs, string/API/crypto indicators, anti‑debug findings, analysis_notes, and threat_score/threat_level.

Analysis Systems

The extractor uses analysis engines to resolve binary patterns:

Inter-procedural Data Flow Analysis

Implemented in interprocedural_analysis.py and invoked by indirect call analysis when available:

  • Summary-based, bounded: max depth 3, limited callers/callees per function, LRU cache (size 1000) for summaries.
  • Tracks: parameter usage as function pointers, return value sources (constant/parameter/global), and global‑based call targets.
  • Outputs: analysis warnings when incomplete or when targets cannot be resolved.

Indirect Call & Jump Table Resolution

  • Indirect calls: Detects indirect call patterns (call reg, call [mem], call [reg+off]), performs intra‑procedural data‑flow to identify targets, then applies inter‑procedural analysis when no targets are found or confidence < 50%.
  • Jump tables: Uses IDA switch info when available and a manual fallback for indirect jumps; both produce target lists with confidence scoring and detection metadata (minimum acceptance threshold 30%).
  • Outbound xrefs: Encoded with confidence and validation checks in the outbound_xrefs JSON entries.

Offline Code Generation (CPP Generator)

  • Hierarchy: Generates a structured directory of .cpp files plus file_info.md/file_info.json.
  • Categorization: Groups functions into C++ class method files or standalone global function files based on naming conventions.
  • Cleaned Source: Replaces IDA escape sequences and wraps comments for readability.

SQLite Database Schemas

Schema validation uses deep_extract/schema.py, but the actual table creation occurs in deep_extract/pe_context_extractor.py:init_sqlite_db().

Tracking Database (analyzed_files.db)

Purpose: Ledger tracking the analysis status of all binaries.

Table: analyzed_files

Column Name Data Type Description
file_path TEXT PRIMARY KEY. Absolute path to the binary file.
base_dir TEXT The directory containing the binary file.
file_name TEXT The name of the binary file, including its extension.
file_extension TEXT The file's extension (e.g., .exe, .dll).
md5_hash TEXT The MD5 cryptographic hash for change detection.
sha256_hash TEXT The SHA256 cryptographic hash for change detection.
analysis_db_path TEXT Relative path to the individual analysis SQLite database (relative to the common DB directory; forward slashes; may be null until completion).
status TEXT Status: PENDING, ANALYZING, or COMPLETE.
analysis_flags TEXT JSON string of the analysis configuration used (see below).
analysis_start_timestamp TIMESTAMP Recorded when analysis began (used for stale lock recovery).
analysis_completion_timestamp TIMESTAMP Recorded when analysis completed.

analysis_flags keys (from AnalysisConfig.to_dict):

  • sqlite_db, extract_dangerous_apis, extract_strings, extract_stack_frame, extract_globals
  • analyze_loops, extract_pe_info, extract_pe_metadata, extract_advanced_pe, extract_runtime_info
  • generate_cpp, force_reanalyze, use_interprocedural_analysis, thunk_depth, min_conf

Note: check_analyzed_files.py compares flags with force_reanalyze removed. cpp_output_dir, sqlite_pragmas, and common_db_pragmas are intentionally not stored in analysis_flags.

Indices

Index Name Column(s) Purpose
idx_files_name file_name Fast filename lookups.
idx_files_hash md5_hash, sha256_hash Hash‑based change detection.
idx_an_files_status_lower_name status, LOWER(file_name) Status + case‑insensitive name search.
idx_an_files_lower_name LOWER(file_name) Case‑insensitive filename search.
idx_an_files_lower_ext LOWER(file_extension) Case‑insensitive extension search.

Individual Analysis Database (per binary)

Purpose: Stores results for a single binary.

Table: file_info

Column Name Data Type Description
file_path TEXT Absolute path to the binary (PK).
base_dir TEXT Directory containing the file.
file_name TEXT Filename with extension.
file_extension TEXT Extension (e.g., .exe, .dll).
file_size_bytes BIGINT Total size in bytes.
md5_hash TEXT MD5 cryptographic hash.
sha256_hash TEXT SHA256 cryptographic hash.
imports TEXT JSON string of imported modules/functions (regular + delay‑load merged; per‑function is_delay_loaded).
exports TEXT JSON string of exported functions.
entry_point JSON JSON string of detected entry points (from extract_all_entry_points_with_methods).
file_version TEXT Resource file version string.
product_version TEXT Product version string.
company_name TEXT Vendor identification.
file_description TEXT File purpose description.
internal_name TEXT Internal module name.
original_filename TEXT Original filename from PE header.
legal_copyright TEXT Copyright string.
product_name TEXT Product name.
time_date_stamp_str TEXT Compilation timestamp.
file_modified_date_str TEXT Filesystem modification date.
sections TEXT JSON string of PE sections.
pdb_path TEXT Path to PDB debug symbols.
rich_header TEXT JSON string with Rich Header metadata.
tls_callbacks TEXT JSON string of TLS callback metadata.
is_net_assembly BOOLEAN Flag for .NET binaries.
clr_metadata TEXT JSON string with CLR header details.
idb_cache_path TEXT Relative path to cached .i64/.idb (from the analysis DB directory; forward slashes).
dll_characteristics TEXT JSON string of DLL flags.
security_features TEXT JSON string of feature assessments.
exception_info TEXT JSON string for exception directory.
load_config TEXT JSON string for PE Load Config data.
analysis_timestamp TIMESTAMP Timestamp (YYYY-MM-DD HH:MM:SS) recorded when the file_info row is inserted.

Table: functions

Column Name Data Type Description
function_id INTEGER Unique primary key.
function_signature TEXT Demangled signature (IDA long form).
function_signature_extended TEXT Extended signature using IDA type info when present.
mangled_name TEXT Original decorated name.
function_name TEXT Raw IDA function name (may be mangled or sub_…).
assembly_code TEXT Disassembly text (capped at MAX_ASSEMBLY_LINES).
decompiled_code TEXT Hex‑Rays output or placeholder text on failure.
inbound_xrefs TEXT Detailed JSON of inbound references.
outbound_xrefs TEXT Detailed JSON of outbound calls/jumps.
simple_inbound_xrefs TEXT Simplified JSON caller list.
simple_outbound_xrefs TEXT Simplified JSON callee list.
vtable_contexts TEXT JSON list of reconstructed class skeletons.
global_var_accesses TEXT JSON list of global read/write accesses.
dangerous_api_calls TEXT JSON list of dangerous API names (stored as a JSON string).
string_literals TEXT JSON list of referenced string literals (order not guaranteed).
stack_frame TEXT JSON object with stack/frame metrics.
loop_analysis TEXT JSON object with loop analysis data.
analysis_errors TEXT JSON array of processing errors/warnings.
created_at TIMESTAMP Record creation timestamp.

Note: JSON fields in file_info/functions are stored as TEXT (serialized JSON). Most fields use json_safety.to_json_safe(), so values may be "null" or include truncation metadata.

Table: function_xrefs

Purpose: Normalized, relational cross-reference table that represents the call graph and data-reference graph as individual rows. Each row is a single directed edge between a source function and a target (which may be an internal function, an imported API, a data reference, or a vtable slot).

This table complements the JSON xref columns on the functions table (simple_inbound_xrefs, simple_outbound_xrefs, inbound_xrefs, outbound_xrefs). The JSON columns are optimized for per-function reads (loading one function's context), while function_xrefs is optimized for cross-module graph queries (traversals, reachability, call-chain analysis) using standard SQL joins and index lookups.

Column Name Data Type Description
id INTEGER PRIMARY KEY (autoincrement). Row identifier.
source_id INTEGER function_id of the calling/referencing function. NULL for data-only inbound references with no resolved source function.
target_id INTEGER function_id of the called/referenced function. NULL when the target is an external import, data reference, or unresolved symbol.
target_name TEXT Name of the target function or symbol (e.g., "CreateProcessAsUserW", "AiCheckForAdminUser", "_SVCHOST_GLOBAL_DATA * g_pSvchostSharedGlobals").
target_module TEXT Where the target lives: "internal" for same-module functions, DLL name for imports (e.g., "kernelbase.dll"), "data" for globals, "vtable" for vtable slots, "static_library" for statically linked library functions.
function_type INTEGER Target classification using IDA function type constants: FT_UNK=0, FT_GEN=1 (internal), FT_LIB=2 (static library), FT_API=3 (import), FT_MEM=4 (data), FT_VTB=8 (vtable/unknown offset).
xref_type TEXT IDA cross-reference type: "Call Near", "Offset", "Read", "Write", etc.
direction TEXT Edge direction relative to the function being processed: "outbound" (this function calls/references the target) or "inbound" (the target calls/references this function).

Uniqueness constraint: UNIQUE(source_id, target_id, target_name, target_module, xref_type, direction) prevents duplicate edges. Insertion uses INSERT OR IGNORE to silently drop duplicates from IDA's raw xref data.

Relationship to JSON xref columns: The function_xrefs rows are derived from the same raw xref data as simple_inbound_xrefs and simple_outbound_xrefs. Both representations are populated during the same extraction pass. The JSON columns retain the original per-function grouping; the function_xrefs table flattens all edges into a single queryable structure.

Use Cases

Reverse lookup -- find all callers of a specific function:

SELECT DISTINCT f.function_name
FROM function_xrefs x
JOIN functions f ON f.function_id = x.source_id
WHERE x.target_name = 'CreateProcessAsUserW'
  AND x.direction = 'outbound';

Internal call graph reconstruction:

SELECT s.function_name AS caller, t.function_name AS callee
FROM function_xrefs x
JOIN functions s ON s.function_id = x.source_id
JOIN functions t ON t.function_id = x.target_id
WHERE x.direction = 'outbound'
  AND x.target_module = 'internal';

Import dependency surface -- which DLLs does a function depend on:

SELECT DISTINCT x.target_module, x.target_name
FROM function_xrefs x
WHERE x.source_id = ?
  AND x.direction = 'outbound'
  AND x.function_type = 3;  -- FT_API

Dead code detection -- functions with no inbound references:

SELECT f.function_name
FROM functions f
WHERE f.function_id NOT IN (
    SELECT target_id FROM function_xrefs
    WHERE direction = 'inbound' AND target_id IS NOT NULL
);

Transitive call chain -- two-hop callers of a function (who calls the callers):

SELECT DISTINCT f2.function_name AS indirect_caller
FROM function_xrefs x1
JOIN function_xrefs x2 ON x2.target_id = x1.source_id
JOIN functions f2 ON f2.function_id = x2.source_id
WHERE x1.target_name = 'NtDuplicateToken'
  AND x1.direction = 'outbound'
  AND x2.direction = 'outbound'
  AND x2.target_module = 'internal';

Global data access analysis -- find all functions that read a specific global:

SELECT f.function_name
FROM function_xrefs x
JOIN functions f ON f.function_id = x.source_id
WHERE x.target_name LIKE '%g_pSvchostSharedGlobals%'
  AND x.direction = 'outbound'
  AND x.xref_type = 'Read';

Table: schema_version

Column Name Data Type Description
version INTEGER PRIMARY KEY. Schema version number.
description TEXT Description of the schema version.
applied_timestamp TIMESTAMP When this schema version was applied.
migration_notes TEXT Details regarding migration from prior version.

Indices

Index Name Table Column(s) Purpose
idx_file_info_lower_ext file_info LOWER(file_extension) Case-insensitive extension search.
idx_file_info_lower_name file_info LOWER(file_name) Case-insensitive filename search.
idx_functions_mangled functions mangled_name COLLATE NOCASE Case-insensitive mangled name lookups.
idx_functions_name functions function_name COLLATE NOCASE Case-insensitive function name lookups.
idx_functions_signature functions function_signature COLLATE NOCASE Case-insensitive signature search.
idx_fxrefs_source function_xrefs source_id Fast caller lookups and outbound joins.
idx_fxrefs_target function_xrefs target_id Fast callee lookups and inbound joins.
idx_fxrefs_target_name function_xrefs target_name COLLATE NOCASE Name-based target search (e.g., imports).
idx_fxrefs_direction function_xrefs direction Filter by inbound/outbound direction.

Function JSON Field Formats

The following JSON fields are stored as strings in the functions table.

inbound_xrefs

Detailed inbound references (callers). Each entry commonly includes:

  • source_ea (hex string), source_location_details, xref_type, is_code, is_vtable_candidate
  • source_name_info (object with mangled/short/long or null)
  • source_function_name, source_function_name_short, source_function_name_long, source_function_name_mangled
  • source_module_name, source_module_host, source_function_type
  • call_confidence, validation_checks, validation_warnings, extraction_type
  • Optional: function_name (for code call/jump/offset xrefs)

outbound_xrefs

Detailed outbound calls/jumps. Base fields include:

  • source_instruction_ea, target_ea, resolved_target_ea, xref_type
  • function_name, short_function_name, function_name_short, function_name_long, function_name_mangled
  • module_name, module_host (pre‑API‑set resolution), function_type
  • is_tail_call, is_thunk_resolved
  • call_confidence, validation_checks, validation_warnings, extraction_type
  • Optional: original_target_name (present when a thunk or jump table target was resolved)

Specialized entries add:

  • VTable calls: is_vtable_call, vtable_info (vtable address, name, method offset/names, size)
  • Indirect calls: is_indirect_call, indirect_call_info (call_type, confidence, analysis_method)
  • Jump table targets: is_jump_table_target, jump_table_detection_confidence, jump_table_detection_method

simple_inbound_xrefs / simple_outbound_xrefs

Simplified xref objects with fields:

  • function_name, function_id (null for data/external refs), module_name, function_type, xref_type, extraction_type ("script")

function_type values: FT_UNK=0, FT_GEN=1, FT_LIB=2, FT_API=3, FT_MEM=4, FT_VTB=8, FT_SYS=16.

vtable_contexts

List of reconstructed class skeletons derived from vtable slots:

  • reconstructed_classes (array of C++ class skeleton strings)
  • source_ea
  • extraction_type (e.g., detailed_vtable_analysis)

global_var_accesses

List of global variable reads/writes:

  • address (hex string)
  • name
  • access_type (Read or Write)

loop_analysis

Object containing:

  • loops: list of loop entries with complexity, block_count, instruction_count, has_function_calls, memory_access, is_infinite, exit_condition_count, cyclomatic_complexity, back_edges, nesting_level
  • loop_count
  • Optional: analysis_available (false when disabled or unsupported), skipped, reason, block_count, limit, function_name, or error

stack_frame

Object containing:

  • local_vars_size, args_size, saved_regs_size, exception_handler, frame_pointer_present, has_canary
  • Optional: analysis_available (false when stack analysis is disabled or fails). When disabled/unavailable the extractor stores { "local_vars_size": null, "args_size": null, "has_canary": null, "analysis_available": false }.

string_literals

List of string literals referenced by the function.

dangerous_api_calls

List of unique API names flagged as dangerous.

analysis_errors

List of error/warning objects. All entries include stage. Entries created from exceptions include error; truncation warnings may include severity, reason, and max_lines.


JSON Serialization Limits

JSON fields are serialized with json_safety.to_json_safe, which enforces size and item limits. When truncation occurs:

  • Lists may include a trailing {"_truncated": true, "original_count": N, "shown_count": M} entry.
  • Dicts may include _truncated_keys when key limits are exceeded.
  • Both lists and dicts receive _serialization_metadata with truncated: true and a transformations list.
  • If a field exceeds the byte limit (default 50MB), an additional _truncated object (or _error object) may be inserted with fields like reason, original_size_bytes, and max_bytes.

Common per‑field caps (list limits):

  • entry_point: 100 items
  • imports: 1000 modules
  • exports: 5000 entries
  • sections: 200 entries
  • tls_callbacks: 256 entries
  • string_literals: 2000 entries
  • vtable_contexts: 100 entries
  • loop_analysis: 500 items per list inside the object
  • global_var_accesses: 1000 entries
  • analysis_errors: 100 entries
  • inbound_xrefs / outbound_xrefs / simple_*_xrefs: capped by EXTRACTOR_MAX_XREFS (env var; default 50,000; clamped to 1,000–500,000)

Usage Notes for AI Agents

  1. Registry Discovery: Query analyzed_files.db to find individual module databases and verify analysis completeness across software packages.
  2. Call Chain Reconstruction: Use function_xrefs for efficient SQL-based call chain queries. For lightweight per-function reads, use simple_outbound_xrefs. For indirect calls/jump tables/vtables, use outbound_xrefs entries with indirect_call_info, jump_table_*, or vtable_info.
  3. Cross-Reference Analysis: Use function_xrefs with SQL joins for graph queries (reverse lookups, transitive callers, import surface analysis). Use inbound_xrefs and outbound_xrefs JSON columns when full xref detail is needed for a single function.
  4. Heuristic Data Retrieval: Use SQL JSON_EXTRACT to filter functions by technical metrics such as loop complexity, stack size, or anti-debug indicators.
  5. Technical Verification: Correlate observations in decompiled_code with the raw assembly_code and stack_frame (canary status) for technical validation.