This document provides a technical reference for the data architecture, schemas, and analysis heuristics generated by the DeepExtract binary analysis extractor for IDA Pro.
DeepExtract is an IDA Pro Python extractor that converts binaries into structured, queryable data. It performs PE metadata extraction and function‑level analysis (xrefs, stack frames, loop analysis, indirect call resolution, vtable context reconstruction) and stores results in SQLite. Optional report generation produces file_info.md/file_info.json and per‑function .cpp files only when --generate-cpp is enabled and at least one function has usable decompiled output (C++ files are emitted only for functions passing the decompilation filter).
The structured output uses a SQLite‑based storage model supporting concurrent analysis and large‑scale binary repositories.
- Tracking Database (
analyzed_files.db): A registry tracking status, hashes, and metadata for binaries in the analysis pipeline.analysis_flagsstoresAnalysisConfig.to_dict()(see key list below; it does not includecpp_output_diror SQLite pragma overrides). - Individual Analysis Databases: Per‑binary databases containing disassembly/decompilation output, cross‑references, and analysis metadata. In batch mode the default is
{filename}_{hash}.dbunderextracted_dbs/; in single‑file mode this is the explicit--sqlite-dbpath. - Optional Report/Code Output: When
--generate-cppis enabled and at least one function has valid decompiled output, the extractor writesfile_info.md,file_info.json, and per‑function.cppfiles under{cpp_output_dir}/{module_name}/(or{sqlite_db_dir}/extracted_raw_code/{module_name}/if--cpp-output-diris not provided).
When using headless_batch_extractor.ps1, the script orchestrates multiple IDA runs and produces a consistent on‑disk layout in {StorageDir}:
analyzed_files.db— Tracking database used bycheck_analyzed_files.pyextracted_dbs/— SQLite analysis databases ({base_name}.db, see naming below)extracted_code/— Generated.cppfiles plusfile_info.md/file_info.json(only when-NoGenerateCppis not set)idb_cache/— IDA database cache files (.i64)logs/— IDA run logs ({base_name}_{timestamp}.log), batch execution log (batch_extractor_{timestamp}.log), and optional symbol download logs (symchk_*.log,symchk_*.log.err)analyzed_modules_list.txt— One absolute input file path per line (UTF-8), for the files that were analyzedextraction_report.json— JSON summary report for the batch run
{
"extraction_info": {
"timestamp": "YYYY-MM-DDTHH:mm:ss",
"mode": "Directory | FileList | Pid",
"storage_directory": "path",
"timeout_hours": number,
"extract_dirs": ["path", ...],
"extract_dirs_recursive": ["path", ...],
"target_pids": [number, ...],
"process_names": ["string", ...]
},
"summary": {
"total_files": number,
"successful": number,
"failed": number,
"timed_out": number,
"skipped": number,
"empty_log_files": number
},
"successful_extractions": [
{ "FileName": "path", "DbPath": "path", "IdbPath": "path", "LogFile": "path" }
],
"failed_extractions": [
{ "FileName": "path", "ExitCode": number, "LogFile": "path" }
],
"timed_out_extractions": [
{ "FileName": "path", "LogFile": "path", "ElapsedTime": number, "Reason": "string" }
],
"skipped_files": [
{ "FileName": "path", "Reason": "string" }
],
"empty_log_files": [
{ "FileName": "path", "LogFile": "path", "ExitCode": "number | \"unknown\"", "Reason": "string" }
]
}Notes:
target_pidsandprocess_namesare arrays included only whenmodeisPid.extract_dirs(non-recursive) andextract_dirs_recursiveare arrays included only whenmodeisDirectory. Either or both may be present depending on which parameters were used.- Per-file base name (
{base_name}) is computed as:{file.Name.Replace('.', '_')}_{md5_prefix}lower-cased, wheremd5_prefixis the first 10 characters of the MD5 of the file contents. Example:kernel32.dll->kernel32_dll_a1b2c3d4e5.
The extractor records function‑level heuristics in the functions table (JSON strings) and binary‑level heuristics in the file_info table.
- Algorithm: Tarjan SCC + dominator/back‑edge detection over IDA flow graphs (
loop_analysis.py). - Metrics: Per‑loop entries include
complexity,block_count,instruction_count,has_function_calls,memory_access,is_infinite,exit_condition_count,cyclomatic_complexity,back_edges, andnesting_level. - Availability: If analysis is disabled or unavailable, the extractor stores
{ "loops": [], "loop_count": 0, "analysis_available": false }. Large functions may be skipped withskipped,reason,block_count, andlimit. Errors may add anerrorfield.
- Algorithm: Matches outbound xref
function_namevalues againstdangerous_apis.jsonviaconstants.is_dangerous_api()(case‑insensitive, strips import prefixes like__imp_/_imp_, and treatsWPP_as dangerous). - Output: JSON string containing a list of unique API names. Empty list when none are detected or on error.
- Algorithm: Builds a function→strings map by enumerating
idautils.Strings()and xrefs to functions (string_analysis.py). A per‑binary cache in%TEMP%is used when available (validated against IDB mtime). - Output: JSON list of unique referenced string literals (capped at 2000 items, order not guaranteed).
- Fields (when available):
local_vars_size,args_size,saved_regs_size,exception_handler,frame_pointer_present,has_canary. - Canary detection: Combines frame variable names with instruction‑level patterns (
__security_check_cookie,stack_chk_fail, XOR against frame values). - Availability: When disabled or on failure, the extractor stores
{ "local_vars_size": null, "args_size": null, "has_canary": null, "analysis_available": false }(other keys are omitted).
- Source: Inbound xrefs flagged as vtable candidates; surrounding vtable slots are read and resolved.
- Output: Entries include
reconstructed_classes(C++ skeleton strings),source_ea, andextraction_type. Capped at 100 entries.
- Scope: Binary‑level analysis (file‑info) when TLS data exists.
- Output: Per‑callback metadata includes function identity, segment info, xrefs, string/API/crypto indicators, anti‑debug findings,
analysis_notes, andthreat_score/threat_level.
The extractor uses analysis engines to resolve binary patterns:
Implemented in interprocedural_analysis.py and invoked by indirect call analysis when available:
- Summary-based, bounded: max depth 3, limited callers/callees per function, LRU cache (size 1000) for summaries.
- Tracks: parameter usage as function pointers, return value sources (constant/parameter/global), and global‑based call targets.
- Outputs: analysis warnings when incomplete or when targets cannot be resolved.
- Indirect calls: Detects indirect call patterns (
call reg,call [mem],call [reg+off]), performs intra‑procedural data‑flow to identify targets, then applies inter‑procedural analysis when no targets are found or confidence < 50%. - Jump tables: Uses IDA switch info when available and a manual fallback for indirect jumps; both produce target lists with confidence scoring and detection metadata (minimum acceptance threshold 30%).
- Outbound xrefs: Encoded with confidence and validation checks in the
outbound_xrefsJSON entries.
- Hierarchy: Generates a structured directory of
.cppfiles plusfile_info.md/file_info.json. - Categorization: Groups functions into C++ class method files or standalone global function files based on naming conventions.
- Cleaned Source: Replaces IDA escape sequences and wraps comments for readability.
Schema validation uses deep_extract/schema.py, but the actual table creation occurs in deep_extract/pe_context_extractor.py:init_sqlite_db().
Purpose: Ledger tracking the analysis status of all binaries.
| Column Name | Data Type | Description |
|---|---|---|
file_path |
TEXT | PRIMARY KEY. Absolute path to the binary file. |
base_dir |
TEXT | The directory containing the binary file. |
file_name |
TEXT | The name of the binary file, including its extension. |
file_extension |
TEXT | The file's extension (e.g., .exe, .dll). |
md5_hash |
TEXT | The MD5 cryptographic hash for change detection. |
sha256_hash |
TEXT | The SHA256 cryptographic hash for change detection. |
analysis_db_path |
TEXT | Relative path to the individual analysis SQLite database (relative to the common DB directory; forward slashes; may be null until completion). |
status |
TEXT | Status: PENDING, ANALYZING, or COMPLETE. |
analysis_flags |
TEXT | JSON string of the analysis configuration used (see below). |
analysis_start_timestamp |
TIMESTAMP | Recorded when analysis began (used for stale lock recovery). |
analysis_completion_timestamp |
TIMESTAMP | Recorded when analysis completed. |
analysis_flags keys (from AnalysisConfig.to_dict):
sqlite_db,extract_dangerous_apis,extract_strings,extract_stack_frame,extract_globalsanalyze_loops,extract_pe_info,extract_pe_metadata,extract_advanced_pe,extract_runtime_infogenerate_cpp,force_reanalyze,use_interprocedural_analysis,thunk_depth,min_conf
Note: check_analyzed_files.py compares flags with force_reanalyze removed. cpp_output_dir, sqlite_pragmas, and common_db_pragmas are intentionally not stored in analysis_flags.
| Index Name | Column(s) | Purpose |
|---|---|---|
idx_files_name |
file_name |
Fast filename lookups. |
idx_files_hash |
md5_hash, sha256_hash |
Hash‑based change detection. |
idx_an_files_status_lower_name |
status, LOWER(file_name) |
Status + case‑insensitive name search. |
idx_an_files_lower_name |
LOWER(file_name) |
Case‑insensitive filename search. |
idx_an_files_lower_ext |
LOWER(file_extension) |
Case‑insensitive extension search. |
Purpose: Stores results for a single binary.
| Column Name | Data Type | Description |
|---|---|---|
file_path |
TEXT | Absolute path to the binary (PK). |
base_dir |
TEXT | Directory containing the file. |
file_name |
TEXT | Filename with extension. |
file_extension |
TEXT | Extension (e.g., .exe, .dll). |
file_size_bytes |
BIGINT | Total size in bytes. |
md5_hash |
TEXT | MD5 cryptographic hash. |
sha256_hash |
TEXT | SHA256 cryptographic hash. |
imports |
TEXT | JSON string of imported modules/functions (regular + delay‑load merged; per‑function is_delay_loaded). |
exports |
TEXT | JSON string of exported functions. |
entry_point |
JSON | JSON string of detected entry points (from extract_all_entry_points_with_methods). |
file_version |
TEXT | Resource file version string. |
product_version |
TEXT | Product version string. |
company_name |
TEXT | Vendor identification. |
file_description |
TEXT | File purpose description. |
internal_name |
TEXT | Internal module name. |
original_filename |
TEXT | Original filename from PE header. |
legal_copyright |
TEXT | Copyright string. |
product_name |
TEXT | Product name. |
time_date_stamp_str |
TEXT | Compilation timestamp. |
file_modified_date_str |
TEXT | Filesystem modification date. |
sections |
TEXT | JSON string of PE sections. |
pdb_path |
TEXT | Path to PDB debug symbols. |
rich_header |
TEXT | JSON string with Rich Header metadata. |
tls_callbacks |
TEXT | JSON string of TLS callback metadata. |
is_net_assembly |
BOOLEAN | Flag for .NET binaries. |
clr_metadata |
TEXT | JSON string with CLR header details. |
idb_cache_path |
TEXT | Relative path to cached .i64/.idb (from the analysis DB directory; forward slashes). |
dll_characteristics |
TEXT | JSON string of DLL flags. |
security_features |
TEXT | JSON string of feature assessments. |
exception_info |
TEXT | JSON string for exception directory. |
load_config |
TEXT | JSON string for PE Load Config data. |
analysis_timestamp |
TIMESTAMP | Timestamp (YYYY-MM-DD HH:MM:SS) recorded when the file_info row is inserted. |
| Column Name | Data Type | Description |
|---|---|---|
function_id |
INTEGER | Unique primary key. |
function_signature |
TEXT | Demangled signature (IDA long form). |
function_signature_extended |
TEXT | Extended signature using IDA type info when present. |
mangled_name |
TEXT | Original decorated name. |
function_name |
TEXT | Raw IDA function name (may be mangled or sub_…). |
assembly_code |
TEXT | Disassembly text (capped at MAX_ASSEMBLY_LINES). |
decompiled_code |
TEXT | Hex‑Rays output or placeholder text on failure. |
inbound_xrefs |
TEXT | Detailed JSON of inbound references. |
outbound_xrefs |
TEXT | Detailed JSON of outbound calls/jumps. |
simple_inbound_xrefs |
TEXT | Simplified JSON caller list. |
simple_outbound_xrefs |
TEXT | Simplified JSON callee list. |
vtable_contexts |
TEXT | JSON list of reconstructed class skeletons. |
global_var_accesses |
TEXT | JSON list of global read/write accesses. |
dangerous_api_calls |
TEXT | JSON list of dangerous API names (stored as a JSON string). |
string_literals |
TEXT | JSON list of referenced string literals (order not guaranteed). |
stack_frame |
TEXT | JSON object with stack/frame metrics. |
loop_analysis |
TEXT | JSON object with loop analysis data. |
analysis_errors |
TEXT | JSON array of processing errors/warnings. |
created_at |
TIMESTAMP | Record creation timestamp. |
Note: JSON fields in file_info/functions are stored as TEXT (serialized JSON). Most fields use json_safety.to_json_safe(), so values may be "null" or include truncation metadata.
Purpose: Normalized, relational cross-reference table that represents the call graph and data-reference graph as individual rows. Each row is a single directed edge between a source function and a target (which may be an internal function, an imported API, a data reference, or a vtable slot).
This table complements the JSON xref columns on the functions table (simple_inbound_xrefs, simple_outbound_xrefs, inbound_xrefs, outbound_xrefs). The JSON columns are optimized for per-function reads (loading one function's context), while function_xrefs is optimized for cross-module graph queries (traversals, reachability, call-chain analysis) using standard SQL joins and index lookups.
| Column Name | Data Type | Description |
|---|---|---|
id |
INTEGER | PRIMARY KEY (autoincrement). Row identifier. |
source_id |
INTEGER | function_id of the calling/referencing function. NULL for data-only inbound references with no resolved source function. |
target_id |
INTEGER | function_id of the called/referenced function. NULL when the target is an external import, data reference, or unresolved symbol. |
target_name |
TEXT | Name of the target function or symbol (e.g., "CreateProcessAsUserW", "AiCheckForAdminUser", "_SVCHOST_GLOBAL_DATA * g_pSvchostSharedGlobals"). |
target_module |
TEXT | Where the target lives: "internal" for same-module functions, DLL name for imports (e.g., "kernelbase.dll"), "data" for globals, "vtable" for vtable slots, "static_library" for statically linked library functions. |
function_type |
INTEGER | Target classification using IDA function type constants: FT_UNK=0, FT_GEN=1 (internal), FT_LIB=2 (static library), FT_API=3 (import), FT_MEM=4 (data), FT_VTB=8 (vtable/unknown offset). |
xref_type |
TEXT | IDA cross-reference type: "Call Near", "Offset", "Read", "Write", etc. |
direction |
TEXT | Edge direction relative to the function being processed: "outbound" (this function calls/references the target) or "inbound" (the target calls/references this function). |
Uniqueness constraint: UNIQUE(source_id, target_id, target_name, target_module, xref_type, direction) prevents duplicate edges. Insertion uses INSERT OR IGNORE to silently drop duplicates from IDA's raw xref data.
Relationship to JSON xref columns: The function_xrefs rows are derived from the same raw xref data as simple_inbound_xrefs and simple_outbound_xrefs. Both representations are populated during the same extraction pass. The JSON columns retain the original per-function grouping; the function_xrefs table flattens all edges into a single queryable structure.
Reverse lookup -- find all callers of a specific function:
SELECT DISTINCT f.function_name
FROM function_xrefs x
JOIN functions f ON f.function_id = x.source_id
WHERE x.target_name = 'CreateProcessAsUserW'
AND x.direction = 'outbound';Internal call graph reconstruction:
SELECT s.function_name AS caller, t.function_name AS callee
FROM function_xrefs x
JOIN functions s ON s.function_id = x.source_id
JOIN functions t ON t.function_id = x.target_id
WHERE x.direction = 'outbound'
AND x.target_module = 'internal';Import dependency surface -- which DLLs does a function depend on:
SELECT DISTINCT x.target_module, x.target_name
FROM function_xrefs x
WHERE x.source_id = ?
AND x.direction = 'outbound'
AND x.function_type = 3; -- FT_APIDead code detection -- functions with no inbound references:
SELECT f.function_name
FROM functions f
WHERE f.function_id NOT IN (
SELECT target_id FROM function_xrefs
WHERE direction = 'inbound' AND target_id IS NOT NULL
);Transitive call chain -- two-hop callers of a function (who calls the callers):
SELECT DISTINCT f2.function_name AS indirect_caller
FROM function_xrefs x1
JOIN function_xrefs x2 ON x2.target_id = x1.source_id
JOIN functions f2 ON f2.function_id = x2.source_id
WHERE x1.target_name = 'NtDuplicateToken'
AND x1.direction = 'outbound'
AND x2.direction = 'outbound'
AND x2.target_module = 'internal';Global data access analysis -- find all functions that read a specific global:
SELECT f.function_name
FROM function_xrefs x
JOIN functions f ON f.function_id = x.source_id
WHERE x.target_name LIKE '%g_pSvchostSharedGlobals%'
AND x.direction = 'outbound'
AND x.xref_type = 'Read';| Column Name | Data Type | Description |
|---|---|---|
version |
INTEGER | PRIMARY KEY. Schema version number. |
description |
TEXT | Description of the schema version. |
applied_timestamp |
TIMESTAMP | When this schema version was applied. |
migration_notes |
TEXT | Details regarding migration from prior version. |
| Index Name | Table | Column(s) | Purpose |
|---|---|---|---|
idx_file_info_lower_ext |
file_info |
LOWER(file_extension) |
Case-insensitive extension search. |
idx_file_info_lower_name |
file_info |
LOWER(file_name) |
Case-insensitive filename search. |
idx_functions_mangled |
functions |
mangled_name COLLATE NOCASE |
Case-insensitive mangled name lookups. |
idx_functions_name |
functions |
function_name COLLATE NOCASE |
Case-insensitive function name lookups. |
idx_functions_signature |
functions |
function_signature COLLATE NOCASE |
Case-insensitive signature search. |
idx_fxrefs_source |
function_xrefs |
source_id |
Fast caller lookups and outbound joins. |
idx_fxrefs_target |
function_xrefs |
target_id |
Fast callee lookups and inbound joins. |
idx_fxrefs_target_name |
function_xrefs |
target_name COLLATE NOCASE |
Name-based target search (e.g., imports). |
idx_fxrefs_direction |
function_xrefs |
direction |
Filter by inbound/outbound direction. |
The following JSON fields are stored as strings in the functions table.
Detailed inbound references (callers). Each entry commonly includes:
source_ea(hex string),source_location_details,xref_type,is_code,is_vtable_candidatesource_name_info(object withmangled/short/longor null)source_function_name,source_function_name_short,source_function_name_long,source_function_name_mangledsource_module_name,source_module_host,source_function_typecall_confidence,validation_checks,validation_warnings,extraction_type- Optional:
function_name(for code call/jump/offset xrefs)
Detailed outbound calls/jumps. Base fields include:
source_instruction_ea,target_ea,resolved_target_ea,xref_typefunction_name,short_function_name,function_name_short,function_name_long,function_name_mangledmodule_name,module_host(pre‑API‑set resolution),function_typeis_tail_call,is_thunk_resolvedcall_confidence,validation_checks,validation_warnings,extraction_type- Optional:
original_target_name(present when a thunk or jump table target was resolved)
Specialized entries add:
- VTable calls:
is_vtable_call,vtable_info(vtable address, name, method offset/names, size) - Indirect calls:
is_indirect_call,indirect_call_info(call_type,confidence,analysis_method) - Jump table targets:
is_jump_table_target,jump_table_detection_confidence,jump_table_detection_method
Simplified xref objects with fields:
function_name,function_id(null for data/external refs),module_name,function_type,xref_type,extraction_type("script")
function_type values: FT_UNK=0, FT_GEN=1, FT_LIB=2, FT_API=3, FT_MEM=4, FT_VTB=8, FT_SYS=16.
List of reconstructed class skeletons derived from vtable slots:
reconstructed_classes(array of C++ class skeleton strings)source_eaextraction_type(e.g.,detailed_vtable_analysis)
List of global variable reads/writes:
address(hex string)nameaccess_type(ReadorWrite)
Object containing:
loops: list of loop entries withcomplexity,block_count,instruction_count,has_function_calls,memory_access,is_infinite,exit_condition_count,cyclomatic_complexity,back_edges,nesting_levelloop_count- Optional:
analysis_available(false when disabled or unsupported),skipped,reason,block_count,limit,function_name, orerror
Object containing:
local_vars_size,args_size,saved_regs_size,exception_handler,frame_pointer_present,has_canary- Optional:
analysis_available(false when stack analysis is disabled or fails). When disabled/unavailable the extractor stores{ "local_vars_size": null, "args_size": null, "has_canary": null, "analysis_available": false }.
List of string literals referenced by the function.
List of unique API names flagged as dangerous.
List of error/warning objects. All entries include stage. Entries created from exceptions include error; truncation warnings may include severity, reason, and max_lines.
JSON fields are serialized with json_safety.to_json_safe, which enforces size and item limits. When truncation occurs:
- Lists may include a trailing
{"_truncated": true, "original_count": N, "shown_count": M}entry. - Dicts may include
_truncated_keyswhen key limits are exceeded. - Both lists and dicts receive
_serialization_metadatawithtruncated: trueand atransformationslist. - If a field exceeds the byte limit (default 50MB), an additional
_truncatedobject (or_errorobject) may be inserted with fields likereason,original_size_bytes, andmax_bytes.
Common per‑field caps (list limits):
entry_point: 100 itemsimports: 1000 modulesexports: 5000 entriessections: 200 entriestls_callbacks: 256 entriesstring_literals: 2000 entriesvtable_contexts: 100 entriesloop_analysis: 500 items per list inside the objectglobal_var_accesses: 1000 entriesanalysis_errors: 100 entriesinbound_xrefs/outbound_xrefs/simple_*_xrefs: capped byEXTRACTOR_MAX_XREFS(env var; default 50,000; clamped to 1,000–500,000)
- Registry Discovery: Query
analyzed_files.dbto find individual module databases and verify analysis completeness across software packages. - Call Chain Reconstruction: Use
function_xrefsfor efficient SQL-based call chain queries. For lightweight per-function reads, usesimple_outbound_xrefs. For indirect calls/jump tables/vtables, useoutbound_xrefsentries withindirect_call_info,jump_table_*, orvtable_info. - Cross-Reference Analysis: Use
function_xrefswith SQL joins for graph queries (reverse lookups, transitive callers, import surface analysis). Useinbound_xrefsandoutbound_xrefsJSON columns when full xref detail is needed for a single function. - Heuristic Data Retrieval: Use SQL
JSON_EXTRACTto filter functions by technical metrics such as loop complexity, stack size, or anti-debug indicators. - Technical Verification: Correlate observations in
decompiled_codewith the rawassembly_codeandstack_frame(canary status) for technical validation.