Progress towards tree-sitter feature#3102
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Tree-Sitter Script Analysis to capa, enabling feature extraction from script languages such as C#, Python, HTML, and ASPX templates. It adds a new Tree-Sitter-based feature extractor, auto-detection capabilities, and signature-based tools, along with comprehensive tests and updated dependencies. The code review feedback primarily addresses compatibility issues with the upgraded tree-sitter library (version 0.25.0), specifically pointing out that QueryCursor has been removed and the Parser instantiation has changed in tree-sitter versions >= 0.21.0. The feedback provides actionable suggestions to execute queries directly and update parser usage. Additionally, it identifies a bug in integer suffix parsing and recommends replacing a deprecated importlib.resources API.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def get_function_call_names(self, node: Node) -> Iterator[Node]: | ||
| cursor = QueryCursor(self.query.function_call_name) | ||
| yield from self.get_captured_nodes(cursor, node) | ||
|
|
||
| def get_imported_constants(self, node: Node) -> Iterator[Node]: | ||
| cursor = QueryCursor(self.query.imported_constant_name) | ||
| yield from self.get_captured_nodes(cursor, node) |
There was a problem hiding this comment.
Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.
| def get_function_call_names(self, node: Node) -> Iterator[Node]: | |
| cursor = QueryCursor(self.query.function_call_name) | |
| yield from self.get_captured_nodes(cursor, node) | |
| def get_imported_constants(self, node: Node) -> Iterator[Node]: | |
| cursor = QueryCursor(self.query.imported_constant_name) | |
| yield from self.get_captured_nodes(cursor, node) | |
| def get_function_call_names(self, node: Node) -> Iterator[Node]: | |
| yield from self.get_captured_nodes(self.query.function_call_name, node) | |
| def get_imported_constants(self, node: Node) -> Iterator[Node]: | |
| yield from self.get_captured_nodes(self.query.imported_constant_name, node) |
There was a problem hiding this comment.
QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/
| def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]: | ||
| target_node = self.tree.root_node if node is None else node | ||
| cursor = QueryCursor(self.query.namespace) | ||
| namespace_captures: List[Tuple[Node, str]] = [] | ||
|
|
||
| for query_name, nodes in cursor.captures(target_node).items(): | ||
| for namespace_node in nodes: | ||
| namespace_captures.append((namespace_node, query_name)) | ||
|
|
||
| return sorted(namespace_captures, key=self.get_node_capture_sort_key) | ||
|
|
There was a problem hiding this comment.
Since QueryCursor is removed in tree-sitter >= 0.21.0, you should execute the query directly using self.query.namespace.captures(target_node). This returns a list of (Node, str) tuples representing the captured nodes and their capture names.
| def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]: | |
| target_node = self.tree.root_node if node is None else node | |
| cursor = QueryCursor(self.query.namespace) | |
| namespace_captures: List[Tuple[Node, str]] = [] | |
| for query_name, nodes in cursor.captures(target_node).items(): | |
| for namespace_node in nodes: | |
| namespace_captures.append((namespace_node, query_name)) | |
| return sorted(namespace_captures, key=self.get_node_capture_sort_key) | |
| def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]: | |
| target_node = self.tree.root_node if node is None else node | |
| captures = self.query.namespace.captures(target_node) | |
| namespace_captures: List[Tuple[Node, str]] = [] | |
| for namespace_node, query_name in captures: | |
| namespace_captures.append((namespace_node, query_name)) | |
| return sorted(namespace_captures, key=self.get_node_capture_sort_key) |
There was a problem hiding this comment.
QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/
| def get_string_literals(self, node: Node) -> Iterator[Node]: | ||
| cursor = QueryCursor(self.query.string_literal) | ||
| yield from self.get_captured_nodes(cursor, node) | ||
|
|
||
| def get_integer_literals(self, node: Node) -> Iterator[Node]: | ||
| cursor = QueryCursor(self.query.integer_literal) | ||
| yield from self.get_captured_nodes(cursor, node) |
There was a problem hiding this comment.
Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.
| def get_string_literals(self, node: Node) -> Iterator[Node]: | |
| cursor = QueryCursor(self.query.string_literal) | |
| yield from self.get_captured_nodes(cursor, node) | |
| def get_integer_literals(self, node: Node) -> Iterator[Node]: | |
| cursor = QueryCursor(self.query.integer_literal) | |
| yield from self.get_captured_nodes(cursor, node) | |
| def get_string_literals(self, node: Node) -> Iterator[Node]: | |
| yield from self.get_captured_nodes(self.query.string_literal, node) | |
| def get_integer_literals(self, node: Node) -> Iterator[Node]: | |
| yield from self.get_captured_nodes(self.query.integer_literal, node) |
There was a problem hiding this comment.
QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/
| def parse(self) -> Tree: | ||
| parser = Parser(self.query.language) | ||
| return parser.parse(self.buf) |
There was a problem hiding this comment.
In tree-sitter >= 0.21.0, the Parser constructor does not accept a Language argument. You must set the language attribute after instantiating the parser.
| def parse(self) -> Tree: | |
| parser = Parser(self.query.language) | |
| return parser.parse(self.buf) | |
| def parse(self) -> Tree: | |
| parser = Parser() | |
| parser.language = self.query.language | |
| return parser.parse(self.buf) |
There was a problem hiding this comment.
Parser constructor accepts Language argument.
Link: https://tree-sitter.github.io/py-tree-sitter/classes/tree_sitter.Parser.html
| def get_content_sections(self) -> Iterator[Node]: | ||
| cursor = QueryCursor(self.query.content) | ||
| yield from self.get_captured_nodes(cursor, self.tree.root_node) |
There was a problem hiding this comment.
Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.
| def get_content_sections(self) -> Iterator[Node]: | |
| cursor = QueryCursor(self.query.content) | |
| yield from self.get_captured_nodes(cursor, self.tree.root_node) | |
| def get_content_sections(self) -> Iterator[Node]: | |
| yield from self.get_captured_nodes(self.query.content, self.tree.root_node) |
There was a problem hiding this comment.
QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/
| import re | ||
| from typing import List, Tuple, Iterator, Optional | ||
|
|
||
| from tree_sitter import Node, Tree, Parser, QueryCursor |
There was a problem hiding this comment.
QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/
| def parse_integer(self, integer: str) -> int: | ||
| for suffix in self.integer_suffixes: | ||
| if integer.endswith(suffix): | ||
| integer = integer[:-1] | ||
| for prefix, base in self.integer_prefixes: | ||
| if integer.startswith(prefix): | ||
| return int(integer, base) | ||
| return int(integer) |
There was a problem hiding this comment.
There is a bug in parse_integer when stripping suffixes. If an integer literal has multiple suffixes (e.g., 100ul in C#), the loop only strips the last character of the suffix that matched first, leaving the other suffix character intact (e.g., leaving 100u), which then fails to parse as an integer and raises a ValueError. Using rstrip with a combined set of uppercase and lowercase suffixes is more robust and correctly handles multi-character suffixes.
| def parse_integer(self, integer: str) -> int: | |
| for suffix in self.integer_suffixes: | |
| if integer.endswith(suffix): | |
| integer = integer[:-1] | |
| for prefix, base in self.integer_prefixes: | |
| if integer.startswith(prefix): | |
| return int(integer, base) | |
| return int(integer) | |
| def parse_integer(self, integer: str) -> int: | |
| suffixes = "".join(self.integer_suffixes) | |
| if suffixes: | |
| integer = integer.rstrip(suffixes + suffixes.upper()) | |
| for prefix, base in self.integer_prefixes: | |
| if integer.startswith(prefix): | |
| return int(integer, base) | |
| return int(integer) |
There was a problem hiding this comment.
Addressed
| def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]: | ||
| signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file)) | ||
| return {category: set(names) for category, names in signatures.items()} |
There was a problem hiding this comment.
importlib.resources.read_text is deprecated since Python 3.11. Use importlib.resources.files instead, which is fully supported in Python 3.10+.
| def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]: | |
| signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file)) | |
| return {category: set(names) for category, names in signatures.items()} | |
| def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]: | |
| ref = importlib.resources.files(capa.features.extractors.ts.signatures) / signature_file | |
| signatures = json.loads(ref.read_text(encoding="utf-8")) | |
| return {category: set(names) for category, names in signatures.items()} |
e71ca37 to
5c19383
Compare
2e75cd1 to
050ab84
Compare
Revives and supersedes old PR mandiant#1080. Resolves merge conflicts and brings up to date with current master.
46a6446 to
19308f7
Compare
19308f7 to
291975a
Compare
|
@mike-hunhoff @larchchen @Maijin The first version of the script analysis feature is ready for review. |
mr-tz
left a comment
There was a problem hiding this comment.
Initial review of the first few files
| if extractor.is_library_function(f.address): | ||
| function_name = extractor.get_function_name(f.address) | ||
| logger.debug("skipping library function 0x%x (%s)", f.address, function_name) | ||
| logger.debug("skipping library function %s (%s)", f.address, function_name) |
There was a problem hiding this comment.
Good point. This debug log should be an int itself. I will revert this line back to int.
But, in line 208, it has be a string because the f.address here can be FileOffsetRangeAddress for the Tree-sitter functions.
|
|
||
| def get_language_from_ext(path: str) -> str: | ||
| if path.endswith(EXT_ASPX): | ||
| return LANG_TEM |
There was a problem hiding this comment.
Embedded template is not immediately clear to me, some doc on that would be good (maybe it is further down).
There was a problem hiding this comment.
I haven’t added documentation yet, but I’ll add a section describing the script analysis feature in the near future.
| buf = f.read() | ||
| return get_language_ts(buf) | ||
| except ValueError: | ||
| return get_language_from_ext(str(path)) |
There was a problem hiding this comment.
I guess the order is debatable here, so flagging for further discussion
There was a problem hiding this comment.
My thinking here was to prefer Tree-sitter’s language detection when possible, since it can inspect the file contents rather than relying only on the extension. The extension fallback is mainly for cases where content-based detection fails.
That said, I agree that the order is debatable. Happy to discuss more on this!
|
Kudo for making this huge pr work 👍, I have two high-level comments mainly:
|
| @@ -0,0 +1,95 @@ | |||
| { | |||
| "classes" : [ | |||
| "System.Data.SqlClient.SqlCommand", | |||
There was a problem hiding this comment.
Would recommend sorting all of these signatures everywhere. This will ease contributions.
| @@ -0,0 +1,47 @@ | |||
| { | |||
| "classes": [ | |||
| "socket.socket", | |||
There was a problem hiding this comment.
Same here - sort everywhere
Sure, Maxime. We can discuss about the architecture in the upcoming meets. |
| taste = sample_path.open("rb").read(8) | ||
| return taste |
There was a problem hiding this comment.
undo to leverage default contex manager, closing file
|
|
||
| def get_format_from_report(sample: Path) -> str: | ||
| if sample.name.endswith((".log", ".log.gz")): | ||
| if sample.name.endswith((".log", "log.gz")): |
There was a problem hiding this comment.
is this change needed/wanted?
| # assume VMRay zipfile at a minimum has these files | ||
| return FORMAT_VMRAY | ||
| elif sample.name.endswith((".json", ".json_", ".json.gz")): | ||
| elif sample.name.endswith(("json", "json_", "json.gz")): |
| "feature": "string: /innerRename/", | ||
| "marks": [ | ||
| { | ||
| "backend": "binexport", | ||
| "mark": "xfail", | ||
| "reason": "string extraction mismatch in BinExport fixture" | ||
| } | ||
| ] |
There was a problem hiding this comment.
where do diffs on BinExport tests come from?
| "file": "c91887", | ||
| "location": "function=0x401A77", | ||
| "feature": "api: kernel32.CreatePipe", | ||
| "feature": "api: CreatePipe", |
There was a problem hiding this comment.
same here, how does this lead to test changes here?
There was a problem hiding this comment.
does this PR need to be rebased?
mr-tz
left a comment
There was a problem hiding this comment.
nice progress overall, thanks for all the work on getting this feature finally integrated :)
Addresses PR #1080 and #2931
This PR updates capa’s Tree-sitter support for script analysis and fixes the related integration issues so the feature works correctly in both source and packaged builds.
Changes include:
Checklist
Parts of this implementation were assisted using AI tools (Codex and ChatGPT).
AI was used for:
All code was reviewed, modified and tested manually before submission.
Testing:

I built it using PyInstaller and tested the capabilities on script inputs. It functions as intended.