Progress towards tree-sitter feature by saniyafatima07 · Pull Request #3102 · mandiant/capa

saniyafatima07 · 2026-06-10T18:39:59Z

Addresses PR #1080 and #2931

This PR updates capa’s Tree-sitter support for script analysis and fixes the related integration issues so the feature works correctly in both source and packaged builds.

Changes include:

Updated few outdated tree-sitter queries
Ensured that all the CI tests pass
Fixed few import errors
Added Python test scripts
Added PyInstaller packaging support so the Tree-sitter signature JSON files are included in bundled builds.

Checklist

CHANGELOG updated

Few tests added

No documentation update needed currently.

This submission includes AI-generated code and I have provided details in the description.

Parts of this implementation were assisted using AI tools (Codex and ChatGPT).

AI was used for:

writing the test script for python
fixing few errors

All code was reviewed, modified and tested manually before submission.

Testing:
I built it using PyInstaller and tested the capabilities on script inputs. It functions as intended.

gemini-code-assist

Code Review

This pull request introduces Tree-Sitter Script Analysis to capa, enabling feature extraction from script languages such as C#, Python, HTML, and ASPX templates. It adds a new Tree-Sitter-based feature extractor, auto-detection capabilities, and signature-based tools, along with comprehensive tests and updated dependencies. The code review feedback primarily addresses compatibility issues with the upgraded tree-sitter library (version 0.25.0), specifically pointing out that QueryCursor has been removed and the Parser instantiation has changed in tree-sitter versions >= 0.21.0. The feedback provides actionable suggestions to execute queries directly and update parser usage. Additionally, it identifies a bug in integer suffix parsing and recommends replacing a deprecated importlib.resources API.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T18:42:53Z

+    def get_function_call_names(self, node: Node) -> Iterator[Node]:
+        cursor = QueryCursor(self.query.function_call_name)
+        yield from self.get_captured_nodes(cursor, node)
+
+    def get_imported_constants(self, node: Node) -> Iterator[Node]:
+        cursor = QueryCursor(self.query.imported_constant_name)
+        yield from self.get_captured_nodes(cursor, node)


Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.

Suggested change

def get_function_call_names(self, node: Node) -> Iterator[Node]:

cursor = QueryCursor(self.query.function_call_name)

yield from self.get_captured_nodes(cursor, node)

def get_imported_constants(self, node: Node) -> Iterator[Node]:

cursor = QueryCursor(self.query.imported_constant_name)

yield from self.get_captured_nodes(cursor, node)

def get_function_call_names(self, node: Node) -> Iterator[Node]:

yield from self.get_captured_nodes(self.query.function_call_name, node)

def get_imported_constants(self, node: Node) -> Iterator[Node]:

yield from self.get_captured_nodes(self.query.imported_constant_name, node)

QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/

gemini-code-assist · 2026-06-10T18:42:54Z

+    def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]:
+        target_node = self.tree.root_node if node is None else node
+        cursor = QueryCursor(self.query.namespace)
+        namespace_captures: List[Tuple[Node, str]] = []
+
+        for query_name, nodes in cursor.captures(target_node).items():
+            for namespace_node in nodes:
+                namespace_captures.append((namespace_node, query_name))
+
+        return sorted(namespace_captures, key=self.get_node_capture_sort_key)
+


Since QueryCursor is removed in tree-sitter >= 0.21.0, you should execute the query directly using self.query.namespace.captures(target_node). This returns a list of (Node, str) tuples representing the captured nodes and their capture names.

Suggested change

def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]:

target_node = self.tree.root_node if node is None else node

cursor = QueryCursor(self.query.namespace)

namespace_captures: List[Tuple[Node, str]] = []

for query_name, nodes in cursor.captures(target_node).items():

for namespace_node in nodes:

namespace_captures.append((namespace_node, query_name))

return sorted(namespace_captures, key=self.get_node_capture_sort_key)

def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]:

target_node = self.tree.root_node if node is None else node

captures = self.query.namespace.captures(target_node)

namespace_captures: List[Tuple[Node, str]] = []

for namespace_node, query_name in captures:

namespace_captures.append((namespace_node, query_name))

return sorted(namespace_captures, key=self.get_node_capture_sort_key)

QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/

gemini-code-assist · 2026-06-10T18:42:54Z

+    def get_string_literals(self, node: Node) -> Iterator[Node]:
+        cursor = QueryCursor(self.query.string_literal)
+        yield from self.get_captured_nodes(cursor, node)
+
+    def get_integer_literals(self, node: Node) -> Iterator[Node]:
+        cursor = QueryCursor(self.query.integer_literal)
+        yield from self.get_captured_nodes(cursor, node)


Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.

Suggested change

def get_string_literals(self, node: Node) -> Iterator[Node]:

cursor = QueryCursor(self.query.string_literal)

yield from self.get_captured_nodes(cursor, node)

def get_integer_literals(self, node: Node) -> Iterator[Node]:

cursor = QueryCursor(self.query.integer_literal)

yield from self.get_captured_nodes(cursor, node)

def get_string_literals(self, node: Node) -> Iterator[Node]:

yield from self.get_captured_nodes(self.query.string_literal, node)

def get_integer_literals(self, node: Node) -> Iterator[Node]:

yield from self.get_captured_nodes(self.query.integer_literal, node)

QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/

gemini-code-assist · 2026-06-10T18:42:54Z

+    def parse(self) -> Tree:
+        parser = Parser(self.query.language)
+        return parser.parse(self.buf)


In tree-sitter >= 0.21.0, the Parser constructor does not accept a Language argument. You must set the language attribute after instantiating the parser.

Suggested change

def parse(self) -> Tree:

parser = Parser(self.query.language)

return parser.parse(self.buf)

def parse(self) -> Tree:

parser = Parser()

parser.language = self.query.language

return parser.parse(self.buf)

Parser constructor accepts Language argument.
Link: https://tree-sitter.github.io/py-tree-sitter/classes/tree_sitter.Parser.html

gemini-code-assist · 2026-06-10T18:42:54Z

+    def get_content_sections(self) -> Iterator[Node]:
+        cursor = QueryCursor(self.query.content)
+        yield from self.get_captured_nodes(cursor, self.tree.root_node)


Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.

Suggested change

def get_content_sections(self) -> Iterator[Node]:

cursor = QueryCursor(self.query.content)

yield from self.get_captured_nodes(cursor, self.tree.root_node)

def get_content_sections(self) -> Iterator[Node]:

yield from self.get_captured_nodes(self.query.content, self.tree.root_node)

QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/

gemini-code-assist · 2026-06-10T18:42:55Z

+import re
+from typing import List, Tuple, Iterator, Optional
+
+from tree_sitter import Node, Tree, Parser, QueryCursor


In tree-sitter >= 0.21.0, QueryCursor has been removed. Import Query instead to execute queries directly.

Suggested change

from tree_sitter import Node, Tree, Parser, QueryCursor

from tree_sitter import Node, Tree, Parser, Query

QueryCursor exists in the latest version. Verified with the official documentation.
Link - https://tree-sitter.github.io/py-tree-sitter/

gemini-code-assist · 2026-06-10T18:42:55Z

+    def parse_integer(self, integer: str) -> int:
+        for suffix in self.integer_suffixes:
+            if integer.endswith(suffix):
+                integer = integer[:-1]
+        for prefix, base in self.integer_prefixes:
+            if integer.startswith(prefix):
+                return int(integer, base)
+        return int(integer)


There is a bug in parse_integer when stripping suffixes. If an integer literal has multiple suffixes (e.g., 100ul in C#), the loop only strips the last character of the suffix that matched first, leaving the other suffix character intact (e.g., leaving 100u), which then fails to parse as an integer and raises a ValueError. Using rstrip with a combined set of uppercase and lowercase suffixes is more robust and correctly handles multi-character suffixes.

Suggested change

def parse_integer(self, integer: str) -> int:

for suffix in self.integer_suffixes:

if integer.endswith(suffix):

integer = integer[:-1]

for prefix, base in self.integer_prefixes:

if integer.startswith(prefix):

return int(integer, base)

return int(integer)

def parse_integer(self, integer: str) -> int:

suffixes = "".join(self.integer_suffixes)

if suffixes:

integer = integer.rstrip(suffixes + suffixes.upper())

for prefix, base in self.integer_prefixes:

if integer.startswith(prefix):

return int(integer, base)

return int(integer)

gemini-code-assist · 2026-06-10T18:42:55Z

+    def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]:
+        signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file))
+        return {category: set(names) for category, names in signatures.items()}


importlib.resources.read_text is deprecated since Python 3.11. Use importlib.resources.files instead, which is fully supported in Python 3.10+.

Suggested change

def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]:

signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file))

return {category: set(names) for category, names in signatures.items()}

def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]:

ref = importlib.resources.files(capa.features.extractors.ts.signatures) / signature_file

signatures = json.loads(ref.read_text(encoding="utf-8"))

return {category: set(names) for category, names in signatures.items()}

… sigs)

…heels

Revives and supersedes old PR mandiant#1080. Resolves merge conflicts and brings up to date with current master.

saniyafatima07 · 2026-06-23T18:44:44Z

@mike-hunhoff @larchchen @Maijin The first version of the script analysis feature is ready for review.
Could you please take a look when you have a chance?

mr-tz

Initial review of the first few files

mr-tz · 2026-06-24T14:00:36Z

            if extractor.is_library_function(f.address):
                function_name = extractor.get_function_name(f.address)
-                logger.debug("skipping library function 0x%x (%s)", f.address, function_name)
+                logger.debug("skipping library function %s (%s)", f.address, function_name)


Aren't these ints?

Good point. This debug log should be an int itself. I will revert this line back to int.
But, in line 208, it has be a string because the f.address here can be FileOffsetRangeAddress for the Tree-sitter functions.

mr-tz · 2026-06-24T17:22:21Z

+
+def get_language_from_ext(path: str) -> str:
+    if path.endswith(EXT_ASPX):
+        return LANG_TEM


Embedded template is not immediately clear to me, some doc on that would be good (maybe it is further down).

I haven’t added documentation yet, but I’ll add a section describing the script analysis feature in the near future.

mr-tz · 2026-06-24T17:23:17Z

+            buf = f.read()
+        return get_language_ts(buf)
+    except ValueError:
+        return get_language_from_ext(str(path))


I guess the order is debatable here, so flagging for further discussion

My thinking here was to prefer Tree-sitter’s language detection when possible, since it can inspect the file contents rather than relying only on the extension. The extension fallback is mainly for cases where content-based detection fails.

That said, I agree that the order is debatable. Happy to discuss more on this!

Maijin · 2026-06-25T08:22:00Z

Kudo for making this huge pr work 👍, I have two high-level comments mainly:

Since automated detection may struggle with heavily obfuscated scripts, it might be beneficial to allow users to force specific language detection on the CLI. This would not only improve user experience but also serve as a useful tool for regression testing. I suggest considering this for the next PR.
Regarding the cs.json, py.json etc. signature files, what is the intended process for updating these as new malicious script patterns emerge? Since these are separated from capa-rules, we should document how rule authors can contribute new signatures. It may also be worth having an architectural discussion regarding the long-term management of these files tagging @mike-hunhoff for visibility on this.

Maijin · 2026-06-25T08:23:01Z

@@ -0,0 +1,95 @@
+{
+    "classes" : [
+        "System.Data.SqlClient.SqlCommand",


Would recommend sorting all of these signatures everywhere. This will ease contributions.

Maijin · 2026-06-25T08:23:13Z

@@ -0,0 +1,47 @@
+{
+    "classes": [
+        "socket.socket",


Same here - sort everywhere

saniyafatima07 · 2026-06-26T06:44:45Z

Kudo for making this huge pr work 👍, I have two high-level comments mainly:

* Since automated detection may struggle with heavily obfuscated scripts, it might be beneficial to allow users to force specific language detection on the CLI. This would not only improve user experience but also serve as a useful tool for regression testing. I suggest considering this for the next PR.

* Regarding the `cs.json`, `py.json` etc. signature files, what is the intended process for updating these as new malicious script patterns emerge? Since these are separated from `capa-rules`, we should document how rule authors can contribute new signatures. It may also be worth having an architectural discussion regarding the long-term management of these files tagging @mike-hunhoff for visibility on this.

Sure, Maxime. We can discuss about the architecture in the upcoming meets.

mr-tz · 2026-06-27T10:23:12Z

+    taste = sample_path.open("rb").read(8)
+    return taste


undo to leverage default contex manager, closing file

mr-tz · 2026-06-27T10:24:23Z


 def get_format_from_report(sample: Path) -> str:
-    if sample.name.endswith((".log", ".log.gz")):
+    if sample.name.endswith((".log", "log.gz")):


is this change needed/wanted?

mr-tz · 2026-06-27T10:24:55Z

                # assume VMRay zipfile at a minimum has these files
                return FORMAT_VMRAY
-    elif sample.name.endswith((".json", ".json_", ".json.gz")):
+    elif sample.name.endswith(("json", "json_", "json.gz")):


same here, wanted change?

mr-tz · 2026-06-27T10:26:52Z

+      "feature": "string: /innerRename/",
+      "marks": [
+        {
+          "backend": "binexport",
+          "mark": "xfail",
+          "reason": "string extraction mismatch in BinExport fixture"
+        }
+      ]


where do diffs on BinExport tests come from?

mr-tz · 2026-06-27T10:27:55Z

      "file": "c91887",
      "location": "function=0x401A77",
-      "feature": "api: kernel32.CreatePipe",
+      "feature": "api: CreatePipe",


same here, how does this lead to test changes here?

does this PR need to be rebased?

mr-tz

nice progress overall, thanks for all the work on getting this feature finally integrated :)

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

saniyafatima07 marked this pull request as ready for review June 10, 2026 18:58

saniyafatima07 marked this pull request as draft June 10, 2026 18:58

saniyafatima07 force-pushed the script-feature branch 2 times, most recently from e71ca37 to 5c19383 Compare June 17, 2026 16:41

mr-tz reviewed Jun 18, 2026

View reviewed changes

Comment thread capa/capabilities/common.py

saniyafatima07 force-pushed the script-feature branch 7 times, most recently from 2e75cd1 to 050ab84 Compare June 22, 2026 20:54

saniyafatima07 and others added 13 commits June 23, 2026 02:32

Fix issues

7bd2acd

fix: address PR review feedback (regex optimization, typos, duplicate…

f8d3848

… sigs)

Update CHANGELOG.md

bcfe22e

Fix outdated code

d86a44e

Fix outdated code

3ebf4bd

Modernize Tree-sitter: Replace local C-compilation with native PyPI w…

9b36aac

…heels

feat: revive script analysis ts feature

ddb701c

Revives and supersedes old PR mandiant#1080. Resolves merge conflicts and brings up to date with current master.

Fix: CI builds

772e55d

Fix import errors

fcca9ad

Fix CI: trial-1

88a98f4

Fix tree-sitter packaging

18342bb

Address valid gemini reviews

14610dc

Minor fixes

8057248

saniyafatima07 force-pushed the script-feature branch 4 times, most recently from 46a6446 to 19308f7 Compare June 22, 2026 22:31

Fix tests

291975a

saniyafatima07 force-pushed the script-feature branch from 19308f7 to 291975a Compare June 22, 2026 22:41

saniyafatima07 marked this pull request as ready for review June 23, 2026 14:15

saniyafatima07 requested review from Maijin, larchchen and mike-hunhoff June 23, 2026 14:16

mr-tz reviewed Jun 24, 2026

View reviewed changes

Address comments

652b6cf

Maijin reviewed Jun 25, 2026

View reviewed changes

Sort json files

af4fd99

mr-tz reviewed Jun 27, 2026

View reviewed changes

	from tree_sitter import Node, Tree, Parser, QueryCursor
	from tree_sitter import Node, Tree, Parser, Query

Uh oh!

Conversation

saniyafatima07 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

saniyafatima07 Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

saniyafatima07 Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saniyafatima07 commented Jun 23, 2026

Uh oh!

mr-tz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Maijin commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saniyafatima07 commented Jun 10, 2026 •

edited

Loading

saniyafatima07 Jun 20, 2026 •

edited

Loading

saniyafatima07 Jun 22, 2026 •

edited

Loading