Skip to content

Conversation

@SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Jan 26, 2026

Description

This PR adds documentation for the new tagged DFA used in CLP via LogSurgeon:

  • Summarizes the background on DFAs.
  • Explains the addition to go from DFA -> TDFA.
  • Demonstrates how the TDFA is used in compression.
  • Demonstrates how the TDFA is used in search (dynamic programming algorithm).

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive guide on Tagged Deterministic Finite Automata (TDFA), covering schema definitions, regex rule handling, NFA→DFA construction and traversal, capture‑group/tag semantics, register behaviour, ambiguity resolution, and match semantics. Includes practical examples and end‑user workflows for compression and search (execution traces, data formats, normalization, interpretation, decompression, and grep‑style matching).

✏️ Tip: You can customize this high-level summary in your review settings.

@SharafMohamed SharafMohamed requested a review from a team as a code owner January 26, 2026 15:29
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

Warning

Rate limit exceeded

@SharafMohamed has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 59 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Walkthrough

Adds a new developer document describing Tagged Deterministic Finite Automata (TDFA): schema components, NFA→DFA construction and traversal, TDFA extensions for capture groups/tags/registers, tag semantics (positive/negative), ambiguity and match resolution, and end-to-end compression and search examples.

Changes

Cohort / File(s) Summary
Documentation
docs/src/dev-docs/tagged-dfa.md
New documentation introducing TDFA: schema (delimiters, regex rules), NFA construction, DFA traversal, TDFA extensions for capture groups/tags/registers, tag semantics (positive/negative), register management (final/intermediate), ambiguity and match semantics, plus practical compression/search examples and execution traces.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding documentation for the tagged DFA with coverage of internals and practical usage examples in compression and search.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@SharafMohamed SharafMohamed changed the title doc: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage. docs: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage. Jan 26, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 42-47: The markdown file has multiple fenced code blocks missing
language identifiers (markdownlint MD040); update each triple-backtick block
(including the table block shown with the header "Variable Name | Regex Pattern
| Input | Match") to include an appropriate language tag (e.g., ```text for
plain tables, ```regex for regex examples, ```yaml where YAML appears, or
```none when no highlighting is desired) so all instances listed in the comment
are annotated consistently and lint errors are resolved.
- Around line 49-50: The documentation contains multiple typos and grammatical
errors that confuse readers; update the text in docs/src/dev-docs/tagged-dfa.md
by correcting misspellings and grammar (e.g., change “durign” to “during”,
“Literal character produce” to “Literal character produce” or better “Literal
characters produce” depending on context, “reasons” to “reason”, complete the
fragment “Above, we i” into a full sentence, “out-going” to “outgoing”, and
“thats” to “that’s”); apply these fixes consistently in the noted sections
(around the existing sentence that mentions NFA/DFA construction and the other
referenced blocks) and re-read nearby sentences for similar small errors to
ensure clarity and correct pluralization/possessives.
- Around line 151-185: The doc has incomplete sections: finish or remove the
"Capture Groups in Regex", "Tagged NFA", and "Ambiguity and Leftmost-Greedy
Resolution" placeholders—specifically, complete "Capture Groups in Regex" to
describe how regex capture groups map to start/end tags and how those tags are
recorded into registers (referencing final(tag) and intermediate(tag,i)); add a
short "Tagged NFA" subsection explaining how an NFA is augmented with tag
actions on transitions and how those are compiled into TDFA operations; and add
an "Ambiguity and Leftmost-Greedy Resolution" paragraph that defines the
leftmost-greedy tie-breaker and how TDFA resolves ambiguous matches. If you
prefer not to author full text, remove the empty headings so the document
contains only the finished "Tags and Registers in the DFA", "Tagged
Transitions", and "Match Semantics" sections.
- Around line 8-13: Update the table of contents anchor links so they match the
actual heading texts: replace the entries linking to `#5-compression` and
`#6-search` with anchors that correspond to the headings "Compression Example" and
"Search Example" (e.g. use `#compression-example` and `#search-example`), and make
the same fixes for the other occurrences that reference the Compression and
Search headings; ensure the TOC entries that reference "Compression" and
"Search" exactly match the generated anchors for the headings.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 56-58: The Markdown heading "### Example Schema" and the adjacent
fenced code block (```regex ... ```) need blank lines before and after them to
satisfy markdownlint; add an empty line above the "### Example Schema" heading,
an empty line between the heading and the opening ```regex fence, and an empty
line after the closing ``` fence (and apply the same around the fenced block at
lines referenced 132-134) so headings and fenced code blocks are separated by
blank lines.
- Around line 20-21: Revise the two bullet definitions so grammar is clear and
parallel: change "- **Variables** are text in the log that contain information
pertinent to the user." to a tighter form like "Variables are text within a log
entry that convey user-relevant information." and change "- **Static-text** is
the remaining, non-variable, text in the log." to a parallel form like "Static
text is the remaining non-variable content of the log." Apply the same
grammatical tightening to the corresponding lines referenced (137-139) to ensure
consistency.
♻️ Duplicate comments (2)
docs/src/dev-docs/tagged-dfa.md (2)

49-50: Fix remaining typos/grammar for readability.
These lingering errors reduce clarity and were flagged earlier; please correct them consistently.

✏️ Proposed edits
-Each regex rule is used to construct an NFA, which is eventually considered durign DFA construction.
+Each regex rule is used to construct an NFA, which is eventually considered during DFA construction.

-- Literal character produce linear sequences of states.
-- Characters classes, quantifiers, and optional segments produce branches.
+- Literal characters produce linear sequences of states.
+- Character classes, quantifiers, and optional segments produce branches.

-At this stage, a single input can lead to multiple possible next states. This nondeterminism is the
-reasons NFAs are inefficient for traversal. For runtime performance, an NFA must be converted to
+At this stage, a single input can lead to multiple possible next states. This nondeterminism is the
+reason NFAs are inefficient for traversal. For runtime performance, an NFA must be converted to

-Above, we i
+Above, we introduce capture groups in regex patterns.

-At various DFA states register value are set or copied into other registers based on the register
-action corresponding to the out-going symbol.
+At various DFA states, register values are set or copied into other registers based on the register
+action corresponding to the outgoing symbol.

-My log has user_id=55 session=AB23 and thats it.
+My log has user_id=55 session=AB23 and that's it.

-My log has user_id=<user_id> session=<session> and thats it.
+My log has user_id=<user_id> session=<session> and that's it.

Also applies to: 76-79, 92-94, 153-154, 167-169, 275-276, 283-284


151-185: Complete or remove placeholder sections.
These empty subsections make the doc feel unfinished and leave key concepts undefined.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 414-420: Update the subsection heading "D.Grep" to include a space
after the period so it matches the other subsections' format; locate the heading
text "D.Grep against the original query:" and change it to "D. Grep against the
original query:" ensuring the spacing is consistent with subsections A, B, and
C.
- Around line 152-162: Reword two sentences for clarity and grammar: change the
awkward sentence that begins "During TDFA traversal," to a clearer phrasing like
"During TDFA traversal, each tag corresponds to one or more registers:" and fix
the subject-verb agreement in the sentence that begins "At various DFA states
register value are set..." to "At various DFA states, register values are set or
copied into other registers based on the register action corresponding to the
outgoing symbol." Ensure references to final(tag), intermediate(tag,i), TDFA
traversal, and registers remain unchanged.
- Around line 58-64: Replace the invalid Pygments lexer identifier "regex" in
the fenced code blocks with a supported lexer like "text" (or "none") to
eliminate build warnings; locate the triple-backtick blocks that start with
```regex containing the snippet lines "delimiters: \n\r\t", "int:-?\d+",
"float:-?\d+\.\d+", and "tagged_user_id:user_id=(?<user_id>\d+)" and change
their opening fence to ```text (apply the same replacement for the two other
occurrences that contain the same snippet).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Line 125: The sentence contains a subject-verb agreement error: the plural
noun phrase "no valid transitions" is paired with the singular verb "exists";
update the text fragment "no valid transitions exists" to use the plural verb
"exist" so it reads "no valid transitions exist".
- Around line 268-269: Wrap the character classes that are being parsed as
footnote refs in inline code formatting so MyST/Sphinx won't treat them as
footnotes: replace the plain occurrences of [^0-9] and [^A-Za-z] in the
tagged-dfa.md text with inline code spans (e.g., use backticks around the
expressions) wherever they appear in the sentence about the true end of the
capture so the docs build error is resolved.
- Around line 316-330: Change the sentence to use correct subject-verb agreement
("no valid transitions exist") and remove leading/trailing spaces inside inline
code spans (e.g., change `My ` to `My`, and ensure `R2`, `R1`, `R0`, `S0`, `S8`,
`S9`, `user_id=` are all formatted without extra spaces inside backticks); keep
the rest of the paragraph semantics the same.
- Around line 270-277: Fix inconsistent Markdown list indentation by reducing
the nested list indentation from 4 spaces to 2 spaces so it matches project
linting; update the block describing the final transition (lines mentioning S9,
set R6, and the accepting operations copying R6→R0, setting R1, and copying
negated R4/R5→R2/R3) to use 2-space indents for the nested bullets and their
sub-bullets to ensure consistent rendering and lint compliance.
- Around line 196-217: The fenced ASCII diagram blocks lack a language
identifier; update the opening fence for the diagram shown (the block starting
with ``` on the TDFA/Log Surgeon/CLP diagram) to ```text so Markdownlint passes,
and apply the same change to the other ASCII diagram block referenced (the
second fenced diagram at the later section). Ensure you only add the word text
immediately after the opening triple backticks (```text) for both blocks.

@junhaoliao junhaoliao added this to the February 2026 milestone Jan 27, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 366-368: The markdown lacks a blank line before the fenced code
block that begins with ```text after the sentence "After TDFA execution Log
Surgeon produces:", so add a single blank line between that sentence and the
opening ```text fence to satisfy markdown formatting; locate the snippet
containing "After TDFA execution Log Surgeon produces:" and the following
"```text" (and the inner "LogType:") and insert one empty line immediately
before the fenced code block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants