Skip to content

Commit 1f575fd

Browse files
committed
feat(ie-html): implement core HTML tokenizer state machine #54
WHATWG HTML tokenizer with ~40 core states: - Token types: Doctype (with public/system IDs, force_quirks), StartTag (with attributes, self_closing), EndTag, Character, Comment, Eof - States: Data, TagOpen, EndTagOpen, TagName, all attribute states (before/name/after/value with double/single/unquoted quoting), SelfClosingStartTag, BogusComment, MarkupDeclarationOpen, all comment states, all doctype states, CDataSection - Iterator-based: impl Iterator<Item = Token> - Tag names lowercased per spec - Pending token queue for multi-token emissions - set_state() for tree builder feedback - Parse errors logged via tracing, never abort - Placeholder stubs for script/raw text/entity states (Step 1b) - 17 unit tests
1 parent f303e7b commit 1f575fd

3 files changed

Lines changed: 1262 additions & 14 deletions

File tree

crates/ie-html/src/lib.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,10 @@
33
//! WHATWG HTML Living Standard parser.
44
//! Targets latest spec only — no quirks mode, no legacy element support.
55
6+
pub mod token;
67
pub mod tokenizer;
78
pub mod tree_builder;
89

10+
pub use token::Token;
11+
pub use tokenizer::Tokenizer;
912
pub use tree_builder::parse;

crates/ie-html/src/token.rs

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#[derive(Debug, Clone, PartialEq)]
2+
pub enum Token {
3+
Doctype {
4+
name: Option<String>,
5+
public_id: Option<String>,
6+
system_id: Option<String>,
7+
force_quirks: bool,
8+
},
9+
StartTag {
10+
name: String,
11+
attributes: Vec<Attribute>,
12+
self_closing: bool,
13+
},
14+
EndTag {
15+
name: String,
16+
},
17+
Character(char),
18+
Comment(String),
19+
Eof,
20+
}
21+
22+
#[derive(Debug, Clone, PartialEq)]
23+
pub struct Attribute {
24+
pub name: String,
25+
pub value: String,
26+
}
27+
28+
impl Token {
29+
pub fn is_start_tag(&self, name: &str) -> bool {
30+
matches!(self, Token::StartTag { name: n, .. } if n == name)
31+
}
32+
33+
pub fn is_end_tag(&self, name: &str) -> bool {
34+
matches!(self, Token::EndTag { name: n } if n == name)
35+
}
36+
}

0 commit comments

Comments
 (0)