diff --git a/lib/Regexp/Parser.pm b/lib/Regexp/Parser.pm index 9412df5..d29fd5a 100644 --- a/lib/Regexp/Parser.pm +++ b/lib/Regexp/Parser.pm @@ -474,6 +474,17 @@ value: my $capture_2 = $parser->captures(2); +=head2 Getting Named Captures + +You can access the named capture groups with the named_captures() method: + + my $all_named = $parser->named_captures(); + +This returns a hash reference mapping capture names to their node objects. +To look up a specific named capture: + + my $node = $parser->named_captures('year'); + =head2 Walking the Tree To walk over the created tree, create an iterator with walker()Z<>: @@ -708,229 +719,87 @@ Invalid [] range "%s-%s" =back -=head1 EXTENSIONS - -Here are some ideas for extensions (sub-classes) for this module. Some -of them may be absorbed into the core functionality of F -in the future. Module names are merely the author's suggestions. - -=over 4 - -=item Regexp::WordBounds - -Adds handlers for C<< < >> and C<< > >> anchors, which match at the -beginning and end of a "word", respectively. C<< /> is equivalent to -C, and C<< />/ >> is equivalent to C. (So -that's the object's qr() method for you right there!) - -=item Regexp::MinLength - -Implements a min_length() method for all objects that determines the -minimum length of a string that would be matched by the regex; provides -a front-end method for the parser. - -=item Regexp::QuantAttr - -Removes quantifiers as objects, and makes 'min' and 'max' attributes of -other objects themselves. - -=item Regexp::Explain (pending, Jeff Pinyan) - -Produces a human-readable explanation of the execution of a regex. Will -be able to produce HTML output that color-codes the elements of the regex -according to a style-sheet (syntax highlighting). - -=item Regexp::Reverse (difficulty rating: ****) - -Reverses a regex so it matches backwards. Ex.: C becomes -C, which perhaps gets optimized to C. The difficulty -rating is so high because of cases like C which, when -reversed, I match differently. - - "100years" =~ /(\d+)(\w+)/; # $1 = 100, $2 = years - "sraey001" =~ /(\w+)(\d+)/; # $1 = sraey00, $2 = 1 - -This means character classes should store a hash of what characters -they represent, as well as the macros C<\w>, C<\d>, etc. Then this -example would be reversed into something like C. -The other difficulty is complex regexes with if-then assertions. I -don't want to think about that. This module is more of a theoretical -exercise, a jump-start to built-in reversing capability in Perl. - -=item Regexp::CharClassOps - -Implements character class operations like union, intersection, and -subtraction. - -=item Regexp::Optimize - -Eliminates redundancy from a regex. It should have various options, -such as whether to do optimize... - - # strings - /foo|father|fort/ => /f(?:o(?:o|rt)|ather)/ - - # char classes - /[\w\d][a-zaeiou]/ => /[\w][a-z]/ - - # redundancy - /^\n?\s+/ => /^\s+/ - /[\w]/ => /\w/ - -There are other possibilities as well. - -=back - -=head1 HISTORY - -=head2 0.022b -- July 6, 2004 - -=over 4 - -=item Hierarchy Changes - -There are now abstract classes I and I. You can't call -their new() method directly, you can only call it through an object that -inherits from that class. - -There are no longer I, I, and I classes; they have been -combined into one class, I. You pass it the min and max, -and the object's C is determined dynamically. - -=item Character Class Hashes - -Character classes (I objects) now have another attribute, C, -which is a hash reference holding character values (eg. 65 for 'A') and -the number of times that character appeared in the character class. The -character class C<[A-CB-E]> would have a character map of C<< { 65 => 1, 66 -=> 2, 67 => 2, 68 => 1, 69 => 1} >>. This will reflect ranges and embedded -classes (such as C<[:cntrl:]> or C<\p{Print}>. - -=item Character Class Rendering - -The visual() method of I objects will quell the repetition of any -character in the class I of embedded classes, so the class -C<[\w\d:4-65:]> will render as C<[\w\d:4-6]>. If you want to prevent -characters and ranges from being display if they are included in an embedded -class, set the I object's C attribute to 1; the character -class would render as C<[\w\d:]>. If you want to go even further and remove -any embedded class that is I redundant (that is, I -character in that embedded class is already found in the class), set the -C attribute to 2; the class above would render as C<[\w:]>. - -=back +=head1 SUPPORTED CONSTRUCTS -=head2 0.021 -- July 3, 2004 +This module supports parsing the following Perl regex constructs: =over 4 -=item I Changed - -If an I element is a Unicode property or a Perl class (like -C<\w> or C<\S>), the object's C field points to the underlying -object type (I, I, etc.). If the element is a POSIX class, -the C field is the string "POSIX". POSIX classes don't exist in a -regex outside of a character class, so I'm a little wary of making them -objects in their own right, even if it would create a better sense of -uniformity. - -=item Documentation - -Fixed some poor wording, and documented the problem with using F -inside F. - -=item Bug Fixes - -Character classes weren't closing properly in the tree. Fixed. - -Standard escapes (C<\a>, C<\e>, etc.) were being returned as I -nodes instead of I nodes when inside character classes. Fixed. -(Mike Lambert) +=item Grouping -Non-grouping parentheses weren't being parsed properly. Fixed. (Mike -Lambert) +C<(...)>, C<(?:...)>, C<< (?...) >>, +C<(?|...)> (branch reset), C<< (?>...) >> (atomic). +Also supports Python-compatible C<(?P=name)> and C<< (?P>name) >> syntax. -Flags weren't being turned off. Fixed. +=item Quantifiers -=back +C<*>, C<+>, C, C<{n}>, C<{n,}>, C<{n,m}> -- with greedy (default), +lazy (C), and possessive (C<+>) variants -=head2 0.02 -- July 1, 2004 +=item Assertions -=over 4 +C<^>, C<$>, C<\b>, C<\B>, C<\A>, C<\Z>, C<\z>, C<\G>, +C<\K> (keep), C<\b{type}> (extended boundaries) -=item Better Abstracting +=item Lookaround -The object() method calls force_object(). force_object() creates an -object no matter what pass the parser is making; object() will return -immediately if it's just the first pass. This means that force_object() -should be used to create stand-alone objects. +C<(?=...)>, C<(?!...)>, C<(?<=...)>, C<(?, +and alphabetic forms C<(*pla:...)>, C<(*nla:...)>, C<(*plb:...)>, +C<(*nlb:...)> -Each object now has an insert() method that defines how it gets placed -into the regex tree. Most objects inherit theirs from the base object -class. +=item Character classes -The walker() method is also now abstracted -- each node it comes across -will have its walk() method called. And the ending node for stack-type -nodes has been abstracted to the ender() method of the node. +C<[...]>, C<[^...]>, POSIX classes C<[[:alpha:]]>, +C<\d>, C<\D>, C<\w>, C<\W>, C<\s>, C<\S>, C<\h>, C<\H>, C<\v>, C<\V>, +C<\R> (linebreak), C<\N> (non-newline), C<\X> (extended grapheme cluster), +C<.> (any) -The init() method has been moved to another file to help keep I -file as abstract as possible. F installs its handlers -in F. That file might end up being where -documentation on writing handlers goes. +=item Unicode properties -The documentation on sub-classing includes an ordered list of what -packages a method is looked up in for a given object of type 'OBJ': -F, F, F, -F. +C<\p{Name}>, C<\P{Name}>, C<\p{Script=Latin}>, etc. -=item Cleaner Grammar Flow +=item Escape sequences -Now the only places 'atom' gets pushed to the queue are after an opening -parenthesis or after 'atom' matches. This makes things flow more -cleanly. +C<\a>, C<\e>, C<\f>, C<\n>, C<\r>, C<\t>, +C<\xHH>, C<\x{HHHH}>, C<\NNN> (octal), C<\o{NNN}>, +C<\cX> (control), C<\N{NAME}>, C<\N{U+HHHH}> -=item Flag Handlers +=item Backreferences -Flag handlers now receive an additional argument that says whether -they're being turned on or off. Also, if the flag handler returns 0, -that flag is removed from the resulting object's visual flag set. That -means C<(?gi-o)> becomes C<(?i)>. +C<\1>..C<\9>, C<\g{N}>, C<\g{-N}>, C<\g{+N}>, +C<< \k >>, C<\k'name'>, C<\k{name}>, +C<(?P=name)> -=item Diagnostics and Bug Fixes +=item Flags -More tests added (specifically, making sure C<(?(N)T|F)> works right). -In doing so, found that the "too many branches" error wasn't being raised -until the second pass. Figured out how to improve the grammar to get -it to work properly. Also added tests for the new captures() method. +C<(?imsx...)>, C<(?-imsx...)>, C<(?^...)> (caret reset), +C<(?a)>, C<(?aa)>, C<(?d)>, C<(?l)>, C<(?u)>, C<(?n)>, C -I changed the field 'class' to 'family' in objects. I was getting -confused by it, so I figured it was a sign that I'd chosen an awful name -for the field. There will still be a class() method in F<__object__>, -but it will throw a "use of class() is deprecated" warning. +=item Conditionals -Quantifiers of the form C<{n}> were being misrepresented as C<{n,}>. -It's been corrected. (Mike Lambert) +C<(?(N)...|...)>, C<(?(DEFINE)...)>, +C<< (?()...|...) >>, C<(?('name')...|...)> -C<\b> was being turned into "b" inside a character class, instead of -a backspace. (Mike Lambert) +=item Backtracking control -Fixed errant "Quantifier unexpected" warning raised by a zero-width -assertion followed by C, which doesn't warrant the warning. +C<(*ACCEPT)>, C<(*FAIL)>, C<(*F)>, C<(*MARK:name)>, +C<(*PRUNE)>, C<(*SKIP)>, C<(*THEN)>, C<(*COMMIT)> -Added "Unrecognized escape" warnings to I escape sequence handlers. +=item Recursive patterns -The 'g', 'c', and 'o' flags now evoke "Useless ..." warnings when used -in flag and non-capturing group constructs. +C<(?R)>, C<(?N)>, C<(?&name)>, C<< (?P>name) >> -=back +=item Script runs -=head2 0.01 -- June 29, 2004 +C<(*script_run:...)>, C<(*sr:...)>, +C<(*atomic_script_run:...)>, C<(*asr:...)> -=over 4 +=item Special -=item First Release - -Documentation not complete, etc. +C<(?{code})>, C<(??{code})> (opaque -- code is stored as string), +C<(?[...])> (extended character class, opaque), +C<(?#comment)>, +C<\Q...\E> (quotemeta) =back @@ -938,21 +807,27 @@ Documentation not complete, etc. =over 4 -=item * Bugs...? +=item Two-pass parsing -I'd like to say this module doesn't have bugs. I don't know of any in -this current version, because I've tried to fix those I've already -found. Those who find bugs should email me. Messages should include the -code you ran that contains the bug, and your opinion on what's wrong -with it. +The parser uses a two-pass model: the first pass (via C) checks +structural validity; the second pass (triggered by C, C, +or C) builds the object tree and checks semantics. Some errors +(such as invalid backreferences) are only detected on the second pass. -=item * Variable interpolation +=item Variable interpolation This module parses I, not Perl. If you send a single-quoted -string as a regex with a variable in it, that '$' will be interpreted as -an anchor. If you want to include variables, use C, or mix single- +string as a regex with a variable in it, that C<$> will be interpreted as +an anchor. If you want to include variables, use C, or mix single- and double-quoted strings in building your regex. +=item Opaque constructs + +Code blocks C<(?{...})> and C<(??{...})> store their content as opaque +strings. Extended character classes C<(?[...])> are also stored as +opaque strings -- their internal set operations are not decomposed into +structured nodes. + =back =head1 AUTHOR