2023-09-02 11:39:52 -04:00
|
|
|
# Abstraction
|
2023-09-22 14:36:14 -04:00
|
|
|
+---------------------+
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
| State register |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
+---------------------+
|
2023-09-02 11:39:52 -04:00
|
|
|
|
|
|
|
|
|
|
|
+---------------------------------+
|
|
|
|
| State transition table |
|
|
|
|
+---------------------------------+
|
|
|
|
|
|
|
|
|
|
|
|
+---------------------------------+
|
|
|
|
| Fallback transition table |
|
|
|
|
+---------------------------------+
|
2023-09-22 14:36:14 -04:00
|
|
|
|
|
|
|
---
|
2023-09-26 08:13:34 -04:00
|
|
|
|
2023-09-22 14:36:14 -04:00
|
|
|
State transition table look up
|
2023-09-26 08:20:20 -04:00
|
|
|
|
2023-09-26 08:14:59 -04:00
|
|
|
+ success --> continue
|
2023-09-26 08:20:20 -04:00
|
|
|
|
2023-09-26 08:14:59 -04:00
|
|
|
+ fail --> look up fallback table
|
2023-09-26 08:19:31 -04:00
|
|
|
* success --> continue
|
|
|
|
* fail --> return
|
2023-09-26 08:20:20 -04:00
|
|
|
|
2023-09-26 08:16:35 -04:00
|
|
|
EOS ? --> look up fallback table
|
2023-09-26 08:20:20 -04:00
|
|
|
|
2023-09-26 08:14:59 -04:00
|
|
|
+ success --> is 0 width?
|
2023-09-26 08:19:31 -04:00
|
|
|
* success --> continue
|
|
|
|
* fail --> return
|
2023-09-26 08:20:20 -04:00
|
|
|
|
2023-09-26 08:14:59 -04:00
|
|
|
+ fail --> return
|
2023-09-26 08:13:34 -04:00
|
|
|
|
2023-09-22 14:36:14 -04:00
|
|
|
---
|
2023-09-26 08:11:21 -04:00
|
|
|
# Legend
|
|
|
|
|
|
|
|
| | Start | End |
|
|
|
|
| :--: | :---: | :-: |
|
|
|
|
| Line | SOS | EOS |
|
2023-09-26 08:21:05 -04:00
|
|
|
| Word | SOW | EOW |
|
2023-09-26 08:11:21 -04:00
|
|
|
|
2023-09-26 08:42:47 -04:00
|
|
|
|
2023-09-22 14:36:14 -04:00
|
|
|
##### HALT\_AND\_CATCH\_FIRE
|
|
|
|
H&C is a special state signalling that we have hit a dead end.
|
|
|
|
The reason why need it and we cant just instanly quick is backtracking.
|
|
|
|
|
|
|
|
---
|
|
|
|
##### [^example]
|
|
|
|
This is a negative range.
|
|
|
|
```
|
|
|
|
let myNegativeRange = {'e', 'x', 'a', 'm', 'p', 'l'}
|
|
|
|
```
|
2023-09-26 08:42:47 -04:00
|
|
|
None of the characters in `$myNegativeRange` must be accepted.
|
|
|
|
The way this is a compiled is that we first hook all chars in `$myNegativeRange` to H&C,
|
2023-09-22 14:36:14 -04:00
|
|
|
then define an OFFSHOOT of width 1.
|
|
|
|
Put differently:
|
|
|
|
if we read something illegal we abort this branch,
|
|
|
|
if what we read was not illegal, we deduct that it must have been legal and we continue.
|
|
|
|
|
|
|
|
Handling "negatives" this way allows us to be "alphabet agnostic" in a sense.
|
|
|
|
Many implementations will presume ASCII, with its fixed 7/8 bit width
|
|
|
|
and create look up tables.
|
|
|
|
Which is fast and cute, but this strategy becomes a giant memory hog
|
|
|
|
if we ever wanted to use it on, say UTF-8 (from 256 te/c (table entries per char) to 4'294'967'295 te/c).
|
|
|
|
|
|
|
|
|
|
|
|
#### .
|
|
|
|
This is the dot operator.
|
|
|
|
It matches any 1 char.
|
|
|
|
|
|
|
|
Similar how negative ranges are implemented,
|
|
|
|
it takes advantage of the fallback table.
|
|
|
|
It simply ignores the state transition table and rather unconditionally hooks itself to the next state.
|
|
|
|
|
|
|
|
|
|
|
|
#### ^
|
|
|
|
This is the carrot operator.
|
2023-09-26 08:42:47 -04:00
|
|
|
It matches the SOS.
|
2023-09-22 14:36:14 -04:00
|
|
|
|
|
|
|
For explanation purposes multilining (match '\n') is irrelevant.
|
|
|
|
That behaves just like a literal.
|
|
|
|
|
|
|
|
What is more interesting is how SOS is recognized.
|
|
|
|
Since `regex_assert()` is recursive the current state is continuesly passed along,
|
|
|
|
however at out first frame, it's not just always 0.
|
|
|
|
`regex_match()` decides depending on the current position of the string.
|
|
|
|
Basically we have the first 2 states (0, 1) reserved and always missing from the state transmission table.
|
|
|
|
+ 0 - SOS
|
|
|
|
+ 1 - !SOS
|
|
|
|
Normally both are _hooked_ to state 2,
|
|
|
|
and we pretend nothing has ever happened.
|
|
|
|
But when carrot operator is compiled, it sets a special compiler flag FORCE\_START\_OF\_STRING,
|
|
|
|
which forbids the hooking of state 1 to 2,
|
|
|
|
therefor when `regex_match()` calls from, say position 2,
|
|
|
|
it passes in 1 as the starting state,
|
|
|
|
no state transition table entry will be found since thats forbidden to begin with,
|
|
|
|
no jumps are found(!),
|
|
|
|
the machine checks whether the current state (1) is the accepting state (>=2)
|
|
|
|
and finally returns failiour.
|
|
|
|
|
|
|
|
|
|
|
|
#### \<
|
2023-09-26 08:11:21 -04:00
|
|
|
This is the SOW operator.
|
2023-09-22 14:36:14 -04:00
|
|
|
SOW must match:
|
|
|
|
```
|
|
|
|
^myword
|
|
|
|
[^\h]myword
|
|
|
|
```
|
|
|
|
Not only that, this combination is key,
|
2023-09-26 08:42:47 -04:00
|
|
|
either it has to be the SOS
|
2023-09-22 14:36:14 -04:00
|
|
|
or there has to be at least something which is not a symbol char.
|
2023-09-23 06:41:15 -04:00
|
|
|
With out the last condition "eexample" would match "\\\<exaple\\\>"
|
2023-09-22 14:36:14 -04:00
|
|
|
as the iteration of `regex_match()` reaches "example".
|
2023-09-23 06:41:15 -04:00
|
|
|
|
|
|
|
From a more practical perspective:
|
|
|
|
``` C
|
|
|
|
\<myword\>
|
|
|
|
// Must match
|
|
|
|
"myword"
|
|
|
|
" myword"
|
|
|
|
```
|