Kaynağa Gözat

documentation of some value

master
anon 8 ay önce
ebeveyn
işleme
900d7ecf7e
1 değiştirilmiş dosya ile 91 ekleme ve 7 silme
  1. +91
    -7
      documentation/README.md

+ 91
- 7
documentation/README.md Dosyayı Görüntüle

@@ -1,11 +1,11 @@
# Abstraction
+---------------------+
| |
| |
| State register |
| |
| |
+---------------------+
+---------------------+
| |
| |
| State register |
| |
| |
+---------------------+


+---------------------------------+
@@ -16,3 +16,87 @@
+---------------------------------+
| Fallback transition table |
+---------------------------------+

---
State transition table look up
+ success --> continue
+ fail --> look up fallback table
+ success --> continue
+ fail --> return
? EOS --> look up fallback table
+ success --> is 0 width?
+ success --> continue
+ fail --> return
+ fail --> return
---
##### HALT\_AND\_CATCH\_FIRE
H&C is a special state signalling that we have hit a dead end.
The reason why need it and we cant just instanly quick is backtracking.

---
##### [^example]
This is a negative range.
```
let myNegativeRange = {'e', 'x', 'a', 'm', 'p', 'l'}
```
None of the characters in $myNegativeRange must be accepted.
The way this is a compiled is that we first hook all chars in $myNegativeRange to H&C,
then define an OFFSHOOT of width 1.
Put differently:
if we read something illegal we abort this branch,
if what we read was not illegal, we deduct that it must have been legal and we continue.

Handling "negatives" this way allows us to be "alphabet agnostic" in a sense.
Many implementations will presume ASCII, with its fixed 7/8 bit width
and create look up tables.
Which is fast and cute, but this strategy becomes a giant memory hog
if we ever wanted to use it on, say UTF-8 (from 256 te/c (table entries per char) to 4'294'967'295 te/c).


#### .
This is the dot operator.
It matches any 1 char.

Similar how negative ranges are implemented,
it takes advantage of the fallback table.
It simply ignores the state transition table and rather unconditionally hooks itself to the next state.


#### ^
This is the carrot operator.
It matches the SOS (start of the string).

For explanation purposes multilining (match '\n') is irrelevant.
That behaves just like a literal.

What is more interesting is how SOS is recognized.
Since `regex_assert()` is recursive the current state is continuesly passed along,
however at out first frame, it's not just always 0.
`regex_match()` decides depending on the current position of the string.
Basically we have the first 2 states (0, 1) reserved and always missing from the state transmission table.
+ 0 - SOS
+ 1 - !SOS
Normally both are _hooked_ to state 2,
and we pretend nothing has ever happened.
But when carrot operator is compiled, it sets a special compiler flag FORCE\_START\_OF\_STRING,
which forbids the hooking of state 1 to 2,
therefor when `regex_match()` calls from, say position 2,
it passes in 1 as the starting state,
no state transition table entry will be found since thats forbidden to begin with,
no jumps are found(!),
the machine checks whether the current state (1) is the accepting state (>=2)
and finally returns failiour.


#### \<
This is the SOW (start of word) operator.
SOW must match:
```
^myword
[^\h]myword
```
Not only that, this combination is key,
either it has to be the start of the string
or there has to be at least something which is not a symbol char.
With out the last condition "eexample" would match "\\\<exaplme\\\>"
as the iteration of `regex_match()` reaches "example".

Yükleniyor…
İptal
Kaydet