documentation of some value

2023-09-22 20:36:14 +02:00 · 2023-09-22 20:36:14 +02:00 · 900d7ecf7e
commit 900d7ecf7e
parent 84a5d503dc
1 changed files with 91 additions and 7 deletions
--- a/documentation/README.md
+++ b/documentation/README.md
@ -1,11 +1,11 @@
 # Abstraction
-	+---------------------+
-	|                     |
-	|                     |
-	|   State register    |
-	|                     |
-	|                     |
-	+---------------------+
+	     +---------------------+
+	     |                     |
+	     |                     |
+	     |   State register    |
+	     |                     |
+	     |                     |
+	     +---------------------+


 	+---------------------------------+
@ -16,3 +16,87 @@
 	+---------------------------------+
 	|    Fallback transition table    |
 	+---------------------------------+
+
+---
+State transition table look up
+    + success --> continue
+    + fail    --> look up fallback table
+		+ success --> continue
+		+ fail    --> return
+? EOS --> look up fallback table
+	+ success --> is 0 width?
+		+ success --> continue
+		+ fail    --> return 
+	+ fail    --> return
+---
+##### HALT\_AND\_CATCH\_FIRE
+H&C is a special state signalling that we have hit a dead end.
+The reason why need it and we cant just instanly quick is backtracking.
+
+---
+##### [^example]
+This is a negative range.
+```
+let myNegativeRange = {'e', 'x', 'a', 'm', 'p', 'l'}
+```
+None of the characters in $myNegativeRange must be accepted.
+The way this is a compiled is that we first hook all chars in $myNegativeRange to H&C,
+then define an OFFSHOOT of width 1.
+Put differently:
+if we read something illegal we abort this branch,
+if what we read was not illegal, we deduct that it must have been legal and we continue.
+
+Handling "negatives" this way allows us to be "alphabet agnostic" in a sense.
+Many implementations will presume ASCII, with its fixed 7/8 bit width
+and create look up tables.
+Which is fast and cute, but this strategy becomes a giant memory hog
+if we ever wanted to use it on, say UTF-8 (from 256 te/c (table entries per char) to 4'294'967'295 te/c).
+
+
+#### .
+This is the dot operator.
+It matches any 1 char.
+
+Similar how negative ranges are implemented,
+it takes advantage of the fallback table.
+It simply ignores the state transition table and rather unconditionally hooks itself to the next state.
+
+
+#### ^
+This is the carrot operator.
+It matches the SOS (start of the string).
+
+For explanation purposes multilining (match '\n') is irrelevant.
+That behaves just like a literal.
+
+What is more interesting is how SOS is recognized.
+Since `regex_assert()` is recursive the current state is continuesly passed along,
+however at out first frame, it's not just always 0.
+`regex_match()` decides depending on the current position of the string.
+Basically we have the first 2 states (0, 1) reserved and always missing from the state transmission table.
+	+ 0 -  SOS
+	+ 1 - !SOS
+Normally both are _hooked_ to state 2,
+and we pretend nothing has ever happened.
+But when carrot operator is compiled, it sets a special compiler flag FORCE\_START\_OF\_STRING,
+which forbids the hooking of state 1 to 2,
+therefor when `regex_match()` calls from, say position 2,
+it passes in 1 as the starting state,
+no state transition table entry will be found since thats forbidden to begin with,
+no jumps are found(!),
+the machine checks whether the current state (1) is the accepting state (>=2)
+and finally returns failiour.
+
+
+#### \<
+This is the SOW (start of word) operator.
+SOW must match:
+```
+^myword
+[^\h]myword
+```
+Not only that, this combination is key,
+either it has to be the start of the string
+or there has to be at least something which is not a symbol char.
+With out the last condition "eexample" would match "\\\<exaplme\\\>"
+as the iteration of `regex_match()` reaches "example".