Highlight things
Вы не можете выбрать более 25 тем Темы должны начинаться с буквы или цифры, могут содержать дефисы(-) и должны содержать не более 35 символов.

568 строки
15KB

  1. /* regex.c
  2. * Copyright 2023 Anon Anonson, Ognjen 'xolatile' Milan Robovic, Emil Williams
  3. * SPDX Identifier: GPL-3.0-only / NO WARRANTY / NO GUARANTEE */
  4. #include "regex.h"
  5. #include <assert.h>
  6. #include <string.h>
  7. #include <limits.h>
  8. #include <stdlib.h>
  9. // ------------------
  10. // ### Char tests ###
  11. // ------------------
  12. static bool is_quantifier(const char c) {
  13. for (const char * s = "+*?="; *s != '\00'; s++) {
  14. if (*s == c) {
  15. return true;
  16. }
  17. }
  18. return false;
  19. }
  20. bool is_magic(const char c) {
  21. if (is_quantifier(c)) {
  22. return true;
  23. }
  24. for (const char * s = "\\[]."; *s != '\00'; s++) {
  25. if (*s == c) {
  26. return true;
  27. }
  28. }
  29. return false;
  30. }
  31. // ----------------------
  32. // ### Internal Types ###
  33. // ----------------------
  34. typedef struct {
  35. int in;
  36. char input;
  37. int to;
  38. int width;
  39. } delta_t;
  40. typedef struct {
  41. int in;
  42. int to;
  43. } offshoot_t;
  44. typedef struct {
  45. bool * do_catch;
  46. bool * is_negative;
  47. int * state;
  48. int * width;
  49. char * whitelist;
  50. char * blacklist;
  51. regex_t * regex;
  52. } compiler_state;
  53. // ----------------------------------
  54. // ### Regex creation/destruction ###
  55. // ----------------------------------
  56. static int escape_1_to_1(const char c, compiler_state * cs) {
  57. char * target_list = (*cs->is_negative) ? cs->blacklist : cs->whitelist;
  58. switch (c) {
  59. case 't': {
  60. strcat(target_list, "\t");
  61. } return 1;
  62. case 'n': {
  63. strcat(target_list, "\n");
  64. } return 1;
  65. case 'r': {
  66. strcat(target_list, "\r");
  67. } return 1;
  68. case 'b': {
  69. strcat(target_list, "\b");
  70. } return 1;
  71. case '[': {
  72. strcat(target_list, "[");
  73. } return 1;
  74. case ']': {
  75. strcat(target_list, "]");
  76. } return 1;
  77. case '.': {
  78. strcat(target_list, ".");
  79. } return 1;
  80. case '=': {
  81. strcat(target_list, "=");
  82. } return 1;
  83. case '?': {
  84. strcat(target_list, "?");
  85. } return 1;
  86. case '+': {
  87. strcat(target_list, "+");
  88. } return 1;
  89. case '*': {
  90. strcat(target_list, "*");
  91. } return 1;
  92. case '\\': {
  93. strcat(target_list, "\\");
  94. } return 1;
  95. }
  96. return 0;
  97. }
  98. static int escape_1_to_N(const char c, compiler_state * cs) {
  99. char * target_list = (*cs->is_negative) ? cs->blacklist : cs->whitelist;
  100. switch(c) {
  101. case 'i': {
  102. const char identifier_chars[] = "@0123456789_"
  103. "\300\301\302\303\304"
  104. "\305\306\307\310\311"
  105. "\312\313\314\315\316"
  106. "\317\320\321\322\323"
  107. "\324\325\326\327\330"
  108. "\331\332\333\334\335"
  109. "\336\337";
  110. strcpy(target_list, identifier_chars);
  111. return sizeof(identifier_chars)-1;
  112. };
  113. case 'I': {
  114. const char identifier_chars[] = "@_"
  115. "\300\301\302\303\304"
  116. "\305\306\307\310\311"
  117. "\312\313\314\315\316"
  118. "\317\320\321\322\323"
  119. "\324\325\326\327\330"
  120. "\331\332\333\334\335"
  121. "\336\337";
  122. strcpy(target_list, identifier_chars);
  123. return sizeof(identifier_chars)-1;
  124. };
  125. case 'k': {
  126. const char keyword_chars[] = "@0123456789_"
  127. "\300\301\302\303\304"
  128. "\305\306\307\310\311"
  129. "\312\313\314\315\316"
  130. "\317\320\321\322\323"
  131. "\324\325\326\327\330"
  132. "\331\332\333\334\335"
  133. "\336\337";
  134. strcpy(target_list, keyword_chars);
  135. return sizeof(keyword_chars)-1;
  136. };
  137. case 'K': {
  138. const char keyword_chars[] = "@_"
  139. "\300\301\302\303\304"
  140. "\305\306\307\310\311"
  141. "\312\313\314\315\316"
  142. "\317\320\321\322\323"
  143. "\324\325\326\327\330"
  144. "\331\332\333\334\335"
  145. "\336\337";
  146. strcpy(target_list, keyword_chars);
  147. return sizeof(keyword_chars)-1;
  148. };
  149. case 'f': {
  150. const char filename_chars[] = "@0123456789/.-_+,#$%~=";
  151. strcpy(target_list, filename_chars);
  152. return sizeof(filename_chars)-1;
  153. };
  154. case 'F': {
  155. const char filename_chars[] = "@/.-_+,#$%~=";
  156. strcpy(target_list, filename_chars);
  157. return sizeof(filename_chars)-1;
  158. };
  159. case 'p': {
  160. const char printable_chars[] = "@"
  161. "\241\242\243\244\245"
  162. "\246\247\250\251\252"
  163. "\253\254\255\256\257"
  164. "\260\261\262\263\264"
  165. "\265\266\267\270\271"
  166. "\272\273\274\275\276"
  167. "\277"
  168. "\300\301\302\303\304"
  169. "\305\306\307\310\311"
  170. "\312\313\314\315\316"
  171. "\317\320\321\322\323"
  172. "\324\325\326\327\330"
  173. "\331\332\333\334\335"
  174. "\336\337";
  175. strcpy(target_list, printable_chars);
  176. return sizeof(printable_chars)-1;
  177. };
  178. case 'P': {
  179. const char printable_chars[] = "@"
  180. "\241\242\243\244\245"
  181. "\246\247\250\251\252"
  182. "\253\254\255\256\257"
  183. "\260\261\262\263\264"
  184. "\265\266\267\270\271"
  185. "\272\273\274\275\276"
  186. "\277"
  187. "\300\301\302\303\304"
  188. "\305\306\307\310\311"
  189. "\312\313\314\315\316"
  190. "\317\320\321\322\323"
  191. "\324\325\326\327\330"
  192. "\331\332\333\334\335"
  193. "\336\337";
  194. strcpy(target_list, printable_chars);
  195. return sizeof(printable_chars)-1;
  196. };
  197. case 's': {
  198. const char whitespace_chars[] = " \t\v\n";
  199. strcpy(target_list, whitespace_chars);
  200. return sizeof(whitespace_chars)-1;
  201. };
  202. case 'd': {
  203. const char digit_chars[] = "0123456789";
  204. strcpy(target_list, digit_chars);
  205. return sizeof(digit_chars)-1;
  206. };
  207. case 'x': {
  208. const char hex_chars[] = "0123456789"
  209. "abcdef"
  210. "ABCDEF";
  211. strcpy(target_list, hex_chars);
  212. return sizeof(hex_chars)-1;
  213. };
  214. case 'o': {
  215. const char oct_chars[] = "01234567";
  216. strcpy(target_list, oct_chars);
  217. return sizeof(oct_chars)-1;
  218. };
  219. case 'w': {
  220. const char word_chars[] = "0123456789"
  221. "abcdefghijklmnopqrstuwxyz"
  222. "ABCDEFGHIJKLMNOPQRSTUWXYZ"
  223. "_";
  224. strcpy(target_list, word_chars);
  225. return sizeof(word_chars)-1;
  226. };
  227. case 'h': {
  228. const char very_word_chars[] = "abcdefghijklmnopqrstuwxyz"
  229. "ABCDEFGHIJKLMNOPQRSTUWXYZ"
  230. "_";
  231. strcpy(target_list, very_word_chars);
  232. return sizeof(very_word_chars)-1;
  233. };
  234. case 'a': {
  235. const char alpha_chars[] = "abcdefghijklmnopqrstuwxyz"
  236. "ABCDEFGHIJKLMNOPQRSTUWXYZ";
  237. strcpy(target_list, alpha_chars);
  238. return sizeof(alpha_chars)-1;
  239. };
  240. case 'l': {
  241. const char lower_alpha_chars[] = "abcdefghijklmnopqrstuwxyz";
  242. strcpy(target_list, lower_alpha_chars);
  243. return sizeof(lower_alpha_chars)-1;
  244. };
  245. case 'u': {
  246. const char upper_alpha_chars[] = "ABCDEFGHIJKLMNOPQRSTUWXYZ";
  247. strcpy(target_list, upper_alpha_chars);
  248. return sizeof(upper_alpha_chars)-1;
  249. };
  250. }
  251. return 0;
  252. }
  253. static int escape_to_negative(const char c,
  254. compiler_state * cs) {
  255. switch (c) {
  256. case 'D': {
  257. const char digit_chars[] = "0123456789";
  258. strcpy(cs->blacklist, digit_chars);
  259. *cs->is_negative = true;
  260. return sizeof(digit_chars)-1;
  261. };
  262. }
  263. return 0;
  264. }
  265. //static int compile_hologram(char * hologram, char * whitelist) {
  266. // if (hologram[0] == '\\') {
  267. // switch (hologram[1]) {
  268. // case '<': {
  269. // const char very_word_chars[] = "abcdefghijklmnopqrstuwxyz"
  270. // "ABCDEFGHIJKLMNOPQRSTUWXYZ"
  271. // "_";
  272. // strcat(whitelist, very_word_chars);
  273. // is_negative = true;
  274. // HOOK_ALL(0, whitelist, 0)
  275. // } break;
  276. // }
  277. // }
  278. //}
  279. static int compile_dot(compiler_state * cs) {
  280. *cs->do_catch = true;
  281. return true;
  282. }
  283. static int compile_escape(const char c,
  284. compiler_state * cs) {
  285. return escape_1_to_1(c, cs)
  286. || escape_1_to_N(c, cs)
  287. || escape_to_negative(c, cs)
  288. //|| compile_hologram(*s, whitelist)
  289. ;
  290. }
  291. static int compile_range(const char * const range,
  292. compiler_state * cs) {
  293. assert((range[0] == '[') && "Not a range.");
  294. char * target_list = (*cs->is_negative) ? cs->blacklist : cs->whitelist;
  295. const char * s;
  296. if (range[1] == '^') {
  297. *cs->is_negative = true;
  298. s = range + 2;
  299. } else {
  300. s = range + 1;
  301. }
  302. for (; *s != ']'; s++) {
  303. assert((*s != '\0') && "Unclosed range.");
  304. char c = *s;
  305. if (c == '\\') {
  306. s += 1;
  307. assert(compile_escape(*s, cs) && "Unknown escape.");
  308. } else if (*(s+1) == '-') {
  309. char end = *(s+2);
  310. assert((c < end) && "Endless range.");
  311. for (char cc = c; cc < end+1; cc++) {
  312. strncat(target_list, &cc, 1);
  313. strncat(target_list, "\0", 1);
  314. }
  315. s += 2;
  316. } else {
  317. strncat(target_list, &c, 1);
  318. }
  319. }
  320. return ((s - range) + 1);
  321. }
  322. void filter_blacklist(const char * whitelist,
  323. const char * blacklist,
  324. char * filtered) {
  325. for (; *blacklist != '\0'; blacklist++) {
  326. for(; *whitelist != '\0'; whitelist++) {
  327. if (*blacklist == *whitelist) {
  328. goto long_continue;
  329. }
  330. }
  331. strncat(filtered, blacklist, 1);
  332. long_continue:;
  333. }
  334. }
  335. #define HALT_AND_CATCH_FIRE INT_MIN
  336. void HOOK_ALL( int from,
  337. const char * const str,
  338. int to,
  339. compiler_state * cs) {
  340. int hook_to = (to == HALT_AND_CATCH_FIRE) ? -1 : ((*cs->state) + to);
  341. for (const char * s = str; *s != '\0'; s++) {
  342. delta_t * delta = malloc(sizeof(delta_t));
  343. delta->in = *cs->state + from;
  344. delta->input = *s;
  345. delta->to = hook_to;
  346. delta->width = *cs->width;
  347. vector_push(&cs->regex->delta_table,
  348. &delta);
  349. }
  350. }
  351. void OFFSHOOT(int from,
  352. int to,
  353. compiler_state * cs) {
  354. offshoot_t * offshoot = malloc(sizeof(offshoot_t));
  355. offshoot->in = *cs->state + from;
  356. offshoot->to = *cs->state + to;
  357. vector_push(&cs->regex->catch_table,
  358. &offshoot);
  359. }
  360. regex_t * regex_compile(const char * const pattern) {
  361. regex_t * regex = (regex_t *)malloc(sizeof(regex_t));
  362. regex->str = strdup(pattern);
  363. vector_init(&regex->delta_table, sizeof(delta_t*), 0UL);
  364. vector_init(&regex->catch_table, sizeof(offshoot_t*), 0UL);
  365. int state = 0;
  366. bool do_catch;
  367. bool is_negative;
  368. int width;
  369. char whitelist[64];
  370. char blacklist[64];
  371. compiler_state cs = {
  372. .do_catch = &do_catch,
  373. .is_negative = &is_negative,
  374. .state = &state,
  375. .width = &width,
  376. .whitelist = whitelist,
  377. .blacklist = blacklist,
  378. .regex = regex,
  379. };
  380. for (const char * s = pattern; *s != '\00';) {
  381. // Reset the compiler
  382. assert(!is_quantifier(*pattern) && "Pattern starts with quantifier.");
  383. whitelist[0] = '\00';
  384. blacklist[0] = '\00';
  385. do_catch = false;
  386. is_negative = false;
  387. width = 1;
  388. // Translate char
  389. switch (*s) {
  390. case '.': {
  391. compile_dot(&cs);
  392. } break;
  393. case '\\': {
  394. s += 1;
  395. assert(compile_escape(*s, &cs) && "Unknown escape.");
  396. } break;
  397. case '[': {
  398. s += compile_range(s, &cs) - 1;
  399. } break;
  400. default: {
  401. whitelist[0] = *s;
  402. whitelist[1] = '\00';
  403. } break;
  404. }
  405. s += 1;
  406. // Compile with quantifier
  407. switch (*s) {
  408. case '=':
  409. case '?': {
  410. HOOK_ALL(0, whitelist, +1, &cs);
  411. if (do_catch || is_negative) {
  412. OFFSHOOT(0, +1, &cs);
  413. }
  414. s += 1;
  415. } break;
  416. case '*': {
  417. HOOK_ALL(0, whitelist, 0, &cs);
  418. if (do_catch) {
  419. OFFSHOOT(0, +1, &cs);
  420. } else if (is_negative) {
  421. OFFSHOOT(0, 0, &cs);
  422. }
  423. s += 1;
  424. } break;
  425. case '+': {
  426. HOOK_ALL(0, whitelist, +1, &cs);
  427. if (do_catch || is_negative) {
  428. OFFSHOOT(0, +1, &cs);
  429. }
  430. state += 1;
  431. HOOK_ALL(0, whitelist, 0, &cs);
  432. if (do_catch || is_negative) {
  433. OFFSHOOT(0, 0, &cs);
  434. }
  435. s += 1;
  436. } break;
  437. default: { // Literal
  438. HOOK_ALL(0, whitelist, +1, &cs);
  439. if (do_catch || is_negative) {
  440. OFFSHOOT(0, +1, &cs);
  441. }
  442. state += 1;
  443. } break;
  444. }
  445. // Compile blacklist
  446. if (*blacklist) {
  447. char filtered_blacklist[64];
  448. filtered_blacklist[0] = '\0';
  449. filter_blacklist(whitelist, blacklist, filtered_blacklist);
  450. HOOK_ALL(0, filtered_blacklist, HALT_AND_CATCH_FIRE, &cs);
  451. }
  452. }
  453. regex->accepting_state = state;
  454. return regex;
  455. }
  456. int regex_free(regex_t * const regex) {
  457. free(regex->str);
  458. vector_free(&regex->delta_table);
  459. vector_free(&regex->catch_table);
  460. free(regex);
  461. return 0;
  462. }
  463. // -----------------
  464. // ### Searching ###
  465. // -----------------
  466. static bool catch_(const regex_t * const regex,
  467. int * const state) {
  468. for (size_t i = 0; i < regex->catch_table.element_count; i++){
  469. const offshoot_t * const offshoot = *(offshoot_t**)vector_get(&regex->catch_table, i);
  470. if (offshoot->in == *state) {
  471. *state = offshoot->to;
  472. return true;
  473. }
  474. }
  475. return false;
  476. }
  477. static int regex_assert(const regex_t * const regex,
  478. const char * const string,
  479. int state,
  480. int width) {
  481. for (const char * s = string; *s != '\00'; s++) {
  482. // delta
  483. for (size_t i = 0; i < regex->delta_table.element_count; i++) {
  484. const delta_t * const delta = *(delta_t**)vector_get(&regex->delta_table, i);
  485. if ((delta->in == state)
  486. && (delta->input == *s)) {
  487. int r = regex_assert(regex, s + delta->width, delta->to, width + 1);
  488. if(r){
  489. return r;
  490. }
  491. }
  492. }
  493. if (catch_(regex, &state)) {
  494. width += 1;
  495. continue;
  496. }
  497. return (state == regex->accepting_state) ? width : false;
  498. }
  499. return false;
  500. }
  501. int regex_match( regex_t * regex,
  502. const char * const string) {
  503. if (regex == NULL) {
  504. return false;
  505. }
  506. if (string == NULL) {
  507. return true;
  508. }
  509. return regex_assert(regex, string, 0, 0);
  510. }
  511. bool regex_search( regex_t * regex,
  512. const char * const string) {
  513. return (bool)regex_match(regex, string);
  514. }