/* Copyright (c) 2023 : Ognjen 'xolatile' Milan Robovic Xhartae is free software! You will redistribute it or modify it under the terms of the GNU General Public License by Free Software Foundation. And when you do redistribute it or modify it, it will use either version 3 of the License, or (at yours truly opinion) any later version. It is distributed in the hope that it will be useful or harmful, it really depends... But no warranty what so ever, seriously. See GNU/GPLv3. */ #ifndef CHAPTER_4_SOURCE #define CHAPTER_4_SOURCE #include "chapter_4.h" /* Of course, we could just write something like 'preview_unhighlighted_text_file' function (name is obviously a joke), but this would stylize (apply colour and effect character attributes) to our entire text file. When we're writing programs, syntax highlighting makes a lot of difference to readability, the same way the code formatting does, and initial program design structure. If you use a lot of external variables everywhere, the entire programs starts to be messy or difficult to maintain, write and debug (or both). However, if you use a lot of variables, and you pass each of them separately into functions, or if you have one huge monolithic structure (this time literal 'struct'), you aren't doing much better, except the compiler will have easier time to optimize your code, even tho that kind of code becomes pain to write. So, having few external functions, that do one thing well, and having few external variables, that won't be edited outside of that file is the best in my opinion. Your function calls won't be long, if you don't want to make those external ("global") variables visible to some other file, just move them to C source file instead of C header file, and redeclare them as 'static', making them internal variables. Keep in mind, you have to use brain more in that case, and think about what you're modifying, where and why. Well, if you want to write programs, and not to think about them, just close this and read Jin Ping Mei instead or something. There's no cheatsheet for making good programs, you choose your constraints, your program design structure, and you start working on it. Sometimes you'll have to choose between performance, maintainability, simplicity or low memory usage, and even if you are smart and manage to get three of them to work out in your project, fourth won't. I can't teach you how to choose, maybe you want to learn embedded or game development, and they each have their own advantages and disadvantages. @C void preview_unhighlighted_text_file (char * text_file, int x, int y) { char * text_data; text_data = file_record (text_file); for (curses_active = 1; curses_active != 0; ) { curses_render_background (' ', COLOUR_WHITE, EFFECT_NORMAL); curses_render_string (text_data, COLOUR_WHITE, EFFECT_NORMAL, x, x); curses_synchronize (); } text_data = deallocate (text_data); } @ So, lets write very basic C programming language syntax highlighting, explain how can we easily do it in little more than 150 lines of (scarily verbose and nicely aligned) code and why we don't need regular expressions for it. You can use these 'syntax_*' functions to tokenize some source code, highlight the syntax of it or something else that I didn't even think about if you're creative. Of course, we can use it to highlight syntax of some other programming language, not only C, and we'll use it later to highlight assembly, Ada, C++, and maybe few more programming languages. Note that regular expressions are way more powerful way of achieving the same results, and doing even more things like replacing some parts of the strings. This is simple solution for simple problem. We'll define some internal variables below, functions 'syntax_delete' (that'll be called automatically when we exit the program), 'syntax_define' to make rules about our character and string matching and 'syntax_select' to process our text file (which is just an array of character, also known as, low and behold, a string). Last function, 'syntax_select', will return index of the syntax rule that matches to our offset in string and store size of the match in 'length' variable, we'll look into it. */ static int syntax_count = 0; // Number of previously defined syntax rules. static int * syntax_enrange = NULL; // Syntax rule can start with any character from 'syntax_begin' if this value is TRUE. static int * syntax_derange = NULL; // Syntax rule can start with any character from 'syntax_end' if this value is TRUE. static char * * syntax_begin = NULL; // Strings containing valid character (sub)sequence for begining the scan. static char * * syntax_end = NULL; // Strings containing valid character (sub)sequence for ending the scan. static char * syntax_escape = NULL; // Escape sequence for the rule, useful for line-breaks in C macros and line-based languages. static int * syntax_colour = NULL; // Colour for our token, these two could be completely independent, but I like to keep them here. static int * syntax_effect = NULL; // Effect for our token. /* Lets go in more details about how this function works. Standard library function 'atexit' will take as an argument function pointer of form 'extern void name (void)' that will, imagine my shock, be executed at the exit point of our little program. We can make mistakes using it, if we don't think while we write our programs, the error will be obvious, memory will be leaked or double-freed, Valgrind will detect it, we'd fix it. Also keep in mind, you can't have too much functions executed at the end, you can check for the value of 'ATEXIT_MAX', which is at least 32 by some standards (POSIX). So the goal is, we think a bit more when we structure our program, and we don't worry about if we forgot to deinitialize something. We'll reuse this in chapter five, but use contra-approach, where we want to explicitly deinitialize syntax. If you take a look in the next function, 'syntax_define', you'll see that we'll use 'atexit' function only once, when 'syntax_active' is FALSE, we'll change it to true, so 'atexit' won't be executed every time we call 'syntax_define', which is good. Lastly, in the 'syntax_delete' function, we're just deallocating (freeing) the memory, so we don't leak it and generate Valgrind warnings. */ static void syntax_delete (void) { int offset; if (syntax_count == 0) { // If syntax "library" wasn't used, we don't want to deallocate memory, we just return. return; } // We could reverse-loop through this without a local variable 'offset' using this approach, but I consider this bad for readability. // --syntax_count; // do { // syntax_begin [syntax_count] = deallocate (syntax_begin [syntax_count]); // syntax_end [syntax_count] = deallocate (syntax_end [syntax_count]); // } while (--syntax_count != -1); for (offset = 0; offset < syntax_count; ++offset) { syntax_begin [offset] = deallocate (syntax_begin [offset]); // Since these two are arrays of strings, we need to deallocate, otherwise we'll leak memory. syntax_end [offset] = deallocate (syntax_end [offset]); // We're basically freeing memory one by one string, you'll see below how we allocate it. } syntax_enrange = deallocate (syntax_enrange); // And now we're deallocating the rest of arrays, so no memory is leaked. syntax_derange = deallocate (syntax_derange); syntax_begin = deallocate (syntax_begin); syntax_end = deallocate (syntax_end); syntax_escape = deallocate (syntax_escape); syntax_colour = deallocate (syntax_colour); syntax_effect = deallocate (syntax_effect); syntax_count = 0; // Lastly, I like to do this, but you don't have to. We'll use it in later chapter tho. } /* In 'syntax_define' function we're reallocating (enlarging) memory, effectively adding a new element into our arrays, and assigning or copying data to them. These syntax rules will be used with 'syntax_select' function to make our syntax highlighting. Lets explain what those function arguments do: @C int syntax_define (int enrange, // Strict matching of string 'begin' in buffer range if FALSE, any character matching if TRUE. int derange, // Strict matching of string 'end' in buffer range if FALSE, and again, any character matching if TRUE. char * begin, // String of array of characters to begin matching. char * end, // String of array of characters to end matching, I don't know why I explain these... char escape, // Escape character, useful for C preprocessor. int colour, // Colour. int effect); // Effect, I hate explaining the code when the identifiers are descriptive. @ */ int syntax_define (int enrange, int derange, char * begin, char * end, char escape, int colour, int effect) { if (syntax_count == 0) { // If our syntax isn't active, we'll execute this only once. atexit (syntax_delete); // Mark this function to be executed at program exit point. } // It's same if we use more 'syntax_highlight_*' functions. fatal_failure (begin == NULL, "syntax_define: Begin string is null pointer."); // I don't like checking for errors, but here, voila. fatal_failure (end == NULL, "syntax_define: End string is null pointer."); ++syntax_count; syntax_enrange = reallocate (syntax_enrange, syntax_count * (int) sizeof (* syntax_enrange)); // Now, we have block of memory reallocation for syntax data: syntax_derange = reallocate (syntax_derange, syntax_count * (int) sizeof (* syntax_derange)); syntax_begin = reallocate (syntax_begin, syntax_count * (int) sizeof (* syntax_begin)); syntax_end = reallocate (syntax_end, syntax_count * (int) sizeof (* syntax_end)); syntax_escape = reallocate (syntax_escape, syntax_count * (int) sizeof (* syntax_escape)); syntax_colour = reallocate (syntax_colour, syntax_count * (int) sizeof (* syntax_colour)); syntax_effect = reallocate (syntax_effect, syntax_count * (int) sizeof (* syntax_effect)); syntax_enrange [syntax_count - 1] = enrange; // In order to "make space" for our actual data. syntax_derange [syntax_count - 1] = derange; syntax_escape [syntax_count - 1] = escape; syntax_colour [syntax_count - 1] = colour; syntax_effect [syntax_count - 1] = effect; syntax_begin [syntax_count - 1] = allocate ((string_length (begin) + 1) * (int) sizeof (* * syntax_begin)); // We need to allocate enough memory for our strings now. syntax_end [syntax_count - 1] = allocate ((string_length (end) + 1) * (int) sizeof (* * syntax_end)); // Notice, we won't REallocate, just allocate! string_copy (syntax_begin [syntax_count - 1], begin); // Finally, we're copying our strings into syntax data. string_copy (syntax_end [syntax_count - 1], end); return (syntax_count - 1); // We return the index, but we won't use it in this chapter. } /* This is more complex, but if you use your eyes to look, your brain to comprehend and your heart to love, I'm sure that you'll understand it. */ int syntax_select (char * string, int * length) { int offset, select; if (syntax_count == 0) { // Don't select without rules, return! return (syntax_count); } fatal_failure (string == NULL, "syntax_select: String is null."); fatal_failure (length == NULL, "syntax_select: Length is null."); // In this first part of the function, we need to check if our syntax rule has been detected at the string offset we've provided. We're looping defined syntax rules and // choosing whether to compare any of the characters, or full string, depending on 'syntax_enrange' value which is essentially boolean, true or false, which I express with // 'int' type for "type-safety simplicity". Keep in mind that we're not returning or modifying the string we provided, so it won't be null-terminated, instead I think // it's best to modify only variable 'length', hence we check with 'string_compare_limit' function. for (select = offset = 0; select != syntax_count; ++select) { // We're looping defined syntax rules: int begin = string_length (syntax_begin [select]); if (syntax_enrange [select] == FALSE) { // Choosing the comparisson based on 'syntax_enrange': if (syntax_derange [select] == FALSE) { // Either full string, or any character in it. if (string_compare_limit (string, syntax_begin [select], begin) == TRUE) { // Limiting our string comparisson. break; // If strings are same, we exit the loop. } } else { if ((string_compare_limit (string, syntax_begin [select], begin) == TRUE) // We need to see if we found our string, and: && (character_compare_array (string [offset + begin], syntax_end [select]) == TRUE)) { // If next character, after the string is in 'syntax_end'. break; } // Otherwise, we implcitly continue the loop. } } else { // Else, we compare any character. if (character_compare_array (string [offset], syntax_begin [select]) == TRUE) { // With our obviously named function... break; // We found it, exit the loop! } // If we didn't, just continue. } } // And now we have our 'select' value. // If there was no syntax rule detected, we need to return from a function, and increment the offset by setting variable 'length' to 1. If we don't increment it, at the // first unrecognized character, our second nested-loop inside function 'syntax_render_file' would use uninitialized or zero value, depending on how we structured our code // before that. We also return 'syntax_count' as the syntax rule index, which is invalid, and would produce Valgrind warning if we didn't handle it. In my unimportant // opinion, this if statement is the ugliest part of the function. if (select >= syntax_count) { // If we didn't found our string, return. * length = 1; return (syntax_count); } // In this second part, we have our 'select' value, index of the syntax rule, and we want to know the 'length' value, by trying to match with 'syntax_end' string. We have // to again, separate two cases for matching any character or full string, except that we use it to determine its' match-length. Important difference is also that there's // special case where we have escape character matching, and where 'syntax_end' string is empty (but not NULL), so in that case we match only one character. We could have // nested loop there, and second loop would need goto statement to exit it, so we only use one loop. for (offset = 1; string [offset - 1] != '\0'; ++offset) { // Now, offset must be 1, and we loop... int end = string_length (syntax_end [select]); if (string [offset] == syntax_escape [select]) { // Here's our escape exception. ++offset; continue; } if (syntax_derange [select] == FALSE) { // Choosing what to compare, yet again... if (string_compare_limit (& string [offset], syntax_end [select], end) == TRUE) { // Again, we're comparing full string. * length = offset + end; // We found it, yaay! break; } } else { if (syntax_end [select] [0] == CHARACTER_NULL) { // And here's our empty string exception. * length = offset; // On that case, we break from loop. break; } if (character_compare_array (string [offset], syntax_end [select]) == TRUE) { // Otherwise, we compare to see if the end is near! * length = offset; break; } } // These two loops look similar, but no! } // And now we have our 'length' value. return (select); // Lastly, return syntax rule index. } /* Imagine my shock, we can now print coloured text, without regular expressions. Nothing much, we can print it without using 'curses_*' functions, but if we want to preview large, well more than 24 line of code, we'd want to scroll it or modify it if we're making a text editor, hence, using curses is good. Lets see how our "mini-main" subprogram-like function does its' work, and how we use 'syntax_*' functions in them, and I also want to make few syntax highlighting abstractions. We can call multiple 'syntax_highlight_*' functions, but it would mix the highlighting of those languages in that case, so we use 'syntax_delete' to reset it. Before we begin (Ada pun intended, remove this in final version), I won't (re)align 'separators' and 'keywords', because they fuck-up my comments, which I never write in my "official" programs. I write comments only here, to explain stuff in more details. Have fun... Oh, and type of variable 'keywords' an array of string pointers of automatic length, which we get with "sizeof (keywords) / sizeof (keywords [0])" part, for those keywords, it would be 32UL, and we cast it to integer. I use "long" comments outside of functions, and "short" comments inside them, while aligning them to the longest line of code, or current indentation level. */ void syntax_highlight_c (void) { char * separators = ".,:;<=>+*-/%!&~^?|()[]{}'\" \t\r\n"; char * keywords [] = { "register", "volatile", "auto", "const", "static", "extern", "if", "else", "do", "while", "for", "continue", "switch", "case", "default", "break", "enum", "union", "struct", "typedef", "goto", "void", "return", "sizeof", "char", "short", "int", "long", "signed", "unsigned", "float", "double" }; int word; if (syntax_count != 0) { // If syntax was used, free it, then we can redefine them. syntax_delete (); // This way, we won't mix syntaces if we use this multiple times. } syntax_define (FALSE, FALSE, "/*", "*/", '\0', COLOUR_GREY, EFFECT_BOLD); // Below, we're simply using our 'syntax_define' function. syntax_define (FALSE, FALSE, "//", "\n", '\0', COLOUR_GREY, EFFECT_BOLD); // I really don't think I need to explain those, so... syntax_define (FALSE, FALSE, "#", "\n", '\\', COLOUR_YELLOW, EFFECT_ITALIC); syntax_define (FALSE, FALSE, "'", "'", '\\', COLOUR_PINK, EFFECT_BOLD); syntax_define (FALSE, FALSE, "\"", "\"", '\\', COLOUR_PINK, EFFECT_NORMAL); for (word = 0; word != (int) (sizeof (keywords) / sizeof (keywords [0])); ++word) { syntax_define (FALSE, TRUE, keywords [word], separators, '\0', COLOUR_YELLOW, EFFECT_BOLD); } syntax_define (TRUE, FALSE, "()[]{}", "", '\0', COLOUR_BLUE, EFFECT_NORMAL); syntax_define (TRUE, FALSE, ".,:;<=>+*-/%!&~^?|", "", '\0', COLOUR_CYAN, EFFECT_NORMAL); syntax_define (TRUE, TRUE, "0123456789", separators, '\0', COLOUR_PINK, EFFECT_BOLD); syntax_define (TRUE, TRUE, "abcdefghijklmnopqrstuvwxyz", separators, '\0', COLOUR_WHITE, EFFECT_NORMAL); syntax_define (TRUE, TRUE, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", separators, '\0', COLOUR_WHITE, EFFECT_BOLD); syntax_define (TRUE, TRUE, "_", separators, '\0', COLOUR_WHITE, EFFECT_ITALIC); } void syntax_highlight_ada (void) { char * separators = ".,:;<=>#+*-/&|()'\" \t\r\n"; char * keywords [] = { "abort", "else", "new", "return", "abs", "elsif", "not", "reverse", "abstract", "end", "null", "accept", "entry", "select", "access", "of", "separate", "aliased", "exit", "or", "some", "all", "others", "subtype", "and", "for", "out", "array", "function", "at", "tagged", "generic", "package", "task", "begin", "goto", "pragma", "body", "private", "then", "type", "case", "in", "constant", "until", "is", "raise", "use", "if", "declare", "range", "delay", "limited", "record", "when", "delta", "loop", "rem", "while", "digits", "renames", "with", "do", "mod", "requeue", "xor", "procedure", "protected", "interface", "synchronized", "exception", "overriding", "terminate" }; int word; if (syntax_count != 0) { syntax_delete (); } syntax_define (FALSE, FALSE, "--", "\n", '\0', COLOUR_GREY, EFFECT_BOLD); syntax_define (FALSE, FALSE, "'", "'", '\\', COLOUR_PINK, EFFECT_BOLD); syntax_define (FALSE, FALSE, "\"", "\"", '\\', COLOUR_PINK, EFFECT_NORMAL); for (word = 0; word != (int) (sizeof (keywords) / sizeof (keywords [0])); ++word) { syntax_define (FALSE, TRUE, keywords [word], separators, '\0', COLOUR_YELLOW, EFFECT_BOLD); } syntax_define (TRUE, FALSE, "()#", "", '\0', COLOUR_BLUE, EFFECT_NORMAL); syntax_define (TRUE, FALSE, ".,:;<=>+*-/&|'", "", '\0', COLOUR_CYAN, EFFECT_NORMAL); syntax_define (TRUE, TRUE, "0123456789", separators, '\0', COLOUR_PINK, EFFECT_BOLD); syntax_define (TRUE, TRUE, "abcdefghijklmnopqrstuvwxyz", separators, '\0', COLOUR_WHITE, EFFECT_NORMAL); syntax_define (TRUE, TRUE, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", separators, '\0', COLOUR_WHITE, EFFECT_BOLD); syntax_define (TRUE, TRUE, "_", separators, '\0', COLOUR_WHITE, EFFECT_ITALIC); } void syntax_render_file (char * text_file, int x, int y) { char * text_data; // This local variable will hold our data. int reset_x, reset_y; // Since we're using curses, we want to reset the offset. curses_configure (); // Curses configuration, aka printing ugly text. switch (file_type (text_file)) { // Depending on our file extension, we select highlighting. case FILE_TYPE_C_SOURCE: case FILE_TYPE_C_HEADER: syntax_highlight_c (); break; case FILE_TYPE_ADA_BODY: case FILE_TYPE_ADA_SPECIFICATION: syntax_highlight_ada (); break; default: break; } text_data = file_record (text_file); // And, imagine, importing our file data into a buffer! reset_x = x; reset_y = y; for (curses_active = 1; curses_active != 0; ) { // We enter our main subprogram loop. int offset, select, length; curses_render_background (' ', COLOUR_WHITE, EFFECT_NORMAL); // We need to clear the screen buffer before rendering. x = reset_x; y = reset_y; select = syntax_count; // I intentionally set this to an invalid value. length = 0; for (offset = 0; offset < string_length (text_data); offset += length) { // And it's time to start rendering our file. int suboffset, colour, effect; select = syntax_select (& text_data [offset], & length); // Here we're evaluating variables 'select' and 'length'. // We can do the same thing in 2 lines of code, but it's less readable in my opinion, I prefer longer verbose way below... // colour = (select >= syntax_count) ? COLOUR_WHITE : syntax_colour [select]; // effect = (select >= syntax_count) ? EFFECT_NORMAL : syntax_effect [select]; // Or, if you find this more intuitive: // colour = (select < syntax_count) ? syntax_colour [select] : COLOUR_WHITE; // effect = (select < syntax_count) ? syntax_effect [select] : EFFECT_NORMAL; if (select >= syntax_count) { // Here, we're handling error value of 'syntax_select'. colour = COLOUR_WHITE; effect = EFFECT_NORMAL; } else { colour = syntax_colour [select]; effect = syntax_effect [select]; } for (suboffset = 0; suboffset < length; ++suboffset) { // Sadly, we need to render them one by one character. if (text_data [offset + suboffset] == CHARACTER_LINE_FEED) { // Rendering of blank characters isn't counted, so: x = reset_x; // If there's a new line, we need to reset 'x' value. y += 1; // And increment 'y' value. } else if (text_data [offset + suboffset] == CHARACTER_TAB_HORIZONTAL) { // If there's a tab, we offset 'x' value by normal count. x += 8; // Normal indentation is 8-characters wide. } else { curses_render_character (text_data [offset + suboffset], colour, effect, x, y); // Finally, we can render it character by character. x += 1; } } } curses_synchronize (); // Lastly, we synchronize our terminal. } text_data = deallocate (text_data); // And deallocate the memory when we exit the subprogram. } #endif