{"id":621786,"date":"2023-03-25T09:48:52","date_gmt":"2023-03-25T14:48:52","guid":{"rendered":"https:\/\/news.sellorbuyhomefast.com\/index.php\/2023\/03\/25\/explaining-my-fast-6502-code-generator\/"},"modified":"2023-03-25T09:48:52","modified_gmt":"2023-03-25T14:48:52","slug":"explaining-my-fast-6502-code-generator","status":"publish","type":"post","link":"https:\/\/newsycanuse.com\/index.php\/2023\/03\/25\/explaining-my-fast-6502-code-generator\/","title":{"rendered":"Explaining my fast 6502 code generator"},"content":{"rendered":"<div>\n<p>\n        To learn how optimizing compilers are made, <a href=\"http:\/\/pubby.games\/nesfab.html\">I built one<\/a> targeting the <a href=\"https:\/\/en.wikipedia.org\/wiki\/MOS_Technology_6502\">6502 architecture<\/a>.<br \/>\n        In a bizarre twist, my compiler generates faster code than GCC, LLVM, and every other compiler I compared it to.\n        <\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/nesfab\/combined.png\"><\/p>\n<p>I reckon my compiler isn&#8217;t doing more when it comes to high-level optimizations,<br \/>\n        so the gains must be from the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Code_generation_(compiler)\">code generation<\/a> side.<br \/>\n        This makes sense, as most compilers are multi-target, with backends<br \/>\n        designed for modern RISC-like systems, not the ancient 6502.<br \/>\n        It doesn&#8217;t matter how good GCC or LLVM&#8217;s high-level optimizations are<br \/>\n        if they falter at the last leg of the race.\n        <\/p>\n<p>Still, my compiler also beats those designed for retro and embedded systems, like VBCC, SDCC, and KickC.<br \/>\n        For this reason, it seemed like a good idea to write about my technique.<\/p>\n<h2>Cheating Disclaimer<\/h2>\n<p>Before I get into it, I want to cover three areas my compiler has an advantage, which aren&#8217;t tied to algorithmic design.<br \/>\n        I call these my &#8220;cheats&#8221;, as other compilers <i>could<\/i> do these things, but there are trade-offs involved.<\/p>\n<p>\n        First, my compiler generates what are know as<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Illegal_opcode\">&#8220;illegal&#8221; instructions<\/a>.<br \/>\n        An illegal instruction is one which isn&#8217;t officially documented by the manufacturer,<br \/>\n        but still exists in the hardware.<br \/>\n        On most 6502 chips, a few illegal instructions exist which combine the behavior of two &#8220;legal&#8221; instructions.<br \/>\n        Their use can save a few cycles, but not every backend generates them.\n        <\/p>\n<p>Second, some of my compiler&#8217;s gains can be explained by its computational effort.<br \/>\n        My compiler spends the majority of its CPU budget (a few milliseconds) doing instruction selection,<br \/>\n        but not all compilers do.<br \/>\n        Typically, the more time spent searching for a solution, the better the results.\n        <\/p>\n<p>\n        Lastly, there is loop unrolling and other optimizations that trade space efficiency for speed,<br \/>\n        which is hard to compare fairly when benchmarking different compilers.<br \/>\n        I tried to pick tests which weren&#8217;t affected by this,<br \/>\n        but obviously I can&#8217;t be perfect.\n        <\/p>\n<p>With my conscience clear, onto the algorithm!<\/p>\n<h2>A Quick, Vague Overview<\/h2>\n<p>\n        I&#8217;ve been calling my own attempt &#8220;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Outsider_art\">outsider art<\/a>&#8220;,<br \/>\n        as I didn&#8217;t know much about code generation when writing it.<br \/>\n        In fact, the technique didn&#8217;t even originate in my compiler;<br \/>\n        I based it on an old animation system I wrote years prior.<br \/>\n        My algorithm is interesting because it combines instruction selection<br \/>\n        with register allocation, but also because it&#8217;s written using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Continuation\">continuations<\/a>.<br \/>\n        In all my years of programming, it&#8217;s the only practical use I&#8217;ve found for such black magic.\n        <\/p>\n<p>\n        To briefly explain how it works, I&#8217;ll say that each basic block in my<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Intermediate_representation\">IR<\/a> is represented as a<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Directed_acyclic_graph\">DAG<\/a> in<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Static_single-assignment_form\">SSA form<\/a>.<br \/>\n        The first step of code generation is to break both of these properties,<br \/>\n        converting basic blocks to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Total_order\">totally-ordered<\/a> lists that do not contain phi nodes.\n        <\/p>\n<p>\n        The ordered basic blocks are then processed from beginning to end,<br \/>\n        generating multiple assembly code combinations per operation.<br \/>\n        The list of combinations will grow at an exponential rate,<br \/>\n        but most can be pruned using ideas from <a href=\"https:\/\/en.wikipedia.org\/wiki\/Dynamic_programming\">dynamic programming<\/a><br \/>\n        and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Branch_and_bound\">branch-and-bound<\/a>.\n        <\/p>\n<p>\n        To improve the generated code, output states of each basic block are fed into their successor blocks as inputs.<br \/>\n        This process repeats until a fixed point is reached.<br \/>\n        A cooling schedule a-la <a href=\"https:\/\/en.wikipedia.org\/wiki\/Simulated_annealing\">simulated annealing<\/a><br \/>\n        is applied to find results faster.\n        <\/p>\n<p>\n        After this, each basic block will have multiple code sequences to choose from.<br \/>\n        The most expensive ones can be heuristically pruned,<br \/>\n        then a good selection can be made by solving a <a href=\"https:\/\/beza1e1.tuxen.de\/articles\/pbqp.html\">partitioned boolean quadratic problem<\/a> (PBQP).\n        <\/p>\n<p>\n        Finally, additional instructions are inserted as transitions between the basic blocks,<br \/>\n        and a few optimization passes are run on the resulting assembly code for good measure.\n        <\/p>\n<hr>\n<h2>A More Detailed Explanation<\/h2>\n<h3>Basic Block IR<\/h3>\n<p>As stated, the<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Intermediate_representation\">IR<\/a><br \/>\n        of each basic block is a<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Directed_acyclic_graph\">DAG<\/a> in<br \/>\n        in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Static_single-assignment_form\">SSA form<\/a>,<br \/>\n        which sounds complicated until you see a picture of it.\n        <\/p>\n<p>Given the code below:<\/p>\n<pre>\nfoo = fn() ^ 5\nreturn foo - (foo & 3)<\/pre>\n<p>The IR DAG would look like this:<\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/dag.png\"><\/p>\n<p>Each node in the graph represents a value, with edges denoting the flow of data.<br \/>\n        To compute a value, the nodes pointing to it must be computed first \u2014<br \/>\n        e.g. to compute the node labeled <code>(&)<\/code>, the nodes <code>(^)<\/code> and <code>(3)<\/code> must be computed first.<br \/>\n        It&#8217;s a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Dependency_graph\">dependency graph<\/a>.\n        <\/p>\n<h3>Ordering<\/h3>\n<p>The first step of code generation is to take the IR and create a<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Total_order\">total order<\/a><br \/>\n        out of its nodes, positioning each node to occur after its dependencies.<br \/>\n        This is akin to putting each node on a number line such that every edge points in the same direction (downwards in the diagram).<\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/scheduled.png\"><\/p>\n<p>The generated assembly code will follow this order.<\/p>\n<h4>How does one find this order?<\/h4>\n<p>It&#8217;s easy to do using a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Topological_sorting\">topological sort<\/a>,<br \/>\n        but there&#8217;s a problem: DAGs can have multiple valid orderings, and some will generate better code than others.<\/p>\n<p>To find a good ordering (one which generates efficient code),<br \/>\n        the topological sorting algorithm is still used, but it can be modified using a few heuristics.<br \/>\n        First, &#8220;fake&#8221; edges can be added to the graph to constrain nodes to occur after other nodes.<br \/>\n        Second, a greedy algorithm can be used to influence the traversal order of the topological sort.<br \/>\n        In my compiler, I prioritize nodes which form the<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Longest_path_problem\">longest path<\/a> through the graph,<br \/>\n        which produces an ordering that has small<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Live-variable_analysis\">live ranges<\/a>,<br \/>\n        resulting in less register pressure and stores.\n        <\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/dag2.png\"><br \/>\n        <br \/><sup>Here&#8217;s an actual DAG from my compiler.<\/sup><\/p>\n<hr>\n<h3>Basic Block Instruction Selection &#8211; Part 1: Animation<\/h3>\n<p>With the IR prepared, now it&#8217;s time to generate some assembly code.<br \/>\n        I mentioned earlier that my algorithm was based on an animation system,<br \/>\n        and although that sounds weird and unrelated, it will be easier for me<br \/>\n        to explain the animation system first and then extend it to work like my compiler.<\/p>\n<p>The animation I&#8217;m talking about was for a NES game.<br \/>\n        On that system, animation can be done by loading a new tileset into video memory each frame,<br \/>\n        overwriting the previous frame&#8217;s tileset.\n        <\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/anim.jpg\"><\/p>\n<p>Most games accomplish this using special hardware on the cartridge called a mapper,<br \/>\n        which lets the system page a tileset into video memory quickly.<br \/>\n        But in my case, I was making a game that wasn&#8217;t going to use that piece of hardware.<br \/>\n        Instead, I had to copy the bytes using the CPU as a middle-man.\n        <\/p>\n<p>The obvious way to copy bytes is with a loop (think &#8220;memcpy&#8221; from C),<br \/>\n        but this turned out to be too slow.<br \/>\n        For each iteration, too much time was being spent incrementing, then comparing,<br \/>\n        then branching back to the loop.<\/p>\n<p>The only way I could see it working was if I unrolled the loop \u2014 not once \u2014 but fully.<br \/>\n        For example, if I wanted to copy the byte sequence (10, 20, 30) to video memory, I could write assembly code like this:<\/p>\n<pre>\nlda #20      ; Load constant: 20\nsta PPUDATA  ; Upload to video RAM\n\nlda #30      ; Load constant: 30\nsta PPUDATA  ; Upload to video RAM\n\nlda #40      ; Load constant: 40\nsta PPUDATA  ; Upload to video RAM<\/pre>\n<p>In other words, I was writing code like a noob who quit programming class before learning what a for-loop was.<br \/>\n        But it was fast, and that&#8217;s what mattered.<\/p>\n<p>Rather than writing these sequences by hand, it was clear I should automate it.<br \/>\n        A naive approach would be to write a script which generates one load and one store per uploaded byte,<br \/>\n        but it&#8217;s possible to do better.<br \/>\n        As registers preserve their values, a sequence like (1, 8, 1, 8, 1, 8) requires only two loads:<\/p>\n<pre>\nldx #1      ; Load register X\nlda #8      ; Load register A\nstx PPUDATA ; Upload register X to video RAM\nsta PPUDATA ; Upload register A to video RAM\nstx PPUDATA\nsta PPUDATA\nstx PPUDATA\nsta PPUDATA<\/pre>\n<p><b>How does one determine these loads optimally?<\/b><\/p>\n<p>To keep things simple, I&#8217;ll first explain how to solve this for a CPU with one register,<br \/>\n        then expand to a CPU with three.<\/p>\n<p>For a CPU with one register, the algorithm boils down to finding redundant loads and eliding them.<br \/>\n        By redundant load, I mean a load which has no effect, like the second load below:<\/p>\n<pre>\nlda #10\nsta PPUDATA\nlda #10       \/\/ redundant\nsta PPUDATA<\/pre>\n<p>Finding and eliding redundant loads is easy to do; just keep track of the previously loaded value.<br \/>\n        The pseudocode below does just that, optimally:<\/p>\n<pre>\nprev = -1\n\nfor each byte to upload:\n    if prev != byte:\n        emit_load(byte)\n        prev = byte\n\n    emit_store()<\/pre>\n<p>This idea can <i>almost<\/i> be extended to use three registers<br \/>\n        by using three &#8220;prev&#8221; variables instead of one.<br \/>\n        You might approach it using a struct or record type to keep things clean:<\/p>\n<pre>\nstruct cpu_state\n{\n    int a\n    int x\n    int y\n}<\/pre>\n<p>Unfortunately, there&#8217;s a problem.<br \/>\n        Although we can keep track of each register this and elide stores,<br \/>\n        it&#8217;s not clear how we should handle loads.<br \/>\n        Given a value not found in any register, which register should load it?\n        <\/p>\n<p>We can make these decisions optimally by generating assembly code for every possible combination of loads (brute force)<br \/>\n        and only keeping the best one, but this quickly becomes infeasible.<br \/>\n        For each byte that needs loading, there are three choices possible.<br \/>\n        The number of combinations will grow at an exponential rate.<\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/tree.png\"><br \/>\n        <br \/><sup>Diagram of the brute-force search tree.<\/sup><\/p>\n<p>Luckily, this brute-force approach is still usable if a trick is used.<br \/>\n        <b>It&#8217;s possible to prune most combinations, shrinking the size of the search space.<\/b><\/p>\n<p>If two pieces of assembly code have been generated, and both have the same observable side effect,<br \/>\n        the higher costing one can be pruned.<br \/>\n        By side effects, I mean that both upload the same byte sequence<br \/>\n        and finish with identical register states.<br \/>\n        For example, these two pieces of code are equivalent:\n        <\/p>\n<pre>\nlda #1\nsta PPUDATA\nldx #2\nstx PPUDATA\nldy #3\nsty PPUDATA\nsta PPUDATA\n<\/pre>\n<pre>\nldy #1\nsty PPUDATA\nldx #2\nstx PPUDATA\nldy #3\nsty PPUDATA\nlda #1\nsta PPUDATA\n<\/pre>\n<p>Both upload the sequence (1, 2, 3, 1), and both leave the registers at (A = 1, X = 2, Y = 3).<br \/>\nThe second one requires an extra load though, so it can be pruned.<\/p>\n<p>Keep in mind that this works for partial snippets of code too.<br \/>\nInstead generating entire code sequences and comparing them,<br \/>\nit&#8217;s possible to generate the code instruction-by-instruction and compare at each step.<\/p>\n<p>To implement this, the algorithm will handle the combinations in a breadth-first fashion.<br \/>\nIt will first generate all the combinations for the first byte and prune some,<br \/>\nthen it will generate all the combinations for the second byte and prune some, and so on.<br \/>\nThe pruning is done by storing the results in a hash table, indexed by each combination&#8217;s &#8220;cpu_state&#8221;<br \/>\n(the struct introduced earlier).<\/p>\n<p>For better performance, an additional method of pruning will be used.<br \/>\nIf we keep track of the current best combination at each iteration step,<br \/>\nwe can prune other combinations that are slower by three or more loads.<br \/>\nThis preserves optimality of the final result, as three loads is enough to convert from any &#8220;cpu_state&#8221; to another.\n<\/p>\n<p>Putting all this together, the algorithm to elide loads in a 3-register CPU becomes:<\/p>\n<pre>\nprev_hash_map.clear()\n\nfor each byte to upload:\n    next_lowest_cost = INFINITY\n    next_hash_map.clear()\n\n    for each combination in prev_hash_map:\n        if combination.cost >= prev_lowest_cost + 3:   \/\/ Second method of pruning\n            continue\n\n        for each register (A, X, Y):\n            new_combination = combination\n            if new_combination.cpu_state.register != byte:\n                new_combination.code.append_load()\n                new_combination.cost += 1\n                new_combination.cpu_state.register = byte\n\n            if combination.cost < next_lowest_cost:\n                next_lowest_cost = combination.cost\n\n            next_hash_map.insert(new_combination)\n            if next_hash_map already had this cpu_state:  \/\/ First method of pruning\n                Keep the lowest-costing one\n\n    swap(next_hash_map, prev_hash_map)\n    prev_lowest_cost = next_lowest_cost<\/pre>\n<p>(If you want to see actual code, I've put a C++ implementation<br \/>\n<a href=\"https:\/\/coliru.stacked-crooked.com\/a\/d9b6528f72b870a1\">here<\/a><br \/>\nand <a href=\"https:\/\/pastebin.com\/raw\/HAh3smh1\">here<\/a>.)<\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/bebop.gif\"><br \/>\n        <br \/><sup>I converted the Cowboy Bebop intro to a NES ROM using this animation process.<br \/>\n        The GIF runs at a lower framerate.<\/sup><\/p>\n<hr>\n<h3>Basic Block Instruction Selection - Part 2: The Compiler<\/h3>\n<p>Alright, alright. . .  we can generate code for animations \u2014 So what?<br \/>\n        This is an article about compilers, right?<\/p>\n<p>\n        Well, consider this: <b>Instead of processing a sequence of bytes with this algorithm,<br \/>\n            let's process the sequence of IR operations created earlier.<\/b><\/p>\n<p>Here's the ordered sequence of IR operations, from earlier.<br \/>\n        I've added variable names (Q, R, S, T) as placeholders for each node:<\/p>\n<pre>\nQ = call fn\nR = Q ^ 5\nS = R & 3\nT = R - S\nreturn T<\/pre>\n<p>Like we did for the sequence of bytes, we can generate combinations per operation and prune the worst ones,<br \/>\n        but there are two big differences:<\/p>\n<ol>\n<li>Different IR operations generate different assembly instructions (not just loads).<\/li>\n<li>The \"cpu_state\" struct will also track variable names.<\/li>\n<\/ol>\n<p>Putting this into practice, I'll show how the algorithm handles the ordered IR.<br \/>\n        The algorithm starts by generating combinations for the first operation: <b>\"call fn\"<\/b>,<br \/>\n        which there is only one:<\/p>\n<pre>\n; call fn:\njsr fn   ; Call the function\nsta Q    ; Store the result into variable Q\n         ; cpu_state is now (A = Q, X = ?, Y = ?)<\/pre>\n<p>After that, it will generate the combinations for the <b>XOR operation (^)<\/b>.<br \/>\n        There's a single assembly instruction which does this: \"EOR\",<br \/>\n        but since XOR is commutative, two code combinations exist:\n        <\/p>\n<pre>; call fn:\njsr fn   ; Call the function\nsta Q    ; Store the result into variable Q\n         ; cpu_state is now (A = Q, X = ?, Y = ?)\n; xor:\nlda #5 ; Load 5 into register A\neor Q  ; XOR register A with variable Q\nsta R  ; Store the result into variable R\n       ; cpu_state is (A = R, X = ?, Y = ?)<\/pre>\n<pre>; call fn:\njsr fn   ; Call the function\nsta Q    ; Store the result into variable Q\n         ; cpu_state is now (A = Q, X = ?, Y = ?)\n; xor:\neor #5 ; XOR register A with variable 5\nsta R  ; Store the result into variable R\n       ; cpu_state is (A = R, X = ?, Y = ?)<\/pre>\n<p>The second combination does not need a load instruction for the Q variable, as it's already loaded in the register.<br \/>\n        Since both of these code sequences have the same side effect, the first combination will be pruned.<\/p>\n<p>Now, for the <b>AND operation (&)<\/b>.<br \/>\n        While there is an AND instruction, there's also an illegal instruction called SAX,<br \/>\n        which perform a bitwise AND of register A with register X and stores the result.<br \/>\n        This results in four combinations, so I'll only show the two that won't get pruned:\n        <\/p>\n<pre>; call fn:\njsr fn   ; Call the function\nsta Q    ; Store the result into variable Q\n         ; cpu_state is now (A = Q, X = ?, Y = ?)\n; xor:\neor #5 ; XOR register A with variable 5\nsta R  ; Store the result into variable R\n       ; cpu_state is (A = R, X = ?, Y = ?)\n; and:\nand #3   ; AND register A with 5\nsta S    ; Store the result into variable S\n         ; cpu_state is (A = S, X = ?, Y = ?)<\/pre>\n<pre>; call fn:\njsr fn   ; Call the function\nsta Q    ; Store the result into variable Q\n         ; cpu_state is now (A = Q, X = ?, Y = ?)\n; xor:\neor #5 ; XOR register A with variable 5\nsta R  ; Store the result into variable R\n       ; cpu_state is (A = R, X = ?, Y = ?)\n; and:\nldx #3   ; Load 3 into register X\nsax S    ; Store the AND of registers A and X into variable S\n         ; cpu_state is (A = R, X = 3, Y = ?)<\/pre>\n<p>For the <b>subtraction operation (-)<\/b>, the SBC instruction will be used following a SEC instruction to set the carry flag.<br \/>\n        As subtraction isn't commutative, there's only one way to do this, but since the previous operation generated two combinations,<br \/>\n        this operation will result in two combinations too.\n        <\/p>\n<pre>; call fn:\njsr fn   ; Call the function\nsta Q    ; Store the result into variable Q\n         ; cpu_state is now (A = Q, X = ?, Y = ?)\n; xor:\neor #5 ; XOR register A with variable 5\nsta R  ; Store the result into variable R\n       ; cpu_state is (A = R, X = ?, Y = ?)\n; and:\nand #3   ; AND register A with 5\nsta S    ; Store the result into variable S\n         ; cpu_state is (A = S, X = ?, Y = ?)\n; subtract:\nlda R    ; Load R into register A\nsec\nsbc S    ; Subtract S from register A\nsta T    ; Store the result into variable T\n         ; cpu_state is (A = T, X = ?, Y = ?)<\/pre>\n<pre>; call fn:\njsr fn   ; Call the function\nsta Q    ; Store the result into variable Q\n         ; cpu_state is now (A = Q, X = ?, Y = ?)\n; xor:\neor #5 ; XOR register A with variable 5\nsta R  ; Store the result into variable R\n       ; cpu_state is (A = R, X = ?, Y = ?)\n; and:\nldx #3   ; Load 3 into register X\nsax S    ; Store the AND of registers A and X into variable S\n         ; cpu_state is (A = R, X = 3, Y = ?)\n; subtract:\nsec\nsbc S    ; Subtract S from register A\nsta T    ; Store the result into variable T\n         ; cpu_state is (A = T, X = 3, Y = ?)<\/pre>\n<p>Lastly, the <b>return operation<\/b>, which is implemented with a single RTS instruction.<br \/>\n        This one is so simple, I'm not going to illustrate it with a code example.<br \/>\n        Simply imagine a \"RTS\" appended onto the end of the previous two combinations.\n        <\/p>\n<h4>Finishing up<\/h4>\n<p>Once the combinations have been run, the algorithm selects the lowest-costing one.<br \/>\n        In this case, it's the one that uses a SAX instruction instead of AND,<br \/>\n        as it requires one less instruction.<\/p>\n<p>The resulting selection looks like this, but we're not done yet:<\/p>\n<pre>\njsr fn\nsta Q\neor #5\nsta R\nldx #3\nsax S\nsec\nsbc S\nsta T\nrts<\/pre>\n<p>The issue is, most of these store instructions are unecessary, as nothing will ever load them.<br \/>\n        To fix this, an extra pass is run which identifies stores that are never loaded.<br \/>\n        This results in the final code:<\/p>\n<pre>\njsr fn\neor #5\nldx #3\nsax S\nsec\nsbc S\nrts<\/pre>\n<p>In the actual compiler, the code generator is a bit smarter about stores,<br \/>\n        and can usually estimate if they're needed or not while generating the combinations.<br \/>\n        This is important, as the presence of a store influences a combination's cost.\n        <\/p>\n<h3>Basic Block Instruction Selection - Part 3: Continuations<\/h3>\n<p>As shown, compiling an IR operation boils down to generating a bunch of combinations.<br \/>\n        However, if you were write code to translate IR operations into assembly combinations,<br \/>\n        you'd find it to be a repetive, menial coding task.\n        <\/p>\n<p>The issue relates to the number of combinations.<br \/>\n        For each IR operation, there will be a few valid instructions that can implement it,<br \/>\n        and each of those instructions will have its own variants, resulting in more combinations.<br \/>\n        Multiply these together (and double it for communicative operations),<br \/>\n        and you'll end up with a lot of combinations per IR operation.<\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/xzibit.jpg\"><br \/>\n        <br \/><sub>Our combinations have combinations!<\/sub><\/p>\n<p>It would be nice to abstract these details away and create building blocks for<br \/>\n        implementing operations. Luckily, I found a solution by using continuations.<\/p>\n<p>Each step (or \"building block\") will be implemented as a function written in<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Continuation-passing_style\">continuation-passing style<\/a>.<br \/>\n        Instead of returning, these functions will call their continuations with each instruction they want to generate a combination for.<br \/>\n        For example, to generate three combinations, the continuation will be called three times with three different instructions.\n        <\/p>\n<p>Here's how a function for loading a value in register A might look, in pseudocode:<\/p>\n<pre>\nload_A(desired_value, cpu_state, continuation)\n    if cpu_state.A == desired_value\n        continuation(cpu_state, NOP)\n    else if cpu_state.X == desired_value\n        continuation(cpu_state, TXA)\n    else if cpu_state.Y == desired_value\n        continuation(cpu_state, TYA)\n    else\n        continuation(cpu_state, LDA)\n        continuation(cpu_state, LAX)<\/pre>\n<p>\n        If the value is already present in one of the registers: either nothing happens (NOP), or a transfer between registers occurs (TXA, TYA).<br \/>\n        Otherwise, two combinations will occur: one using the LDA instruction,<br \/>\n        and another using the LAX instruction.\n        <\/p>\n<p>The point of writing functions like this is that they can be composed.<br \/>\n        With a bit of plumbing, it's possible to end up with a<br \/>\n        <a href=\"https:\/\/en.wikipedia.org\/wiki\/Domain-specific_language\"\">domain-specific language<\/a> that resembles:<\/p>\n<pre>\ncompose(load_A(L), load_X(R), instruction(SAX))<\/pre>\n<p>A single line like this can cover 50 different assembly sequences, all handling different combinations of instructions and addressing modes.<br \/>\n        It's much easier to write than trying to enumerate all the possibilities by hand.<\/p>\n<h3>Control Flow - Part 1: Basics<\/h3>\n<p>Although the algorithm I've described is tied to basic blocks, it can handle control flow (branches, loops, etc) too.<br \/>\n        Branch operations are not special \u2014 they generate assembly combinations like any other.\n        <\/p>\n<p>For example, take a look at this control flow graph (the rectangles are basic blocks):<\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/cfg.png\"><\/p>\n<p>Each basic block in this graph will be assigned a unique label name (L1, L2, L3, etc).<\/p>\n<p>Now we'll generate the code for each basic block <i>separately<\/i>, implementing control flow as instructions involving these labels.<br \/>\n        Once every basic block has its code generated, we can concatenate the results:<\/p>\n<pre>\nL1:\n    ; (More code would go here)\n    beq L1 ; Branch instruction\n    bne L2\nL2:\n    ; (More code would go here)\n    jmp L4 ; Jump instruction\nL3:\n    ; (More code would go here)\n    beq L1\n    bne L4\nL4:\n    ; (More code would go here)\n    rts<\/pre>\n<p>With this implemented, the compiler now works.<br \/>\n        It can generate executables!<\/p>\n<h3>Control Flow - Part 2: Optimizations<\/h3>\n<p>The process above works, but does not produce good code across basic block boundaries.<br \/>\n        The issue stems from compiling basic blocks separately without sharing knowledge of the register states between them.<br \/>\n        If one basic block needs a value from a previous block, it will emit a load instruction<br \/>\n        even if it's redundant.<\/p>\n<p>For example, this is how a loop would look:<\/p>\n<pre>\nL0:\n    ldx #0 ; Load register X with 0\n    stx I  ; Store variable I\nL1:\n    ldx I  ; Load variable I\n    inx    ; Increment register X\n    stx I  ; Store variable I\n    bne L1 ; Branch if register X is not zero<\/pre>\n<p>But this is how a loop <i>should<\/i> look:<\/p>\n<pre>\n    ldx #0 ; Load register X with 0\nL1:\n    inx    ; Increment register X\n    bne L1 ; Branch if register X is not zero<\/pre>\n<p>The second loop lacks redundant loads.<\/p>\n<p>To solve this, the final \"cpu_state\" from basic blocks need to be shared.<br \/>\n        This can be done by passing each basic block's result into its successors as input.\n        <\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/prop1.png\"><br \/>\n        <br \/><sup>Old method: Generate code for each basic block individually.<\/sup><\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/prop2.png\"><br \/>\n        <br \/><sup>New method: Use the output states from generating a basic block as the input states to the next.<\/sup><\/p>\n<p>This is an iterative process where each node can be processed multiple times.<br \/>\n        When a node is processed, it will trigger its successors to be processed next passing outputs as inputs. This will repeat until a fixed point is reached.\n        <\/p>\n<p>Going back to the loop example, initial iterations will produce inefficient code,<br \/>\n        but later iterations will propagate the register states through the loop from (L3) to (L1).<br \/>\n        Once variable \"I\" appears pre-loaded in register X, optimal code will follow.<\/p>\n<p>To expedite this process, it's beneficial to only pass the lowest-costing results along<br \/>\n        and get stricter each iteration.<br \/>\n        This resembles <a href=\"https:\/\/en.wikipedia.org\/wiki\/Simulated_annealing\">simulated annealing<\/a>.\n        <\/p>\n<h3>Control Flow - Part 3: Some Loads Required<\/h3>\n<p>Although the method above generates improved code,<br \/>\n        it's no longer possible to get a working program by concatenating the lowest-costing combinations.<br \/>\n        The issue is, each code combination now expects some specific register values as inputs,<br \/>\n        but the output of one combination may not correspond to the input of another.<br \/>\n        You could get nonsensical results like initializing a loop variable using the Y register,<br \/>\n        but incrementing using the X register:<\/p>\n<pre>\n; This code is incorrectly compiled!\nL0:\n    ldy #0   ; Output Y as the iteration count\n             ; NOTE: A sty instruction was removed by the \"remove unused stores\" pass.\nL1:\n             ; Input X as the iteration count\n    inx\n    bne L1<\/pre>\n<p>To fix this, the input states of each basic block must match the output states of their predecessors.<br \/>\n        If there is a discrepency, a load can be inserted to correct it.<\/p>\n<p>By inserting loads, the code becomes:<\/p>\n<pre>\nL0:\n    ldy #0\n    sty I   ; This store is no longer removed.\nL1_entry:\n    ldx I   ; Inserted load\nL1:\n    inx\n    bne L1<\/pre>\n<p>Which works, but isn't ideal.<\/p>\n<h3>Control Flow - Part 4: PBQP<\/h3>\n<p>The woes above are caused by selecting only the lowest-costing combinations per basic block,<br \/>\n        without taking into consideration the compatability between input\/output states.<br \/>\n        Ideally, we'd factor in the cost of inserted loads.<\/p>\n<p>Doing so is an optimization problem, and one which closely maps to the<br \/>\n        <a href=\"https:\/\/beza1e1.tuxen.de\/articles\/pbqp.html\">Partitioned Boolean Quadratic Problem<\/a> (PBQP).<br \/>\n        If we can solve a PBQP problem, we can select optimal combinations for our basic blocks.<\/p>\n<p>Unfortunately, not much has been written on PBQP, and what does exist is hard to read for most:<\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/pbqp.png\"><br \/>\n        <br \/><sup><a href=\"http:\/\/www.complang.tuwien.ac.at\/scholz\/pbqp.html\">Image Source<\/a><\/sup><\/p>\n<p>My advice is that instead of struggling over equations, it's easier to learn PBQP by thinking visually, in terms of graphs.\n        <\/p>\n<p>Imagine a graph where each node has a finite list of choices,<br \/>\n        and each of those choices has a cost.<br \/>\n        The goal PBQP is to pick one choice for each node, minimizing the total cost.<br \/>\n        The catch is, an additional cost is incurred for every edge in the graph<br \/>\n        based on the choices made at the edge's vertices.\n        <\/p>\n<p>The picture below illustrates such a graph.<br \/>\n        Each node has a choice of two colors, and those colors have a cost next to them.<br \/>\n        When selecting two colors, the cost of the edge must be paid as well.<br \/>\n        This is done by looking up the cost in a table unique to each edge, using the two selected colors as keys.\n        <\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/pbqp2.png\"><\/p>\n<p>Below, I've made some choices, and have crossed out everything I didn't choose.<br \/>\n        The total cost is the sum of the visible numbers, which is 3 + 1 + 7 + 3 + 6 + 7 = 27.<br \/>\n        The goal of PBQP is to find choices which minimizes this sum.<\/p>\n<p>        <img decoding=\"async\" src=\"http:\/\/pubby.games\/codegen\/pbqp3.png\"><\/p>\n<p>A <a href=\"https:\/\/pp.ipd.kit.edu\/uploads\/publikationen\/mbaumstark19bachelorarbeit.pdf\">fast algorithm<\/a><br \/>\n        exists to solve PBQP problems near-optimally.<br \/>\n        The algorithm works on the graph, reducing and combining nodes until none remain.<br \/>\n        At the end, the algorithm works backwards, using the steps it took to decide which choices to make.<\/p>\n<p>Going back to the compiler, the problem I was describing fits perfectly into this visualization:<\/p>\n<ul>\n<li>The graph is the control-flow graph.<\/li>\n<li>The nodes are the basic blocks.<\/li>\n<li>The node choices are the code combinations generated prior.<\/li>\n<li>The edge costs are the number of extra loads that have to be inserted given two code combinations.<\/li>\n<\/ul>\n<p>By implementing a PBQP solver and using it on our combinations, the end result will produce near-optimal code.<br \/>\n        The assembly loop which gave us trouble before will become:<\/p>\n<pre>\n    ldx #0\nL1:\n    inx\n    bne L1<\/pre>\n<p>Which is optimal.<\/p>\n<h2>Complexity and Performance<\/h2>\n<ul>\n<li>Creating the order is O(N<sup>2<\/sup>), but has a negligable runtime cost.<br \/>\n            In practice, typical code is very easy to order.<\/li>\n<li>I don't know what the complexity of going out of SSA is, but I'm assuming its O(N<sup>2<\/sup>).<br \/>\n                    Regardless, the runtime cost is negligable.<\/li>\n<li>Code-generating basic blocks is O(N) with respect to the number of IR operations, with a high constant multiplier.<br \/>\n                This has a impactful runtime cost, with the bottleneck being the hash table.<br \/>\n                I use a very fast hash-table of my own design, which alleviates this.\n            <\/li>\n<li>The PBQP problem is solved in O(N) relative to the number of edges, and O(N<sup>3<\/sup>) relative to the number of choices.<br \/>\n                This <i>can<\/i> be a big runtime cost, but isn't an issue so long as the number of choices is kept small.<br \/>\n                I reckon the PBQP solver could be implemented using SIMD if necessary, but I haven't found the need.<\/li>\n<li>Repeatedly generating basic blocks has some awful complexity,<br \/>\n                but because of the simulated annealing it becomes O(N).<br \/>\n                A significant cost at this step is cache misses caused by reading assembly combinations.<br \/>\n                With proper technique, many of these cache misses can be avoided.\n            <\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>This code generator has worked well enough for me.<br \/>\n        Although parts of it are hairy, the continuations make it easy to add new operations<br \/>\n        and I don't see much technical debt in the code. The generated assembly code has been good, albeit not perfect.<br \/>\n        As stated, it beat other compilers for the benchmarks I ran, but I'm sure others could beat mine as well.<br \/>\n        My only goal was to produce a good compiler, not the best.\n        <\/p>\n<p>As I mentioned in the beginning, I don't have deep knowlege of other code generation techniques.<br \/>\n        Most of the papers on code generation I've seen involve ILP \/ SAT solvers and take minutes, if not hours to compile a long function.<br \/>\n        Besides, everything nowadays is designed for RISC architectures, which is totally unlike the 6502.<br \/>\n        You might think that papers from the 1970s would be more useful, but in the 70s computers were sluggish<br \/>\n        and the algorithms designed for them were O(N) and compensating.<br \/>\n        You just don't see advice like \"hash everything and run it a million times\" back then.\n        <\/p>\n<p><a href=\"http:\/\/pubby.games\/index.html#softwarearticles\">[More programming articles]<\/a> <a href=\"http:\/\/pubby.games\/index.html\">[Back to my homepage]<\/a><a href=\"http:\/\/pubby.games\/work.html\">[Hire me]<\/a><\/p>\n<\/p><\/div>\n<p><a href=\"https:\/\/pubby.games\/codegen.html\" class=\"button purchase\" rel=\"nofollow noopener\" target=\"_blank\">Read More<\/a><br \/>\n Diego Noren<\/p>\n","protected":false},"excerpt":{"rendered":"<p>To learn how optimizing compilers are made, I built one targeting the 6502 architecture. In a bizarre twist, my compiler generates faster code than GCC, LLVM, and every other compiler I compared it to. I reckon my compiler isn&#8217;t doing more when it comes to high-level optimizations, so the gains must be from the code<\/p>\n","protected":false},"author":1,"featured_media":621787,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2515,36166,46],"tags":[],"class_list":{"0":"post-621786","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-explaining","8":"category-generator","9":"category-technology"},"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts\/621786","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/comments?post=621786"}],"version-history":[{"count":0,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts\/621786\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/media\/621787"}],"wp:attachment":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/media?parent=621786"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/categories?post=621786"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/tags?post=621786"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}