Lesson 3: Local Analysis & Optimization #348

sampsyo · 2023-08-21T20:19:07Z

sampsyo
Aug 21, 2023
Maintainer

The tasks for this lesson include implementing basic dead-code elimination (DCE) and, as the main event, implementing local value numbering (LVN). I ❤️ LVN!

MelindaFang-code · 2023-09-06T17:09:06Z

MelindaFang-code
Sep 6, 2023

Summarization of work

For this task, I implemented global dead code elimination to remove the variables never used after definition, local optimization to remove redefinition of variables that are not used, and also local value numbering. For local value numbering, I tried to solve copy propagation by first finding the local number of the variable assigned by the id, and rewrite the id instruction using the variable of that local number or change it to a const. For common subexpression elimination, I checked the 'op' field for commutativity and sort the arguments if it is add or multiple. For constant folding, I only implemented it for add/mul/sub/div, and each time if the value is not stored in the table, I try to replace the args using the values stored in the table, and by induction it should be already be const value if it is calculated in the same block before, and then do the calculation, and replace the instruction with const operation. Finally I combine all the optimization together. I first run local optimization and local value numbering, and then run global dead code elimination until convergence.

Testing

I first manually run my program to produce bril files and run it using turnt to see if the results are correct. Then I run the test cases in core and examples folder using brench, and compare the profiling results from existing dce.py, lvn.py, and my implementation. I noticed that my implementation has better performance as the existing dce.py, but worse than the lvn.py. Using the test suites in examples/lvn, my implementation has similar performance as the provided lvn.py. But for more complex tests there are differences. My guess is that my lvn function did not implement all the optimization techniques and thus not optimizing as aggressively as the existing implementation. For instance for constant folding there are operations like eq, le that I didn't replace.

Hardest part

What was the hardest part of the task? How did you solve this problem?
The hardest part is to figure out the edge cases. For example, there are many instructions that does not have 'dest' field or 'args' field, like the 'labels' and the terminators. Also we need to think about the edge cases where in the beginning nothing existing in the tables. In addition, I spend some time to figure out how to implement the case where dest will be overwritten. My original thought was to keep track of the existing variables and where they are used, and if we find it redefined, go back and change them, because I try to do operations in one pass. Then I realize I should probably scan the instructions first bottom up to retrieve the when a variable is last defined and then compare it when we encounter the variable.

0 replies

keikun555 · 2023-09-06T21:05:23Z

keikun555
Sep 6, 2023

Summarize what you did.

Explain how you know your implementation works—how did you test it? Which test inputs did you use? Do you have any quantitative results to report?

For trivial dead code elimination I referred to the pseudocode we wrote in class and was relatively straightforward to implement
For local value numbering, I referred to the pseudocode given in Lesson 3 and found that there are a lot of edge cases we didn't consider in depth during class. One such edge case is how we want to handle value operations that may have side effects, for example, alloc calls and function calls that return a value. We want to give unique identifiers to such calls because we don't want to replace them with precomputed values -- we want the side effects as well. That's why in my value tuple type, I have an optional string for identifying such function/alloc calls.
I also implemented a python type hint system for Python dictionaries generated from Bril JSON files. This was mainly possible due to the Bril documentation, and the Literal and TypedDict types available from Python's typing and typing_extensions (pip install typing-extensions) packages. This library is definitely incomplete, and I will add more to it when needed for future tasks.

The .toml file for the benchmarks I ran is here (will need the bril library outside of the git repo) and the results are here. I found that lvn + tde decreased substantial numbers of total dynamic instances, whereas just running local value numbering expectedly did not (and sometimes even added more instances!). As for the commutative lvn optimization, I ran the optimization + tdce on this test file and found that it successfully removed the commuting redundant case using this toml file and this csv output.

What was the hardest part of the task? How did you solve this problem?

As written previously, there were many edge cases I had to consider when writing my LVN implementation. The most difficult part of this is figuring out what the edge cases were from failed tests. For one, when the mem/adj2csr test case failed, I needed to find what my LVN optimizer changed. To do that, I needed to remove all comments and empty lines using sed -e '/^\s*#.*$/d' -e '/^\s*$/d' benchmarks/mem/adj2csr.bril and used diff to compare with what my optimizer produced. I then reverted block by block what was changed by the optimizer and found that the culprit was modifying

num_edges: int = call @adj2csr num_nodes adjmat csr_offset csr_edges;

to

num_edges: int = call @adj2csr num_nodes adjmat csr_offset csr_offset;

This was because both csr_offset and csr_edges were both assigned by the same call function and arguments and my optimizen't was treating functions as Values without side effects.

  csr_offset: ptr<int> = call @zeroarray sqsize;
  csr_edges: ptr<int> = call @zeroarray sqsize;

Handling function calls as side effect operations fixed the issue.

3 replies

keikun555 Sep 7, 2023

One addition -- I made a python script to count the number of dynamic instances were removed and filtered out the rows that removed at least one dynamic instance into this csv file.

keikun555 Sep 11, 2023

Forgot to update the links that are dead -- it should work if the urls changed from task3 -> lesson3

keikun555 Sep 12, 2023

Updated with commit based URL

matth2k · 2023-09-06T23:26:00Z

matth2k
Sep 6, 2023

Summary
- Trivial Dead Code Elimination
- Local Value Numbering Implementation with Clobbering & Constant Folding
Implementation Details
- For TDCE, the implementation followed very closely to what we learned in lecture. Nothing particularly interesting to report.
- LVN, on the other hand, took a lot of effort to create a solution that lent itself to robust constant folding.
  - Part of the strengh and weakness of my solution is that I created a class called Expr, which has four subclasses: BinaryExpr, UnaryExpr, Const, and Value. So in this case, an LVN number is a Value which is a type of sub-expression itself. If it can be folded, we do so, otherwise we use the Value expression at face value and use the LVN table to rewrite the output variable as an id operation. This allowed me to implement constant folding, but caused myself a lot of confusion on the fact that I changed the simple numbering algorithm to having Value's be a sub-type of expressions.
  - Finally, I implemented clobbering by storing the canonical variables of a Value as an entire set. This way, when we clobber a variable in the table, we don't always have to make a new variable name. Only if we are about to lose the subexpression (i.e. there is only 1 canonical variable left), I rewrite the variable with an id operation.

Evaluation

To test correctness of the transforms, I ran dce and lvn on all the benchmarks in the Bril repo and ensured they did not break: core, float, mem, and mixed.
As for testing the strength of my constant folding, I used every test in the examples/test/lvn and I was happy with the results I found. I could shortcircuit boolean logic and constant fold past clobbered variables.
~~I did not see great speedups still. The greatest speedups where only because the application did not have many instructions in the first place.~~ Here are some highlights of my results:

App	DCE % Speedup	DCE + LVN % Speedup
ray-sphere-insersecion	0.00%	58.45%
mandelbrot	0.00%	46.01%
bitshift	0.00%	29.34%
pow	5.56%	25.00%
euler	0.05%	24.37%
pascals-row	4.79%	17.81%
two-sum	10.2%	16.33%

I also have no slowdowns, which is nice.

Anything Hard or Interesting?
- For some reason, it was confusing/challenging for me to understand if folding a Value expression was a different case than other types of Expr. The answer is that, the folding is the same, but Values have one less step of indirection in that a Value is already bound to a variable. In the end the logic goes like this
- Case On
  - Case: An non-value Expr has a Value associated with it
    - Add this output as a canonical variable for the Value we found.
  - Case: We reuse a Value directly
    - Add this output as a canonical variable.
  - Case: This is a new Expr
    - Allocate a new Value for this expression and associate it with the new Expr

2 replies

matth2k Sep 7, 2023

After reading @stephenverderame's difficulty with function calls, it made me realize I had to tweak my LVN code to properly process function arguments. Now I get much better speedups across more of the programs!

matth2k Sep 8, 2023

One more update: I didn't forgot that floating point operations started with "f". I thought those ops were overloaded by type for some reason. Double the speedups again basically.

stephenverderame · 2023-09-07T01:01:38Z

stephenverderame
Sep 7, 2023

Summary

Repo
Trivial and Local Dead Code Elimination
Local Value Numbering with Copy Propagation, Constant Folding, Algebraic Simplification, and Constant Propagation

Implementation Details

For TDCE, I first did local dead code elimination and then performed the global, trivial pass.

For LVN, I used four hashmaps. One corresponds to the environment which maps variable names to value numbers. The next two correspond to the 3-way table in the lecture. One maps value expressions to their value number, and a second maps value numbers to their home location. A final map was used for constant folding which maps value numbers to literals or None. I created a custom type to represent value expressions (instead of the tuples in class) which also enabled basic algebraic simplification and value numbering to be.

Slightly different from in class, I decided that whenever an instruction would overwrite a variable that was the "home" to a value, I would first insert an id instruction to move that value to a new home. This prevented the need to insert unnecessary copies at the end of the basic block (so any escaping variables maintain the same values as the original program) and allowed dead code elimination to remove any new home variables that weren't actually needed.

For any instructions that I didn't support (function calls, loads, floating point), I gave them unique value numbers that could never be repeated to make sure they didn't break anything.

Testing/Evaluation

Once again, I wrote a few manual tests, mostly for checking specific aspects of LVN and DCE, but mainly tested correctness by ensuring all preexisting benchmarks and benchmarks from the previous task worked using TURNT.

To test LVN was actually doing something, I did a few ad-hoc tests manually by looking at the generated BRIL instructions and used brench to collect run statistics on all benchmarks.

In my brench run, I followed up lvn with local and trivial dce. Overall, I found that all but 2 tests had reduced dynamic instructions. The slowdown here is likely due to the fact that I used a greedy algorithm to reconstruct CFGs back into programs, which may have found a worse ordering of basic blocks than the original program. I measured the speedup by computing baseline dyn instr / lvn dynamic instr and computed the following summary statistics:

Mean	StdDev	Quartile 1	Median	Quartile 3
1.3504x	0.475	1.0001x	1.0833x	1.5949x

Many tests resulted in instruction counts similar to the original, however, some had large improvements such as euclid, mandelbrot and check-primes with less than half the instructions as the baseline.

Difficulties

One issue is that, while I didn't support floating point arithmetic for value numbering, I did want it to not break those programs and still perform copy propagation. The issue I was running across is that integer constants could be used to initialize float variables. To solve this, I created a separate value key for IntAsFloat constants to distinguish const 0 as an integer and const 0 as a float to avoid type mismatches.

A similarly related issue was figuring out how to treat things I wasn't/couldn't support such as function calls and floating point ops. Take this for example:

x: int = mul a b;
x: int = call @f;
y: int = mul a b;

Here it needs to know that x has been overwritten, the way I dealt with this was to set x to a new, unique value number on line 2 so that the system recognizes that mul a b is no longer stored in x.

0 replies

kevinnegy · 2023-09-07T03:49:10Z

kevinnegy
Sep 7, 2023

Summary

Implementation details/testing
DCE is implementation is straightforward, using a forward pass. My code tested with brench on all benchmarks seemed to do nothing, despite having improved my own small set of programs. Brench reported that no instruction was deleted, and therefore no incorrect program output.

LVN was very complicated and is not currently working. For my own small toy examples that included the examples from class, the code was optimized and still gave the correct output. However, when I tried to test with brench on all the benchmarks/*bril programs, most programs do not have any difference in number of instructions. There are a few benchmarks that are also either reported as incorrect or "missing". The 4 missing programs generated the following errors:

error: mismatched main argument arity: expected 1; got 0 (2 programs)
error: fsub argument 1 must be a float
error: add argument 1 must be a int

I think in generating new instructions, I may have done poor type conversion. I'm not sure what is causing the first error, but it seems to be indicating that I'm accidentally deleting an argument somewhere.

There was only one benchmark from the bril benchmarks folder that showed an improvement. The quadratic.bril benchmark went from 785 to 412 and seemingly still got the same output. I am suspicious about this since my program didn't optimize any other program, but would need to look deeper into that program to verify.

Difficulties
Two overarching difficulties:

Just getting a grasp on how to manipulate bril programs in block form and then to put it back together. Last week's task, I created basic blocks but didn't do anything with them. So then I was very stumped this week on how to convert blocks back to an entire json. After much struggling, I looked at the example/tdce.py file and realized that I should be reattaching the blocks (instructions) directly back to the original json object instead of trying to create a new json program object. I also took the flatten function for combining blocks back together from that example program. A one-character error I made that Python kindly and silently let me do, led to a frustrating hour or two trying to figure out why I couldn't get my new blocks to be stored correctly in the original json object. Finally, I got it to work. I'm hoping this initial obstacle won't be an issue in the future since I'll already have the code template for doing bril json manipulations and can just plug and play.
The LVN implementation was (and is) very challenging. As others have mentioned, not all instructions have the same format. Labels have no opcode, several functions are missing destination variables, and then unlike practically every other op+dest instruction, const instructions have no args, just a value. Handling each type of instruction was a challenge. I ended up skipping labels, not putting any non-dest instructions into the table, and just having some manual handling for other "irregular" instructions.

Second, pointers proved difficult because my LVN code wanted to treat two different pointer alloc's of the same size as the exact same "expression" when in reality the memory underneath the hood was different. I ended up just skipping all pointer instructions with the idea that since pointer values are under the surface, it wasn't safe to try to optimize. This led to instructions whose pointer values were not in the lvn table/environment. I ended up just throwing in any arguments that had not been defined as "noop" variables into the table. For example,

#Skip handling for these instructions
valA: ptr<int> = load vectorA 
valB: ptr<int> = load vectorB  

# valA and valB don't show up in LVN table/env.
# add both as one [expression, canonical variable] row, e.g [(valA), valA] 
sum: int = add valA valB

I handled function arguments the same since it's unknown what the expression for function arguments would be.

0 replies

bcarlet · 2023-09-07T04:59:19Z

bcarlet
Sep 7, 2023

Summary

Details

The TDCE implementation follows the straightforward approach discussed in lecture.

The LVN implementation includes the basic algorithm and some limited semantic extensions, namely, special handling for commutative operations and id. It doesn't attempt the fancier extensions like constant folding. To handle overwrites, I diverged from the pseudocode discussed in lecture by not rewriting variable names, but rather marking values in the table as unavailable when they are overwritten. Additionally, I store a set of variables for each value rather than a single variable, so it may be possible to fall back to another variable if the canonical variable is overwritten.

Testing

To test correctness, I used brench to run the optimizations on all the benchmarks in the Bril repository, ensuring that they all produced the same output as the unoptimized program.

To test the quality of the optimizations, I first inspected their behavior on the test cases under examples/test/lvn to ensure that they were behaving as I expected. For a more systematic evaluation, I again used brench to collect dynamic instruction counts across all benchmarks in the Bril repository.

Across all benchmarks, I observed a minimum speedup of 1.0× (which is to say, I never observed any slowdowns), and a maximum speedup of 1.97×. Average speedups for the various benchmark suites were as follows:

Suite	Baseline	LVN	TDCE
core	1	1.16494	1.00424
float	1	1.27775	1.00572
mem	1	1.04344	1.0157
mixed	1	1.07608	1.00013

Difficulties

As has been mentioned a lot already, there were a lot of edge cases to consider when applying the high-level algorithm to actual code. Some fun ones included instructions with side effects (or otherwise impure functions) and variables defined outside the basic block. Some edge cases only arose when testing with one of the various Bril extensions. To try to accommodate extensions without having to consider them each individually, I adopted a conservative "opt-in" approach to optimization, where only instructions that are explicitly supported will be rewritten.

0 replies

collinzrj · 2023-09-07T22:43:40Z

collinzrj
Sep 7, 2023

Summary what you did
I have implemented trivial dead code elimination, locally killed instruction elimination, lvn and constant folding. Everything is implemented in the optimization.py file, it accepts a json style bril file and output the json style optimized file.

Testing
I have attached a result.csv. I have tested my optimization with brench against all the bril code in benchmark. Every run succeeds except on program that doesn't work even under baseline. The result column shows the total dyn instructions. We can see this number drops for some programs, especially for the function_call program, the dyn instructions drop from 59809726 to 54208816.

Difficulties
I spent quite a bit of time debugging lvn. There are many corners cases I have to pay attention to

0 replies

willwng · 2023-09-08T00:22:18Z

willwng
Sep 8, 2023

Summary

We (Vivian and I) did our implementation in Kotlin. We wanted to future-proof/set up infrastructure, namely parsing JSON into our own data classes
We hope this will help out for future implementation tasks (but took non-negligible time compared to work on the actual optimizations)
We implemented local and global trivial DCE and LVN. This enabled CSE and copy propagation, which we built into the LVN pass.

Implementation details

We first parse the JSON into our own data classes; this allowed neat features such as distinguishing between all types
of instructions and values. We had to write custom serializers, so we could output our data structures back into JSON
We then implemented trivial local DCE based on pseudocode shown in class. We decided to do a forward-pass approach
after giving up on our backward-pass approach. We eventually found our mistakes and compared our results between the two
to make sure they matched.
Then, we worked on LVN.
We maintained a couple different notions of values that need to be recorded: constants
and computed tuples as discussed in class, as well as types representing a value that was defined outside the block,
and impure values. We kept track of impure values purely because it made it easier to record when existing
equivalences were broken; however, no impure values ever alias one another as a result of numbering.
To handle overwriting, each time a canonical variable is about to be overwritten, we store it in a new canonical
variable defined at that point. This sometimes creates extraneous definitions, but we can easily remove them again
with trivial DCE. We also considered maintaining a set of variables pointing to each value, or scanning through the
environment to find backup canonical variables before creating new ones. However, we decided since a DCE pass occurs
after LVN anyway, it seems a little cheaper and simpler to just remove extraneous definitions later.
After doing this, we realized LVN combined with trivial local DCE was quite weak, since the assumption that all variables are live upon exiting a basic block meant that almost no definitions were removed. So we also implemented global DCE, which is stronger in that it can view the entire program, but weaker in that if it sees any variable being used, it preserves all definitions of it. It turned out that using both local and global DCE after LVN/CSE/copy propogation produced much better results.

How did you test it? Which test inputs did you use? Do you have any quantitative results to report?

For LVN, we also printed handy tables: this allowed us to compare results to our handmade tests and ones shown in
class. For example:

We first used tests we wrote ourselves targeting things like copy propagation, references to variables not defined in
the same block, CSE enabling more DCE, and checking that overwritten variables are stored in fresh names.
We then ran brench on all the benchmarks in the bril repo, where we found our results agree for all
(except "function_call" which timed out for both the optimized and unoptimized case)
We also found an average of 10.2% less dynamic instructions for the benchmarks after our optimizations:

What was the hardest part of the task? How did you solve this problem?

Parsing big integers was annoying in Kotlin; some tests had integers that wouldn't fit into the typical Kotlin int
We had one issue with floating point accuracy (around 1e-6) for the "cordic.bril" test

0 replies

Enochen · 2023-09-08T01:03:25Z

Enochen
Sep 8, 2023

Summary

Trivial Dead Code Elimination
Local Value Numbering w/ Copy Propagation, Constant Folding

Details

The trivial dead code elimination implementation was pretty straightforward, it was pretty much just the approach we covered in class.

Local value numbering, on the other hand, was more complicated. I began implementation with very simple test cases in mind, and had a pretty good time. The general implementation involves an array of "table entries" each containing a value and a list of variables that refer to it (the first variable being treated as the canonical). I then created indexes for both values and variables to jump to the corresponding table entry. Finally, it was just a matter of creating temporary variable names for clobbered variables and basically updating the table/index data structure when appropriate. I also solved commutativity by sorting the arguments within each value.

I had a lot of fun implementing constant folding! I didn't have time to implement all the operations that are possible to fold on, such as those involving booleans, but I believe the structure is there and implementing the rest should be a breeze. This felt like writing a super basic interpreter.

As I started to test my optimization on more parts of the language, I realized there were a lot more considerations to take care of, such as function arguments and call instructions which I hadn't considered in the beginning. As a result, I ended up with some hacky approaches to these cases, such as when handling arguments that have been assigned to in the block being processed.

Testing

I started by writing a couple of extremely contrived examples and manually ran my optimizations to compare the outputs while implementing/fixing basic functionality.

I then used the programs in examples/test/lvn to validate my approach and as a sanity check on intended results.

Finally, I used brench to run en masse against all the benchmarks and collected/analyzed dynamic instruction counts. All outputs are correct, and none have more instructions than the baseline, which is good to see. Certain benchmarks like core/fizz-buzz have significantly reduced instruction counts, which made sense after looking at the source bril. I also specifically took a look at my own benchmark which I added in the previous task. I had used the TypeScript compiler to generate the source and was pleased to see my LVN + TDCE reducing instruction count by over 15%!

Difficulties

I chose to implement LVN in TypeScript, which I thought struck a good balance for developer experience, having Bril typings available while also being able to run as a script without a compilation step. However, JavaScript is just totally messed up -- for example, the Map data structure has no way of checking deep equality, meaning that I cannot use "tuples" (ie lists) as or within keys. I managed to overcome this by stringifying my keys, but I do not feel good about this workaround at all. I think I'm going to migrate to some other language like rust for later projects, if I have the time to port all my util code over 🥲.

One tricky "problem" (moreso aesthetic annoyance) I encountered and was not able to fully solve was using the "correct" variable naming when dealing with clobbered variable names. My implementation creates new temporary variable names for write instructions that eventually will be overwritten later, and since these replacement names come first, they end up being the canonical name for the value.

For example:

@main {
  a: int = const 4;
  sum: int = add a a;
  print sum;
  sum: int = add a a;
  print sum;
}

becomes

@main {
  lvn_temp_0: int = const 8;
  print lvn_temp_0;
  print lvn_temp_0;
}

while aesthetically I'd like it to be

@main {
  sum: int = const 8;
  print sum;
  print sum;
}

One approach I tried was to update the canonical variable of a value with the latest variable for that value, however this would increase the number of instructions, making the optimization worse.

0 replies

20ashah · 2023-09-08T01:20:00Z

20ashah
Sep 8, 2023

Summary

Me and @JohnDRubio worked on writing the trivial dead code elimination and LVN algorithms discussed in class. Our LVN algorithm follows the basic implementation along with a few extensions.

Implementation details

We implemented the overall table and environment cloud by creating a Table class. Our Table class contained 2 main data structures, one for the actual table, and the other for the cloud. The actual table was a dictionary, where the keys represented the value column in the table. These keys mapped to another tuple representing the other 2 columns, the variable name and the row number. For example, if we had a row 1 in our table with the value add 0, 0 and the canonical variable name x, the entry in the table data structure would be (add, a, b) -> (x, 1). The second data structure was also a dictionary representing the cloud, which simply maps the variable name to the row number.
Our algorithm follows the approach described in class, with a few parts that are slightly different. One involved the issue described in class with the edge case of variables being re-defined, and using the wrong value in our table. We fixed this issue by performing a check as we go through each instruction. For each instruction, we scan down the list and check if we see another instruction with the same definition - if we do, we rename the definition of the initial instruction and all its uses between the instruction and the redefinition. We also make sure that the name we are generating does not have a duplicate in the basic block.
In addition to the basic LVN algorithm, we implemented a few extensions that were listed in class. The first was a fairly simple extension of ordering the argument numbers in the value column tuple so that we support cumulative operations. The second extension that we implemented was copy propagation. To do this, as we scan through the instructions and see an 'id', we look up the variable name we are copying in the cloud data structure to get the associated row that the value is. We then point the destination of the id instruction to this row. By doing this, we are never creating a row in the table with an id value, but instead having all the copies point to the original row and value in the table.

Testing

To test this, we first wrote some bril programs which specifically tested some of the features that we wanted, for example copy propagation, renaming duplicate variables, common sub-expression elimination, etc making sure that the output of the program didn't change.
Once we did this and verified that our algorithm worked for these hand-written tests, we started running our algorithms against the benchmarks.
We first verified that our optimizations still produce the correct result for all the benchmarks. To look at how our optimizations did, we used Brench to output dynamic instruction count for each benchmark.
The results when comparing the unoptimized version and after running the trivial dead code elimination is shown here
We also ran the same benchmarks with the LVN algorithm, with the results here
There are several test cases that error for the LVN benchmark, which is shown by 'missing' or 'incorrect', however when we run these cases individually on the unoptimized case and after the LVN algorithm , we see the correct result, so we are not sure why we see an inconsistency in Brench.

Difficulties

The biggest difficulty for these tasks was implementing the LVN algorithm, specifically dealing with a lot of the edge cases that didn't present themselves from just the pseudo-code.
Ones of these edge cases was function arguments. We initially didn't consider them until we ran some of the benchmarks, which broke our algorithm. To get around this, we initialized our table with rows for each argument, allowing for common sub expression optimizations involving the arguments.
Another edge case that we found was dealing with uses in a basic block where its definition was in another basic block. Since this is a local optimization, we are running our LVN optimization on one basic block at a time. When we saw a usage of a variable that was not already in the table (since it was defined in a different basic block), we added a row for it in the table, just like we did for arguments.
A final edge case we found was for the logic of renaming definitions if we saw a rewrite to the same variable later down the line. The edge case was if the very next instruction was the rewrite and it had that same variable that we were trying to rewrite as a use. Making sure that we changed the original definition name and the use in the very next instruction fixed this edge case.

1 reply

jdroob Sep 8, 2023

For whatever it's worth - wanted to drop a screenshot in here of an example of a test case that we're getting 'incorrect' for when we run brench but seem to be getting the correct result (with fewer instructions) when we run the test case manually

SanjitBasker · 2023-09-08T01:37:22Z

SanjitBasker
Sep 8, 2023

I worked with @obhalerao on this project.

Summary

Trivial Dead Code Elimination

Local Value Numbering

Details

For this assigment, we changed the C++ JSON parsing library we used from RapidJSON to nlohmann/json, as the latter had much cleaner syntax. We can safely say that we did not regret this decision.

For the trivial dead code elimination assignment, we followed a similar algorithm as the one that was discussed in class: simply iterate backwards through the function, maintain a set of dead variables that are assigned to, and remove them from the set. We iterated until convergence here.

Local value numbering was more tricky. The data structure we chose for the value table was a vector of pairs, the first of which contained a Value representation, and the second of which contained a queue of potential homes for each value, the first of which would be the canonical home. In our haste to write a Value representation, we overlooked how to process constant values, and came up with the admittedly janky solution of reusing the Value representation we made, but putting the value of the constant itself instead of the number of the relevant value. This also means that we do not support processing of float constants; this is something that we can easily change. We do, however, support commutativity of relevant operations.

In addition to the value table, we also maintain maps from each variable to its value number and also from each Value itself to its value number, and update these on each iteration of the overall LVN pass. Then, after we update the table for each instruction, we rewrite the args of each instruction with the home of each value, after popping names off the queue corresponding to the value that are not actually its home.

Lastly, after the LVN pass is performed for each basic block and all instructions are rewritten, we perform global DCE (the version that simply removes all unused assignments) until convergence on the entire program, and then the same local DCE implemented previously on all basic blocks.

Testing

We tested our programs using brench on the entirety of the bril benchmark suite, in addition to the bril programs in the tdce and lvn directories from the bril repo. Here are the speedups we observed on some of the benchmark directories.

Here are the average speedups we observed in all of the Bril benchmarks.

Difficulties

The primary difficulty we had with this assignment was, as per usual, getting used to C++. However, after working with it for an assignment at this scale, we feel as though we have a better handle of how it works now. In addition, there were many edge cases that we missed on our first pass through (e.g. how to handle numbering of call instructions), so we're glad we had the benchmark suite to confirm our suspicions on how to handle these edge cases.

0 replies

alifarahbakhsh · 2023-09-08T02:06:13Z

alifarahbakhsh
Sep 8, 2023

Link to the repo

Summary:

I have implemented both the trivial DCE and the local DCE. The tests are run with the trivial one.
I have implemented LVN with support for commutativity and awareness of the semantics of id.

Experiment results:
I used brench to test my optimizations. The code supports the benchmarks in the core and float folders of the
canonical Bril benchmarks. I achieve a maximum of %50 and an average of %6 improvement in core, and a maximum of %8 and an average of %1 in float. I break nothing.

Difficulties:
So many interesting corner cases! I encountered the function side effect issue mentioned above, and also a subtle
issue with the value field of the const opcode. I had to repetitively modify how I encode values in a tuple to make things
work. Moreover, I just manually ignored cases that I was not interested in, e.g., support for float. All in all, great experience!

0 replies

Arthur-Chang016 · 2023-09-08T03:03:37Z

Arthur-Chang016
Sep 8, 2023

Summary

LVN
TDCE

I started with LVN first. I felt it's more challenging and interesting.
I switch from python to C++. I tried to use C++ polymorphism and some features that I rarely use. Learning those features even took more time than implementing the algorithms themselves.
I choose the nlohmann/json as the json parsing library. The syntax is pretty intuitive though it's not the fastest library.

Details

LVN
- Data structure: vector for table, hash map for var2num. Also need a extra Tree map for referring the Value to Number(index of table)
- Breaking programs (functions) into basic blocks
- Design the framework to process bril.
- For operand "id", I just refer to its number, instead of creating a new table entry for it.
- Currently only support bril cores.
TDCE
- Share the same framework with LVN.
- Go backward by adding/ deleting defs/ uses variables.

Testing

Current I only run on my trivial hand-made test cases and simple.bril and reassign.bril.

Difficulties

Although I am kind of familiar with C style programming and pointers, C++ is pretty hard to manipulate. I was pretty used to Java and sometimes when I want to do polymorphism (eg. isinstanceof function) in C++, it's pretty annoying.

LVN
- for each basic block, some variable might come from other places (func args or other blocks). For those unnumbered variables, I just skip those instruction instead of numbering them.
- Need to specially tackle "call" and some non-computing rhs instructions. We cannot numbering them but can only modifying their arguments.
TDCE
- It's much easier than LVN.
- The most tricky part is to think clearly the order of adding/ deleting variable in the live container.
- The initial set of variables so that it will eliminate correctly.

Generative AI

Only ask chatGpt for creating Makefile with given file directory structure. Since Makefile is pretty hard to write and read, it's helpful in this case.

0 replies

he-andy · 2023-09-08T03:15:23Z

he-andy
Sep 8, 2023

Summary

Trivial Dead Code Elimination & LVN (CSE, Copy Prop) [repo]

Details

My optimizations use the bril-rs library to parse the bril JSON. TDCE was written closely following the pseudocode given in class and I had little issues getting it to work.

LVN was a lot more involved for me. To represent the table, I made a struct LVNTable with three hashmaps maintaining all the info needed for LVN. Most notably, I mapped numbering to a list of all associated canonical names to deal with clobbering. To speed this up, I also maintained the last def point of each variable and relabelled variables if they were re-def'd. This way, there would only ever be two elements in the list of associated canonical names.

Challenges

TDCE was quite straightforward to implement. LVN/CSE were a little more complicated, but after wrapping my mind around and designing the data structure, it wasn't too bad. In this initial implementation, I used a last_def table to keep track of whether the variable an instruction def'd (if any) was the last_def of that variable. However, this ran into numerous issues when trying to do Copy Prop, as it wouldn't consider a variable defined outside a preceding basic block and redefined in the current block. To workaround this, I kept a list of all canonical names as I mentioned before.

Testing & Verification

To sanity check while coding, I tested against some simple examples designed for DCE/CSE/Copy Prop. Afterwards, I used brench against the entire benchmark suite. This would verify my code's outputs against the baseline. I wrote a script using python to aggregate the statistics.

Full Suite

	DCE	LVN + DCE
mean	1.007	1.275
stddev	0.019	1.007
min	1.00	1.00
max	1.114	3.381

By Suite (mean speedup)

Suite	DCE	LVN + DCE
core	1.004	1.282
float	1.006	1.534
mem	1.016	1.064
mixed	1.001	1.240

0 replies

NgaiJustin · 2023-09-08T03:42:40Z

NgaiJustin
Sep 8, 2023

Summary

Details

For TDCE I followed the pseudocode outlined in the lecture. Did not have too much time this time round so fell back onto Python. I also added some types for the Bril to make it easier to work with.

LVN was more challenging. Implementation-wise four hashtables mapping from num2val, val2var, var2num, and a num2var to handle the cases with clobbered names. I also batched the block mutations. Finally, after the LVN pass per block, I did a global + local DCE to clean everything up

Difficulties

TDCE was not too bad, but LVN was much more challenging. My first few implementations fully modeled the pseudocode using the three-way hashtable but I ran into issues with variables being defined in a preceding block and used in a subsequent block. This resulted in patches upon patches and made everything quite messy. Thinking back, I should have perhaps spent more time rethinking the my structs.

Testing

I first tested my TDCE and LVN implementation on some small handwritten test cases (can be found under test/tdce and test/lvn respectively)
Then, I tweaked the test cases from the bril examples repo and manually inspected the difference
Finally, I ran brench on the core benchmark (results.csv can be found in the repo). There were some last minute breakage on some of the test cases but as a whole TCDE is 100% correct and LVN does not make anything worse from he TCDE results. Most notability, I observed a large speedup with the quadratic core benchmark

quadratic	baseline	785	1
quadratic	tdce	783	1.002547771
quadratic	lvn_tdce	412	1.475159236

0 replies

evanmwilliams · 2023-09-08T03:51:42Z

evanmwilliams
Sep 8, 2023

Partners

@emwangs and I worked on this assignment together.

Summary

Lesson 3

Implementation Details

We opted to use C++ for this assignment. While we think it was a good choice in many ways, we spent some time fighting against the nlohmann package for silly JSON parsing details
Honestly setting up CMake took more time for us than some of the other implementation task work. It was fairly simple
Trivial Dead Code Elimination (TDCE) was rather simple to implement. We simply scan through the instructions to check if there are any that are never used before they are reassigned - if so it's deleted
LVN was much harder than we expected - the pseudocode from lecture hides quite a few tricky implementation details. Namely, we were a bit confused on how to handle some edge cases such as when there is a use of a variable but no def before it in the basic block. We also ran into issues just understanding how to represent some of the data structures efficiently. We ultimately settled on using a dictionary for canonical values to their local value numbers and a vector that is indexed on the value numbers to store both the canonical values and the canonical names

Challenges

As we discussed above, LVN was pretty hard to implement. There are a lot of edge cases in designing the right data structures, and it's also difficult to parse the JSON format since not every instruction has a standard format. I think it'd actually be easier to parse the JSON file into an AST format and work with that rather than the JSON directly, but we stuck with the JSON format for this lab due to some time constraints
I also think that 2 days was maybe a bit of an aggressive timeline to get this one implemented, but that's okay! We learned a lot :D

Testing

We ran the example from lecture and all of the examples in the benchmark to verify that our programs were correct. We visually inspected the efficiency of some of our optimizations, but did not have a whole lot of time to aggregate pretty statistics. Will try to post them in the next day if I am able to!

0 replies

ryanwmao · 2023-09-08T03:54:37Z

ryanwmao
Sep 8, 2023

@xalbt and I worked together on lesson 3.

Summary

repo

Implementation

We implemented TDCE in C++. The program takes json from stdin and outputs json to stdout. Our LVN is written in Python and takes json from stdin and outputs json to stdout.
We followed the pseudocode outlined in lecture. Because of some design issues in the LVN and the resulting time crunch, we weren't able to extend our LVN implementation to the other analyses presented in lecture.
In terms of debugging, we found this pretty challenging, but we were able to get by with comparing the bril2txt after using our tools and the original .bril files.

Difficulties

We underestimated the amount of time it would take to implement LVN correctly -- there were many issues that we did not foresee and had to spend extensive time fixing.
It seemed like one of the test cases in Core used memory extensions, which we did not support.
There were a ton of edge cases we did not initially consider for LVN -- variables reaching some uses that were outside the scope of the block, function argument variables, etc. For each, we applied a fix as best as we could integrate into our infrastructure, and so our code ended up looking very messy for the LVN. In the future, more extensive planning procedures would have to take place to avoid such difficulties.

Testing

We ran our pipleine (first TDCE, then LVN, then TDCE) on the core benchmarks with brench. In terms of correctness, we compared our test results with that of the example code provided in the repository; both implementations failed the dot-product test case, which uses memory extensions, so we thought that our code was fine
Our data

0 replies

AliceSzzze · 2023-09-08T03:59:08Z

AliceSzzze
Sep 8, 2023

implementation repo
results

I used Java to implement the tasks for lesson 3.

Implementation:

DCE was not too bad. I did something similar to what we talked about in the lecture, i.e. removing a previous assignment if there has not been a use between the last assignment and the current one. I also added a second pass to remove variables that are not used anywhere globally.
LVN was more difficult than I thought. I broke the multi-way table in the lecture down to multiple hashmaps to allow easier and faster indexing, at the expense of space. I numbered the variables to deal with reassignments and used a queue to keep track of all the potential candidates for the canonical home of an expression.

Difficulties:

I decided to switch to Java for this project, so I spent some time setting everything up. JSON parsing in Java is definitely not as nice as Python.
Had to learn how to use Gradle
I definitely should have started this task earlier. I started coding LVN before I thought everything through, and realized that there were a lot of corner cases and subtleties that needed to be resolved.

Testing:

I first tested on a few small bril programs with clear opportunities for optimization and verified that the redundancy was removed. I then ran them on all the benchmarks. The charts below show the proportional decrease in instruction counts after different transformations. I haven't extended LVN to work for floats, so some of the float benchmarks broke. Missing bars indicate a timeout or an incorrect result. Thankfully none of the non-float benchmarks broke (I think?)

What I haven't done and would like to do if I had time:

associativity
constant folding
support for floats

0 replies

yxd97 · 2023-09-08T04:06:12Z

yxd97
Sep 8, 2023

Summary

Implemented trivial dead code elimination that is similar to the algorithm discussed in class.
Implemented local value numbering that can reuse common subexpressions, with the ability to address commutative operators, and optimize chains of id s.
Codebase
I did not use any AI tool in this task.

Implementation Details

The trivial dead code elimiation simply deletes definitions that are never used anywhere in a function. It does not analyze re-definitions.
To deal with re-definitions, I implemented a renaming algorithm, which Renames any Old Definition (ROD in short) of a variable within one basic block. The new name is the original name suffixed by the name of the basic block and the line number within this block, so that it is guaranteed to be unique (how do I name a basic block). This algorithm also takes care of uses that should be renamed accordingly.
I implemented the LVN algorithm which takes in a program pre-processed by the DCE and ROD algorithms. In this case, the LVN does not to deal with re-definitions. I created a class for the value of a variable, with custom __eq__ method to account for commutative opeartors. The LVN algorithm takes two passes on a basic block; the first one builds the two tables to index variables with numbers, and vice versa. The second pass will try to rewrite any instruction according to the tables. Finally, another DCE pass is used to clean up.

Testing Strategy

I created unit tests for the DCE and ROD algorithms. The test cases are either hand-crafted, or abstracted from a part of a benchmark program when the test on that program fails.
There are also test cases for the combined optimization algorithm, which is a combination of hand-crafted ones and representative ones from the Bril repo.
Finally, I tested the optimization algorithm on all benchmarks in the Bril repo. The results indicate my optimization only reduces the dynamic instruction count of some programs.

Challenges

The LVN algorithm is not easy to tackle with many details to polish, so it relies heavily on testing to find out flaws in my implementation. Therefore, I created unit tests, and kept adding test cases whenever anything goes wrong. I also made an effort to understand which kind of coding style in the program caused issues to my optimzation algorithms to create a minimal test case for it, instead of simply copying the whole benchmark program.

0 replies

bennyrubin · 2023-09-08T11:43:49Z

bennyrubin
Sep 8, 2023

Summary

For this task I implemented

TDCE
Local Value Numbering

Implementation & Testing

I followed the template from class and carefully reconstructed a few examples myself to get the intuition behind how the solution worked, before starting to code it up. As we discussed in class, I used my tdce tool to clean-up unused instructions after my LVN pass. I implemented commutativity by canonicalizing the order of my arguments in the tuple for commutative operations. I initially implemented copy propagation, but was spending a lot of time working on "variable clobbering", so I left it for another day.

After doing a few of these assignments, I don't think python is particularly well suited for these tasks, considering the types of annoying runtime errors I was encountering that a "smarter" compiler could point out upfront. This led me to write some pretty hacky code. I might either switch up which language I use for the future assignments or start annotating more types in my programs.

To start with testing, I used a few small handcrafted examples from the bril repo, just to make sure my LVN program was sensible. Then I used brench on all of the core benchmarks. A few of them broke, so I manually compared 2 or 3 to my own implementation and fixed a couple subtle bugs. Then, my implementation worked on all the benchmarks. The speed-up was interesting. I suppose it heavily depends on the source bril program because some only had a couple instructions fewer, but some had a few hundred instructions fewer. I think the compiler generated bril programs are particularly amenable to LVN speedup.

Difficulties

I found quite a few challenges with transforming the pseudo-code from class into a real implementation. There were lots of corner cases I had not considered, and I had to be extra careful with how I dealt with certain types of instructions or else the whole thing would break. Debugging the LVN solution was particularly frustrating. It took a while to compare the "correct" bril program with the output of my LVN tool to see what was going wrong. A particularly funny bug I found was that python treats True and 1 as equivalent when hashing, which was breaking some of the programs. I had to add logic to treat them differently. Finally, I didn't think hard about how to optimize things like Ptrs, but I didn't want my LVN to break those programs, so I explicitly state in the code which instructions I support optimizing and which I don't.

0 replies

jiahanxie353 · 2023-09-11T02:17:23Z

jiahanxie353
Sep 11, 2023

For this task, my summary is that I implemented

trivial dead code elimination and
local value numbering

For trivial dead code elimination, it's rather easy to eliminate globally unused variables and to kill any instruction that is reassigned before used. TDCE was "trivial" except when I erase the unused instructions, nlohmann json will leave an empty json object behind, which results in printing {} in the new json file. For local value numbering, I started out implementing the algorithm based on my own understanding. But later I found I didn't think of dealing with const values and their numbering mappings, and rename unused variables with fresh names. Later, I just followed the pseudocode given in the lesson, which made it easier to implement.
For testing, I used a bunch of hand crafted test cases from the bril repo, as well as my own test cases, in which I came up with numerous tricky examples to test out LVN. I also designed test cases intended for globally unused but not locally killed, and vice versa, to emphasize their different use cases. Finally, I wrote test cases to test out the optimization performance of combing LVN + TDCE.

0 replies

zachary-kent · 2023-09-11T03:58:52Z

zachary-kent
Sep 11, 2023

Summary

TDCE/LVN Implementation with Extensions

For this assignment, I implemented TDCE and LVN as described in class with additional constant propagation, copy propagation, algebraic identities, and constant folding optimizations. I decided to pivot and implement these tasks in Haskell, which I came to believe was an excellent language for the job. Although the algorithms were presented in an imperative manner during lecture, I used the effectful algebraic effects library to enjoy seamless pseudo-imperative programming where effects are tracked by the type system.

Implementation Challenges

The local and global TDCE implementations were fairly straightforward, following from the pseudocode from class.
LVN, however, was much more difficult to implement. To decouple the "renaming" and LVN step, I first developed a separate pass that renames all variables in a basic block that will be redefined later. This way, the input program to LVN is in "local SSA" and I did not have to worry about renaming variables on the fly.
The base LVN algorithm implementation actually worked the first time! Decoupling the renaming and numbering steps really made it easier to reason about program behavior.
Copy propagation was much more difficult to implement. Although variable renaming had allowed us to avoid variable clobbering in the base implementation, it rears its head once again here. The issue is that variables live-in to a basic block cannot be renamed, and can be easily clobbered if copy propagation is implemented naively. For example, consider the following Bril: naive copy propagation will replace the occurrence of id a in the third operand with id a, which is obviously incorrect.

@main(a: int) {
  b: int = id a;
  a: int = const 0;
  c: int  = id b;
}

The problem required some deep thinking to solve. The insight I eventually had was that the var2num table implicitly partitions the variables into equivalence classes; when a variable x with value number i is redefined, I move it out of its equivalence class and select a different variable from the equivalence class for value number I (that is, the set { y : var2num[y] = i }) as its representative and update the value number table accordingly. This resulted in a robust, effective solution.
Algebraic identities were fairly easy to implement; when implementing an Eq instance for value tuples, I simply baked in the relevant commutativity properties.

Testing and Results

I leveraged brench to test my optimizations over all benchmarks and ensure they were correct and effective. I also wrote several small sanity-checking diff tests with turnt to ensure that my optimizations behaved as intended. The distribution of benchmark speedups is below, where the horizontal axis is the ratio "optimized dynamic insts / original dynamic insts`, and the vertical axis is frequency. Overall, I achieved a mean 13.4% speedup across all benchmarks, which is pretty impressive! The speedups were much better in benchmarks generated from the typescript frontend.

0 replies

janpaulpl · 2023-09-15T04:59:44Z

janpaulpl
Sep 15, 2023

Source

Trivial dead code elimination
An attempt at LVN
Benchmark statistics

Implementation and issues

For this assignment, I implemented a trivial dead code elimination and a local value numbering algorithm. I initially attempted a functional approach in OCaml, which wasn’t successful due to the imperative nature of code presented in class. I kept getting issues on how to deal with my state, which was trivially solved by converting all my code to Python for TDCE. However, on LVN, my issues in the imperative world persisted. I am not sure exactly why my python LVN implementation keeps getting timed out on larger examples, albeit it seems to at least be functional (although not optimal) on smaller test cases. I would prefer to take some more time and re-implement LVN in C++ with a closer attention to the pseudo-code to get it working further.

Benchmark analysis

The trivial dead code elimination seems to have worked as expected! I used the entire benchmark directory under the bril source code and the Brench tool for automation. Consistently for larger examples,TDCE would be more efficient by a bit. However, my attempt at a working LVN didn’t succeed fully, either timing out for bigger examples or having the same performance as no optimizations.

Leftmost bar is baseline, followed by dce and lvn

Benchmark	Optimization	Result
is-decreasing	baseline	127
is-decreasing	dce	127
is-decreasing	lvn	127
sum-divisors	baseline	159
sum-divisors	dce	159
sum-divisors	lvn	159
palindrome	baseline	298
palindrome	dce	298
palindrome	lvn	298
relative-primes	baseline	1923
relative-primes	dce	1914
relative-primes	lvn	1923
birthday	baseline	484
birthday	dce	483
birthday	lvn	timeout

0 replies

Lesson 3: Local Analysis & Optimization #348

sampsyo Aug 21, 2023 Maintainer

Replies: 23 comments · 6 replies

Summarization of work

Testing

Hardest part

Summarize what you did.

Explain how you know your implementation works—how did you test it? Which test inputs did you use? Do you have any quantitative results to report?

What was the hardest part of the task? How did you solve this problem?

Summary

Implementation Details

Testing/Evaluation

Difficulties

Summary

Details

Testing

Difficulties

Summary

Details

Testing

Difficulties

Summary

Implementation details

Testing

Difficulties

Summary

Details

Testing

Difficulties

Summary

Details

Testing

Difficulties

Generative AI

Summary

Details

Challenges

Testing & Verification

Full Suite

By Suite (mean speedup)

Summary

Details

Difficulties

Testing

Partners

Summary

Implementation Details

Challenges

Testing

Summary

Implementation

Difficulties

Testing

Implementation:

Difficulties:

Testing:

What I haven't done and would like to do if I had time:

Summary

Implementation Details

Testing Strategy

Challenges

Summary

Implementation & Testing

Difficulties

sampsyo
Aug 21, 2023
Maintainer

Replies: 23 comments 6 replies