User:Yosh/ProtoFlux Notes

Optimization Notes

Notes on ProtoFlux optimization. Most of these only become relevant in extreme cases (such as making parsers or extremely large-scale projects), but it's nonetheless good to write down.

Some of these notes--most notably Bit Manipulation and Strings As Lookup Tables--most likely arise because of how old the .NET version is that Resonite currently runs, that being .NET Framework 4.6.2. Such an old compiler isn't quite optimized compared to those nowadays, making some of those ancient techniques that are usually resolved by compilers be relevant again. They might become much less relevant once the .NET 8 update happens, but we can only wait and see.

Dynamic Impulses

Dynamic Impulses on slots containing a lot of children are slow, at least when a lot happen in one update. I think it's just because of searching for the intended receivers. Not sure. In any case, if you only need one receiver for an impulse, you can directly reference the receiver node as the slot to trigger a dynamic impulse on. This makes a considerable speedup from my tests on slots of ~400 children and thousands up dynamic impulses per update.

Impulse Multiplexers Are Fast

Impulse multiplexers themselves are fast--O(1) that is. They have two conditionals (num < 0 || num >= Impulses.Count) and that's really it. It's how you obtain the index that can be the real bottleneck of code, and you would have to figure out that bottleneck yourself first (usually, any Index Of First Match for this index is pretty slow).

One interesting thing about impulse multiplexers is that any index outside the bounds of the multiplexer causes the impulse to drop, which is useful for parsers. You can combine this with the string as lookup table part to filter out any specific characters (such as whitespace characters) by setting them to 1 until you reach a non-whitespace character, which will be 0. This prevents the need to use a separate if statement or evaluation while staying fast due to using a LUT.

Reducing Branches

Branches are the main obstacle to avoid when trying to optimize code. Taking branches mean either waiting for the pipeline to complete before trying, or branch predicting and risking a miss. Compilers nowadays are really good at compiling out what one would assume to be branches into faster code, but due to the old mono runtime Resonite uses, ProtoFlux's architecture, or both, branches aren't quite optimized out that well, which can lead to really slow code. Here's a few ways to reduce the amount of branches in ProtoFlux.

Bit Manipulation

Bit Manipulation is achieving a desired result by leveraging individual bits of a number, usually involving a bunch of shifts and bit operators like OR, AND, NOR, etc.

For a large list of bit manipulations and other optimization goodies along the line that you could utilize in ProtoFlux, take a look at Bit Twiddling Hacks.

Strings As Lookup Tables

This one is a pretty neat one that allowed me to reduce my parsing of a 50k character long JSON string from ~260ms (using a two-input index of first match) to ~160ms. If you're making a parser and can be sure that the "relevant" characters (for e.g. escaping or determining a type) lie in pure ASCII, it doesn't take that much memory at all to make a lookup table. Here's how:

For your string, signify any "irrelevant" characters as the null character (0x00) and any "relevant" characters as something other than the null character. Whatever the relevant character value is depends on your use case, but it can go up to 65535 given that a string is just an array of chars. For my JSON string parser, I made the relevant values decimal 1 and 2 for backslash and quote. These relevant values would then directly change the state and plug into an impulse multiplexer for determining what to do next. You can either make this string in game, which would involve either manual character entering (pain) or a ton of string multiply -> concat string char (also pain), or you can create it externally with a hex editor and import it as a .txt file. One thing to be careful about when using a hex editor is to not use any bytes above 0x7F, since importing a file with bytes above 0x7F will corrupt the data. Or, if you're smart, you can make the file UTF-8 from the get-go so it works... but that's also pain. I've personally never needed any values higher than 0x7F myself

Reducing Nodes

It's not just branches that can be reduced. Reducing nodes in general can pose a non-insignificant speedup. I highly recommend testing if reducing nodes actually matter for your use case before committing to it at all, but here's a few examples of what I have experienced.

Index Of First Match<char> VS Index Of String

An Index Of First Match<char> and an Index Of String node should theoretically be relatively the same thing with the same speed. You have to iterate over all the characters of the inputs or input string, comparing each one to the desired character. However, from my testing, Index Of String scales *much* better than Index Of First Match. This was mainly relevant for my "replacing characters with other characters" bit of JSON string parsing, but the results were quite surprising, with Index Of String being quite faster for large strings with ~8 needed replacements. I don't have concrete numbers, but I will probably add them later.

Reducing String Writes

Strings are more expensive than you might think. There's no real concept of slices in C# or ProtoFlux, so any string operation creates a completely new string that needs to completely override the old string. Consider the following pseudocode:

String input         // user-provided input
String output = ""   // starts empty

// define a state enum where the values are Normal and Escape
// in protoflux, you'd make this an integer. this is for readability
enum State: { Normal, Escape }
State state = Normal

for (i from 0 to input.length - 1):
  switch (state):                   // no fallthrough
    case Normal:
      if (input[i] == '\'):
        state = Escape
      else:
        output += input[i]
    case Escape:                    // we want to use \n to represent a newline
      if (input[i] == 'n'):
        output += newline
      else:
        output += input[i]
      State = Normal                // go back to the normal state
}

Intuitively, one would think that output is a fixed memory location (that can get reallocated when needed), and "adding" to it is simply adding another character to the end of that same thing. However, this is not how that works in C# due to string being immutable--and thus ProtoFlux has this problem too. Each write is a completely new string at a new memory location generated dynamically from the concatenation. This starts to get really slow as strings get longer and longer.

In C#, the solution is to use [StringBuilder](https://learn.microsoft.com/en-us/dotnet/api/system.text.stringbuilder) to efficiency build a string in this manner, but we don't have access to that in ProtoFlux. As such, the next best option is to buffer these writes. If most of the string is expected to be unchanged, then writes can be significantly reduced by only writing when you really need to. Here's how you'd rewrite the code to buffer these writes:

String input
String output = ""

// add the NormalInit state here for "initializing" the buffer
enum State: { NormalInit, Normal, Escape }
State state = NormalInit

int buffer_start

for (i from 0 to input.length - 1):
  switch (state):
    case NormalInit:
      buffer_start = i              // start of write buffer is the current index
      state = Normal
        // FALLTHROUGH TO NEXT CASE //
    case Normal:
      if (input[i] == '\'):
        // perform the actual write since input == output until here
        output = output + input.substring(buffer_start, i - buffer_start)
        state = Escape
    case Escape:
      if (input[i] == 'n'):
        output = output + newline
      else:
        output = output + input[i]
      State = NormalInit            // this causes the buffer start to be correct
}

// append the rest to the output
output = output + input.substring(buffer_start)

This causes the amount of writes to be linear with the amount of escaping we need to do. Since we expect most of the string to be "normal", and only escaping when we need a newline, this will result in a significant speedup. For my JSON string parser I mentioned earlier, it went from handling a 50k character long string without any escapes in a full second or two to ~260ms (this was before the string lookup table mentioned before).