Optimization guidelines: Difference between revisions

From Resonite Wiki
overhaul
 
Line 1: Line 1:
<!--T:35-->This is a community-generated page to help other users discover how to optimize the performance of their creations in Resonite. As with the whole Wiki, anyone is free to edit this page. However, since this page makes specific technical recommendations, it is best if information is backed up with an official source (e.g. a Discord comment from one of the Resonite development team).
<languages/>


== Rendering == <!--T:1-->
This page details information and various avenues related to optimizing assets or creations in Resonite, ranging from high-level concepts to technical details. Not everything on this page will be applicable to every creation, and the more technical parts of this page can result in diminishing returns with more effort/time. Therefore, do not worry too much about following everything on this page all the time, as just being aware of the high-level concepts will be enough for most users.


=== Blendshapes === <!--T:2-->
== Rendering ==
Every time a blendshape changes, the vertices have to be retransformed on the entire mesh. If the majority of the mesh is not part of any blendshape, then that performance is wasted. Resonite can automatically optimize this with the "Separate parts of mesh unaffected by blendshapes" found under the [[SkinnedMeshRenderer (Component)|SkinnedMeshRenderer component]]. Whether this is worth it or not varies on a case-by-case basis, so you'll have to test your before/after performance while driving blendshapes to be sure.


<!--T:42-->
Resonite currently uses [https://learn.microsoft.com/en-us/windows/win32/direct3d11/atoc-dx-graphics-direct3d-11 Direct3D 11] for rendering in [https://unity.com/ Unity]. The [https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm D3D 11 technical specifications] provide the most comprehensive information of the low-level pipeline, but is generally not useful to read directly.
Any blendshape not at zero has a performance cost proportional to the number of vertices, even if the blendshape is not changing. For example, a one million vertex mesh will have a significantly higher performance impact with a blendshape at 0.01 than with the blendshape at 0. If you have a mesh with non-driven blendshapes set to anything other than zero, consider baking them.


=== Textures ===


No matter their value, when a skinned mesh is first spawned (such as when an avatar loads) every blendshape must be calculated which can result in a lag spike if there are lots of blendshapes on it; baking non-driven blendshapes will additionally help prevent that.
[[File:StaticTexture2D format and memory display.webp|thumb|right|The [[Component:StaticTexture2D|StaticTexture2D]] and [[Component:StaticTexture3D|StaticTexture3D]] components will display the resolution of a texture, compression format that the texture is using, and total VRAM usage of the texture + its mipmaps.]]


=== Materials === <!--T:3-->
{{hatnote|More information: [[Texture compression]]}}
Some materials (notably the [[FurMaterial|Fur material]]) are much more expensive than others.


<!--T:4-->
There are two main metrics that can be optimized with textures: ''asset size'' and ''VRAM size''.
Alpha/Transparent/Additive/Multiply blend modes count as transparent materials and are slightly more expensive because things behind them have to be rendered and filtered through them. Transparent materials use the forward rendering pipeline, so they don't handle dynamic lights as consistently. Opaque and Cutout blend modes on PBS materials use the deferred rendering pipeline, and handle dynamic lights better.


=== Texture Dimensions === <!--T:5-->
''Asset size'' is the size of the original asset file on disk (i.e. PNG, WEBP, etc.). Optimizing this metric is useful for saving on cloud storage space. If a texture has <code>DirectLoad</code> enabled, then reducing the asset size will also reduce the transfer size of the asset to other clients. Otherwise, the transfer size is determined by the variant size, not the asset size, which is more out of your control.
Square textures with pixel dimensions of a power of two (2, 4, 8, 16, 32, 64, 128, 512, 1024, 2048, 4096, etc) are more efficiently handled in VRAM.


=== Texture Atlasing === <!--T:6-->
''VRAM size'' is the size that the texture will take up in [https://en.wikipedia.org/wiki/Video_random-access_memory video RAM] on the GPU. This metric is more important to optimize, as high VRAM usage will cause performance drops. VRAM size should be kept as low as reasonably possible.
If you have a number of different materials of the same type, consider atlasing (combining multiple textures into one larger texture). Even if different parts of your mesh use different settings, the addition of maps can let you combine many materials into one. Try to avoid large empty spaces in the resulting atlas, as they can waste VRAM.


<!--T:7-->
Textures should only be as big as is necessary to express a "good enough" visual on a mesh, which will save on both asset size and VRAM. For avatars, a 2048x2048 ("2k") texture will usually be enough for most purposes. Larger textures may be needed for noisy normal maps, incredibly large meshes (e.g. splat maps), or texture atlasing.
Places where atlasing doesn't help:
* If you need a different material all together, e.g. a [[FurMaterial|Fur]] part of a mostly PBS avatar.
* If you need part of your avatar to have Alpha blend, but the majority is fine with Opaque or Cutout.


=== Procedural vs Static Assets === <!--T:8-->
Textures should almost always use some form of [[Texture compression#Block_compression|block compression]] to reduce their VRAM size. The only exceptions to this should be low-resolution textures that require color perfection (e.g. pixel art).
If you are not driving the parameters of a procedural mesh, then you can save performance by baking it into a static mesh. Procedural meshes and textures are per-world. This is because the procedural asset is duplicated with the item. Static meshes and textures are automatically instanced across worlds so there's only a single copy in memory at all times, and do not need to be saved on the item itself.


=== GPU Mesh Instancing === <!--T:9-->
Square textures provide no real optimization benefit over non-square textures. Non-power-of-two textures are also not worse than power-of-two textures on any GPU made after the late 2000s. However, it is generally recommended to keep texture dimensions a power of two for better [[mipmap]] generation. At the very least, texture dimensions should be a multiple of 4 as this makes texture block compression not use any extraneous space.
If there are multiple instances of the same static mesh/material combination, they will be instanced (on most shaders). This can significantly improve performance when rendering multiple instances of the same object, e.g. having lots of trees in the environment. Note that [[SkinnedMeshRenderer (Component)|SkinnedMeshRenderers]] are not eligible for GPU instancing.


=== Mirrors and Cameras === <!--T:10-->
If you are working with many small textures that require the same material, consider [https://en.wikipedia.org/wiki/Texture_atlas atlasing] the textures into one large texture. All conventional materials have a <code>TextureScale</code> and <code>TextureOffset</code> field to manage atlasing at the shader level, which is much more efficient than using a bunch of individual textures. The [[Component:UVAtlasAnimator|UVAtlasAnimator]] component can be used to easily change these two fields, assuming a rectangular atlas of uniform sprite sizes.
Mirrors and cameras can be quite expensive, especially at higher resolutions, as they require additional rendering passes. Mirrors are generally more expensive than cameras, as they require two additional passes (one per eye).


<!--T:11-->
=== Blendshapes ===
The performance of cameras can be improved by using appropriate near/far clip values and using the selective/exclusive render lists. Basically, avoid rendering what you don't need to.
 
For every frame that a mesh is being rendered, if there is any blendshape set to a nonzero value, every vertex on the mesh is recalculated to account for blendshape distortions. This calculation does <em>not</em> scale with blendshape amount: 1 nonzero blendshape will be about as heavy as several nonzero blendshapes. As such, it is generally best practice to separate out parts of a mesh that are not affected by blendshapes and to bake non-zero blendshapes that do not change on a mesh.
 
However, no matter the value of the blendshape, when a skinned mesh is first spawned (e.g. when an avatar loads), Every blendshape must be calculated. This can cause reduced framerate while the mesh is being loaded depending on the vertex count, as well as cause the mesh to take several seconds to appear. It is recommended to separate out parts of a mesh with high-density blendshapes, such as a head with face tracking blendshapes, from the rest of the mesh. This can improve the mesh loading time while reducing framerate less.
 
Resonite can attempt to automatically do blendshape optimizations with the <code>Separate parts of mesh unaffected by blendshapes</code> and <code>Bake non-driven blendshapes</code> buttons found under the [[Component:SkinnedMeshRenderer|SkinnedMeshRenderer]] component. Additionally, there are functions to bake a blendshape by index or split a blendshape across an axis. Whether these optimizations are worth it or not varies on a case-by-case basis, so you'll have to test your before/after performance with nonzero blendshapes to be sure.
 
=== Materials ===
 
Some materials, notably the [[Component:FurMaterial|Fur material]], are much more expensive than others. Take care in using the least-expensive materials that accomplish your goal.
 
Using the <code>Opaque</code> or <code>Cutout</code> blend mode on a material is generally advised whenever possible, as these blend modes use the deferred rendering pipeline and handle dynamic lights better. <code>Alpha</code>, <code>Transparent</code>, <code>Additive</code>, and <code>Multiply</code> blend modes are all treated as transparent materials and use the forward rendering pipeline, which is generally more expensive and don't handle dynamic lights as consistently.
 
=== Procedural assets ===
 
If you are not driving the parameters of a procedural mesh, then you can save performance by baking it into a static mesh. Procedural meshes and textures are per-world, as the procedural asset is duplicated with the item. Static meshes and textures are automatically instanced across worlds so there's only a single copy in memory at all times and do not need to be saved on the item itself.
 
=== GPU mesh instancing ===
 
[https://docs.unity3d.com/2019.4/Documentation/Manual/GPUInstancing.html '''Mesh instancing'''] is when multiple copies of the same mesh are drawn at once, reducing the amount of draw calls to the GPU. If you have multiple instances of the same static mesh/material combination, they will be instanced on most shaders. This can significantly improve performance when rendering multiple instances of the same object, such as trees or generic buildings.
 
[[Component:SkinnedMeshRenderer|SkinnedMeshRenderers]] are not eligible for GPU instancing. Additionally, different [[material property block|material property blocks]] will prevent instancing across different copies of the same mesh, even if the underlying material is the same.


<!--T:12-->
=== Mirrors and cameras ===
It's good practice to localize mirrors and cameras with a [[ValueUserOverride`1 (Component)|ValueUserOverride]] so users can opt in if they're willing to sacrifice performance to see them.


As a rule of thumb, a 1024x1024 mirror can work well on an integrated GPU, while a 4096x4096 mirror (normally fine on a dedicated GPU) tends to ruin performance. If you're adding quality options to your world, then adjusting camera and mirror resolutions accordingly will help a lot.
Mirrors and cameras can be quite expensive, especially at higher resolutions in complex worlds, as they require additional rendering passes. Mirrors are generally more expensive than cameras, as they require two additional passes (one per eye).


=== Reflection Probes === <!--T:13-->
The performance of cameras can be improved by using appropriate near/far clip values and using the selective/exclusive render lists. Basically, avoid rendering what you don't need to.
Baked reflection probes are quite cheap, especially at the default resolution of 128x128. The only real cost is the VRAM used to store the cube map.


<!--T:14-->
In addition, it's good practice to localize mirrors and cameras with [[Component:ValueUserOverride|ValueUserOverride]] so users can opt in if they're willing to sacrifice performance to see them.
Real-time reflection probes are extremely expensive, and are comparable to six cameras.


=== Lighting === <!--T:15-->
The complexity of the scene being renderered will affect how much a camera or mirror will hurt at a certain resolution. It is recommended to provide mirror or camera resolution options to worlds or items so users can choose what resolution suits them the most.
Light impact is proportional to how many pixels a light shines on. This is determined by the size of the visible light volume in the world, regardless of much geometry it affects. Short range or partially occluded lights are therefore cheaper.


<!--T:16-->
=== Reflection probes ===
Lights with shadows are much more expensive than lights without. In deferred shading, shadow-casting meshes still need to be rendered once or more for each shadow-casting light. Furthermore, the lighting shader that applies shadows has a higher rendering overhead than the one used when shadows are disabled.


<!--T:17-->
[[File:Reflection probe timer changes setup.jpg|thumb|right|A simple timer setup for a reflection probe to update every 5 seconds, which can be changed with the <code>_speed</code> field on the Panner1D. The <code>ChangesSources</code> points to the ValueField&lt;int&gt;]]
Point lights with shadows are very expensive, as they render the surrounding scene six times. If you need shadows try to keep them restrained to a spot or directional light.


<!--T:36-->
Baked [[Component:ReflectionProbe|reflection probes]] are quite cheap, especially at the default resolution of 128x128. The only real cost is the VRAM used to store the cube map. Even then, at low resoluitons, the VRAM usage is insignificant.
It is possible to control whether a [[MeshRenderer (Component)|MeshRenderer]] or [[SkinnedMeshRenderer (Component)|SkinnedMeshRenderer]] component casts shadows using the ShadowCastMode Enum value on the component. Changing this to 'Off' may be helpful if users wish to have some meshes cast shadows, but not all (and hence don't want to disable shadows on the relevant lights). Alternatively, there may be some performance benefits by turning off shadow casting for a highly detailed mesh and placing a similar, but lower detail, mesh at the same location with ShadowCastMode 'ShadowOnly'.  


=== Culling === <!--T:18-->
Realtime reflection probes are extremely expensive and are comparable to six cameras. Additionally, if the change sources for an <code>OnChanges</code> reflection probe update frequently, then the probe will be no better than a realtime reflection probe. If the change sources update continuously yet subtly, consider setting the change sources to something hooked up to a timer to only update the reflection probe every few seconds.


<!--T:37-->
=== Lighting ===
'Culling' refers to not processing, or at least not rendering, specific parts of a world to reduce performance costs.


==== Frustrum culling ==== <!--T:38-->
Light impact is proportional to how many pixels a light shines on. This is determined by the size of the visible light volume in the world, regardless of how much geometry it affects. Short range or partially occluded lights are therefore cheaper.


<!--T:19-->
Lights with shadows are much more expensive than lights without. In deferred shading, shadow-casting meshes still need to be rendered once or more for each shadow-casting light. Furthermore, the lighting shader that applies shadows has a higher rendering overhead than the one used when shadows are disabled.
Resonite automatically performs frustum culling, meaning objects outside of the field of view will not render (e.g. objects behind a user). With frustrum culling, there is some cost associated with calculating visibility, but it is generally constant for each mesh. The detection process relies on each object's bounding box, which is essentially an axis-aligned box that fully wraps around the entire mesh. Therefore the mesh complexity is irrelevant (save for the initial calculation of its bounds or, in case of skinned meshes with some calculation modes, the number of bones). There are a few considerations to optimizing content to work best with this system.
* The cost of checking for whether objects should be culled by this system scales with the number separate active meshes. This means it can work together with user-made culling systems (see [[#User-made culling systems|below]]) which can reduce the number of active meshes which must be tested.
* If any part of a mesh is determined to be visible due to the bounding box calculation, the entire mesh must be rendered - Resonite cannot only render parts of a mesh.
** As such, sometimes it makes sense to separate a large mesh into smaller pieces if the whole mesh would not normally be visible all at once. World terrain meshes may be good candidates for splitting into separate submeshes.
** On the other hand, in some situations it is better to combine meshes which will generally be visible at the same time. Even though combining meshes with multiple materials does not directly save on rendering costs, it ''does'' save on testing for whether meshes should be culled. If multiple meshes are baked into a single mesh, the bounding box testing only needs to be performed once for that combined mesh. This is effectively Resonite's version of the static mesh batching which occurs in Unity.


<!--T:39-->
Point lights with shadows are very expensive, as they render the surrounding scene six times. If you need shadows, try to keep them restrained to a spot or directional light.
Note that [[SkinnedMeshRenderer (Component)|SkinnedMeshRenderer components]] have multiple modes for bounds calculation which impose different performance costs. The calculation mode is indicated by the BoundsComputeMethod on each SkinnedMeshRenderer component.
* <code>Static</code> is a very cheap method based on the mesh alone. This does not require realtime computation, so ideally use this if possible.
* <code>FastDisjointRootApproximate</code> first merges all bones into disjoint groups (any overlapping bones are merged into a single one) to reduce overall number of bones. It then uses those to approximate bounds in realtime. Fastest realtime method, recommended if parts of a mesh are being culled when using `Static`.
* <code>MediumPerBoneApproximate</code> computes mesh bounds from bounds of every single bone. More accurate, but also much slower.
* <code>SlowRealtimeAccurate</code> uses actual transformed geometry, requiring the skinned mesh to be processed all the time. Very heavy, but will respect things like blendshapes in addition to bones.
* <code>Proxy</code> is slightly different from the others, but also potentially very cheap. It relies on the bounding box calculated for another SkinnedMeshRenderer referenced in the ProxyBoundsSource field. Useful in cases where you have a large main mesh and you need the visiblity of smaller meshes to be linked to it.


==== User-made culling systems ==== <!--T:40-->
It is possible to control whether a [[MeshRenderer (Component)|MeshRenderer]] or [[SkinnedMeshRenderer (Component)|SkinnedMeshRenderer]] component casts shadows using the ShadowCastMode Enum value on the component. Changing this to 'Off' may be helpful if users wish to have some meshes cast shadows, but not all (and hence don't want to disable shadows on the relevant lights). Alternatively, there may be some performance benefits by turning off shadow casting for a highly detailed mesh and placing a similar, but lower detail, mesh at the same location with ShadowCastMode 'ShadowOnly'.


<!--T:20-->
== ProtoFlux ==
It is possible to create additional culling systems for your worlds by selectively deactivating slots and/or disabling components depending on e.g. a user's position. Very efficient culling solutions can be created with the [[ColliderUserTracker (Component)|ColliderUserTracker component]] to detect when a user is inside a specific collider.


As [[Component:ColliderUserTracker|ColliderUserTracker]], works on colliders, it cannot detect users in No-Clip.
How much ProtoFlux optimization matters for you depends on what type of ProtoFlux code you are making. The optimization requirements for, say, incredibly hot loops with a lot of calculation are not the same as drives that only get evaluated a maximum of once per frame. ProtoFlux optimization can be a great rabbithole, and it's important to not try to do too much at the expense of readability or simplicity.


==== Specific culling considerations for collider components ==== <!--T:41-->
This snippet will mainly focus on "everyday user" flux, disregarding complex cases or hot loops for maximum performance gain. It will not be extensive nor deep. If you are interested in extreme ProtoFlux optimization, check out the [[ProtoFlux optimization]] page.


<!--T:33-->
=== Overall ===
Avoid performing manual culling of colliders in such a way that they are activated/deactivated very often. Collider performance impact works differently than for rendered meshes; performance costs for colliders are already heavily optimized as they are only checked when they're relevant. Toggling them on and off regularly can disrupt this under-the-hood optimization process and may even be ''more'' expensive.


== ProtoFlux == <!--T:21-->
Less ProtoFlux is not always better! It's more important to make your calculations do less work than it is to do them in a smaller space. Users should also not feel pressured to avoid using ProtoFlux because they believe using 'only components' is somehow more optimal. There are many cases where specific components are a better solution than using several ProtoFlux nodes and vice-versa, the decision mainly comes down to using the best tool for the job.
Less ProtoFlux is not always better! It's more important to make your calculations do less work than it is to do them in a smaller space. Users should also not feel pressured to avoid using ProtoFlux because they believe using 'only components' is somehow more optimal. There are many cases where specific components are a better solution than using several ProtoFlux nodes and vice-versa, the decision mainly comes down to using the best tool for the job.
=== Writes and Drivers === <!--T:22-->
Changing the value of a Sync will result in network traffic, as that change to the data model needs to be sent to the other users in the session. [[ValueUserOverride`1 (Component)|ValueUserOverride]] does not remove this network activity, as the overrides themselves are Syncs.


<!--T:23-->
=== Writes and drives ===
Exceptions:
* [[Drive|Drivers]] compute things locally for every user, and do not cause network traffic
* "Self Driven values" (A [[ValueCopy (Component)|ValueCopy]] with Writeback and the same source and target) are also locally calculated, even if you use the [[ProtoFlux:Write|Write node]] to change the value. Do note, [[ProtoFlux:Store|Stores]] or [[ProtoFlux:Local|Locals]] can serve this functionality in a more optimized way.
* If multiple writes to a value occur in the same update, only the last value will be replicated over the network.


<!--T:24-->
[[Impulse]] chains, and thus writes to a field, allow for more explicit evaluation flow, while [[drive]] chains are more implicit in when nodes get evaluated. This means that you have more control over when things can be re-evaluated using careful impulse chains and listening to values yourself instead of letting the game do it for a drive chain.
Generally it's cheaper to perform computations locally and avoid network activity, but for more expensive computations it's better to have one user do it and sync the result.


=== Dynamic Variables and Impulses === <!--T:25-->
By default, writing to a field will incur network traffic at the end of the [[update]]. Multiple writes can be done in one update and only the last write will induce traffic. To prevent network traffic, one can either drive a field or a self-driven [[Component:ValueCopy|ValueCopy]] component with writeback to "localize" the field for writing.
[[Dynamic Variables]] are extremely efficient and can be used without concern for performance, however creating and destroying Dynamic Variable Spaces can be costly and should be done infrequently. Care should also be taken when driving dynamic variables; while this is possible, you must ensure that EVERY instance of the same [[DynamicValueVariable`1 (Component)|DynamicValueVariable]] or [[DynamicReferenceVariable`1 (Component)|DynamicReferenceVariable]] is driven as well.


<!--T:26-->
=== Dynamic impulses ===
[[Dynamic Impulses]] are also extremely efficient, especially if you target them at a slot close to their receiver.


=== Frequent Impulses === <!--T:27-->
[[Dynamic impulses]] are used to dynamically call "receivers" of some tagged impulse under a slot hierarchy. This works by recurviely looking at all children of the input slot given to a [[ProtoFlux:DynamicImpulseTrigger|trigger node]] and seeing if there is a receiver. As such, it is recommended to minimize the scope of a dynamic impulse as much as is reasonable, should the impulse be called very frequently.
High frequency updates from the [[Update (ProtoFlux)|Update]] node, [[Fire While True (ProtoFlux)|Fire While True]], etc. should be avoided if possible if the action results in network replication. Consider replacing them with [[Drive|Drivers]].


=== Continuously Changing Attribute === <!--T:28-->
=== Continuously changing drive chains ===
ProtoFlux nodes and outputs can be marked with Continuously Changing attribute which in conjunction with [[ProtoFlux:Fire On Change|Fire On Change]] (or any other Fire On X node) or [[Drive|Drives]] will make the node chain reevaluate every frame. Having any node that is marked as such in your node chain will make everything to the right of it to become Continuously Changing. Using heavy nodes (for example [[ProtoFlux:Find Child By Name|Find Child By Name]]) or really heavy node chains that contain Continuously Changing nodes, as value for Fire On Change or Drive, will force that logic to evaluate every frame, causing lag.


For best performance, please avoid doing heavy computations or heavy slot lookups on Fire On X or Drives. Write the result of your heavy nodes to a [[ProtoFlux#Variables|variable]] instead, and then use the result stored in the variable.
Certain nodes are marked with an attribute called [[ContinuouslyChanging]]. This will cause any [[listener node]] that use the output of the original node to re-evaluate the entire input chain every update. A list of these nodes can be found on the [[:Category:ContinuouslyChanging nodes]] page.


[[ProtoFlux:Continuously Changing Relay|Continuously Changing Relay]] is simply passing through the value, but since the node is marked as Continuously Changing, everything to the right of it will become Continuously Changing.
If potentially-heavy nodes are used in such a chain, such as [[ProtoFlux:BodyNodeSlot|BodyNodeSlot]] or [[ProtoFlux:FindChildByName]], this can be a silent killer of performance depending on how heavy the chain is and how many instance of the flux exist.


=== Sample Color node === <!--T:34-->
It is highly recommended to, if you are using any heavy node in a chain that is marked continuously changing, to see if the result of the heavy node can be written to a [[ProtoFlux:Store|Store]] whenever it may need to be updated. This would allow the heavy node to not be re-evaluated every update.
[[ProtoFlux:Sample ColorX|Sample Color]] is an inherently expensive node to use as it works by rendering a small and narrow view. Best to use this sparingly. Performance cost can be reduced by limiting the range which must be rendered by using the NearClip and FarClip inputs.


=== Slot Count === <!--T:31-->
=== Measuring ProtoFlux runtime ===
Slot count and packed ProtoFlux nodes don't matter much performance-wise. Loading and saving do take a hit for complex setups but this hit is not eliminated by placing the ProtoFlux nodes on one slot. Resonite still has to load and save the exact same number of components.


=== Measuring ProtoFlux Performance ===
The runtime of a ProtoFlux [[Impulses|Impulse]] chain can be measured using [[ProtoFlux:Utc_Now|Utc Now]] and a single variable storing its value just before executing the measured code. Subtracting this value from the one evaluated just after the execution yields the duration it took to run.
The runtime of a ProtoFlux [[Impulses|Impulse]] chain can be measured using [[ProtoFlux:Utc_Now|Utc Now]] and a single variable storing its value just before executing the measured code.
Subtracting this value from the one evaluated just after the execution yields the duration it took to run.
Take note that this does not measure indirect performance impacts like increased network load or physics updates!
Take note that this does not measure indirect performance impacts like increased network load or physics updates!


Line 146: Line 119:
The most sensible approach would be to create an [[Impulses|Impulse]] based measurement to measure the runtime of a simple [[Protoflux:Write|Write]] node connected to same expression as the drive.
The most sensible approach would be to create an [[Impulses|Impulse]] based measurement to measure the runtime of a simple [[Protoflux:Write|Write]] node connected to same expression as the drive.


If you - for whatever reason - need to compute the execution time during Drive evaluation you could abuse the order of evaluations:
If you{{em dash}}for whatever reason{{em dash}}need to compute the execution time during drive evaluation, you could abuse the order of evaluations:


<gallery widths=480px heights=200px>
<gallery widths=480px heights=200px>
Line 152: Line 125:
</gallery>
</gallery>


== Common performance myths ==


A few myths occasionally propagate regarding optimization. This list serves to state the truth of the various discussions:


== Profiling == <!--T:32-->
* In general, [[slot count]] does not inherently matter for the performance of an object. Should anything iterate over the slots or elements of an object, it will matter, but not in the general case.
If you're working on a new item that might be expensive, consider profiling it:
* In general, there is no notable difference in performance between using ProtoFlux versus components for driving values. Performance differences would occur on a case-by-case basis depending on the complexity of the component and the potential re-implementation in ProtoFlux.
* The Debug menu you can find in Home tab of the Dash has many helpful timings
 
== Culling ==
 
{{hatnote|More information: [[Culling]]}}
 
'''Culling''' refers to not not rendering or processing specific parts of a world or certain slots to reduce performance costs. Resonite currently has built-in [[Culling#Built-in culling|frustrum culling]] for rendering meshes, but this does not affect any CPU-based stuff like ProtoFlux or components. It is recommended to set up some sort of additional world-based culling if your world is large and complex. There are also user-made systems to cull only user avatars in a world for a potential per-session performance increase with a large number of users.
 
== Profiling ==
 
'''[https://en.wikipedia.org/wiki/Profiling_(computer_programming) Profiling]''', in general, is the act of measuring what and where is taking up the most of a particular resource. Profiling can be used to help determine what is taking up the most CPU or GPU frame time in a world. Some methods of profiling a world or item include:
 
* The '''Debug''' facet on the home screen of the [[dash menu]] provides timings for various aspects of Resonite. The two most useful tabs here are the ''Worlds'' and ''Focused World'' tabs, the latter of which will show the time it takes to execute every part of the world [[update|update cycle]].
* The [https://github.com/esnya/ResoniteMetricsCounter ResoniteMetricsCounter] mod provides timings for various slot hierarchies, protoflux chains, and components in the current session.
** Do not take the exact timings at face value, as there is non-insignificant overhead from patching everything that is required for the mod to work. Nevertheless, it serves as an excellent tool for exploring what is <em>relatively</em> taking up the most frame time in a world.
* The [https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-trace dotnet-trace] tool is able to perform in-depth traces of FrooxEngine and enumerate what functions are taking up the most runtime. While it provides the most extensive and accurate profiling, it can't single out individual slot hierarchies, ProtoFlux groups, or components, and is generally for more advanced users.
* SteamVR has a "Display Performance Graph" that can show GPU frametimes. This can also be shown in-headset from a toggle in the developer settings (toggle "Advanced Settings" on in the settings menu)
* SteamVR has a "Display Performance Graph" that can show GPU frametimes. This can also be shown in-headset from a toggle in the developer settings (toggle "Advanced Settings" on in the settings menu)
* [https://store.steampowered.com/app/908520/fpsVR/ fpsVR] is a paid SteamVR tool that gives a lot of information about CPU and GPU timings, vram/ram usage and other metrics that can be very useful when profiling inside of VR.
* [https://store.steampowered.com/app/908520/fpsVR/ fpsVR] is a paid SteamVR tool that gives a lot of information about CPU and GPU timings, vram/ram usage and other metrics that can be very useful when profiling inside of VR.
[[Category:Optimization]]

Latest revision as of 00:17, 20 December 2025


This page details information and various avenues related to optimizing assets or creations in Resonite, ranging from high-level concepts to technical details. Not everything on this page will be applicable to every creation, and the more technical parts of this page can result in diminishing returns with more effort/time. Therefore, do not worry too much about following everything on this page all the time, as just being aware of the high-level concepts will be enough for most users.

Rendering

Resonite currently uses Direct3D 11 for rendering in Unity. The D3D 11 technical specifications provide the most comprehensive information of the low-level pipeline, but is generally not useful to read directly.

Textures

The StaticTexture2D and StaticTexture3D components will display the resolution of a texture, compression format that the texture is using, and total VRAM usage of the texture + its mipmaps.
More information: Texture compression

There are two main metrics that can be optimized with textures: asset size and VRAM size.

Asset size is the size of the original asset file on disk (i.e. PNG, WEBP, etc.). Optimizing this metric is useful for saving on cloud storage space. If a texture has DirectLoad enabled, then reducing the asset size will also reduce the transfer size of the asset to other clients. Otherwise, the transfer size is determined by the variant size, not the asset size, which is more out of your control.

VRAM size is the size that the texture will take up in video RAM on the GPU. This metric is more important to optimize, as high VRAM usage will cause performance drops. VRAM size should be kept as low as reasonably possible.

Textures should only be as big as is necessary to express a "good enough" visual on a mesh, which will save on both asset size and VRAM. For avatars, a 2048x2048 ("2k") texture will usually be enough for most purposes. Larger textures may be needed for noisy normal maps, incredibly large meshes (e.g. splat maps), or texture atlasing.

Textures should almost always use some form of block compression to reduce their VRAM size. The only exceptions to this should be low-resolution textures that require color perfection (e.g. pixel art).

Square textures provide no real optimization benefit over non-square textures. Non-power-of-two textures are also not worse than power-of-two textures on any GPU made after the late 2000s. However, it is generally recommended to keep texture dimensions a power of two for better mipmap generation. At the very least, texture dimensions should be a multiple of 4 as this makes texture block compression not use any extraneous space.

If you are working with many small textures that require the same material, consider atlasing the textures into one large texture. All conventional materials have a TextureScale and TextureOffset field to manage atlasing at the shader level, which is much more efficient than using a bunch of individual textures. The UVAtlasAnimator component can be used to easily change these two fields, assuming a rectangular atlas of uniform sprite sizes.

Blendshapes

For every frame that a mesh is being rendered, if there is any blendshape set to a nonzero value, every vertex on the mesh is recalculated to account for blendshape distortions. This calculation does not scale with blendshape amount: 1 nonzero blendshape will be about as heavy as several nonzero blendshapes. As such, it is generally best practice to separate out parts of a mesh that are not affected by blendshapes and to bake non-zero blendshapes that do not change on a mesh.

However, no matter the value of the blendshape, when a skinned mesh is first spawned (e.g. when an avatar loads), Every blendshape must be calculated. This can cause reduced framerate while the mesh is being loaded depending on the vertex count, as well as cause the mesh to take several seconds to appear. It is recommended to separate out parts of a mesh with high-density blendshapes, such as a head with face tracking blendshapes, from the rest of the mesh. This can improve the mesh loading time while reducing framerate less.

Resonite can attempt to automatically do blendshape optimizations with the Separate parts of mesh unaffected by blendshapes and Bake non-driven blendshapes buttons found under the SkinnedMeshRenderer component. Additionally, there are functions to bake a blendshape by index or split a blendshape across an axis. Whether these optimizations are worth it or not varies on a case-by-case basis, so you'll have to test your before/after performance with nonzero blendshapes to be sure.

Materials

Some materials, notably the Fur material, are much more expensive than others. Take care in using the least-expensive materials that accomplish your goal.

Using the Opaque or Cutout blend mode on a material is generally advised whenever possible, as these blend modes use the deferred rendering pipeline and handle dynamic lights better. Alpha, Transparent, Additive, and Multiply blend modes are all treated as transparent materials and use the forward rendering pipeline, which is generally more expensive and don't handle dynamic lights as consistently.

Procedural assets

If you are not driving the parameters of a procedural mesh, then you can save performance by baking it into a static mesh. Procedural meshes and textures are per-world, as the procedural asset is duplicated with the item. Static meshes and textures are automatically instanced across worlds so there's only a single copy in memory at all times and do not need to be saved on the item itself.

GPU mesh instancing

Mesh instancing is when multiple copies of the same mesh are drawn at once, reducing the amount of draw calls to the GPU. If you have multiple instances of the same static mesh/material combination, they will be instanced on most shaders. This can significantly improve performance when rendering multiple instances of the same object, such as trees or generic buildings.

SkinnedMeshRenderers are not eligible for GPU instancing. Additionally, different material property blocks will prevent instancing across different copies of the same mesh, even if the underlying material is the same.

Mirrors and cameras

Mirrors and cameras can be quite expensive, especially at higher resolutions in complex worlds, as they require additional rendering passes. Mirrors are generally more expensive than cameras, as they require two additional passes (one per eye).

The performance of cameras can be improved by using appropriate near/far clip values and using the selective/exclusive render lists. Basically, avoid rendering what you don't need to.

In addition, it's good practice to localize mirrors and cameras with ValueUserOverride so users can opt in if they're willing to sacrifice performance to see them.

The complexity of the scene being renderered will affect how much a camera or mirror will hurt at a certain resolution. It is recommended to provide mirror or camera resolution options to worlds or items so users can choose what resolution suits them the most.

Reflection probes

A simple timer setup for a reflection probe to update every 5 seconds, which can be changed with the _speed field on the Panner1D. The ChangesSources points to the ValueField<int>

Baked reflection probes are quite cheap, especially at the default resolution of 128x128. The only real cost is the VRAM used to store the cube map. Even then, at low resoluitons, the VRAM usage is insignificant.

Realtime reflection probes are extremely expensive and are comparable to six cameras. Additionally, if the change sources for an OnChanges reflection probe update frequently, then the probe will be no better than a realtime reflection probe. If the change sources update continuously yet subtly, consider setting the change sources to something hooked up to a timer to only update the reflection probe every few seconds.

Lighting

Light impact is proportional to how many pixels a light shines on. This is determined by the size of the visible light volume in the world, regardless of how much geometry it affects. Short range or partially occluded lights are therefore cheaper.

Lights with shadows are much more expensive than lights without. In deferred shading, shadow-casting meshes still need to be rendered once or more for each shadow-casting light. Furthermore, the lighting shader that applies shadows has a higher rendering overhead than the one used when shadows are disabled.

Point lights with shadows are very expensive, as they render the surrounding scene six times. If you need shadows, try to keep them restrained to a spot or directional light.

It is possible to control whether a MeshRenderer or SkinnedMeshRenderer component casts shadows using the ShadowCastMode Enum value on the component. Changing this to 'Off' may be helpful if users wish to have some meshes cast shadows, but not all (and hence don't want to disable shadows on the relevant lights). Alternatively, there may be some performance benefits by turning off shadow casting for a highly detailed mesh and placing a similar, but lower detail, mesh at the same location with ShadowCastMode 'ShadowOnly'.

ProtoFlux

How much ProtoFlux optimization matters for you depends on what type of ProtoFlux code you are making. The optimization requirements for, say, incredibly hot loops with a lot of calculation are not the same as drives that only get evaluated a maximum of once per frame. ProtoFlux optimization can be a great rabbithole, and it's important to not try to do too much at the expense of readability or simplicity.

This snippet will mainly focus on "everyday user" flux, disregarding complex cases or hot loops for maximum performance gain. It will not be extensive nor deep. If you are interested in extreme ProtoFlux optimization, check out the ProtoFlux optimization page.

Overall

Less ProtoFlux is not always better! It's more important to make your calculations do less work than it is to do them in a smaller space. Users should also not feel pressured to avoid using ProtoFlux because they believe using 'only components' is somehow more optimal. There are many cases where specific components are a better solution than using several ProtoFlux nodes and vice-versa, the decision mainly comes down to using the best tool for the job.

Writes and drives

Impulse chains, and thus writes to a field, allow for more explicit evaluation flow, while drive chains are more implicit in when nodes get evaluated. This means that you have more control over when things can be re-evaluated using careful impulse chains and listening to values yourself instead of letting the game do it for a drive chain.

By default, writing to a field will incur network traffic at the end of the update. Multiple writes can be done in one update and only the last write will induce traffic. To prevent network traffic, one can either drive a field or a self-driven ValueCopy component with writeback to "localize" the field for writing.

Dynamic impulses

Dynamic impulses are used to dynamically call "receivers" of some tagged impulse under a slot hierarchy. This works by recurviely looking at all children of the input slot given to a trigger node and seeing if there is a receiver. As such, it is recommended to minimize the scope of a dynamic impulse as much as is reasonable, should the impulse be called very frequently.

Continuously changing drive chains

Certain nodes are marked with an attribute called ContinuouslyChanging. This will cause any listener node that use the output of the original node to re-evaluate the entire input chain every update. A list of these nodes can be found on the Category:ContinuouslyChanging nodes page.

If potentially-heavy nodes are used in such a chain, such as BodyNodeSlot or ProtoFlux:FindChildByName, this can be a silent killer of performance depending on how heavy the chain is and how many instance of the flux exist.

It is highly recommended to, if you are using any heavy node in a chain that is marked continuously changing, to see if the result of the heavy node can be written to a Store whenever it may need to be updated. This would allow the heavy node to not be re-evaluated every update.

Measuring ProtoFlux runtime

The runtime of a ProtoFlux Impulse chain can be measured using Utc Now and a single variable storing its value just before executing the measured code. Subtracting this value from the one evaluated just after the execution yields the duration it took to run. Take note that this does not measure indirect performance impacts like increased network load or physics updates!

Measuring the performance impact of ProtoFlux drives is not as straightforward. The most sensible approach would be to create an Impulse based measurement to measure the runtime of a simple Write node connected to same expression as the drive.

If you—for whatever reason—need to compute the execution time during drive evaluation, you could abuse the order of evaluations:

Common performance myths

A few myths occasionally propagate regarding optimization. This list serves to state the truth of the various discussions:

  • In general, slot count does not inherently matter for the performance of an object. Should anything iterate over the slots or elements of an object, it will matter, but not in the general case.
  • In general, there is no notable difference in performance between using ProtoFlux versus components for driving values. Performance differences would occur on a case-by-case basis depending on the complexity of the component and the potential re-implementation in ProtoFlux.

Culling

More information: Culling

Culling refers to not not rendering or processing specific parts of a world or certain slots to reduce performance costs. Resonite currently has built-in frustrum culling for rendering meshes, but this does not affect any CPU-based stuff like ProtoFlux or components. It is recommended to set up some sort of additional world-based culling if your world is large and complex. There are also user-made systems to cull only user avatars in a world for a potential per-session performance increase with a large number of users.

Profiling

Profiling, in general, is the act of measuring what and where is taking up the most of a particular resource. Profiling can be used to help determine what is taking up the most CPU or GPU frame time in a world. Some methods of profiling a world or item include:

  • The Debug facet on the home screen of the dash menu provides timings for various aspects of Resonite. The two most useful tabs here are the Worlds and Focused World tabs, the latter of which will show the time it takes to execute every part of the world update cycle.
  • The ResoniteMetricsCounter mod provides timings for various slot hierarchies, protoflux chains, and components in the current session.
    • Do not take the exact timings at face value, as there is non-insignificant overhead from patching everything that is required for the mod to work. Nevertheless, it serves as an excellent tool for exploring what is relatively taking up the most frame time in a world.
  • The dotnet-trace tool is able to perform in-depth traces of FrooxEngine and enumerate what functions are taking up the most runtime. While it provides the most extensive and accurate profiling, it can't single out individual slot hierarchies, ProtoFlux groups, or components, and is generally for more advanced users.
  • SteamVR has a "Display Performance Graph" that can show GPU frametimes. This can also be shown in-headset from a toggle in the developer settings (toggle "Advanced Settings" on in the settings menu)
  • fpsVR is a paid SteamVR tool that gives a lot of information about CPU and GPU timings, vram/ram usage and other metrics that can be very useful when profiling inside of VR.