Tweaking the HDT-SMP config to properly utilize OpenCL for big gains in performance and accuracy

goaway · March 27, 2020

Did you ever wonder why SMP physics in Skyrim is so performance intensive? Surprisingly, it could be because it is not set up to take full advantage of your hardware. OpenCL is extremely good at performing the kind of calculations used by a physics engine which runs on dozens or hundreds of paralell entities, but the way HDT-SMP is set up by default is not tuned to make use of it.

<configs>
	<opencl>
		<!--warning : not finish yet-->
		<!--warning : this can be slower because of the bad PCI-E transfer and bad schedule-->
		<enable>false</enable>
		<platformID>0</platformID>
		<numQueue>16</numQueue>
	</opencl>
	<solver>
		<numIterations>20</numIterations>
		<groupIterations>1</groupIterations>
		<groupEnableMLCP>true</groupEnableMLCP>
		<erp>0.2</erp>
		<min-fps>60</min-fps>
	</solver>
  </configs>

This is the default config for HDT-SMP that would be found in your skse\hdtSkinnedMeshConfigs folder. As you can see, the openCL section is commented with warnings. After messing around with it in my spare time I have discovered a couple of things.

1. These numbers are no where near what OpenCL should be run at.

2. This is probably 99% of the reason why OpenCL seems to have bad performance.

So how can you tweak this to fully make use of OpenCL? For starters, you are going to need a tool that lets you see the OpenCL profile for your CPU/GPU, called "GPU caps viewer". You can find the link to it on this page

When you download and install it, open it up and you will see a screen that looks like this

Part1.png.f83926d88776a6c162807548f27ff491.png

Click over to the tab marked "OpenCL" and you will see a screen that looks like this

Part2.png.a73350999b01433958c27ffa1f2211ca.png

Depending on your CPU, GPU, motherboard, and bios, you may have multiple entries here. For this example, you can see that I have my GPU listed, a Nvidia RTX 2080 TI

Part3.png.187c0567b6aff2eab9964f3e1dd98e38.png

I also have the "intel HD graphics" platform profile, which is essentially the on-board display portion of my CPU chipset. A lot of skylake/coffeelake or newer CPUs will have this. You may want to come back and try this platform later, but for right now, let's stick with the GPU platform.

There are a few things we want to know from the profiler. The first is the platformID. Take a look at the default config again, and you should notice this line.

		<enable>false</enable>
		<platformID>0</platformID>

The platformID, 0, actually denotes NOT using an OpenCL capable platform on my device, and the OpenCL capable platforms are 1: GPU, and 2: CPU graphics. If you enable OpenCL but don't select the correct platformID, you will almost certainly see a drop in performance. This may explain the commented section.

Since my GPU is platform 1, I will change my config to look like this

		<enable>true</enable>
		<platformID>1</platformID>

Right away this gave me a pretty sizable performance boost. It actually gets better though.

Next you want to determine the maximum number of queues you can scale up to. One of the biggest bottlenecks for HDT-SMP is actually not how quickly it can perform the calculations, but the fact that physics engines are designed to calculate in real time! If the physics calculations are not completed for whatever reason before the GPU completes a given frame, the frametime render may actually be throttled until the physics solver can keep up.

This means that even if your system can easily handle the calculations, if the solver is not able to process enough calculations simultaneously, it will create a bottleneck! And due to how world physics seems to work, this bottleneck may actually throttle your FPS down to keep the time-based physics calculations accurate.

Note: This is my own speculation, however, I am not 100% certain that the implementation of SMP works the same way, but it certainly seems to be the case.

Back in GPU caps viewer on the OpenCL screen, you want to find the value for maximum parallel work-item/work-group sizes

Part4.png.788af7adff01f563ee04d96b72a88dea.png

As you can see from my GPU, it is capable of 1024 x 1024 x 64 array calculations with a max workgroup size of 1024. I'm not exactly a programmer, nor particularly familiar with multi-threaded architecture, but I think this may be where some of the confusion originally came from.

Intel describes the maximum workgroup per slice of 16, and the calculation to determine maximum necessary work groups as ( work items ) / (work items / work group). However, this is per slice of CPU graphics, NOT the total, which Intel recommends as 256. When I went back to look at my CPU HD graphics platform, this is exactly the max work group size listed, 256.

So going on that value, and using my GPU max work group size value of 1024, I changed my config to look like this now

	<opencl>
		<!--warning : not finish yet-->
		<!--warning : this can be slower because of the bad PCI-E transfer and bad schedule-->
		<enable>true</enable>
		<platformID>1</platformID>
		<numQueue>1024</numQueue>
	</opencl>

The first time I tried this I was sure this would cause a crash. After all, the default value for queues is 16, and this is 64 times larger.

Not only did it not crash, but it ran absolutely flawlessly! I even stress-tested it a bit by loading up a huge number of NPCs in a single cell, all of which were wearing HDT-SMP outfits that required calculations. Previously this would drop my framerate considerably, but not so anymore!

You may want to experiment with this a bit, but there are two things to keep in mind. You shouldn't set your numQueue higher than your maximum workgroup size, and whatever value you do set needs to be a power of 2.

Let's go back to the intel CPU platform for a second.

If you run a very demanding ENB, you might already have your GPU close to maxxed out with just post-processing enhancements. In which case, you might find more benefit out of using your CPU to perform the calculations, since skyrim LE almost never taxes modern CPUs anywhere close to what they are capable of.

If I wanted to use my CPU instead, I would go to the OpenCL tab back in GPU Caps viewer

Part5.png.96e6957ad159f21417a353a37528e621.png

Based on these values, I would then set my config like this

	<opencl>
		<!--warning : not finish yet-->
		<!--warning : this can be slower because of the bad PCI-E transfer and bad schedule-->
		<enable>true</enable>
		<platformID>2</platformID>
		<numQueue>256</numQueue>
	</opencl>

You may need to experiment a little to see which option gives you better performance. In my case, my GPU is powerful enough to be able to handle both ENB and physics, but your results may differ.

With that set up, you can either be satisfied with the performance improvements, or you can try and tweak the solver calculations too.

The default values for these vary from what I have seen, but generally it looks like this

	<solver>
		<numIterations>20</numIterations>
		<groupIterations>1</groupIterations>
		<groupEnableMLCP>true</groupEnableMLCP>
		<erp>0.2</erp>
		<min-fps>60</min-fps>
	</solver>

There are a few things we can try changing here. The first one is the ERP value. What is ERP? It is the error reduction parameter, and it represents the % of error that is corrected with each frame of time that is calculated by the physics solver. When Physics engines run a calculation on a joint of two connected bodies(bones), there is a natural divergence from the imposed constraints that happens to occur. This is a good reference

It ranges from 0.1 to 1.0 (in theory), but most references I have found say 1.0 is impossible, and 0.9 is the limit. It is how quickly (closer to 1.0) or how smoothly (closer to 0.1) the error is corrected. This is relevant for outfits and hair that has a long chain made out of multiple joints. I have found that I prefer setting this a bit higher, so errors are corrected quicker, and I set mine to 0.4. Setting it to 0.5 or higher introduced some instability that made my game start crashing, although you may get different results.

Next you can try increasing the iterations. Most of the physics engine references I have found suggest that 20 is already on the high end of iterations, but what about the group iterations? What I have found suggests this is performed on a connected group (such as every joint in a strand of cloth, a tail, or hair, for example) that is performed in addition to the total quick-step world iterations.

So, since using OpenCL allowed for a huge expansion of the number of calculations that can be performed, I changed it to match the total iterations (which may be overkill) and now my config looks like this

	<solver>
		<numIterations>20</numIterations>
		<groupIterations>20</groupIterations>
		<groupEnableMLCP>true</groupEnableMLCP>
		<erp>0.4</erp>
		<min-fps>60</min-fps>
	</solver>

This worked pretty well! I decided to push the total iterations up a little until I started getting stability/performance issues, and ended up with the final values of num 32 and group 20

and adding the previous section, my overall configs.xml is now

	
<configs>
	<opencl>
		<!--warning : not finish yet-->
		<!--warning : this can be slower because of the bad PCI-E transfer and bad schedule-->
		<enable>true</enable>
		<platformID>1</platformID>
		<numQueue>1024</numQueue>
	</opencl>
<solver>
		<numIterations>32</numIterations>
		<groupIterations>20</groupIterations>
		<groupEnableMLCP>true</groupEnableMLCP>
		<erp>0.4</erp>
		<min-fps>60</min-fps>
	</solver>
  </configs>

This has not only given me a SIGNIFICANT performance boost, especially when surrounded by many actors all wearing HDT-SMP outfits, but it has also improved the accuracy (ie realism) of the physics quite a bit.

Give it a try and let me know what you think. It may take some experimenting to find the ideal values if your system is already strained by other parts of your Skyrim build, but hopefully this will make SMP physics more appealing and minimize whatever performance impacts it may have on your system.

If you experience problems or have stability issues and need to undo any changes, the default settings are shown in the first box.

agiz19 · March 27, 2020

hey.

this looks very interesting, but it would be nice if you made a comparison video before/after type for this so we can see the actual results and differences , at least a snapshot of same scenes with framerate visible would already help, but video is the way to go, I highly doubt many people will try this based on few words, otherwise it seems something very useful and really a comparison would be very appreciated.

Also have you tried this in Special Edition?

RomeoZero · March 28, 2020

23 hours ago, goaway said:

If you experience problems or have stability issues and need to undo any changes, the default settings are shown in the first box.

Gonna test this out.

RomeoZero · March 29, 2020

goaway ,This actually works pretty well with same your results on GTX1060, be it clothes and body or single body for testings, even with HDT+SMP combo. I always knew there was something in config file that stress hard FPS on calculations from just rigged bones and xml data.Good job on it !

OrrieL · March 29, 2020

I am sorry to floor your excitement, but the opencl tag is not implemented in both LE and SSE binaries. I devoted a lot of time to figure it out in the past (before sources were released) and ended up disassembling the DLL only to find out the tag is not implemented.

Now when source codes were released you can check yourself https://github.com/aers/hdtSMP64/blob/master/hdtSMP64/config.cpp

You can check smp log and you will see something like unknown parameter in the first lines and if you comment out the opencl tag then there will be no more unknown parameter warning the log.

If SMP code does not select OpenCL platform/device then it is on bullet engine to choose one and I bet it always defaults to CPU if not specified otherwise. Maybe someone else can find an answer or even better modify the sources to actually support OpenCL platform/device selection.

The solver section IS implemented and settings can have impact on the performance. Especially group iterations which are used for group constraints and LERPs. The values you set are extremely high and if used with complex mesh physics utilizing both group constraints and LERPs will have serious performance impact while having very little accuracy difference.

Still , the biggest performance killer are collision calculations for which GPU might help.

Do you have a proof to support your theory that SMP in your system is using GPU?

goaway · March 29, 2020

On 3/27/2020 at 4:57 AM, agiz19 said:

hey.

this looks very interesting, but it would be nice if you made a comparison video before/after type for this so we can see the actual results and differences , at least a snapshot of same scenes with framerate visible would already help, but video is the way to go, I highly doubt many people will try this based on few words, otherwise it seems something very useful and really a comparison would be very appreciated.

This is very hard to quantify or explain unfortunately, my FPS is capped at 60 and doesn't really change except in cells with many NPCs. The best way I know how to see the performance impact of physics is to make a cell that brings me down to about 40-45 fps and then open the console. When you do this, the graphics are still rendering full-power, but the physics engine stops calculating while the console is open, so my fps shoots back up to 60.

When I set it up to use OpenCL, I haven't been able to drop my FPS by physics, the only way it ever goes down is when I overload my GPU with supersampling resolutions and that doesn't have anything to do with physics.

That said, I already went through a lot of trouble writing the guide. It is fairly simple and does not take very long to test this yourself, so I don't really have any intention of making a youtube video to "prove it". It's ok with me if you aren't interested, I just wanted to share it in case somebody else is.

13 hours ago, OrrieL said:

I am sorry to floor your excitement, but the opencl tag is not implemented in both LE and SSE binaries. I devoted a lot of time to figure it out in the past (before sources were released) and ended up disassembling the DLL only to find out the tag is not implemented.

Now when source codes were released you can check yourself https://github.com/aers/hdtSMP64/blob/master/hdtSMP64/config.cpp

You can check smp log and you will see something like unknown parameter in the first lines and if you comment out the opencl tag then there will be no more unknown parameter warning the log.

Still , the biggest performance killer are collision calculations for which GPU might help.

Do you have a proof to support your theory that SMP in your system is using GPU?

The fact that it is using the GPU is very easy to see. Using MSI-Afterburner and RTSS, I can track my GPU utilization and VRAM. My GPU maxxes at about 80% in skyrim using my ENB settings, and I can switch to the GPU PlatformID of OpenCL and see that rise to 95% without an increase in VRAM. Changing the platformID is the only variable that is different. I can't really think of an alternative explanation for that.

I should add that I do not have anything in my logs like unknown parameter. Maybe it's different based on which version of the .dll you are running?

Quote

If SMP code does not select OpenCL platform/device then it is on bullet engine to choose one and I bet it always defaults to CPU if not specified otherwise. Maybe someone else can find an answer or even better modify the sources to actually support OpenCL platform/device selection.

I was pretty sure that it used ODE quick-step physics as a solver, but it sounds like you know a lot about this, so I will take your word for it.

But... the whole point of setting the platformID is to specify the GPU version of openCL (which is openCL 1.2 C, as opposed to the CPU-graphics which uses OpenCL 2.1) The variables are part of the actual OpenCL code language, so maybe it doesn't really matter if SMP defines them in the source as long as it integrates the OpenCL libraries. Your guess is as good as mine.

I'd also like to add that I verified the relationship between numQueues and max work group size by going over the limit. When I set it to 1025, my game would bog down and gradually become unstable, and setting it any higher than 1025 caused an immediate crash. I am uncertain how I could possibly explain this observation if it doesn't actually use OpenCL.

Quote

The solver section IS implemented and settings can have impact on the performance. Especially group iterations which are used for group constraints and LERPs. The values you set are extremely high and if used with complex mesh physics utilizing both group constraints and LERPs will have serious performance impact while having very little accuracy difference.

That's sort of the point. If I used it with OpenCL= false or PlatformID=0, it would cause serious performance issues. But when I set it up like this, it works just fine.

You are right, I probably overdid it with the iterations, but my methodology was to keep incrementing the values until I got stability or performance issues, and then scale back a few steps.

The collisions are the aspect where the accuracy of the simulation is most noticeably improved. Any deformation caused by one mesh interacting with another is significantly less, well... deformed looking. Hand/breast interactions (this is loverslab, afterall), for example, do not have the same unnatural looking stretching, the deforming mesh looks more stable and less jerky. My assumption was that this is a result of the combination of increased total iterations plus the group iterations since I use a bodymesh and SMP config that utilizes multiple breast bones.

OrrieL · March 30, 2020

9 hours ago, goaway said:

I should add that I do not have anything in my logs like unknown parameter. Maybe it's different based on which version of the .dll you are running?

Interesting, are you on LE or SE? Can you please send me the DLL that you are using via msg? Thanks.

EDIT: It made me curious and I asked competent sources about it. OpenCL in the SE version was never implemented, everything runs on the CPU. LE sources were not released so we can only speculate, but I guess SE SMP code is just evolution of LE SMP code and Hydrogen probably used most of the LE code as a base so if it is not in SE version then it is probably not even in LE version. But I am still curious about the DLL you are using.

27X · March 30, 2020

18 hours ago, OrrieL said:

Interesting, are you on LE or SE? Can you please send me the DLL that you are using via msg? Thanks.

EDIT: It made me curious and I asked competent sources about it. OpenCL in the SE version was never implemented, everything runs on the CPU. LE sources were not released so we can only speculate, but I guess SE SMP code is just evolution of LE SMP code and Hydrogen probably used most of the LE code as a base so if it is not in SE version then it is probably not even in LE version. But I am still curious about the DLL you are using.

There are older versions that use OCL.

You also want the target per frame Interpolation to be 64, not 60.

Donselino · April 1, 2020

Oke I think a thank you is in order, 3 to 9 fps up depending on the area.

OrrieL · April 2, 2020

On 3/31/2020 at 1:30 AM, 27X said:

There are older versions that use OCL.

You also want the target per frame Interpolation to be 64, not 60.

I thought you might be right, I use (and dig in it) SMP for about 2 years and I never tried the old DLLs so I gave it a shot and tried.

First we are talking about Skyrim LE, not SE.

I searched my drive and found an archive containing 5 different DLL version from 2015-2018.

I disassembled every one of them and was looking for the strings and to see if there is a condition to parse <opencl> tag and if there is "Unknown config" warning if unknown parameter is found.

I modified the config with the same settings as OP is using.

Then I loaded a game and used F-F animation using the high poly CBBE body with custom XML to trigger triangle-triangle collisions.

I do not think that testing with NPCs just walking around is good enough as physics calculations (ie. movement, not collisions) is not that CPU intensive and even potato CPUs can handle it. Collisions are where things get ugly.

Here are the results:

This is my OpenCL system (info taken from hashcat)

OpenCL Info:

Platform ID #1
Vendor : NVIDIA Corporation
Name : NVIDIA CUDA
Version : OpenCL 1.2 CUDA 10.2.131

Device ID #1
Type : GPU
Vendor ID : 32
Vendor : NVIDIA Corporation
Name : GeForce GTX 1660 Ti
Version : OpenCL 1.2 CUDA
Processor(s) : 24
Clock : 1875
Memory : 1536/6144 MB allocatable
OpenCL Version : OpenCL C 1.2
Driver Version : 442.19