Hi!
Interesting program you want to make, coincidently I just finished on a private key generator in C++ and CUDA.
I went through that python code that you shared and saw that you need 256 bit keys and if I observed correctly, you want to employ the brute force strategy which is a really task intensive and memory consuming process.
The task intensive part can easily be overcome by a gpu(understandable), but the memory intensive part will be tricky even for a gpu and you need to use ever bit of memory available, whereas python is known to particularly neglect memory usage in favor of ease of use.
My proposal is to refactor your project into C++ and once done I will make it use CUDA.
Any current day NVidia gpu has a compute capability greater than 2.0 meaning that it can support up to 1024 threads in parallel. SInce you want to use brute force it is essential that those threads generate as many keys as possible.
Doing so each thread will execute for more than 2 seconds so I would deactivate WDDM to grant me longer execution times.
And to achieve maximal gpu occupancy, I would just go ahead and use the ideal amount amount of threads per block: 128 and 8 blocks per grid.
Even doing so, not matter how powerful your gpu, there is a risk of exhausting the gpu memory usage. The answer to this problem is batching: give each thread a certain amount of keys to generate so as to fully occupy gpu memory.
In case you find my proposal useful, feel free to contact me.