Can you explain why 64bit matters? CUDA should not require 64bit DLLs. If your point is that there is a memory limit to the types of computations you can do with your DLL, then you might want to consider a different architecture to easily workaround this. Specifically, it would be trivial to launch a new (64bit) process as a host controller for functions instead of calling cuda*() operations from the same 32bit process. The overhead of
CreateProcess() should be negligible if you're dealing with computations on 4+GB data sets, and you can keep the process open so that it's no more heavy than loading the DLL in the first place. Incidentally, this architecture also gives you a better sandbox to work with, so that if the CUDA program crashes, it won't take mIRC with it. Given how low-level CUDA functions operate, that's probably something you want.