Quick question, is this just pure C code that can be loaded into an Nvidia gpu and run (via the python code)? I scanned the C and didn't see anything CUDA related (maybe I missed something, I'm not a GPU programmer!). K mentions something about a direct CUDA implementation coming soon, how would that be different than what this is?