I desperately need some insights about CUDA memory performance, and there are two chapters discussing memory coalescing which are very useful to me. I need to write a very simple kernel to convert 3 bytes RGB into 4 bytes RGBA so that OpenGL could render it correctly, but what I have done naively is to spawn too many threads and by doing so it slow down the code so much, so much so that I wondered what I have really done.
Later I came to know that CUDA memory utilization is my bottleneck. Being a traditional programmer, I never pay attention to how I access memory. Now with the insights on memory coalescing in this book, I get the real education about how GPU really works.
On the other hand, I use CUDA under Linux, it is important to know that CUDA has provided a profiling tool to diagnose memory bottleneck, and cuda-gdb could use to step through my CUDA kernel. This book also spend quite some details to illustrate how to use nvcc. This is also up to date to discuss Fermi architecture (Compute Capability 2.x with L1, L2 cache) and CUDA Toolkit 4.0 .
Get this book if you need real insights from a real practitioner on CUDA. Also for the entry, you should get CUDA by Example (Jason Sanders and Edward Kandrot) as well. You would be so glad as I did.