I am currently at the NVIDIA GTC 2013 conference, and this is my blog on a variety of things which strike me as interesting.
Update: all of the talks are now available on the NVIDIA GTC website .
Monday, March 18th:
The highlight for me was an interesting talk on " Getting started with OpenACC " by Jeff Larkin. OpenACC is an approach similar to OpenMP in which the programmer puts in pragmas (or compiler directives) into their C / FORTRAN code, avoiding the need to write applications in CUDA.
Previously I've been a bit sceptical about how many applications will be able to benefit from OpenACC. However, this talk explained about various features (such as the ability to control the parameters for the generated CUDA kernels, and the ability to execute code asynchronously) which make me think it may be more useful than I had previously thought.
Tuesday, March 19th:
The day started with a 2.5 hour keynote presentation by Jen-Hsun Huang, the NVIDIA CEO. He covered a number of topics, but the main ones that interested me were:
- high-end Tesla development:
- Maxwell (due 2014) -- 2x performance compared to Kepler, key innovation is unified virtual memory with CPU
- Volta (due 2016) -- 4x performance compared to Kepler, key innovation is stacked memory leading to 1TB/s memory bandwidth
- low-end Tegra development
- Tegra 5 (aka Logan, due 2014) -- 3x performance compared to Tegra 4, has a Kepler GPU
- Tegra 6 (aka Parker, due 2015) -- 10x performance compared to Tegra 4, Maxwell GPU plus Denver CPU with ARM cores
- notice the dates compared to Tesla development -- much more rapid performance increase with Tegra
- Grid VCA (Visual Computing Appliance)
- back-to-future -- server plus thin clients
- 4U server box with 16 GPUs and 2 8-core CPUs
- thin client software can run on any "edge" device
- it's aimed particularly at SMEs with heavy graphics computing needs -- big sales point is just one system to manage for whole company
- for similar reasons, it might be good for academic research groups or departments
- I'll try to learn more about this tomorrow
There is a HPCWire report which includes some video from the presentation.
During the lunchtime period I went around the Exhibition Area to see some of the displays by various NVIDIA partners. One thing that I've been wanting to find out about is the status of GPUDirect for transferring data to/from GPUs directly from other PCIe-attached devices such as cameras and SSDs. I learned is that there is now something called GPUDirect RDMA which uses standard PCIe RDMA protocols -- this means that many (or most) PCIe devices in the future should be able to transfer data directly into GPUs. GE Defense Systems had info on FPGA-based boards for handling radar data, and someone else told me about Fusion-IO's SSDs which can sustain up to 6GB/s data rates.
In the afternoon, the highlight for me was a talk by Mark Harris whose title is something like Chief Technologist for CUDA. He discussed the Future Directions for CUDA , and some of the things which caught my eye were:
- unified virtual memory so either CPU or GPU can access any data (fully supported with Maxwell GPU but some support coming sooner)
- LLVM compiler developments to support development of DSLs (domain specific languages)
- new CUDA support in Python -- I'll go to a talk on this on Thursday
Wednesday, March 20th:
The day started with an NDA briefing on various aspects of the future product plans, filling in some details on the outline provided by Jen-Hsun. I obviously can't tell you the details, but I think I can say I'm happy with what I heard -- sorry if that sounds rather elliptic. If anyone has a sufficiently strong need to know more, contact me and I'll see if I can include you under the NDA.
I also got more non-NDA details on the Grid VCA I mentioned above. It's a Windows-based product using virtualisation. The ideal target is a company like Rolls-Royce with lots of designers doing really heavy-duty single-GPU computation/visualisation. Through a simple client app, the user can remotely log onto a virtual machine, do what they need to do and then log off again. They don't need their own private high-end desktop system. So I think it's potentially a very good product but probably not appropriate for academic research.
A bit more on new language support:
- Python: Continuum Analytics are the company which has developed CUDA support within Python. There's a writeup on it by AnandTech , and the product is called Anaconda Accelerate .
- R: a company called Fuzzy Logix has developed GPU analytics within R.
- C/C++/FORTRAN: Accelereyes has sold their MATLAB Jacket product to Mathworks, but has now created a C/C++/FORTRAN product called ArrayFire which is based on the same lazy evaluation technology. This is free (at least for academic research) and I think it merits some investigation so I'll try to find a couple of undergraduates to do an evaluation project this summer.
Thursday, March 21st:
It's the final day of GTC 2013, and I'm starting to flag a bit.
An interesting talk in the morning on " SHFL: Tips and Tricks " by Julien Demouth was on the "shuffle" instructions for shifting data between registers in the same warp. This is useful for a wide variety of applications, including reduction, scan, sort and FFT operations within a warp. The bottom line is that it's always more efficient than the standard way of doing things using shared memory, and it doesn't need any shared memory allocation.
The hardware shuffle instructions are only for 32-bit registers, but there's a way of using them to do 64-bit shuffles. An example of this was given in the talk, but hopefully NVIDIA will later put it into the standard maths header file.
Julien also mentioned a new open-source NVIDIA Research initiative called " cub " which provides routines / functions for block-level operations like reduction, scan and sorts. I'd heard about this previously from Jon Cohen, and met the lead developer Duane Merrill on Monday. I think this could be very useful, because optimising these kinds of operations is very tricky, and can change significantly from one architecture to the next.
I've spent more time than usual at this conference in the Exhibitor Hall, seeing a wide variety of commercial offerings from companies other than NVIDIA. One I talked to just now is Green Revolution Computing . They have a very interesting immersion technology for cooling computer systems, putting the servers inside a "bath" so they are cooled by liquid rather than air. This leads to considerable savings, reducing total power consumption by up to factor 2x. I think it's something we ought to at least consider for future HPC systems in Oxford.
An interesting talk in the afternoon was on atomic operations . A lot of this I knew already -- it's material I discuss in my CUDA course. However, there were a couple of new things I learned. One is that atomic operations to global device memory (which are handled in the L2 cache) are much faster if different threads in a warp are updating different elements of the same cache line -- this is an unusual situation but could perhaps be exploited for vector increments by first using a shuffle-based matrix transposition.
They also showed how insert sort could be implemented using atomics and an interesting list data structure called a SkipList -- in the same way that the railway line to London has some trains that stop at every station, and some that go faster by stopping at very few, then a SkipList has a hierarchical structure in which the bottom level goes from one item to the next, and the higher levels skip over multiple items.
The main overall message is that the performance of atomics is improving steadily, and if you have dismissed them in the past as giving poor performance then it's maybe worth having a second look at them.
The final talk I went to was by Jon Cohen from NVIDIA on short read DNA sequence alignment. They're looking for academics to work with, so if there's anyone in Oxford who is interested please let me know -- Jon is a great guy to work with.