click here if you're having trouble viewing this email







Queue E-Mail Newsletter

for the Week of May 13, 2008


Sponsored by
Oracle
ACM
Enterprise 2.0 Conference 2008



Latest Articles:

Scalable Parallel Programming with CUDA
Is CUDA the parallel programming model that application developers have been waiting for?
(scroll down to read an excerpt from this article)

GPUs: A Closer Look
As the line between GPUs and CPUs begins to blur, it's important to understand what makes GPUs tick.


--- Embedded Technologies

Oracle's embedded technologies for ISVs and OEMs http://www.oracle.com/go/?&Src=5951533&Act=13


Latest Queuecasts:
A Conversation with Jason Hoffman, pt. 2
Queue's January/February 2008 issue features a conversation with Jason Hoffman, CTO of Joyent, a provider of scalable infrastructure for Web applications. Interviewed by Sun's Bryan Cantrill (of DTrace fame), Hoffman discusses a range of topics, from providing scalable infrastructure for Facebook apps, to virtualization, to Ruby on Rails.

A Conversation with Jason Hoffman, pt. 1
Queue's January/February 2008 issue features a conversation with Jason Hoffman, CTO of Joyent, a provider of scalable infrastructure for Web applications. Interviewed by Sun's Bryan Cantrill (of DTrace fame), Hoffman discusses a range of topics, from providing scalable infrastructure for Facebook apps, to virtualization, to Ruby on Rails.

The Ever Expanding Ecosystem for Embedded Computing
Mike Vizard from ACM Queue talks with Oracle's Mike Olson about the changing architecture of network-enabled applications. Olson explains the thinking behind the company's new focus on embedded database and middleware technology. He explores the technical, business and economic forces shaping this fast-growing market. Tune in to learn how Oracle plans to serve customers way outside the enterprise.



Join ACM

A Special Offer to Join ACM for Queue Readers http://www.acm.org/joinacm2


New article on ACM Queue:

Scalable Parallel Programming with CUDA



Is CUDA the parallel programming model that application developers have been waiting for?

by John Nickolls, Ian Buck, and Michael Garland, Nvidia, Kevin Skadron, University of Virginia

From the GPUs for Parallel Programming issue, vol. 6, no. 2 - March/April 2008
article excerpt:

The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

According to conventional wisdom, parallel programming is difficult. Early experience with the CUDA1,2 scalable parallel programming model and C language, however, shows that many sophisticated programs can be readily expressed with a few easily understood abstractions. Since NVIDIA released CUDA in 2007, developers have rapidly developed scalable parallel programs for a wide range of applications, including computational chemistry, sparse matrix solvers, sorting, searching, and physics models. These applications scale transparently to hundreds of processor cores and thousands of concurrent threads. NVIDIA GPUs with the new Tesla unified graphics and computing architecture (described in the GPU sidebar) run CUDA C programs and are widely available in laptops, PCs, workstations, and servers. The CUDA model is also applicable to other shared-memory parallel processing architectures, including multicore CPUs.3

CUDA provides three key abstractions - a hierarchy of thread groups, shared memories, and barrier synchronization - that provide a clear parallel structure to conventional C code for one thread of the hierarchy. Multiple levels of threads, memory, and synchronization provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. The abstractions guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel, and then into finer pieces that can be solved cooperatively in parallel. The programming model scales transparently to large numbers of processor cores: a compiled CUDA program executes on any number of processors, and only the runtime system needs to know the physical processor count.

The CUDA Paradigm

CUDA is a minimal extension of the C and C++ programming languages. The programmer writes a serial program that calls parallel kernels, which may be simple functions or full programs. A kernel executes in parallel across a set of parallel threads. The programmer organizes these threads into a hierarchy of grids of thread blocks. A thread block is a set of concurrent threads that can cooperate among themselves through barrier synchronization and shared access to a memory space private to the block. A grid is a set of thread blocks that may each be executed independently and thus may execute in parallel.

When invoking a kernel, the programmer specifies the number of threads per block and the number of blocks making up the grid. Each thread is given a unique thread ID number threadIdx within its thread block, numbered 0, 1, 2, ..., blockDim - 1, and each thread block is given a unique block ID number blockIdx within its grid. CUDA supports thread blocks containing up to 512 threads. For convenience, thread blocks and grids may have one, two, or three dimensions, accessed via .x, .y, and .z index fields.

As a very simple example of parallel programming, suppose that we are given two vectors x and y of n floating-point numbers each and that we wish to compute the result of y←ax + y, for some scalar value a. This is the so-called saxpy kernel defined by the BLAS (basic linear algebra subprograms) library. The code for performing this computation on both a serial processor and in parallel using CUDA is shown in figure 1.

The __global__ declaration specifier indicates that the procedure is a kernel entry point. CUDA programs launch parallel kernels with the extended function-call syntax

kernel<<<dimGrid, dimBlock>>>(... parameter list ...);

where dimGrid and dimBlock are three-element vectors of type dim3 that specify the dimensions of the grid in blocks and the dimensions of the blocks in threads, respectively. Unspecified dimensions default to 1.

In the example, we launch a grid that assigns one thread to each element of the vectors and puts 256 threads in each block. Each thread computes an element index from its thread and block IDs and then performs the desired calculation on the corresponding vector elements. The serial and parallel versions of this code are strikingly similar. This represents a fairly common pattern. The serial code consists of a loop where each iteration is independent of all the others. Such loops can be mechanically transformed into parallel kernels: each loop iteration becomes an independent thread. By assigning a single thread to each output element, we avoid the need for any synchronization among threads when writing results to memory.

The text of a CUDA kernel is simply a C function for one sequential thread. Thus, it is generally straightforward to write and is typically simpler than writing parallel code for vector operations. Parallelism is determined clearly and explicitly by specifying the dimensions of a grid and its thread blocks when launching a kernel.

Parallel execution and thread management are automatic. All thread creation, scheduling, and termination are handled for the programmer by the underlying system. Indeed, a Tesla-architecture GPU performs all thread management directly in hardware. The threads of a block execute concurrently and may synchronize at a barrier by calling the __syncthreads() intrinsic. This guarantees that no thread participating in the barrier can proceed until all participating threads have reached the barrier. After passing the barrier, these threads are also guaranteed to see all writes to memory performed by participating threads before the barrier. Thus, threads in a block may communicate with each other by writing and reading per-block shared memory at a synchronization barrier.



Read the rest at acmqueue.com


Lead the Evolution Join the largest gathering for people ready to reinvent the way work is done. Explore your options for bringing social tools and 2.0 technologies to your business. Enterprise 2.0 supports you through keynotes, sessions, case studies and a Demo Pavilion featuring the latest technologies. http://www.Enterprise2Conf.com


See all the latest articles and audio interviews with Queue's RSS Feeds


To unsubscribe to this newsletter, send an email to
queuenews-request@acmqueue.com
with the words 'unsubscribe' in the subject line.

Change your email address

Subscribe to Queue in print

About Queue

Contact Us

Privacy policy


For advertising information, contact advertising@acmqueue.com



© 2008 ACM, Inc. All rights reserved.