CUDA

These instructions explain how to set the environment for CUDA toolkit and install your own CUDA SDK

  • The CUDA toolkit includes the compiler and libraries and has been installed globally.
  • CUDA SDK includes programs and examples that use CUDA. You should install your own copy so that you can modify the examples and explore the algorithms.

Installing the CUDA SDK in your home directory

  1. Add the line:
    • addpackage cuda
    to your .login file in your home directory (don't put this in our .cshrc or .tcshrc file).
  2. Download the CUDA SDK from the following link:
    • http://developer.nvidia.com/object/cuda_3_0_downloads.html#Linux
    Under the LINUX platform, download "GPU Computing SDK code samples and more". The file name looks is:
    • gpucomputingsdk_3.0_linux.run
  3. Log in the machine that contains your GPU graphics card. For example:
    • ssh -X wave3d

The remaining steps will be on the machine containing the GPU card(s) (for example: wave3d). You must be logged into a machine containing a GPU card(s) to use the cuda developer tools (compiler) and any cuda programs.

  1. Move the gpucomputingsdk_3.0_linux.run file to the directory where you want to install the CUDA SDK (for example: ~/GPU"). Change to that directory.
  2. Execute the following command:
    • sh  gpucomputingsdk_3.0_linux.run
  3. At the prompt "Enter install path (default ~/NVIDIA_GPU_Computing_SDK):", enter the path where you want to install the CUDA SDK. For example:
    • ~/GPU/NVIDIA_GPU_Computing_SDK
  4. At the prompt "Enter CUDA install path (default /soft/cuda/cuda):", press enter if you are using the shared CUDA toolkit.
  5. If successfully installed, the directory you entered above will appear. For example:
    • ~/GPU/NVIDIA_GPU_Computing_SDK
    In the remaining instructions this directory will be referred to as "the SDK directory".
  6. Change to the SDK directory. For example:
    • cd ~/GPU/NVIDIA_GPU_Computing_SDK
  7. Change to the "C" directory:
    • cd C
  8. Execute the make command:
    • make
  9. All the executable program examples are now compiled and located at "C/bin/linux/release" under the SDK directory. Change to this directory. For example:
    • cd ~/GPU/NVIDIA_GPU_Computing_SDK/C/bin/linux/release
  10. To check the quantity and properties of the cuda device execute the deviceQuery:
    • ./deviceQuery
  11. You will get output describing each of the GPU boards you have in your system. The output will depend on the number and type of board(s). If the deviceQuery is successful your SDK install is complete. Here is an example output for a machine with 4 Tesla boards:
    • ./deviceQuery Starting...
      
       CUDA Device Query (Runtime API) version (CUDART static linking)
      
      There are 4 devices supporting CUDA
      
      Device 0: "Tesla C1060"
        CUDA Driver Version:                           3.0
        CUDA Runtime Version:                          3.0
        CUDA Capability Major revision number:         1
        CUDA Capability Minor revision number:         3
        Total amount of global memory:                 4294770688 bytes
        Number of multiprocessors:                     30
        Number of cores:                               240
        Total amount of constant memory:               65536 bytes
        Total amount of shared memory per block:       16384 bytes
        Total number of registers available per block: 16384
        Warp size:                                     32
        Maximum number of threads per block:           512
        Maximum sizes of each dimension of a block:    512 x 512 x 64
        Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
        Maximum memory pitch:                          2147483647 bytes
        Texture alignment:                             256 bytes
        Clock rate:                                    1.30 GHz
        Concurrent copy and execution:                 Yes
        Run time limit on kernels:                     No
        Integrated:                                    No
        Support host page-locked memory mapping:       Yes
        Compute mode:                                  Default (multiple host threads can use this device simultaneously)
      
      Device 1: "Tesla C1060"
        CUDA Driver Version:                           3.0
        CUDA Runtime Version:                          3.0
        CUDA Capability Major revision number:         1
        CUDA Capability Minor revision number:         3
        Total amount of global memory:                 4294770688 bytes
        Number of multiprocessors:                     30
        Number of cores:                               240
        Total amount of constant memory:               65536 bytes
        Total amount of shared memory per block:       16384 bytes
        Total number of registers available per block: 16384
        Warp size:                                     32
        Maximum number of threads per block:           512
        Maximum sizes of each dimension of a block:    512 x 512 x 64
        Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
        Maximum memory pitch:                          2147483647 bytes
        Texture alignment:                             256 bytes
        Clock rate:                                    1.30 GHz
        Concurrent copy and execution:                 Yes
        Run time limit on kernels:                     No
        Integrated:                                    No
        Support host page-locked memory mapping:       Yes
        Compute mode:                                  Default (multiple host threads can use this device simultaneously)
      
      Device 2: "Tesla C1060"
        CUDA Driver Version:                           3.0
        CUDA Runtime Version:                          3.0
        CUDA Capability Major revision number:         1
        CUDA Capability Minor revision number:         3
        Total amount of global memory:                 4294770688 bytes
        Number of multiprocessors:                     30
        Number of cores:                               240
        Total amount of constant memory:               65536 bytes
        Total amount of shared memory per block:       16384 bytes
        Total number of registers available per block: 16384
        Warp size:                                     32
        Maximum number of threads per block:           512
        Maximum sizes of each dimension of a block:    512 x 512 x 64
        Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
        Maximum memory pitch:                          2147483647 bytes
        Texture alignment:                             256 bytes
        Clock rate:                                    1.30 GHz
        Concurrent copy and execution:                 Yes
        Run time limit on kernels:                     No
        Integrated:                                    No
        Support host page-locked memory mapping:       Yes
        Compute mode:                                  Default (multiple host threads can use this device simultaneously)
      
      Device 3: "Tesla C1060"
        CUDA Driver Version:                           3.0
        CUDA Runtime Version:                          3.0
        CUDA Capability Major revision number:         1
        CUDA Capability Minor revision number:         3
        Total amount of global memory:                 4294770688 bytes
        Number of multiprocessors:                     30
        Number of cores:                               240
        Total amount of constant memory:               65536 bytes
        Total amount of shared memory per block:       16384 bytes
        Total number of registers available per block: 16384
        Warp size:                                     32
        Maximum number of threads per block:           512
        Maximum sizes of each dimension of a block:    512 x 512 x 64
        Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
        Maximum memory pitch:                          2147483647 bytes
        Texture alignment:                             256 bytes
        Clock rate:                                    1.30 GHz
        Concurrent copy and execution:                 Yes
        Run time limit on kernels:                     No
        Integrated:                                    No
        Support host page-locked memory mapping:       Yes
        Compute mode:                                  Default (multiple host threads can use this device simultaneously)
      
      deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4243455, CUDA Runtime Version = 3.0, NumDevs = 4, Device = Tesla C1060, Device = Tesla C1060
      
      
      PASSED
      
      Press <Enter> to Quit...
      -----------------------------------------------------------