TACC Stampede User Guide - TACC User Portal [PDF]

May 18, 2017 - If you've forgotten your password, go to the TACC portal home page ( portal.tacc.utexas.edu ) and select

12 downloads 10 Views 1MB Size

Report

Download PDF

PNG Network

Recommend Stories

TACC ส่งสินค้าใหม่บุกเซเว่นฯ ต้อนรับซัมเมอร์ ม็

Raise your words, not voice. It is rain that grows flowers, not thunder. Rumi

Provider Portal User Guide

It always seems impossible until it is done. Nelson Mandela

Web Portal User Guide

Never wish them pain. That's not who you are. If they caused you pain, they must have pain inside. Wish

Submission Portal User Guide

Come let us be friends for once. Let us make life easy on us. Let us be loved ones and lovers. The earth

Customer Portal User Guide

Don't be satisfied with stories, how things have gone with others. Unfold your own myth. Rumi

SAS External Portal User Guide

We must be willing to let go of the life we have planned, so as to have the life that is waiting for

insupport™ provider portal user guide

When you do things from your soul, you feel a river moving in you, a joy. Rumi

user guide user guide

The beauty of a living thing is not the atoms that go into it, but the way those atoms are put together.

user guide user guide

Pretending to not be afraid is as good as actually not being afraid. David Letterman

VORILS – User Guide User Guide

Before you speak, let your words pass through three gates: Is it true? Is it necessary? Is it kind?

Idea Transcript

USER PORTAL

Notices Introduction

Sandy Bridge: Configuration Compute nodes Login nodes Intel E5 Sandy Bridge Processor Interconnect File Systems Overview

Sandy Bridge: System Access Secure Shell

Table of Contents t t

Sandy Bridge: System Overview

HOME

Unix Shell Environment Variables Startup Scripts Technical Background

Modules

CONSULTING

ABOUT

Stampede User Guide Last update: May 18, 2017 see revision history

Notices 05/18/17 Stampede1's KNL sub-system is no longer available, and the KNL material in the Stampede1 User Guide is now obsolete. We have begun the process of moving Stampede1's 508 KNL nodes to Stampede2. See the Stampede2 documentation for more information. 04/05/17 The maximum number of nodes requestable in the development queue has been reduced. Jobs are now limited to four nodes. See Stampede Production Queues for other limits. 09/27/16 This user guide has been updated substantially to reflect the new Knights Landing (KNL) Upgrade. Most of the older Knights Corner (KNC) coprocessor content has been moved to the new Stampede Archive: Knights Corner Technical Material document. 09/01/16 Multi-Factor Authentication (MFA) is now mandated in order to access all TACC resources. Please see the Multi-Factor Authentication at TACC tutorial for assistance in setting up your account.

Respect the shared filesystems

Login to an Assigned Compute Node

TRAINING

Good Citizenship

Start an Interactive Session

DOCUMENTATION

Files stored in your Stampede1 $WORK directory will remain available to you from other TACC systems; these files are on the Global Shared File System hosted on Stockyard. See $WORK: Stampede2 vs Stampede1 and Temporary Mounts of the Stampede1 File Systems for more information.

Using Stampede

Submit a Slurm job script

ALLOCATIONS

Stampede1 is no longer available as of Apr 2, 2018. We plan to end access to Stampede1 on April 2, 2018. As of that date you won't be able to login to the system, and Stampede2 will no longer provide read-only mounts of the Stampede1 home and scratch file systems.

XSEDE Single Sign-On Hub

Accessing Compute Nodes

RESOURCES

Print

GSI-OpenSSH (gsissh)

Don't run programs on the login nodes

NEWS

Sign In

Introduction TACC's Stampede system, generously funded by the National Science Foundation (NSF) through award ACI-1134872, entered production in January 2012 as a 6,400+ node cluster of Dell PowerEdge server nodes featuring Intel Xeon E5 Sandy Bridge host processors and the Intel Knights Corner (KNC) coprocessor, the first generation of processors based on Intel's Many Integrated Core (MIC) architecture. Stampede's 2016 Intel Knights Landing (KNL) Upgrade prepares the way for Stampede2 by adding 508 Intel Xeon Phi 7250 second-generation KNL MIC compute nodes. The KNL represents a radical break with the first generation KNC MIC coprocessor. Unlike the legacy KNC, a Stampede KNL is not a coprocessor: each KNL is a stand-alone, self-booting processor that is the sole processor in its node. While the KNL and Sandy Bridge nodes share the same three Lustre file systems, the Stampede KNL Upgrade is largely its own independent cluster. In fact it can be helpful to think of Stampede as two related but largely independent sub-systems (Figure 1.): the Sandy Bridge cluster and the KNL cluster. Note that the KNL upgrade adds new nodes (and new capabilities) to the system, but leaves the original Stampede hardware intact. The Sandy Bridge cluster, which is the original Stampede system consisting of Sandy Bridge compute nodes with their KNC coprocessors, remains available for production use. When you initiate a login session you do so on either the Sandy Bridge or KNL cluster. When you submit a job on the Sandy Bridge side, for example, it will run only on the Sandy Bridge compute nodes. The early sections of this User Guide address information important to all Stampede users as well as material specific to the Sandy Bridge cluster. A single self-contained section below focuses on the KNL cluster. For simplicity we have migrated most of the older KNC material to its own stand-alone legacy document, Stampede Archive: Knights Corner (KNC) Technical Material.

Available Modules Software upgrades and adding modules Controlling Modules Loaded at Login

File Management Globus Connect Linux Command-line Transfer Utilities scp rsync

globus-url-copy globus-url-copy Examples

GSI-OpenSSH

Application Development Programming Models Distributed-memory Model Parametric Sweeps Shared-memory Model Hybrid Model

Figure 1. Stampede Sandy Bridge and KNL clusters

Compiling The Intel Compiler Suite Basic Compiler Commands and Serial Program Compiling Compiling OpenMP Programs Compiling MPI Programs Compiling with gcc Compiler Optimization Options Basic Optimization for Serial and Parallel Programming using OpenMP and MPI

Libraries Intel Math Kernel Library (MKL) MKL with Intel C, C++, and Fortran Compilers MKL with GNU C, C++, and Fortran Compilers Using MKL as BLAS/LAPACK with Third-Party Software Using MKL as BLAS/LAPACK with TACC's MATLAB, Python, and R Modules Controlling Threading in MKL Using ScaLAPACK, Cluster FFT, and Other MKL Cluster Capabilities

Stampede 1's KNL sub-system is no longer available, and the KNL material in the Stampede 1 User Guide is now obsolete. We have begun the process of moving Stampede 1's 508 KNL nodes to Stampede2. See the Stampede2 documentation for more information.

Sandy Bridge: System Overview Stampede began production as a 10 petaflop (PF) Dell Linux cluster based on 6400+ Dell PowerEdge server nodes, most of which contain two Intel Xeon E5 Sandy Bridge processors and a first generation KNC coprocessor. These components are still in production. The aggregate peak performance of the Xeon Sandy Bridge E5 processors is 2+PF, while the KNC coprocessors deliver an additional aggregate peak performance of 7+PF. The system also includes login nodes, large-memory nodes, graphics nodes (for both remote visualization and computation), and dual-coprocessor nodes. Additional nodes (not directly accessible to users) provide management and file system services. The KNL Upgrade, which adds 1.5 PF to Stampede's capabilities, consists of 508 Intel Xeon Phi 7250 KNL compute nodes. One of the important design considerations for Stampede was to create a multi-use cyberinfrastructure resource, offering large memory, large showq -u" or "squeue -u username" to see your jobs. Example job scripts are available online in /share/doc/slurm . They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.

Sample Slurm Job Scripts Please click on the links below for pop-up sample Slurm job submission scripts. Trivial Serial Job example Simple MPI example OpenMP applications Symmetric applications, (Host + MIC) or (MIC only) Multiple MPI jobs within one batch job Large MPI applications in the largemem queue Hybrid applications (MPI/OpenMP or MPI/pthreads) MPI Job on the KNL Cluster

Viewing Job & Queue Status After job submission, users may monitor the status of their jobs in several ways. While the job is in the waiting state the system is continuously monitoring the number of nodes that become available and applying a fair share algorithm and a backfill algorithm to determine a fair, expedient scheduling to keep the machine running at optimum capacity. The latest queue information can be displayed several different ways with:

Quick view with showq TACC's "showq" job monitoring command-line utility displays jobs in the batch system in a manner similar to PBS' utility of the same name. showq summarizes running, idle, and pending jobs, also showing any advanced reservations scheduled within the next week. See Table 13 for some showq options.

login1$ showq ACTIVE JOBS-------------------JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME ================================================================================ 5201623 ms1b_MG_13 bernuzzi Running 128 18:03:43 Mon May 18 04:24:41 5224194 SET_6_0 minerj3 Running 32 18:05:01 Mon May 18 04:25:59 5226688 5kTp4kp04 blandrum Running 48 10:00:07 Mon May 18 20:21:05 5256143 CUUG tg817524 Running 16 19:43:59 Mon May 18 06:04:57 5265360 NAMD mahmoud Running 256 5:20:05 Mon May 18 15:41:03 ... login1$ showq -u janeuser ... WAITING JOBS-----------------------JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME ================================================================================ 1676351 helloworld janeuser Waiting 4096 15:30:00 Wed Sep 11 11:59:53 1676352 helloworld janeuser Waiting 4096 15:30:00 Wed Sep 11 12:00:07 1676354 helloworld janeuser Waiting 4096 15:30:00 Wed Sep 11 12:00:09 ... Table 13. showq options Option

Description

-l

displays queue and node count columns

-u

only active and waiting jobs of the user are reported

--help get more information on options

Get some info with sinfo The "sinfo" command without arguments might give you more information than you want. Use the print options in in the snippet below with sinfo for a more readable listing that summarizes each queue on a single line. The column labeled "NODES(A/I/O/T)" of this summary listing displays the number of nodes with the Allocated, Idle, and Other states along with the Total node count for the partition. See "man sinfo" for more information. See also the squeue command detailed below. Lists the availability and status of queues:

login1$ sinfo -o "%20P %5a %.10l %16F"

Job Monitoring with squeue Both the showq and squeue commands with the "-u username" option display similar information:

login1$ squeue -u janeuser JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1676351 normal hellowor janeuser PD 0:00 256 (Resources) 1676352 normal hellowor janeuser PD 0:00 256 (Resources) 1676354 normal hellowor janeuser PD 0:00 256 (Resources) Each command's output lists the three jobs (1676351, 1676352 & 1676354) waiting to run. The showq command displays cores and time requested, while the squeue command displays the partition (queue), the state (ST) of the job along with the node list when allocated. In this case, all three jobs are in the Pending (PD) state awaiting "Resources", (nodes to free up). Table 14 details common squeue options and Table 15 describes the command's output fields.

Table 14. Common squeue Options Option

Result

< >=comma separated list -i

Repeatedly report at intervals (in seconds).

-j

Displays information for specified job(s)

-p

Displays information for specified partitions (queues).

-t

Shows jobs in the specified state(s) See squeue man page for state abbreviations: "all" or list of {PD,R,S,CG,CD,CF,CA,F,TO,PR,NF}

The "squeue" command output includes a listing of jobs and the following fields for each job: Table 15. Columns in the squeue command output Field

Description

JOBID job id assigned to the job user that owns the job

USER

STATE current job status, including, but not limited to: CD

(completed)

CF

(cancelled)

F

(failed)

PD

(pending)

R

(running)

Using the squeue command with the --start and -j options can provide an estimate of when a particular job will be scheduled:

login1$ squeue --start -j 1676354 JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON) 1676534 normal hellow janeuser PD 2013-08-21T13:42:03 256 (Resources) Even more extensive job information can be found using the "scontrol" command. The output shows quite a bit about the job: job dependencies, submission time, number of codes, location of the job script and the working directory, etc. See the man page for more details.

login1$ scontrol show job 1676354 JobId=1676991 Name=mpi-helloworld UserId=slindsey(804387) GroupId=G-40300(40300) Priority=1397 Account=TG-STA110012S QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=15:30:00 TimeMin=N/A SubmitTime=2013-09-11T15:12:49 EligibleTime=2013-09-11T15:12:49 StartTime=2013-09-11T17:40:00 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=normal AllocNode:Sid=login4:27520 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=256-256 NumCPUs=4096 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/home1/01158/slindsey/mpi/submit.slurm WorkDir=/home1/01158/slindsey/mpi

About Pending Jobs Viewing the status of your jobs in the queue may reveal jobs in a pending (PD) state. Jobs submitted to Slurm may be, and remain, in a pending state for many reasons such as: A queue (partition) may be temporarily offline The resources (number of nodes) requested exceed those available Queues are being drained in anticipation of system maintenance. The system is running other high priority jobs The Reason Codes summarized below identify the reason a job is awaiting execution. If a job is pending for multiple reasons, only one of those reasons is displayed. For a full list, view the squeue man page. Job Pending Codes

Description

Dependency

This job is waiting for a dependent job to complete.

NodeDown

A node required by the job is down.

PartitionDown

The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintainance. Note that this message may be displayed for a time even after the system is back up.

Priority

One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours.

ReqNodeNotAvail No nodes can be found satisfying your limits, for instance because maintainance is scheduled and the job can not finish before it Reservation

The job is waiting for its advanced reservation to become available.

Resources

The job is waiting for resources (nodes) to become available and will run when Slurm finds enough free nodes.

SystemFailure

Failure of the Slurm system, a file system, the network, etc.

Launching Applications This section discusses how to submit jobs for your particular programming model: MPI, hybrid (openMP+MPI), symmetric, and serial codes. Stampede also provides users still experimenting with codes interactive access to the development nodes. In the examples below we use "a.out" as the executable name, but of course the name may be any valid application name, along with arguments and file redirections (.e.g.,

ibrun tacc_affinity ./myprogram myargs < myinput Please consult the sample Slurm job submission scripts below for various runtime configurations.

Launching Scalable MPI Programs The MVAPICH-2 MPI package provides a runtime environment that can be tuned for scalable code. For packages with short messages, there is a FAST_PATH option that can reduce communication costs, as well as a mechanism to share receive queues. Also, there is a Hot-Spot Congestion Avoidance option for quelling communication patterns that produce hot spots in the switch. See Chapter 9, Scalable features for Large Scale Clusters and Performance Tuning and Chapter 10, MVAPICH2 Parameters of the MVAPICH2 User Guide for more information. MVAPICH docs can be found here. It is relatively easy to distribute tasks on nodes in the Slurm parallel environment when only the E5 cores are used. The "-N" option sets the number of nodes and the "-n" option sets the total MPI tasks.

Launching MPI Applications with ibrun For all codes compiled with any MPI library, use the ibrun command (NOT mpirun) to launch the executable within the job script. The syntax is:

ibrun ./myprogram ibrun ./a.out The ibrun command supports options for advanced host selection. A subset of the processors from the list of all hosts can be selected to run an executable. An offset must be applied. This offset can also be used to run two different executables on two different subsets, simultaneously. The option syntax is:

ibrun -n number_of_cores -o hostlist_offset myprogram myprogram_args For the following advanced example, 64 cores were requested in the job script.

ibrun -n 32 -o 0 ./a.out & ibrun -n 32 -o 32 ./a.out & wait The first call launches a 32-core run on the first 32 hosts in the hostfile, while the second call launches a 32-core run on the second 32 hosts in the hostfile, concurrently (by terminating each command with "&"). The wait command (required) waits for all processes to finish before the shell continues. The wait command works in all shells. Note that the "-n" and "-o" options must be used together. The ibrun command also supports the "-np" option which limits the total number of tasks used by the batch job.

ibrun -np number_of_cores myprogram myprogram_args Unlike the "-n" option, the "-np" option requires no offset. It is assumed that the offset is 0.

Using a multiple of 16 cores per node For many pure MPI applications, the most cost-efficient choices are to use a multiple of 16 tasks per node. This will ensure that each core on all the nodes is assigned one task. Specify the total number of tasks (use a value evenly divisible by 16). Slurm will automatically place 16 tasks on each node. (If the number of tasks is not divisible by 16, less than 16 tasks will be placed on one of the nodes.) The following example will run on 4 nodes, 16 tasks per node.:

#SBATCH -n 64 Do not use the -N (number of nodes) option alone; only a single task will be launched on each node in this case.

Using fewer than 16 cores per node When fewer than 16 tasks are needed per node, use a combination of -n and -N. The following resource request

#SBATCH -N 4 -n 32 requests 4 nodes with 32 task distributed evenly across the nodes and sockets. The tasks per node is determined from the ratio tasks/nodes, and the tasks for a node are divided evenly across the two sockets (one socket will acquire an extra task when the task number is odd). When the tasks/nodes ratio is not an integer, floor(tasks/nodes) tasks are placed on each node, and the remaining are assigned sequentially to nodes in the hostfile list, one to each node until no more remain. The distribution across sockets allows maximal memory bandwidth to each socket.

Launching Hybrid Programs For hybrid jobs, specify a total-task/nodes ratio with values of 1/2/4/6/8/12/16. Then, set the $OMP_NUM_THREADS environment variable to the number of threads per task, and use tacc_affinity with ibrun. The hybrid Bourne-type shell example below illustrates parameters to run a hybrid job. It requests 2 nodes and 4 tasks, 2 tasks per node, 8 threads/task.

#SBATCH -n4 -N2 ... export OMP_NUM_THREADS=8 #8 threads/task ... ibrun tacc_affinity ./a.out Please see the sample Slurm job scripts section below for a complete hybrid job script.

Launching Serial Programs For serial batch executions, use one node and one task, and do not use the ibrun command to launch the executable (just use the executable name) and submit your job to the serial queue (partition). The serial queue has a 12-hour runtime limit and allows up to 6 simultaneous runs per user. There are 148 nodes available for the serial queue.

#SBATCH -N 1 -n 1 # one node and one task #SBATCH -p serial # run in serial queue ... ./a.out # execute your application (no ibrun)

Interactive Sessions Interactive access to a single node on the supercomputer is extremely useful for developing und debugging codes that may not be ready for full-scale deployment. Interactive sessions are charged to projects just like normal batch jobs. Please restrict usage to the (default) development queue.

idev on Stampede TACC's HPC staff have recently implemented the idev application on Stampede. idev provides interactive access to a single node and then spawns the resulting interactive environment to as many terminal sessions as needed for debugging purposes. idev is simple to use, bypassing the arcane syntax of the srun command. Further idev documentation can be found here: https://portal.tacc.utexas.edu/software/idev In the sample session below, a user requests interactive access to a single node for 15 minutes in order to debug the progindevelopment application. idev returns a compute node login prompt:

login1$ idev -m 15 ... --> Sleeping for 7 seconds...OK ... --> Creating interactive terminal session (login) on master node c557-704. ... c557-704$ vim progindevelopment.c c557-704$ make progindevelopment Now the user may open another window to run the newly-compiled application, while continuing to debug in the original terminal session:

WINDOW2 c557-704$ ibrun -np 16 ./progindevelopment WINDOW2 ...program output ... WINDOW2 c557-704$ Use the "-h" switch to see more options:

login1$ idev -h

srun Slurm's srun command will interactively request a batch job, returning a compute-node name as a prompt, usually scheduled within a short period of time. Issue the srun command only from a login node. Command syntax is:

srun --pty -A projectnumber -p queue -t hh:mm:ss -n tasks -N nodes /bin/bash -l The "-A", "-p", "-t", "-n" and "-N" batch options respectively specify the project/allocation number, queue (partition), maximum runtime, total number of tasks and the number of nodes. The "-A" option is only necessary for users with multiple logins. The batch job is terminated when the shell is exited. The following example illustrates a request for 1 hour in the development queue on one compute node using the bash shell, followed by an MPI executable launch.

login1$ srun --pty -p development -t 01:00:00 -n16 /bin/bash -l ... c423-001$ c423-001$ ibrun ./a.out

Affinity and Memory Locality HPC workloads often benefit from pinning processes to hardware instead of allowing the operating system to migrate them at will. This is particularly important in multicore and heterogeneous systems, where process (and thread) migration can lead to less than optimal memory access and resource sharing patterns, and thus a significant performance degradation. TACC provides an affinity script called tacc_affinity, to enforce strict local memory allocation and process pinning to the socket. For most HPC workloads, the use of tacc_affinity will ensure that processes do not migrate and memory accesses are local. To use tacc_affinity with your MPI executable, use this command:

c423-001$ ibrun tacc_affinity a.out This will apply an affinity for the tasks_per_socket option (or an appropriate affinity if tasks_per_socket is not used, and a memory policy that forces memory assignments to the local socket. Try ibrun with and without tacc_affinity to determine if your application runs better with TACC affinity setting. However, there may be instances in which tacc_affinity is not flexible enough to meet the user's requirements. This section describes techniques to control process affinity and memory locality that can be used to improve execution performance in Stampede and other HPC resources. In this section an MPI task is synonymous with a process. Do not use multiple methods to set affinity simultaneously as this can lead to unpredictable results.

Using numactl numactl is a linux command that allows explicit control of process affinity and memory policy. Since each MPI task is launched as a separate process, numactl can be used to specify the affinity and memory policy for each task. There are two ways this can be used to exercise numa control when launching a batch executable:

c423-001$ ibrun numactl options ./a.out c423-001$ ibrun my_affinity ./a.out The first command sets the same options for each task. Because the ranks for the execution of each a.out are not known to numactl it is not possible to use this command-line to tailor options for each individual task. The second command launches an executable script, my_affinity, that sets affinity for each task. The script will have access to the number of tasks per node and the rank of each task, and so it is possible to set individual affinity options for each task using this method. In general any execution using more than one task should employ the second method to set affinity so that tasks can be properly pinned to the hardware. In threaded applications, the same numactl command may be used, but its scope is limited globally to all threads, because every forked process or thread inherits the affinity and memory policy of the parent. This behavior can be modified from within a program using the numa API to control affinity. The basic calls for binding tasks and threads are "sched_getaffinity", "sched_setaffinity" and "numalib", respectively. Note, on the login nodes the core numbers for masking are assigned round-robin to the sockets (cores 0, 2, 4,… are on socket 0 and cores 1, 3, 5, … are on socket 1) while on the compute nodes they are assigned contiguously (cores 0-7 are on socket 0 and 8-15 are on socket 1). The TACC provided affinity script, tacc_affinity, enforces a strict local memory allocation to the socket, forcing eviction of previous user's IO buffers, and also distributes tasks evenly across sockets. Use this script as a template for implementing your own affinity script if a custom affinity script is needed for your jobs. Table 16. Common numactl options Option

Arguments

Description

-N

0,1

Socket Affinity. Execute process only on this (these) socket(s)

-C

[0-15]

Core Affinity. Execute process on this (these, comma separated list) core(s).

-l

None

Memory Policy. Allocate only on socket where process runs. Fallback to another if full.

-i

0,1

Memory Policy. Strictly allocate round robin on these (comma separated list) sockets. No fallback; abort if no more allocation space is available.

-m

0,1

Memory Policy. Strictly allocate on this (these, comma separated list) sockets. No fallback; abort if no more allocation space is available.

– preferred=

0,1 (select only one)

Memory Policy. Allocate on this socket. Fallback to the other if full.

Additional details on numactl are given in its man page and help information:

login1$ man numactl login1$ numactl --help

Using Intel's KMP_AFFINITY To alleviate the complexity of setting affinity in architectures that support multiple hardware threads per core, such as the MIC family of coprocessors, Intel provides the means of controlling thread pinning via the environment variables, $KMP_AFFINITY and $MIC_KMP_AFFINITY. Set these variables to control affinity on the cores and Phi coprocessors.

login1$ export KMP_AFFINITY=[,...]type Table 17. KMP_AFFINITY types Affinity type Description compact

Pack threads close to each other.

explicit

Use the proclist modifier to pin threads.

none

Does not pin threads.

scatter

Round-robin threads to cores.

balanced

(Phi coprocessor only) Use scatter, but keep OMP thread ids consecutive.

KMP_AFFINITY type modifiers include: norespect or respect (OS thread placement) noverbose or verbose nowarnings or warnings granularity=[fine|core] where fine - pinned to HW thread core - able to jump between HW threads within the core proclist={} used with explicit affinity type setting The meaning of the different affinity types is best explained with an example. Imagine that we have a system with 4 cores and 4 hardware threads per core. If we place 8 threads the assignments produced by the compact, scatter, and balanced types are shown in Figure 5 below. Notice that compact does not fully utilize all the cores in the system. For this reason it is recommended that applications are run using the scatter or balanced (Phi coprocessor only) options in most cases.

Figure 5. KMP Affinity Thread Affinity Interface OpenMP* Thread Affinity Control Balanced Affinity Type Please see the KNL:Running Jobs section below for information on launching jobs on the KNL cluster.

File Systems The TACC HPC platforms have several different file systems with distinct storage characteristics. These are pre-defined, user-owned directories in these file systems for users to store their data. Of course, these file systems are shared with other users, so they are managed by either a quota limit, a purge policy (time-residency) limit, or a migration policy. The $HOME, $WORK and $SCRATCH directories on Stampede are Lustre file systems, designed for parallel and high performance data access of large files from within applications. They have been configured to work well with MPI-IO and support access from many compute nodes. Since metadata services for each file system are through a single server (limitation of Lustre), users should consider efficient strategies for minimizing file services (opening and closing) files when scaling applications to large node counts. To determine the amount of disk space used in a file system, cd to the directory of interest and execute the df -k . command, including the dot that represents the current directory as demonstrated below:

login1$ cd mydirectory login1$ df -k . Filesystem 1K-blocks Used Available Use% Mounted on 206.76.192.2:/home1 15382877568 31900512 15350977056 1% /home1 In the command output above, the file system name appears on the left (IP number, followed by the file system name), and the used and available space (-k, in units of 1 KBytes) appear in the middle columns, followed by the percent used, and the mount point: To determine the amount of space available in a user-owned directory, cd to the directory and execute the du -sh command (s=summary, h=units 'human readable):

login1$ du -sh To determine quota limits and usage on $HOME and $WORK execute the Lustre files system "lfs quota" command without any options (from any directory). Usage and quotas are reported at each login.

login1$ lfs quota $HOME login1$ lfs quota $WORK Stampede's major file systems, $HOME, $WORK, $SCRATCH, /tmp and $ARCHIVE, are detailed below.

$HOME At login, the system automatically sets the current working directory to your home directory. Store your source code and build your executables here. This directory has a quota limit of of 5GB, 150K files. This file system is backed up. The login nodes and any compute node can access this directory. Use the environment variable $HOME to reference your home directory in scripts. Use the "cdh" or "cd" commands to change to $HOME .

$WORK This directory has a quota limit of of 1TB, 3M files. Store large files here. Change to this directory in your batch scripts and run jobs in this file system. The work file system is approximately 450TB This file system is not backed up. The login nodes and any compute node can access this directory. Purge Policy: not purged Use the environment variable $WORK to reference this directory in scripts. Use "cdw" to change to $WORK.

$SCRATCH Store large files here. Change to this directory in your batch scripts and run jobs in this file system. The scratch file system is approximately approximately 8.5PB. This file system is not backed up. The login nodes and any compute node can access this directory. Purge Policy: Files with production access times* greater than 10 days may be purged. Use $SCRATCH to reference this directory in scripts. Use the "cds" command to change to $SCRATCH. NOTE: TACC staff may periodically delete files from the $SCRATCH file system even if files are less than 10 days old. A full file system inhibits use of the system for everyone. Using programs or scripts to actively circumvent the file purge policy will not be tolerated. * A file's access time is updated when that file is modified on a login or compute node. Read or execution of a file/script on a login node does not update the access time, however read or execution of a file/script on a compute node does update the access time. Preservation of access times on login nodes keeps utilities such as tar, scp, etc. from obscuring production usage for purging. To view files' access times:

login1$ ls -ul . Do NOT install software in the $SCRATCH file system as it is subject to purging.

/tmp This is a directory in a local disk on each node where you can store files and perform local I/O for the duration of a batch job. It is often more efficient to use and store files directly in $SCRATCH (to avoid moving files from /tmp at the end of a batch job). The /tmp file system is approximately 80GB available to users. Files stored in the /tmp directory on each node are removed immediately after the job terminates. Use "/tmp" to reference this file system in scripts.

$ARCHIVE Stampede's archival storage system is Ranch and is accessible via the $ARCHIVER and $ARCHIVE environment variables. Store permanent files here for archival storage. This file system is NOT NSF mounted (directly accessible) on any node. Use the scp command to transfer data to this system.

login1$ scp ${ARCHIVER}:$ARCHIVE/mybigfile $WORK or

login1$ scp mybigfile ${ARCHIVER}: Use the ssh command to login to the Ranch system from any TACC machine. For example:

login1$ ssh $ARCHIVER Some files stored on the archiver may require staging prior to retrieval. See the Ranch user guide for more on archiving your files.

Sharing Files Users often wish to collaborate with fellow project members by sharing files and data with each other. Project managers or delegates can create shared workspaces, areas that are private and accessible only to other project members, using UNIX group permissions and commands. Shared workspaces may be created as read-only or read-write, functioning as data repositories and providing a common work area to all project members. Please see Sharing Project Files on TACC Systems for step-by-step instructions.

GPU Programming CUDA is available on the login nodes and the GPU-equipped compute nodes. GPU nodes are accessible through the gpu queue for production work and the gpudev queue for development work. Production job scripts should include the "module load cuda" command before executing cuda code; likewise, load the cuda module before or after acquiring an interactive, development gpu node with the "idev" command.

Accelerator (CUDA) Programming NVIDIA's CUDA compiler and libraries are accessed by loading the CUDA module:

$ module load cuda Use the nvcc compiler on the login node to compile code, and run executables on nodes with GPUs-there are no GPUs on the login nodes. Stampede's K20 GPUs are compute capability 3.5 devices. When compiling your code, make sure to specify this level of capability with:

nvcc -arch=compute_35 -code=sm_35 ... The NVIDA CUDA debugger is cuda-gdb. Applications must be debugged through a VNC session or an interactive srun session. Please see the relevant srun and VNC sections for more details. The NVIDIA Compute Visual Profiler, computeprof, can be used to profile both CUDA and OpenCL programs that have been developed in NVIDIA CUDA/OpenCL programming environment. Since the profiler is X based, it must be run either within a VNC session or by ssh-ing into an allocated compute node with X-forwarding enabled. The profiler command and library paths are included in the $PATH and $LD_LIBRARY_PATH variables by the CUDA module. The computeprof executable and libraries can be found in the following respective directories:

$TACC_CUDA_DIR/computeprof/bin $TACC_CUDA_DIR/computeprof/lib For further information on the CUDA compiler, programming, the API, and debugger, please see: $TACC_CUDA_DIR/doc/nvcc.pdf $TACC_CUDA_DIR/doc/CUDA_C_Programming_Guide.pdf $TACC_CUDA_DIR/doc/CUDA_Toolkit_Reference_Manual.pdf $TACC_CUDA_DIR/doc/cuda-gdb.pdf

Heterogeneous (OpenCL) Programming The OpenCL heterogeneous computing language is supported on all Stampede computing platforms. The Intel OpenCL environment supports the Xeon processors and Xeon Phi coprocessors, and the NVIDIA OpenCL environment supports the Tesla accelerators. Note that the prompts in the examples below are deliberately generic. This is because you can compile an OpenCL application on essentially any Stampede node, but you can run it only on nodes with the hardware the application requires (e.g. GPU or KNC).

Using the Intel OpenCL Environment The Intel OpenCL Drivers and runtimes are supported at TACC for all installed Intel compilers. Execute the compiler command with the "-lOpenCL" loader option to include the OpenCL libraries, and prepend the "/opt/apps/intel/opencl/lib64" path to the $LD_LIBRARY_PATH environment variable when running an OpenCL executable, as illustrated below.

$ icc -lOpenCL -o ocl.out ocl_prog.c $ export LD_LIBRARY_PATH=/opt/apps/intel/opencl/lib64:$LD_LIBRARY_PATH $ ./ocl.out

Using the NVIDIA OpenCL Environment The NVIDIA OpenCL environment supports the v1.1 API is accessible through the cuda module:

$ module load cuda For programming with NVIDIA OpenCL, please see the OpenCL specification at: https://www.khronos.org/registry/cl/specs/opencl-1.1.pdf. Use the g++ compiler to compile NVIDIA-based OpenCL. The include files are located in the $TACC_CUDA_DIR/include subdirectory. The OpenCL library is installed in the /usr/lib64 directory, which is on the default library path. Use this path and g++ options to compile OpenCL code:

$ export OCL=$TACC_CUDA_DIR $ g++ -I $OCL/include -lOpenCL prog.cpp

Visualization on Stampede While batch visualization can be performed on any Stampede node, a set of nodes have been configured for hardware-accelerated rendering. The vis queue contains a set of 128 compute nodes configured with one NVIDIA K20 GPU each. The largemem queue contains a set of 16 compute nodes configured with one NVIDIA Quadro 2000 GPU each.

Remote Desktop Access Remote desktop access to Stampede is formed through a VNC connection to one or more visualization nodes. Users must first connect to a Stampede login node (see System Access) and submit a special interactive batch job that: allocates a set of Stampede visualization nodes starts a vncserver process on the first allocated node sets up a tunnel through the login node to the vncserver access port Once the vncserver process is running on the visualization node and a tunnel through the login node is created, an output message identifies the access port for connecting a VNC viewer. A VNC viewer application is run on the user's remote system and presents the desktop to the user. Note: If this is your first time connecting to Stampede, you must run vncpasswd to create a password for your VNC servers. This should NOT be your login password! This mechanism only deters unauthorized connections; it is not fully secure, as only the first eight characters of the password are saved. All VNC connections are tunnelled through SSH for extra security, as described below. Follow the steps below to start an interactive session. 1. Start a Remote Desktop TACC has provided a VNC job script (/share/doc/slurm/job.vnc) that requests one node in the vis queue for four hours, creating a VNC session.

login1$ sbatch /share/doc/slurm/job.vnc You may modify or overwrite script defaults with sbatch command-line options: "-t hours:minutes:seconds" modify the job runtime "-A projectnumber" specify the project/allocation to be charged "-N nodes" specify number of nodes needed "-p partition" specify an alternate queue. See more sbatch options in Table 11 All arguments after the job script name are sent to the vncserver command. For example, to set the desktop resolution to 1440x900, use:

login1$ sbatch /share/doc/slurm/job.vnc -geometry 1440x900 The vnc.job script starts a vncserver process and writes to the output file, vncserver.out in the job submission directory, with the connect port for the vncviewer. Watch for the "To connect via VNC client" message at the end of the output file, or watch the output stream in a separate window with the commands:

login1$ touch vncserver.out ; tail -f vncserver.out The lightweight window manager, xfce, is the default VNC desktop and is recommended for remote performance. Gnome is available; to use gnome, open the "~/.vnc/xstartup" file (created after your first VNC session) and replace "startxfce4" with "gnome-session". Note that gnome may lag over slow internet connections. 2. Create an SSH Tunnel to Stampede TACC requires users to create an SSH tunnel from the local system to the Stampede login node to assure that the connection is secure. On a Unix or Linux system, execute the following command once the port has been opened on the Stampede login node:

localhost$ ssh -f -N -L xxxx:stampede.tacc.utexas.edu:yyyy [email protected] where "yyyy" is the port number given by the vncserver batch job "xxxx" is a port on the remote system. Generally, the port number specified on the Stampede login node, yyyy, is a good choice to use on your local system as well "-f" instructs SSH to only forward ports, not to execute a remote command "-N" puts the ssh command into the background after connecting "-L" forwards the port On Windows systems find the menu in the Windows SSH client where tunnels can be specified, and enter the local and remote ports as required, then ssh to Stampede. 3. Connecting vncviewer Once the SSH tunnel has been established, use a VNC client to connect to the local port you created, which will then be tunneled to your VNC server on Stampede. Connect to localhost:xxxx, where xxxx is the local port you used for your tunnel. In the examples above, we would connect the VNC client to localhost::xxxx. (Some VNC clients accept localhost:xxxx). We recommend the TigerVNC VNC Client, a platform independent client/server application. Once the desktop has been established, two initial xterm windows are presented (which may be overlapping). One, which is white-on-black, manages the lifetime of the VNC server process. Killing this window (typically by typing "exit" or "ctrl-D" at the prompt) will cause the vncserver to terminate and the original batch job to end. Because of this, we recommend that this window not be used for other purposes; it is just too easy to accidentally kill it and terminate the session. The other xterm window is black-on-white, and can be used to start both serial programs running on the node hosting the vncserver process, or parallel jobs running across the set of cores associated with the original batch job. Additional xterm windows can be created using the window-manager left-button menu.

Running Applications on the VNC Desktop From an interactive desktop, applications can be run from icons or from xterm command prompts. Two special cases arise: running parallel applications, and running applications that use OpenGL.

Running Parallel Applications from the Desktop Parallel applications are run on the desktop using the same ibrun wrapper described above (see Running Applications). The command:

c442-001$ ibrun [ibrun options] application [application options] will run application on the associated nodes, as modified by the ibrun options.

Running OpenGL/X Applications On The Desktop Running OpenGL/X applications on Stampede visualization nodes requires that the native X server be running on each participating visualization node. Like other TACC visualization servers, on Stampede the X servers are started automatically on each node (this happens for all jobs submitted to the vis and largemem queues). Once native X servers are running, several scripts are provided to enable rendering in different scenarios. vglrun: Because VNC does not support OpenGL applications, VirtualGL is used to intercept OpenGL/X commands issued by application code and re-direct it to a local native X display for rendering; rendered results are then automatically read back and sent to VNC as pixel buffers. To run an OpenGL/X application from a VNC desktop command prompt:

c442-0011$ vglrun [vglrun options] application application-args tacc_xrun: Some visualization applications present a client/server architecture, in which every process of a parallel server renders to local graphics resources, then returns rendered pixels to a separate, possibly remote client process for display. By wrapping server processes in the tacc_xrun wrapper, the $DISPLAY environment variable is manipulated to share the rendering load across the two GPUs available on each node. For example,

c442-001$ ibrun tacc_xrun application application-args will cause the tasks to utilize each node, but will not render to any VNC desktop windows. tacc_vglrun: Other visualization applications incorporate the final display function in the root process of the parallel application. This case is much like the one described above except for the root node, which must use vglrun to return rendered pixels to the VNC desktop. For example,

c442-001$ ibrun tacc_vglrun application application-args will cause the tasks to utilize the GPU for rendering, but will transfer the root process' graphics results to the VNC desktop.

Visualization Applications Stampede provides a set of visualization-specific modules listed below.: Parallel VisIt on Stampede Parallel ParaView on Stampede Amira on Stampede

Parallel VisIt on Stampede VisIt was compiled under the Intel compiler and the mvapich2 and MPI stacks. After connecting to a VNC server on Stampede, as described above, load the VisIt module at the beginning of your interactive session before launching the Visit application:

c442-001$ module load visit c442-001$ vglrun visit VisIt first loads a dataset and presents a dialog allowing for selecting either a serial or parallel engine. Select the parallel engine. Note that this dialog will also present options for the number of processes to start and the number of nodes to use; these options are actually ignored in favor of the options specified when the VNC server job was started.

Preparing data for Parallel Visit In order to take advantage of parallel processing, VisIt input data must be partitioned and distributed across the cooperating processes. This requires that the input data be explicitly partitioned into independent subsets at the time it is input to VisIt. VisIt supports SILO data, which incorporates a parallel, partitioned representation. Otherwise, VisIt supports a metadata file (with a .visit extension) that lists multiple data files of any supported format that are to be associated into a single logical dataset. In addition, VisIt supports a "brick of values" format, also using the .visit metadata file, which enables single files containing data defined on rectilinear grids to be partitioned and imported in parallel. Note that VisIt does not support VTK parallel XML formats (.pvti, .pvtu, .pvtr, .pvtp, and .pvts). For more information on importing data into VisIt, see Getting Data Into VisIt; though this documentation refers to VisIt version 2.0, it appears to be the most current available.

Parallel ParaView on Stampede After connecting to a VNC server on Stampede, as described above, do the following: 1. Set the $NO_HOSTSORT environment variable to 1 csh shell

login1$ setenv NO_HOSTSORT 1

bash shell login1$ export NO_HOSTSORT=1 2. Set up your environment with the necessary modules: If the user is intending to use the Python interface to Paraview via any of the following methods: the Python scripting tool available through the ParaView GUI pvpython loading the paraview.simple module into python then load the python, qt and paraview modules in this order:

c442-001$ module load python qt paraview else just load the qt and paraview modules in this order:

c442-001$ module load qt paraview Note that the qt module is always required and must be loaded prior to the paraview module. 3. Launch ParaView:

c442-001$ vglrun paraview [paraview client options] 4. Click the "Connect" button, or select File -> Connect 5. If this is the first time you've used ParaView in parallel (or failed to save your connection configuration in your prior runs): I. II. III. IV.

Select "Add Server" Enter a "Name" e.g. "ibrun" Click "Configure" For "Startup Type" in the configuration dialog, select "Command" and enter the command:

c442-001$ ibrun tacc_xrun pvserver [paraview server options] and click "Save" V. Select the name of your server configuration, and click "Connect" You will see the parallel servers being spawned and the connection established in the ParaView Output Messages window.

Amira on Stampede Amira runs only on one specific Stampede node: c400-116. You must explicitly request this node when submitting a VNC job script:

login1$ sbatch -w c400-116 -A project /share/doc/slurm/job.vnc After connecting to a VNC server, load Amira as follows:

c400-116$ module load amira c400-116$ vglrun $AMRIA_BIN/start

Tools Timing Tools Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series. Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification. The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.

Package Timers The time command is available on most UNIX systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore you might have to use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax:

c123-456$ /usr/bin/time -p ./a.out The -p option specifies traditional precision output, units in seconds. See the time man page for additional information. To use time with an MPI task, use:

c123-456$ /usr/bin/time -p ibrun ./a.out This example provides timing information only for the rank 0 task on the master node (the node that executes the job script); however, the time output labeled real is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (not load balanced).

Code Section Timers Section timing is another popular mechanism for obtaining timing information. Use these to measure the performance of individual routines or blocks of code by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below. Table 18. Code Section Timers Routine

Type

Resolution (usec) OS/Compiler

times

user/sys

1000

getrusage

wall/user/sys 1000

Linux/AIX/IRIX

gettimeofday

wall clock

1

Linux/AIX/IRIX/UNICOS

rdtsc

wall clock

0.1

Linux

0.001

AIX

read_real_time wall clock

Linux/AIX/IRIX/UNICOS

system_clock

wall clock

system dependent Fortran90 Intrinsic

MPI_Wtime

wall clock

system dependent MPI Library (C and Fortran)

For general purpose or coarse-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accessed in the same way: Example Fortran

C code

real*8, external :: x_timer real*8 :: sec0, sec1, tseconds ... sec0 = x_timer() sec1 = x_timer() tseconds = sec1-sec0

double x_timer(void); double sec0, sec1, tseconds; ... sec0 = x_timer(); sec1 = x_timer(); tseconds = sec1-sec0

Standard Profilers The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the hotspots As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the "-p" (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below with a sample Fortran program.

Profiling Serial Executables login1$ ifort -p prog.f90 # instruments code login1$ idev ... c123-456$ a.out # produces gmon.out trace file c123-456$ gprof # reads gmon.out (default args: a.out gmon.out),report sent to STDOUT

Profiling Parallel Executables login1$ mpif90 -p prog.f90 # instruments code login1$ setenv GMON_OUT_PREFIX gout.* # forces each task to produce a gout login1$ idev ... c123-456$ ibrun a.out # produces gmon.out trace file c123-456$ gprof -s gout.* # combines gout files into gmon.sum c123-456$ gprof a.out gmon.sum # reads executable (a.out) and gmon.sum, report sent to STDOUT Detailed documentation is available at www.gnu.org.

Profiling with PerfExpert Source-code performance optimization has four stages: measurement, diagnosis of bottlenecks, determination of optimizations, and rewriting source code. Executing these steps for today's complex many processor and heterogeneous computer architectures requires a wide spectrum of knowledge that many application developers would rather not have to learn. PerfExpert, an expert system built on generations of performance measurement and analysis tools, utilizes knowledge of architectures and compilers to implement (partial) automation of performance optimization for multicore chips and heterogeneous nodes of cluster computers. PerfExpert automates the first three performance optimization stages, then implements those optimizations as part of the fourth stage. PerfExpert is available on the Stampede Sandy-Bridge nodes, but not yet on the MICs. PerfExpert is dependent upon the Java interface, HPC toolkit, and the PAPI hardware counter utility, and requires the papi, hpctoolkit, and perfexpert modules to be loaded. The "module help" command provides additional information.

login1$ module load papi hpctoolkit perfexpert login1$ module help perfexpert

Stampede KNL Cluster Stampede 1's KNL sub-system is no longer available, and the KNL material in the Stampede 1 User Guide is now obsolete. We have begun the process of moving Stampede 1's 508 KNL nodes to Stampede2. See the Stampede2 documentation for more information. While the Stampede KNL Upgrade and the Sandy Bridge cluster share the /home1, /work, and /scratch file systems, the Stampede KNL Upgrade is largely an independent cluster. This KNL cluster has its own dedicated Haswell login node, a separate OmniPath network, a KNL-compatible software stack, its own Slurm scheduler, and KNL-specific queues. It also runs a newer Linux distribution (Centos 7) than the Sandy Bridge cluster (Centos 6). Moreover, there are implications associated with sharing $HOME across the Sandy Bridge and KNL clusters. The section below on KNL Cluster: Modules addresses the most important such implication. More generally, remember that a Linux $HOME directory contains startup and configuration files and directories (often invisible so-called "dotfiles" that begin with the "." character) that will be active on both sides of Stampede. This means, for example, that your .bash_history may contain commands that are meaningful or correct only on one side or the other. The KNL cluster includes 508 Intel Xeon Phi 7250 KNL compute nodes (68 cores per node, 4 hardware threads per core) housed in 9 racks. Each KNL is a self-hosted node, running CentOS 7 and supporting a KNL-compatible software stack. The nodes include 112GB of local solid state drive (SSD). The interconnect is a 100Gb/sec OmniPath network: a fat tree topology of eight core-switches and 320 leaf switches with 5/4 oversubscription. The lightweight KNL cores have a clock frequency of 1.4 GHz, about half that of more conventional processors. This means that performance on KNL depends to a large degree on making effective use of a large number of cores. To put it another way, good performance requires a program or workflow that exposes a high degree of parallelism. Each of Stampede's KNL nodes includes 96GB of traditional DDR4 Random Access Memory (RAM). In addition, the KNL processors feature an additional 16 GB of high bandwidth, on-package memory known as Multi-Channel Dynamic Random Access Memory (MCDRAM) that is up to four times faster than DDR4. The KNL's memory is configurable in two important ways: there are BIOS settings that determine at boot time the processor's memory mode and cluster mode. The processor's memory mode determines whether the fast MCDRAM operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. The cluster mode determines the mechanisms for achieving cache coherency, which in turn determines latency: roughly speaking, this amounts to specifying whether and how one can think of some memory addresses as "closer" to a given core than others. Following Intel's defaults and recommendations, the nodes in the KNL cluster's normal and development queues are configured as "Cache-Quadrant" (memory mode set to "Cache", cluster mode set to "Quadrant"). See "KNL: Programming and Performance Considerations" below for a top-level description of these and other available memory and cluster modes.

KNL: System Access The KNL cluster has a single dedicated login node. Unlike the other login and compute nodes in Stampede, this login node is a Haswell processor. Access this login node by executing

localhost$ ssh login-knl1.stampede.tacc.utexas.edu Note that the characters after the "login-" prefix are the three lower-case letters "k", "n", and "el", followed by the digit "1". Access to this login node requires MFA. During TACC's transition to site-wide MFA, connecting to login-knl1 from the Sandy Bridge cluster may require MFA even if you have already authenticated.

KNL: Modules The module system treats the Sandy Bridge and KNL clusters as separate systems. On the KNL cluster, for example, the module system will load modules that are appropriate for KNL. If you use the "module save" command to create personal collections of modules, the module system will ignore collections you created on the Sandy Bridge cluster when you are on the KNL cluster; in fact such collections are invisible on the KNL cluster. See Using Stampede:Modules above or execute "module help" for more information on these and other commands that allow you to define and manage personal collections of modules.

KNL: Building Software The KNL architecture is binary compatible with earlier Intel architectures: applications built for Haswell and Sandy Bridge may in theory run on KNL without recompiling. However, the software stacks are different: currently the 2017 Intel compiler and Intel MPI (IMPI) libraries are installed only on the KNL cluster, so you will almost certainly need to rebuild all applications and libraries to achieve the best performance. You may be tempted to use the shared file systems to compile once and run on both sides of Stampede. This is not likely to work: even though the CPUs are compatible, the OmniPath network stack on KNL is not compatible with the Infiniband network stack on the Sandy Bridge cluster. Also note that KNL is not binary compatible with the legacy KNC coprocessors on the Sandy Bridge cluster. In particular, the "-mmic" flag (used to compile for KNC) is not supported on KNL. Finally, remember that the login node is a Haswell, not KNL, processor. This has important implications: in particular, it will affect the way you compile code for KNL. You will need to think about both compatibility and performance when deciding where and how to compile your code for KNL. You can compile for KNL on either the Haswell login node or any KNL compute node. Building on the login node is likely to be faster and is the approach we currently recommend. When building on the Haswell login node, it may be enough to cross-compile: use the "-xMIC-AVX512" switch at both compile and link time to produce compiled code appropriate only for the KNL; e.g.

knl-login1$ icc -xMIC-AVX512 -o mycode.exe mycode.c knl-login1$ ifort -xMIC-AVX512 -o mycode.exe mycode.f90 You can also elect to build on the KNL compute nodes; again use the "-xMIC-AVX512" to produce KNL-optimized code. Allow extra time to compile on KNL: the configure stage in particular may be many times slower than it would be on the Haswell login node. For applications with build systems that need to build and run its own test programs in the build environment (e.g. Autotools/configure, SCons, and Cmake), you may need to specify flags that produce code that will run on both the login Haswell node (the build architecture where these tests will run) and on the compute KNL nodes (the actual target architecture). This is done through an Intel compiler feature called CPU dispatch that produces binaries containing alternate paths with optimized codes for multiple architectures. To produce such a binary containing optimized code for both Haswell and KNL, supply two flags when compiling and linking:

-xCORE-AVX2 -axMIC-AVX512 In a typical build system, it may be enough to add these flags to the CFLAGS, CXXFLAGS, FFLAGS, and LDFLAGS makefile variables. Expect the build to take longer than it would for one target architecture, and expect the resulting binary to be larger.

KNL: Running Jobs In general, the job submission process on the KNL cluster is the same as it is on the Sandy Bridge side: use sbatch to submit a batch job, and idev to begin an interactive session. A typical job script for KNL looks no different than its Sandy Bridge counterpart. Below is an example for an MPI job. When submitting KNL jobs, however, you do need to specify explicitly the total number of nodes your job requires. This means incluing in your script or submission command a value for "-N". This is to reduce the chance of accidentally assigning more tasks to a node than you actually intend. See KNL: Best Practices below for more information.

Sample KNL Job Script #!/bin/bash #SBATCH -J myjob # Job name #SBATCH -o myjob.o%j # Name of stdout output file #SBATCH -e myjob.e%j # Name of stderr error file #SBATCH -p normal # Queue name #SBATCH -N 4 # Total # of nodes #SBATCH -n 32 # Total # of mpi tasks #SBATCH -t 01:30:00 # Run time (hh:mm:ss) #SBATCH [email protected] #SBATCH --mail-type=all # Send email at begin and end of job #SBATCH -A myproject # Allocation name (req'd if more than 1) # Other commands must follow all #SBATCH directives... module reset module list pwd date # Launch MPI job... ibrun ./mycode.exe # Use ibrun instead of mpirun or mpiexec As on other TACC systems the MPI launcher is ibrun. However the KNL cluster has its own independent Slurm scheduler and queue list. This means that commands like sinfo and squeue issued on the KNL cluster will show only the KNL queues. Each KNL queue reflects a specific configuration of its KNL nodes (memory and cluster mode). See above for a general description of KNL modes, and KNL: Programming and Performance Considerations below for additional detail. Despite the fact that KNL has 68 cores per node, note that the charge for a KNL node-hour is 16 SU (the same charge as a Sandy Bridge node-hour). Note: hyper-threading is enabled on the KNLs. While there are 68 active cores on each KNL, the operating system and scheduler see a total of 68 x 4 = 272 CPUs (hardware threads). The "showq" utility will report 272 "cores" (hardware threads) for each node associated with your job, regardless of the number of hardware threads you use. Table 19. KNL Production Queues Queue

Max Runtime Max Nodes and Associated Cores per Job Max Jobs in Queue Charge per node hour Configuration (Memory-Cluster)

development

2 hrs

4 nodes (272 cores)

1

16 SU

Cache-Quadrant

normal

48 hrs

80 nodes* (5440 cores)

10

16 SU

Cache-Quadrant

Flat-Quadrant 48 hrs

40 nodes* (2720 cores)

5

16 SU

Flat-Quadrant

Flat-All2All

12 hrs

2 nodes* (136 cores)

1

16 SU

Flat-All-to-All

Flat-SNC-4

12 hrs

2 nodes* (136 cores)

1

16 SU

Flat-SNC-4

*To make special arrangements for larger jobs, or for jobs requiring special non-hybrid node configurations, submit a ticket through the TACC User Portal. Include in your request reasonable evidence of your readiness to run under the conditions you are requesting. In most cases this should include strong or weak scaling results summarizing experiments you have run on the KNL cluster.

KNL: Visualization The Stampede KNL cluster uses the KNL processors for all visualization and rendering operations. OpenGL-based graphics are rendered using the Intel OpenSWR (http://openswr.org/) library. This capability is harnessed by loading the "swr" module and using "swr application" (e.g. "swr glxgears"), similar to the "vglrun" syntax. There is no separate visualization queue on Stampede-KNL. All visualization apps are (or will be soon) available on all nodes. We are in the process of porting visualization application builds to KNL. If an application that you use on Stampede is not yet available, please submit a consulting ticket at the TACC or XSEDE portal. We expect that most users will notice little difference in visualization application experience on KNL compared to other Stampede nodes. Some users will see performance improvement due to data caching in MCDRAM.

KNL: Programming and Performance Considerations Architecture KNL cores are grouped in pairs; each pair of cores occupies a tile. Since there are 68 cores on each node in Stampede's KNL cluster, each node has 34 active tiles. These 34 active tiles are connected by a two-dimensional mesh interconnect. Each KNL has 2 DDR memory controllers on opposite sides of the chip, each with 3 channels. There are 8 controllers for the fast, on-package MCDRAM, two in each quadrant. Each core has its own local L1 cache (32KB, data, 32KB instruction) and two 512-bit vector units. These vector units are almost identical, but only one of them can execute legacy (non-AVX512) vector instructions. This means that, in order to use both vector units, you must compile to generate AVX512 instructions. Each core can run up to 4 hardware threads. The two cores on a tile share a 1MB L2 cache. Different cluster modes specify the L2 cache coherence mechanism at the node level.

Memory Modes The processor's memory mode determines whether the fast MCDRAM operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. The output of commands like "top", "free", and and "ps -v" reflect the consequences of memory mode in which the processor is actually running. Such commands will show the amount of RAM actually available to the operating system, not the hardware (DDR + MCDRAM) installed on the processor. Cache Mode. In this mode, the fast MCDRAM is configured as an L3 cache. The operating system transparently uses the MCDRAM to move data from main memory. In this mode, the user has access to 96GB of RAM, all of it traditional DDR4. The KNL normal and development queues are configured in cache mode. Flat Mode. In this mode, DDR4 and MCDRAM act as two distinct Non-Uniform Memory Access (NUMA) nodes. It is therefore possible to specify the type of memory (DDR4 or MCDRAM) when allocating memory. In this mode, the user has access to 112GB of RAM: 96GB of traditional DDR and 16GB of fast MCDRAM. By default, memory allocations occur in DDR4. To use MCDRAM, use the numactl utility or the memkind library; see Managing Memory below for more information. Hybrid Mode (not supported on Stampede). In this mode, the MCDRAM is configured so that a portion acts as L3 cache and the rest as RAM (a second NUMA node supplementing DDR4).

Figure 6. KNL Memory Modes

Cluster Modes The KNL's core-level L1 and tile-level L2 caches can reduce the time it takes for a core to access the data it needs. In order for the cores to share memory safely, however, there must be mechanisms in place to ensure cache coherency. Cache coherency means that all cores have a consistent view of the data: if data value x changes as a result of a calculation on a given core, there must be no risk of other cores using outdated values of x. This, of course, is essential on any multi-core chip, but it is especially difficult to achieve on many-core processors. The details for KNL are proprietary, but the key idea is this: each tile tracks an assigned range of memory addresses. It does so on behalf of all cores on the chip, maintaining a data structure (tag directory) that tells it which cores are using data from its assigned addresses. Coherence requires both tile-to-tile and tile-to-memory communication. Cores that read or modify data must communicate with the tiles that manage the memory associated with that data. Similarly, when cores need data from main memory, the tile(s) that manage the associated addresses will communicate with the memory controllers on behalf of those cores. The KNL can do this in several ways, each of which is called a cluster mode. Each cluster mode, specified in the BIOS as a boot-time option, represents a tradeoff between simplicity and control. There are three major cluster modes with a few minor variations: All-to-All. This is the most flexible and most general mode, intended to work on all possible hardware and memory configurations of the KNL. But this mode also may have higher latencies than other cluster modes because the processor does not attempt to optimize coherency-related communication paths. Quadrant (variation: hemisphere). This is Intel's recommended default, and the cluster mode in Stampede's normal and development queues. This mode attempts to localize communication without requiring explicit memory management by the programmer/user. It does this by grouping tiles into four logical/virtual (not physical) quadrants, then requiring each tile to manage MCDRAM addresses only in its own quadrant (and DDR addresses in its own half of the chip). This reduces the average number of "hops" that tile-to-memory requests require compared to all-to-all mode, which can reduce latency and congestion on the mesh. Sub-NUMA 4 (variation: Sub-NUMA 2). This mode, abbreviated SNC-4, divides the chip into four NUMA nodes so that it acts like a four-socket processor. SNC-4 aims to optimize coherency-related on-chip communication by confining this communication to a single NUMA node when it is possible to do so. This requires explicit manual memory management by the programmer/user (in particular, allocating memory within the NUMA node that will use that memory) to achieve any performance benefit. See "Managing Memory" below for more information.

Figure 7. KNL Cluster Modes TACC's early experience with the KNL suggests that there is little reason to deviate from Intel's recommended default memory and cluster modes. Cache-Quadrant tends to be a good choice for almost all workflows; it offers a nice compromise between performance and ease of use for the applications we have tested. Flat-Quadrant is the most promising alternative and sometimes offers moderately better performance, especially when memory requirements per node are less than 16GB (see "Managing Memory" below). We have not yet observed significant performance differences across cluster modes, and our current recommendation is that configurations other than Cache-Quadrant and FlatQuadrant are worth considering only for very specialized needs. We have configured the KNL queues accordingly.

Managing Memory By design, any application can run in any memory and cluster mode, and applications always have access to all available RAM. Moreover, regardless of memory and cluster modes, there are no code changes or other manual interventions required to run your application safely. However, there are times when explicit manual memory management is worth considering to improve performance. The Linux numactl utility allows you to specify at runtime where your code should allocate memory. For pure MPI or hybrid MPIthreaded codes launched with TACC's ibrun launcher, tacc_affinity manages the details for you by calling numactl under the hood. When running in flat-quadrant mode, launch your code with simple numactl settings to specify whether memory allocations occur in DDR or MCDRAM:

numactl --membind=0 ./a.out # launch a.out (non-MPI); use DDR (default) ibrun numactl --membind=0 ./a.out # launch a.out (MPI-based); use DDR (default) numactl --membind=1 ./a.out # use only MCDRAM numactl --preferred=1 ./a.out # use MCDRAM if possible; else DDR numactl --hardware # show numactl settings numactl --help # list available numactl options Other settings (e.g. membind=4,5,6,7) specify fast memory within NUMA nodes when in Flat-SNC-4. Please consult TACC Training materials for additional information. Intel's new memkind library adds the ability to manage memory in source code with a special memory allocator for C code and a corresponding attribute for Fortran. This makes possible a level of control over memory allocation down to the level of the individual data element. As this library matures it will likely become an important tool for those whose needs require fine-grained control of memory.

Best Known Practices and Preliminary Observations It may not be a good idea to use all 272 hardware threads simultaneously, and it's certainly not the first thing you should try. In most cases it's best to specify no more than 64-68 MPI tasks or independent processes per node, and 1-2 threads/core. One exception is worth noting: when calling threaded MKL from a serial code, it's safe to set OMP_NUM_THREADS or MKL_NUM_THREADS to 272. This is because MKL will choose an appropriate thread count less than or equal to the value you specify. See Controlling Threading in MKL above for more information. When measuring KNL performance against traditional processors, compare node-to-node rather than core-to-core. KNL cores run at lower frequencies than traditional multicore processors. Thus, for a fixed number of MPI tasks and threads, a given simulation may run 2-3x slower on KNL than the same submission on Sandy Bridge. A welldesigned parallel application, however, should be able to run more tasks and/or threads on a KNL node than is possible on Sandy Bridge. If so, it may exhibit better performance per KNL node than it does on Sandy Bridge. General Expectations. From a pure hardware perspective, a single Stampede KNL node could improve performance by as much as 6x compared to Stampede's dual socket Sandy Bridge nodes; this is true for both memory bandwidth-bound and compute-bound codes. This assumes the code is running out of (fast) MCDRAM in the queues configured in flat mode (450 GB/s bandwidth vs 75 GB/s on Sandy Bridge) or using cache-contained workloads in the queues configured in cache mode (memory footprint < 16GB). It also assumes perfect scalability and no latency issues. In practice we have observed application improvements between 1.3x and 5x for several HPC workloads typically run in TACC systems. Codes with poor vectorization or scalability could see much smaller improvements. In terms of network performance, the OmniPath network provides 100 Gbit/s peak bandwidth, with point-to-point exchange performance measured at over 11 GB/s for a single task pair across nodes. Latency values will be higher than those for the Sandy Bridge FDR Infiniband network: on the order of 2-4 microseconds for exchanges across nodes. Affinity. In Cache-Quadrant mode (normal and development queues), default affinity settings are usually sensible and often optimal for threaded codes as well as MPI-threaded hybrid applications. In other modes, use tacc_affinity (for MPI codes) or investigate manual affinity settings.

References The following manuals and other reference documents were used to gather information for this User Guide and may contain additional information of use. Bash Users' Startup Files: Quick Start Guide Globus Globus URL Copy Intel Compilers Intel Math Kernel Library Documentation Intel Product Search Lmod: Environmental Modules Systems NVIDIA OpenSSH rsync Slurm Stampede Archive: Knights Corner Technical Material TACC Training Courses

KNL References TACC Technical Report: KNL Utilization Guidelines Intel Developer Zone Intel Compiler Options for Intel SSE and Intel AVX generation and processor-specific optimizations Intel Xeon Phi Processor High Performance Programming book TACC staff contributed many presentations at the latest Intel Xeon Phi Users Group IXPUG 2016 conference. KNL Memory Configuration KNL Memory Configuration Lab KNL Hardware KNL Introduction Lab KNL MPI Hybrid KNL Hybrid Execution Lab OpenMP Affinity Hybrid Lab OpenMP Affinity Lab

Policies TACC resources are deployed, configured, and operated to serve a large, diverse user community. It is important that all users are aware of and abide by TACC Usage Policies. Failure to do so may result in suspension or cancellation of the project and associated allocation and closure of all associated logins. Illegal transgressions will be addressed through UT and/or legal authorities. The Usage Policies are documented here: http://www.tacc.utexas.edu/user-services/usage-policies.

Help Help is available 24/7. Please submit a helpdesk ticket via the TACC User Portal

Revision History The "Last Update" date is the date of the most recent change to this document. This revision history is a list of non-trivial updates; it excludes routine items such as corrected typos and minor format changes.

Click to view Office of the Vice President for Research Feedback Home Facebook Twitter Contact ©2011-2018 Texas Advanced Computing Center, The University of Texas at Austin

TACC Stampede User Guide - TACC User Portal [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch