Dell Presentation Template Standard 4:3 Layout - HPC Advisory Council [PDF]

Dell HPC. Internal vs. External: NAMD. 9. 0.95. 0.82. 0. 0.2. 0.4. 0.6. 0.8. 1. 1.2. STMV. Step s/Secon d. NAMD – STMV

9 downloads 3 Views 1MB Size

Recommend Stories


Dell EMC HPC Systems
You have survived, EVERY SINGLE bad day so far. Anonymous

Dell EMC HPC Innovation Lab
The butterfly counts not months but moments, and has time enough. Rabindranath Tagore

Dell HPC Omni-Path Fabric
Raise your words, not voice. It is rain that grows flowers, not thunder. Rumi

Substrate Standard Layout Guidelines Standard Layout Guidelines
The butterfly counts not months but moments, and has time enough. Rabindranath Tagore

Advisory Council Membership
Life is not meant to be easy, my child; but take courage: it can be delightful. George Bernard Shaw

Provider Advisory Council
Live as if you were to die tomorrow. Learn as if you were to live forever. Mahatma Gandhi

masonry advisory council
You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

Avant Student Advisory Council
Stop acting so small. You are the universe in ecstatic motion. Rumi

Legacy PowerPoint presentation template
Be who you needed when you were younger. Anonymous

Template | PowerPoint Presentation
The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

Idea Transcript


Dell HPC Dr. Jeffrey Layton ([email protected]) Enterprise Technologist - HPC

GPU Computing at Dell

Dell HPC

GPU Computing Approach • Hardware changes rapidly – New CPUs – New GPUs

– New Interconnects – New software

• All of these happen at different rates and at different times • GPU applications are evolving very rapidly

• How do you adapt to these changes? How do you protect your investment? How do you adapt to new and evolving applications? • Be Flexible

3

Dell HPC

Great example of flexibility

• From initial development to “final” code version – performance improves by a factor of 9! • Software changes during development results in hardware changes 4

Dell HPC

Implementation • Develop on something smaller such as a laptop or workstation • Deploy production applications onto cluster • For cluster deployments: – Move GPUs to external PCIe chassis

• Allows CPUs and GPUs to be changed independently • Allows network to be changed independently • Optimize power and cooling for GPUs and CPUs separately

• Add GPUs to host nodes as applications evolve – It may be 1 GPU today and 8 GPUs tomorrow

5

Dell HPC

Dell C410x • 3U PCIe chassis – 16 slots (10 in front, 6 in back) – all x16 – 8 PCIe connections to host nodes (1-8 slots per connection)

• Redundant power supplies (4x 1400W) • BMC (IPMI 2.0) on-board 6

Dell HPC

Host nodes: • C6100:

• C6145:

• • • •

4-in-2U 2S Intel with IB mezz card (x8) PCIe x16 HIC card Redundant power

• 2x 4S AMD boards in 2U • (4) x16 slots –

3 are open



1 has iPASS connector

• IB mezz card (x8) • Redundant power

7

Dell HPC

Host/GPU combinations • Many combinations are possible – Intel or AMD? – How many GPUs per node?

– How many lanes per GPU?

8

Dell HPC

Internal vs. External: NAMD NAMD – STMV Benchmark 1.2

Steps/Second

1

0.95 0.82

0.8 SuperMicro (2) C410x / C6100 (2)

0.6 0.4 0.2

0 STMV

9

Dell HPC

Internal vs. External: CUDASW++ CUDASW++ 30

GFLOPS

25 20 15

C410x / C6100 (2) SuperMicro (2)

10 5 0

Query Length

10

Dell HPC

Scalability: NAMD NAMD 1.52

1.6

Steps/Second

1.4 1.2 1

0.84

0.8 0.6

0.47

0.4 0.2

0.95

CPU C410x / C6100 (1) C410x / C6100 (2) C410x / C6100 (4) SuperMicro (2)

0.10

0 STMV

11

Dell HPC

Impact of CUDA versions • Heisenberg Spin Glass (HSG) Model – Spin Glass modeling is a technique used in statistical mechanics to simulate and predict the behavior of various physical phenomena

• HSG is multi-GPU capable using MPI – Recent upgrade to CUDA 4.0

• Two code versions: – MPI based › GPUs communicate by sending data to host, then to approproate GPU

– CUDA 4.0 › GPUs communicate directly (no host)

• Compare performance

12

Dell HPC

HSG results • CUDA 4.0 (GPU Direct) is 15-30% faster than MPI • For Intel systems, GPU Direct requires all GPUs to be connected to the same IOH • C410x allows you to expand to multiple GPUs per single IOH

13

Dell HPC

Data Management and Storage

Dell HPC

Realities • HPC storage is about 15-25% the cost of a system but about 90% of the problems • HPC Storage is about Solutions not just hardware – Hardware, file system, client, management/monitoring, documentation, best practices, sizing and performance guidance, services and support

• There are no one, two, or even three file systems/solutions that satisfy the various requirements – Recent IDC study: 25 customers = 13 file systems

• Applications/Processes drive solutions (just like compute). But – Very few customers understand the IO characteristics of the apps

• Access frequency requirements don’t match the underlying storage platform – A very large percentage of data is never touched approximately 2-4 weeks after it is created 15

Dell HPC

HPC Storage Solutions Aren’t Easy • Ignoring Cost – name the Top 3 storage attributes 1. Performance

2. Reliability 3. Capacity • Difficult or impossible to get all 3 attributes in a single solution with HPC price constraints

• Can we get all 3 attributes in different solutions and integrate them? – Maintain attributes and improves flexibility and increases options 16

Dell HPC

Flexibility, Adaptability, and Options • The performance importance of data changes over the life of the data – At first, performance is very important – After a period of time, the performance is less important

• Why keep data on high-performance storage that isn’t being used? • Based on applications and performance importance there are three basic categories of data requirements: 1. Fast Scratch •

Performance, performance, performance

2. Primary (/home) •

Reliability

3. Long-term •

17

Capacity (very little performance)

Dell HPC

Dell’s approach to deliver HPC storage solutions • Dell is delivering solutions using two approaches: – Complete solutions - Fully vetted, tested, supported › Come with end-to-end support from Dell and partners › Detailed documentation including best practices, performance and sizing guidance

› Deployment services if necessary

– Roll-it-your-own › Dell creates technical whitepapers containing: – Recommended configurations – Details on configuration – Best practices and sizing guidance

› Customer buys hardware and uses whitepapers as a reference guide › Full Dell warranty and support on Dell components – Limited or no deployment services; no solution type services

• Overtime, deliver building blocks that will integrate into the larger storage ecosystem 18

Dell HPC

Fast Scratch Storage • Requirements: – Very fast (above 1.4 GB/s) – more than NFS

– Scalability in performance and capacity – Cost effective – Reliability is not necessarily a primary requirement

• Roll-Your-Own reference configurations and supporting data Cambridge University Developed Lustre Reference Configuration – Detailed whitepaper discussing architecture and performance analysis of the Lustre solution deployed at University of Cambridge

– The deployment steps and best practices listed in the paper can be used to architect similar Lustre solutions using Dell server and storage products – Currently work under progress to develop a reference architecture using latest generation Dell PowerEdge servers and PowerVault storage

• Complete Dell HPC Fast Scratch Solutions Dell | Terascala High Performance Computing Storage Solution (DT-HSS) – Third generation Lustre solution from Dell and Terascala referred to as DT-HSS3 – Utilizes Dell’s latest generation 6Gb/s SAS based PowerVault MD series storage

19

Dell HPC

The DELL | Terascala HPC Storage Solution (DT-HSS3) • Unique scale out storage appliance for throughput intensive applications • Fully supported storage appliance that leverages Lustre, industry’s leading open-source parallel file system • Simple, linear scalability – Up to 6.2 GB/s of read and 4.2GB/s write throughput per base object pair. Scale aggregate performance by adding object pairs. – 48TB to Petabytes in a single name space – Pre-defined configurations from 48TB to 336 TB in a single rack – (building blocks) – Configurations serve as building blocks for larger and faster solutions

• Rich management including hardware and file system monitoring

Metadata Storage Server (MDS) Pair

Object Storage Server (OSS) Pair

– Automated Install & Maintenance , Health Monitoring, Failover Solution, Root Cause Analysis

20

Dell HPC

Primary Storage • Requirements: – Performance is usually not a big deal – Reliability is important – Ease of use is important

• Typical usage for home directories, user data, application data and results • NFS is a widely used protocol for such use case • Roll-Your-Own reference configurations and supporting data: – Dell PowerVault MD1200 as a Network File System Backend Storage Solution – Optimizing Dell PowerVault MD1200 Storage Arrays for High Performance Computing (HPC) Deployments

• Complete Dell HPC NFS Storage Solutions – Dell HPC NFS Storage Solution (NSS) › Leverages Dell PowerEdge and PowerVault storage › 24-96TB (raw storage) in a single namespace using Red Hat XFS file system

› Dell developed tuning and best practices 21

Dell HPC

The Dell HPC NFS Storage Solution • Takes the guesswork out of NFS configurations – Appliance approach to inexpensive NFS solutions

• Range of capacity: – Up to 96TB in a single namespace

• HA Configuration options • Good performance – Up to 1.47 GB/s for writes and 2.4 GB/s for reads for NFS performance – 6Gbps SAS, optional IB or 10GigE – Tuned storage and file system configurations

• Cost Effective • Reliable and supported – Proven hardware – 3 years support with Dell including XFS support – Redundant power supplies, connections, plus drive spares kit

NFS Gateway

Storage – MD1200

… Expansion MD1200’s

• Easy to install – Dell configuration and deployment: Whitepaper and Dell PS – Affordable installation services available 22

Dell HPC

Benefits of Dell NSS • Performance tuned NFS server – Best possible performance – No need to experiment with tuning options – already tuned 1400000

1200000

Througput KB/s

1000000

30%

800000

tuned 600000

not tuned

400000

200000

0 2

4

8

12

16

24

32

Clients

23

Dell HPC

NSS Options Common Aspects • NFS Gateway – – – – – –

Dell Server (R710) RAID-1 for OS (plus 1 hot-spare) RAID-0 for additional swap space 3 years of support on OS, file system, hardware Cold spares (disks) IB, 10GigE options

– RHEL 5.5 OS – Redhat Scalable File system (XFS) – Dell ProSupport

NSS • Single NFS Gateway – Perc H800 RAID card(s) in NFS gateway › Dell MD1200 JBOD’s connected to RAID cards – RAID-60 or RAID-60+LVM 24

NSS-HA • Two Active-Passive NFS Gateways – Dell MD3200 RBOD contains RAID card – Dell MD1200 JBOD’s are connected to RBOD – RAID-6 + LVM Dell HPC

NSS Large Solution: 96 TB’s

QDR IB or 10GigE

Summary Raw capacity: 96TB Formatted capacity: ~80TB RAID-60 and LVM RAID-6 within each MD1200 RAID-0 across MD1200 pairs LVM to combine LUNS

10GigE NFS Performance Peak Sequential Read: 850 MB/s Peak Sequential Write: 1,180 MB/s

InfiniBand NFS Performance Peak Sequential Read: 1,350 MB/s Peak Sequential Write: 1,470 MB/s

25

Dell HPC

NSS-HA: Large 1

Dell R710 NSS-HA Server

1

Dell 710 NSS-HA Server

Summary Raw capacity: 96TB Formatted capacity: ~80TB RAID-6 and LVM RAID-6 within each MD3200/1200 LVM to combine LUNS

PowerVault MD3200

10GigE NFS Performance

PowerVault MD1200

InfiniBand NFS Performance

Peak Sequential Read: 560 MB/s Peak Sequential Write: 1,130 MB/s

Peak Sequential Read: 2,430 MB/s Peak Sequential Write: 1,274 MB/s

GigE

Power Cords IB or 10GigE SAS (6Gbps)

26

Dell HPC

Summary • Two most recent trends: • GPU Computing – GPU Computing is still evolving › Hardware (CPUs, GPUs, Interconnect), and software (CUDA)

– Best course of action is to remain flexible – Ability to upgrade CPUs or GPUs or software independent of each – External PCIe chassis affords flexibility › Good host nodes

• Data Management and Storage – Overall it’s the largest problem for users today – Focus on performance (fast-scratch), reliability (primary), and capacity (long-term) › Develop a product for each piece and integrate them together

– Roll-it-your-own and Fully supported solutions are available – Tools for data management are becoming highly critical 27

Dell HPC

Thanks!

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.