ahassoun's picture
Upload 3018 files
ee6e328
|
raw
history blame
9.03 kB

ํ›ˆ๋ จ์šฉ ์‚ฌ์šฉ์ž ๋งž์ถคํ˜• ํ•˜๋“œ์›จ์–ด [[custom-hardware-for-training]]

๋ชจ๋ธ ํ›ˆ๋ จ๊ณผ ์ถ”๋ก ์— ์‚ฌ์šฉํ•˜๋Š” ํ•˜๋“œ์›จ์–ด๋Š” ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. GPU์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด, Tim Dettmer์˜ ํ›Œ๋ฅญํ•œ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ๋ฅผ ํ™•์ธํ•ด๋ณด์„ธ์š”. ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ ๋งํฌ (์˜์–ด๋กœ ์ž‘์„ฑ๋จ).

GPU ์„ค์ •์— ๋Œ€ํ•œ ์‹ค์šฉ์ ์ธ ์กฐ์–ธ์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

GPU [[gpu]]

๋” ํฐ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ ๋•Œ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ์„ธ ๊ฐ€์ง€ ์˜ต์…˜์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋” ํฐ GPU
  • ๋” ๋งŽ์€ GPU
  • ๋” ๋งŽ์€ CPU ๋ฐ NVMe (DeepSpeed-Infinity๋ฅผ ํ†ตํ•œ ์˜คํ”„๋กœ๋“œ(offload))

์šฐ์„ , ํ•˜๋‚˜์˜ GPU๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด๋ด…์‹œ๋‹ค.

์ „์› ๊ณต๊ธ‰๊ณผ ๋ƒ‰๊ฐ [[power-and-cooling]]

๋น„์‹ผ ๊ณ ์„ฑ๋Šฅ GPU๋ฅผ ๊ตฌ๋งคํ•œ ๊ฒฝ์šฐ, ์˜ฌ๋ฐ”๋ฅธ ์ „์› ๊ณต๊ธ‰๊ณผ ์ถฉ๋ถ„ํ•œ ๋ƒ‰๊ฐ์„ ์ œ๊ณตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ „์› ๊ณต๊ธ‰:

์ผ๋ถ€ ๊ณ ์„ฑ๋Šฅ ์†Œ๋น„์ž์šฉ GPU๋Š” 2๊ฐœ ํ˜น์€ ๊ฐ€๋”๊ฐ€๋‹ค 3๊ฐœ์˜ PCI-E 8ํ•€ ์ „์› ์†Œ์ผ“์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์นด๋“œ์— ์žˆ๋Š” ์†Œ์ผ“ ์ˆ˜๋งŒํผ ๋…๋ฆฝ์ ์ธ 12V PCI-E 8ํ•€ ์ผ€์ด๋ธ”์ด ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”. ๊ฐ™์€ ์ผ€์ด๋ธ”์˜ ํ•œ์ชฝ ๋์— ์žˆ๋Š” 2๊ฐœ์˜ ์Šคํ”Œ๋ฆฟ(๋˜๋Š” ํ”ผ๊ทธํ…Œ์ผ(pigtail) ์ผ€์ด๋ธ”)์„ ์‚ฌ์šฉํ•˜์ง€ ๋งˆ์„ธ์š”. ์ฆ‰, GPU์— 2๊ฐœ์˜ ์†Œ์ผ“์ด ์žˆ๋‹ค๋ฉด, PSU(์ „์› ๊ณต๊ธ‰ ์žฅ์น˜)์—์„œ ์นด๋“œ๋กœ ์—ฐ๊ฒฐ๋˜๋Š” 2๊ฐœ์˜ PCI-E 8ํ•€ ์ผ€์ด๋ธ”์ด ํ•„์š”ํ•˜๋ฉฐ, ๋์— 2๊ฐœ์˜ PCI-E 8ํ•€ ์ปค๋„ฅํ„ฐ๊ฐ€ ์žˆ๋Š” ์ผ€์ด๋ธ”์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค! ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์นด๋“œ์˜ ์ „์ฒด ์„ฑ๋Šฅ์„ ์ œ๋Œ€๋กœ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ๊ฐ์˜ PCI-E 8ํ•€ ์ „์› ์ผ€์ด๋ธ”์€ PSU ์ชฝ์˜ 12V ๋ ˆ์ผ์— ์—ฐ๊ฒฐ๋˜์–ด์•ผ ํ•˜๋ฉฐ ์ตœ๋Œ€ 150W์˜ ์ „๋ ฅ์„ ๊ณต๊ธ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ผ๋ถ€ ๋‹ค๋ฅธ GPU๋Š” PCI-E 12ํ•€ ์ปค๋„ฅํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ์ปค๋„ฅํ„ฐ๋Š” ์ตœ๋Œ€ 500W-600W์˜ ์ „๋ ฅ์„ ๊ณต๊ธ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ €๊ฐ€ํ˜• GPU๋Š” 6ํ•€ ์ปค๋„ฅํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ตœ๋Œ€ 75W์˜ ์ „๋ ฅ์„ ๊ณต๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ GPU๊ฐ€ ์•ˆ์ •์ ์ธ ์ „์••์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋„๋ก ๊ณ ๊ธ‰ PSU๋ฅผ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ ์ €ํ’ˆ์งˆ์˜ PSU๋Š” GPU๊ฐ€ ์ตœ๊ณ  ์„ฑ๋Šฅ์œผ๋กœ ๋™์ž‘ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ์ „์••์„ ์•ˆ์ •์ ์œผ๋กœ ๊ณต๊ธ‰ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฌผ๋ก , PSU๋Š” GPU์— ์ „์›์„ ๊ณต๊ธ‰ํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•œ ์—ฌ๋ถ„์˜ ์ „๋ ฅ ์šฉ๋Ÿ‰์„ ๊ฐ€์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ƒ‰๊ฐ:

GPU๊ฐ€ ๊ณผ์—ด๋˜๋ฉด ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๊ณ  ์ตœ๋Œ€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋„ˆ๋ฌด ๋œจ๊ฑฐ์›Œ์ง€๋ฉด ์ค‘์ง€๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU๊ฐ€ ๊ณผ์—ด๋  ๋•Œ ์ •ํ™•ํ•œ ์ ์ • ์˜จ๋„๋ฅผ ์•Œ๊ธฐ ์–ด๋ ค์šฐ๋‚˜, ์•„๋งˆ๋„ +80โ„ƒ ๋ฏธ๋งŒ์ด๋ฉด ์ข‹์ง€๋งŒ ๋” ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Šต๋‹ˆ๋‹ค. 70โ„ƒ-75โ„ƒ ์ •๋„๊ฐ€ ํ›Œ๋ฅญํ•œ ์˜จ๋„ ๋ฒ”์œ„์ž…๋‹ˆ๋‹ค. ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ์˜จ๋„๋Š” ๋Œ€๋žต 84โ„ƒ-90โ„ƒ ์ •๋„์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์„ฑ๋Šฅ ์ €ํ•˜ ์ด์™ธ์—๋„ ์ง€์†์ ์œผ๋กœ ๋งค์šฐ ๋†’์€ ์˜จ๋„๋Š” GPU ์ˆ˜๋ช…์„ ๋‹จ์ถ•์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์–ด์„œ, ์—ฌ๋Ÿฌ ๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ธก๋ฉด ์ค‘ ํ•˜๋‚˜์ธ GPU ๊ฐ„ ์—ฐ๊ฒฐ ๋ฐฉ์‹์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋‹ค์ค‘ GPU ์—ฐ๊ฒฐ ๋ฐฉ์‹ [[multigpu-connectivity]]

๋‹ค์ค‘ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ GPU ๊ฐ„์˜ ์—ฐ๊ฒฐ ๋ฐฉ์‹์€ ์ „์ฒด ํ›ˆ๋ จ ์‹œ๊ฐ„์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ GPU๊ฐ€ ๋™์ผํ•œ ๋ฌผ๋ฆฌ์  ๋…ธ๋“œ์— ์žˆ์„ ๊ฒฝ์šฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

nvidia-smi topo -m

๋งŒ์•ฝ NVLink๋กœ ์—ฐ๊ฒฐ๋œ ๋“€์–ผ GPU ํ™˜๊ฒฝ์ด๋ผ๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      NV2     0-23            N/A
GPU1    NV2      X      0-23            N/A

NVLink๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š๋Š” ๋‹ค๋ฅธ ํ™˜๊ฒฝ์˜ ๊ฒฝ์šฐ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      PHB     0-11            N/A
GPU1    PHB      X      0-11            N/A

์ด ๊ฒฐ๊ณผ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฒ”๋ก€๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

๋”ฐ๋ผ์„œ ์ฒซ ๋ฒˆ์งธ ๊ฒฐ๊ณผ์˜ NV2๋Š” GPU๊ฐ€ 2๊ฐœ์˜ NVLink๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ด๊ณ , ๋‘ ๋ฒˆ์งธ ๊ฒฐ๊ณผ์˜ PHB๋Š” ์ผ๋ฐ˜์ ์ธ ์†Œ๋น„์ž์šฉ PCIe+๋ธŒ๋ฆฟ์ง€ ์„ค์ •์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์„ค์ •์—์„œ ์–ด๋–ค ์œ ํ˜•์˜ ์—ฐ๊ฒฐ ๋ฐฉ์‹์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”. ์ผ๋ถ€ ์—ฐ๊ฒฐ ๋ฐฉ์‹์€ GPU ๊ฐ„ ํ†ต์‹ ์„ ๋” ๋น ๋ฅด๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์œผ๋ฉฐ(NVLink์™€ ๊ฐ™์ด), ์–ด๋–ค ์—ฐ๊ฒฐ ๋ฐฉ์‹์€ ๋” ๋Š๋ฆฌ๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(PHB์™€ ๊ฐ™์ด).

์‚ฌ์šฉํ•˜๋Š” ํ™•์žฅ์„ฑ ์†”๋ฃจ์…˜์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์—ฐ๊ฒฐ ์†๋„๊ฐ€ ์ฃผ์š”ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜๋„ ์žˆ๊ณ  ๋ฏธ๋ฏธํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. DDP์™€ ๊ฐ™์ด GPU๊ฐ€ ๊ฑฐ์˜ ๋™๊ธฐํ™”ํ•˜์ง€ ์•Š์•„๋„ ๋˜๋Š” ๊ฒฝ์šฐ, ์—ฐ๊ฒฐ ์†๋„๊ฐ€ ๋Š๋ ค๋„ ํฐ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ZeRO-DP์™€ ๊ฐ™์ด GPU๊ฐ„ ํ†ต์‹ ์ด ๋งŽ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ, ๋” ๋น ๋ฅธ ํ›ˆ๋ จ์„ ์œ„ํ•ด์„œ๋Š” ๋” ๋น ๋ฅธ ์—ฐ๊ฒฐ ์†๋„๊ฐ€ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

NVLink [[nvlink]]

NVLink๋Š” Nvidia์—์„œ ๊ฐœ๋ฐœํ•œ ์œ ์„  ๊ธฐ๋ฐ˜์˜ ์ง๋ ฌ ๋‹ค์ค‘ ๋ ˆ์ธ ๊ทผ๊ฑฐ๋ฆฌ ํ†ต์‹  ๋งํฌ์ž…๋‹ˆ๋‹ค.

์ƒˆ๋กœ์šด ์„ธ๋Œ€์˜ NVLink๋Š” ๋” ๋น ๋ฅธ ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Nvidia Ampere GA102 GPU Architecture์—์„œ ์•„๋ž˜์™€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

3์„ธ๋Œ€ NVLinkยฎ GA102 GPU๋Š” 4๊ฐœ์˜ x4 ๋งํฌ๋ฅผ ํฌํ•จํ•˜๋Š” NVIDIA์˜ 3์„ธ๋Œ€ NVLink ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ™œ์šฉํ•˜๋ฉฐ, ๊ฐ ๋งํฌ๋Š” ๋‘ ๊ฐœ์˜ GPU ๊ฐ„์— ๊ฐ ๋ฐฉํ–ฅ์œผ๋กœ ์ดˆ๋‹น 14.0625GB์˜ ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 4๊ฐœ์˜ ๋งํฌ๋Š” ๊ฐ ๋ฐฉํ–ฅ์— ์ดˆ๋‹น 56.25GB์˜ ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋‘ ๊ฐœ์˜ GPU ๊ฐ„์—๋Š” ์ดˆ๋‹น 112.5GB์˜ ์ด ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ RTX 3090 GPU๋ฅผ NVLink๋ฅผ ์‚ฌ์šฉํ•ด SLI๋กœ ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (3-Way ๋ฐ 4-Way SLI ๊ตฌ์„ฑ์€ ์ง€์›๋˜์ง€ ์•Š์Œ์— ์œ ์˜ํ•˜์„ธ์š”.)

๋”ฐ๋ผ์„œ nvidia-smi topo -m์˜ ๊ฒฐ๊ณผ์—์„œ NVX์˜ ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ๋” ์ข‹์Šต๋‹ˆ๋‹ค. ์„ธ๋Œ€๋Š” GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด, gpt2๋ฅผ ์ž‘์€ wikitext ์ƒ˜ํ”Œ๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด, NVLink๊ฐ€ ํ›ˆ๋ จ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

NVlink Time
Y 101s
N 131s

NVLink ์‚ฌ์šฉ ์‹œ ํ›ˆ๋ จ์ด ์•ฝ 23% ๋” ๋น ๋ฅด๊ฒŒ ์™„๋ฃŒ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ๋ฒค์น˜๋งˆํฌ์—์„œ๋Š” NCCL_P2P_DISABLE=1์„ ์‚ฌ์šฉํ•˜์—ฌ NVLink๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋„๋ก ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ „์ฒด ๋ฒค์น˜๋งˆํฌ ์ฝ”๋“œ์™€ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

# DDP w/ NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69}

# DDP w/o NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69}

ํ•˜๋“œ์›จ์–ด: ๊ฐ๊ฐ 2๊ฐœ์˜ TITAN RTX 24GB + 2๊ฐœ์˜ NVLink (NV2 in nvidia-smi topo -m) ์†Œํ”„ํŠธ์›จ์–ด: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0