Post

vLLM v1 Engine ์†Œ๊ฐœ ๐Ÿง 

vLLM ์•„ํ‚คํ…์ฒ˜ ์„ค๋ช…

vLLM v1 Engine ์†Œ๊ฐœ ๐Ÿง 

vLLM v1 Engine ์†Œ๊ฐœ ๐Ÿง 

โ“ v1 Engine์ด๋ž€?

  • vLLM v1 engine์€ ๊ธฐ์กด v0 ์—”์ง„์„ ๋Œ€์ฒดํ•˜๋Š” ์ƒˆ๋กœ์šด ํ•ต์‹ฌ ์•„ํ‚คํ…์ฒ˜๋กœ, ์Šค์ผ€์ค„๋Ÿฌ, KV ์บ์‹œ ๋งค๋‹ˆ์ €, ์›Œ์ปค, ์ƒ˜ํ”Œ๋Ÿฌ, API ์„œ๋ฒ„ ๋“ฑ์ด ๋ชจ๋‘ ์žฌ์„ค๊ณ„๋œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ์ฝ”๋“œ ๋ณต์žก์„ฑ์„ ์ค„์ด๊ณ  ์œ ์ง€๋ณด์ˆ˜์„ฑ์„ ๋†’์ธ ๋ชจ๋“ˆํ˜• ์„ค๊ณ„๊ฐ€ ํŠน์ง•์ž…๋‹ˆ๋‹ค
  • v1์€ CPU ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๊ทน์†Œํ™”ํ•˜๊ณ , ๋””์ฝ”๋“œ์™€ ํ”„๋ฆฌํ•„(prefill)์„ ๋‹จ์ผ ์Šค์ผ€์ค„๋Ÿฌ ์•ˆ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ์ „์ฒด์ ์ธ ์ฒ˜๋ฆฌ ์†๋„์™€ ์•ˆ์ •์„ฑ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค

    โš–๏ธ v0 vs v1 ๋น„๊ต

ํ•ญ๋ชฉv0 Enginev1 Engine (ํ˜„์žฌ ๊ถŒ์žฅ)
์•„ํ‚คํ…์ฒ˜๋ณต์žกํ•œ ๊ฐœ๋ณ„ ๊ตฌ์„ฑ ์š”์†Œํ†ตํ•ฉ๋œ ๋‹จ์ผ ์Šค์ผ€์ค„๋Ÿฌ ๋ฐ ๋ชจ๋“ˆํ˜• ๊ตฌ์กฐ docs.vllm.aivLLM Blog
์„ฑ๋Šฅ์•ˆ์ •์ ์ด๋‚˜ ์Šค์ผ€์ผ์— ๋”ฐ๋ผ CPU ์˜ค๋ฒ„ํ—ค๋“œ ๋ฐœํ˜„์ตœ๋Œ€ 1.7๋ฐฐ ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰, ๊ณ  QPS ํ™˜๊ฒฝ์—์„œ ์ผ๊ด€๋œ ๋‚ฎ์€ ๋ ˆ์ดํ„ด์‹œ vLLM BlogRed Hat DeveloperRed Hat Developer
ํ”„๋ฆฌํ•„ ๋ฐฉ์‹์ž…๋ ฅ ์ „์ฒ˜๋ฆฌ ์ดํ›„ ๋””์ฝ”๋”ฉchunked-prefill์œผ๋กœ ์ž…๋ ฅ๊ณผ ๋””์ฝ”๋”ฉ ๋™์‹œ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ โ†’ TTFT ๊ฐ์†Œ kaitchup.substack.comrocm.blogs.amd.com
Prefix Caching์ œํ•œ์  ์‚ฌ์šฉVLM ํฌํ•จ prefix-caching์ด ๊ธฐ๋ณธ ํ™œ์„ฑํ™” ๋ฐ ์ตœ์ ํ™” vLLM Blogrocm.blogs.amd.com
์ง€์› ๋ชจ๋ธํญ๋„“์€ ๋ชจ๋ธ ์ง€์›Decoder-only, MoE, ์ผ๋ถ€ VLM ์ง€์›. ์•„์ง embedding, encoder-decoder ๋ฐ Mamba ๊ณ„์—ด์€ ์ œํ•œ์  ์ง€์› vLLM Blogdocs.vllm.ai
์ง€์› ๊ธฐ๋ŠฅLoRA, speculative decoding, structured output ๋“ฑ ๋ฒ”์œ„ ๋„“์Œ์ผ๋ถ€ ๊ธฐ๋Šฅ์€ ์•„์ง ๋ถ€์กฑ (pipeline parallelism, best_of, structured decoding ๋“ฑ deprecated) docs.vllm.aidocs.vllm.ai
ํ•˜๋“œ์›จ์–ด ํ˜ธํ™˜์„ฑ๋‹ค์–‘ํ•œ GPU ๋ฐ TPU ์ง€์›ํ˜„์žฌ๋Š” Ampere ์ด์ƒ NVIDIA GPU๋งŒ ๊ณต์‹ ์ง€์› (TPU, AMD ๋“ฑ์€ WIP) docs.vllm.airocm.blogs.amd.com

๐Ÿš€ ์™œ v1์„ ๋งŒ๋“ค์—ˆ๋‚˜?

1. ์ฝ”๋“œ ๋ณต์žก์„ฑ ํ•ด๊ฒฐ ๋ฐ ํ™•์žฅ์„ฑ ํ™•๋ณด

v0 ์—”์ง„์ด ์„ฑ์žฅํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์ด ๋…๋ฆฝ์ ์œผ๋กœ ์ถ”๊ฐ€๋˜์–ด ์‹œ์Šคํ…œ ๋ณต์žก์„ฑ์ด ๋†’์•„์กŒ๊ณ , ์œ ์ง€๋ณด์ˆ˜ ๋ฐ ์‹ ๊ทœ ๊ธฐ๋Šฅ ์ถ”๊ฐ€๊ฐ€ ์–ด๋ ค์›Œ์กŒ์Šต๋‹ˆ๋‹ค. v1์€ ์ด๋ฅผ ๋‹จ์ˆœํ™”ํ•˜๊ณ  ๊ตฌ์กฐํ™”๋œ ์„ค๊ณ„๋ฅผ ํ†ตํ•ด ๊ฐœ๋ฐœ ์ƒ์‚ฐ์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค docs.vllm.aiGitHub.

2. ์„ฑ๋Šฅ ์ตœ์ ํ™”

  • CPU ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ด๊ณ , chunked-prefill๊ณผ FlashAttention 3๋ฅผ ํ™œ์šฉํ•ด ๋†’์€ QPS ํ™˜๊ฒฝ์—์„œ๋„ 1.7๋ฐฐ ๋น ๋ฅธ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ ์ง€ํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค
  • ํŠนํžˆ VLM(vision-language) ์ž‘์—…์—์„œ prefix caching, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์Šค์ผ€์ค„๋ง ๋“ฑ์ด ์ตœ์ ํ™”๋˜์–ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋”์šฑ ๋‘๋“œ๋Ÿฌ์ง‘๋‹ˆ๋‹ค

    3. ์ƒˆ๋กœ์šด ๊ธฐ๋Šฅ์˜ ๊ธฐ๋ฐ˜

๋‹จ์ผ ์Šค์ผ€์ค„๋Ÿฌ ๊ธฐ๋ฐ˜ ์„ค๊ณ„, ์ผ๊ด€์  ๋กœ์ง ํ๋ฆ„, ์บ์‹œ ์•„ํ‚คํ…์ฒ˜ ๊ฐœํŽธ ๋“ฑ์„ ํ†ตํ•ด ํ–ฅํ›„ speculative decoding, ๊ตฌ์กฐํ™” ์ถœ๋ ฅ, ๋กœ์ง€์ธ  ํ”„๋กœ์„ธ์„œ ๋“ฑ ์‹ ๊ทœ ๊ธฐ๋Šฅ ๋„์ž…์— ์œ ๋ฆฌํ•œ ๊ตฌ์กฐ๋ฅผ ๋งˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค

โœ… ์‚ฌ์šฉ ๋ฐฉ๋ฒ• ์š”์•ฝ

  1. vLLM_USE_V1=1 ํ™˜๊ฒฝ๋ณ€์ˆ˜ ์„ค์ •
  2. VLLM_WORKER_MULTIPROC_METHOD=spawn ๋“ฑ multiprocessing ๋ฐฉ์‹ ์„ค์ •
  3. Python API ๋˜๋Š” vllm serve <๋ชจ๋ธ๋ช…> ์‹คํ–‰ ์‹œ ์ž๋™์œผ๋กœ v1 ์—”์ง„ ํ™œ์„ฑํ™” (v0 ๋กœ ๋ฐฑ์›Œ๋“œ ํ˜ธํ™˜ ์œ ์ง€ ๊ฐ€๋Šฅ)

    โš ๏ธ ํ˜„์žฌ ์ œ์•ฝ ๋ฐ ํ–ฅํ›„ ๊ณ„ํš

  • ํ˜„์žฌ ์ œํ•œ๋œ ์ง€์›: encoder-decoder ๋ชจ๋ธ, embedding ๋ชจ๋ธ, ์ผ๋ถ€ Mamba ๊ณ„์—ด ๋ฏธ์ง€์›
  • ๊ธฐ๋Šฅ ๊ฒฉ์ฐจ: best_of, structured decoding, pipeline parallelism, ์ผ๋ถ€ ๋กœ๋ผ ๋ฐ speculative decoding ๊ธฐ๋Šฅ์€ ์—ฌ์ „ํžˆ v0์— ๋น„ํ•ด ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค (ํ˜„์žฌ ์ง„ํ–‰ ์ค‘)
  • ํ–ฅํ›„ ๊ฐœ์„  ๋ฐฉํ–ฅ: FP8 KV ์บ์‹œ, LoRA ์ตœ์ ํ™”, embedding ๋ฐ encoder-decoder ๋ชจ๋ธ ์ง€์› ํ™•๋Œ€, VLM ๋ฐ TPU/AMD์™€์˜ ํ˜ธํ™˜์„ฑ ํ™•์žฅ

    ๐Ÿ“š ์ฐธ๊ณ  ๋งํฌ

  • vLLM ๋ธ”๋กœ๊ทธ: โ€œvLLM V1: A Major Upgrade to vLLMโ€™s Core Architectureโ€ vllm-ascend.readthedocs.io+14vLLM Blog+14docs.vllm.ai+14
  • ๊ณต์‹ ๋ฌธ์„œ: v1 User Guide, Deprecated/Feature ๋น„๊ต, ๋ชจ๋ธ ์ง€์› ํ˜„ํ™ฉ Red Hat Developer+2docs.vllm.ai+2docs.vllm.ai+2
  • ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ: v0.8.1 ๋น„๊ต ๊ฒฐ๊ณผ ๋ฐ v1 ๊ธฐ๋ฐ˜ ์„œ๋ฒ„ ์„ฑ๋Šฅ ๋ถ„์„ Red Hat DeveloperRed Hat Developer
This post is licensed under CC BY 4.0 by the author.