lynx   »   [go: up one dir, main page]

Transformers documentation

Quanto

You are viewing v4.53.1 version. A newer version v4.56.2 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Quanto

์ด ๋…ธํŠธ๋ถ์œผ๋กœ Quanto์™€ transformers๋ฅผ ์‚ฌ์šฉํ•ด ๋ณด์„ธ์š”!

๐Ÿค— Quanto ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋‹ค๋ชฉ์  ํŒŒ์ดํ† ์น˜ ์–‘์žํ™” ํˆดํ‚ท์ž…๋‹ˆ๋‹ค. ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์–‘์žํ™” ๋ฐฉ๋ฒ•์€ ์„ ํ˜• ์–‘์žํ™”์ž…๋‹ˆ๋‹ค. Quanto๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • ๊ฐ€์ค‘์น˜ ์–‘์žํ™” (float8,int8,int4,int2)
  • ํ™œ์„ฑํ™” ์–‘์žํ™” (float8,int8)
  • ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š์Œ (e.g CV,LLM)
  • ์žฅ์น˜์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š์Œ (e.g CUDA,MPS,CPU)
  • torch.compile ํ˜ธํ™˜์„ฑ
  • ํŠน์ • ์žฅ์น˜์— ๋Œ€ํ•œ ์‚ฌ์šฉ์ž ์ •์˜ ์ปค๋„์˜ ์‰ฌ์šด ์ถ”๊ฐ€
  • QAT(์–‘์žํ™”๋ฅผ ๊ณ ๋ คํ•œ ํ•™์Šต) ์ง€์›

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pip install quanto accelerate transformers

์ด์ œ from_pretrained() ๋ฉ”์†Œ๋“œ์— QuantoConfig ๊ฐ์ฒด๋ฅผ ์ „๋‹ฌํ•˜์—ฌ ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ torch.nn.Linear ๋ ˆ์ด์–ด๋ฅผ ํฌํ•จํ•˜๋Š” ๋ชจ๋“  ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ ๋ชจ๋“  ๋ชจ๋ธ์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

ํ—ˆ๊น…ํŽ˜์ด์Šค์˜ transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๊ฐœ๋ฐœ์ž ํŽธ์˜๋ฅผ ์œ„ํ•ด quanto์˜ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ผ๋ถ€ ํ†ตํ•ฉํ•˜์—ฌ ์ง€์›ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด ๋ฐฉ์‹์œผ๋กœ๋Š” ๊ฐ€์ค‘์น˜ ์–‘์žํ™”๋งŒ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ํ™œ์„ฑํ™” ์–‘์žํ™”, ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜, QAT ๊ฐ™์€ ๋” ๋ณต์žกํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” quanto ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ•ด๋‹น ํ•จ์ˆ˜๋ฅผ ์ง์ ‘ ํ˜ธ์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)

์ฐธ๊ณ ๋กœ, transformers์—์„œ๋Š” ์•„์ง ์ง๋ ฌํ™”๊ฐ€ ์ง€์›๋˜์ง€ ์•Š์ง€๋งŒ ๊ณง ์ง€์›๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค! ๋ชจ๋ธ์„ ์ €์žฅํ•˜๊ณ  ์‹ถ์œผ๋ฉด quanto ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋Œ€์‹  ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Quanto ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์–‘์žํ™”๋ฅผ ์œ„ํ•ด ์„ ํ˜• ์–‘์žํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋น„๋ก ๊ธฐ๋ณธ์ ์ธ ์–‘์žํ™” ๊ธฐ์ˆ ์ด์ง€๋งŒ, ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š”๋ฐ ์•„์ฃผ ํฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค! ๋ฐ”๋กœ ์•„๋ž˜์— ์žˆ๋Š” ๋ฒค์น˜๋งˆํฌ(llama-2-7b์˜ ํŽ„ํ”Œ๋ ‰์„œํ‹ฐ ์ง€ํ‘œ)๋ฅผ ํ™•์ธํ•ด ๋ณด์„ธ์š”. ๋” ๋งŽ์€ ๋ฒค์น˜๋งˆํฌ๋Š” ์—ฌ๊ธฐ ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

llama-2-7b-quanto-perplexity

์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋Œ€๋ถ€๋ถ„์˜ PTQ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํ˜ธํ™˜๋  ๋งŒํผ ์ถฉ๋ถ„ํžˆ ์œ ์—ฐํ•ฉ๋‹ˆ๋‹ค. ์•ž์œผ๋กœ์˜ ๊ณ„ํš์€ ๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜(AWQ, Smoothquant)์„ ์ตœ๋Œ€ํ•œ ๋งค๋„๋Ÿฝ๊ฒŒ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

< > Update on GitHub

ะ›ัƒั‡ัˆะธะน ั‡ะฐัั‚ะฝั‹ะน ั…ะพัั‚ะธะฝะณ