lynx   »   [go: up one dir, main page]

Transformers documentation

Quark

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Quark

Quark๋Š” ํŠน์ • ๋ฐ์ดํ„ฐ ํƒ€์ž…, ์•Œ๊ณ ๋ฆฌ์ฆ˜, ํ•˜๋“œ์›จ์–ด์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๋„๋ก ์„ค๊ณ„๋œ ๋”ฅ๋Ÿฌ๋‹ ์–‘์žํ™” ํˆดํ‚ท์ž…๋‹ˆ๋‹ค. Quark์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ „์ฒ˜๋ฆฌ ์ „๋žต, ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์กฐํ•ฉํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿค— Transformers๋ฅผ ํ†ตํ•ด ํ†ตํ•ฉ๋œ PyTorch ์ง€์›์€ ์ฃผ๋กœ AMD CPU ๋ฐ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ•˜๋ฉฐ, ์ฃผ๋กœ ํ‰๊ฐ€ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, lm-evaluation-harness๋ฅผ ๐Ÿค— Transformers ๋ฐฑ์—”๋“œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ Quark๋กœ ์–‘์žํ™”๋œ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ์›ํ™œํ•˜๊ฒŒ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Quark์— ๊ด€์‹ฌ์ด ์žˆ๋Š” ์‚ฌ์šฉ์ž๋Š” ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๋ชจ๋ธ ์–‘์žํ™”๋ฅผ ์‹œ์ž‘ํ•˜๊ณ  ์ง€์›๋˜๋Š” ์˜คํ”ˆ ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

Quark๋Š” ์ž์ฒด ์ฒดํฌํฌ์ธํŠธ/์„ค์ • ํฌ๋งท๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ๋‹ค๋ฅธ ์–‘์žํ™”/๋Ÿฐํƒ€์ž„ ๊ตฌํ˜„์ฒด (AutoAWQ, ๋„ค์ดํ‹ฐ๋ธŒ fp8)์™€ ํ˜ธํ™˜๋˜๋Š” ์ง๋ ฌํ™” ๋ ˆ์ด์•„์›ƒ์œผ๋กœ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ๋„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

Transformer์—์„œ Quark ์–‘์žํ™” ๋ชจ๋ธ์„ ๋กœ๋“œํ•˜๋ ค๋ฉด ๋จผ์ € ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

pip install amd-quark

์ง€์› ๋งคํŠธ๋ฆญ์Šค

Quark๋ฅผ ํ†ตํ•ด ์–‘์žํ™”๋œ ๋ชจ๋ธ์€ ํ•จ๊ป˜ ์กฐํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ด‘๋ฒ”์œ„ํ•œ ๊ธฐ๋Šฅ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์„ฑ์— ๊ด€๊ณ„์—†์ด ๋ชจ๋“  ์–‘์žํ™”๋œ ๋ชจ๋ธ์€ PretrainedModel.from_pretrained๋ฅผ ํ†ตํ•ด ์›ํ™œํ•˜๊ฒŒ ๋‹ค์‹œ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ํ‘œ๋Š” Quark์—์„œ ์ง€์›ํ•˜๋Š” ๋ช‡ ๊ฐ€์ง€ ๊ธฐ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ธฐ๋Šฅ Quark์—์„œ ์ง€์›ํ•˜๋Š” ํ•ญ๋ชฉ
๋ฐ์ดํ„ฐ ํƒ€์ž… int8, int4, int2, bfloat16, float16, fp8_e5m2, fp8_e4m3, fp6_e3m2, fp6_e2m3, fp4, OCP MX, MX6, MX9, bfp16
์–‘์žํ™” ์ „ ๋ชจ๋ธ ๋ณ€ํ™˜ SmoothQuant, QuaRot, SpinQuant, AWQ
์–‘์žํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ GPTQ
์ง€์› ์—ฐ์‚ฐ์ž nn.Linear, nn.Conv2d, nn.ConvTranspose2d, nn.Embedding, nn.EmbeddingBag
์„ธ๋ถ„์„ฑ(Granularity) per-tensor, per-channel, per-block, per-layer, per-layer type
KV ์บ์‹œ fp8
ํ™œ์„ฑํ™” ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ MinMax / Percentile / MSE
์–‘์žํ™” ์ „๋žต weight-only, static, dynamic, with or without output quantization

Hugging Face Hub์˜ ๋ชจ๋ธ

Quark ๋„ค์ดํ‹ฐ๋ธŒ ์ง๋ ฌํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ณต๊ฐœ ๋ชจ๋ธ์€ https://huggingface.co/models?other=quark ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Quark๋Š” quant_method="fp8"์„ ์ด์šฉํ•˜๋Š” ๋ชจ๋ธ๊ณผ quant_method="awq"์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ๋„ ์ง€์›ํ•˜์ง€๋งŒ, Transformers๋Š” ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์„ AutoAWQ๋ฅผ ํ†ตํ•ด ๋ถˆ๋Ÿฌ์˜ค๊ฑฐ๋‚˜ ๐Ÿค— Transformers์˜ ๋„ค์ดํ‹ฐ๋ธŒ fp8 ์ง€์›์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Transformers์—์„œ Quark๋ชจ๋ธ ์‚ฌ์šฉํ•˜๊ธฐ

๋‹ค์Œ์€ Transformers์—์„œ Quark ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋ฐฉ๋ฒ•์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to("cuda")

print(model.model.layers[0].self_attn.q_proj)
# QParamsLinear(
#   (weight_quantizer): ScaledRealQuantizer()
#   (input_quantizer): ScaledRealQuantizer()
#   (output_quantizer): ScaledRealQuantizer()
# )

tokenizer = AutoTokenizer.from_pretrained(model_id)
inp = tokenizer("Where is a good place to cycle around Tokyo?", return_tensors="pt")
inp = inp.to("cuda")

res = model.generate(**inp, min_new_tokens=50, max_new_tokens=100)

print(tokenizer.batch_decode(res)[0])
# <|begin_of_text|>Where is a good place to cycle around Tokyo? There are several places in Tokyo that are suitable for cycling, depending on your skill level and interests. Here are a few suggestions:
# 1. Yoyogi Park: This park is a popular spot for cycling and has a wide, flat path that's perfect for beginners. You can also visit the Meiji Shrine, a famous Shinto shrine located in the park.
# 2. Imperial Palace East Garden: This beautiful garden has a large, flat path that's perfect for cycling. You can also visit the
< > Update on GitHub

ะ›ัƒั‡ัˆะธะน ั‡ะฐัั‚ะฝั‹ะน ั…ะพัั‚ะธะฝะณ