lynx   »   [go: up one dir, main page]

Transformers documentation

AWQ

You are viewing v4.53.1 version. A newer version v4.56.2 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

AWQ

์ด ๋…ธํŠธ๋ถ ์œผ๋กœ AWQ ์–‘์žํ™”๋ฅผ ์‹ค์Šตํ•ด๋ณด์„ธ์š” !

Activation-aware Weight Quantization (AWQ)์€ ๋ชจ๋ธ์˜ ๋ชจ๋“  ๊ฐ€์ค‘์น˜๋ฅผ ์–‘์žํ™”ํ•˜์ง€ ์•Š๊ณ , LLM ์„ฑ๋Šฅ์— ์ค‘์š”ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ์จ 4๋น„ํŠธ ์ •๋ฐ€๋„๋กœ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•ด๋„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์–‘์žํ™” ์†์‹ค์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

AWQ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด llm-awq, autoawq , optimum-intel ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. Transformers๋Š” llm-awq, autoawq ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด ์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” autoawq๋กœ ์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ๋“œ๋ฆฌ๋‚˜, llm-awq๋กœ ์–‘์žํ™”๋œ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ๋„ ์œ ์‚ฌํ•œ ์ ˆ์ฐจ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

autoawq๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pip install autoawq

AWQ ์–‘์žํ™”๋œ ๋ชจ๋ธ์€ ํ•ด๋‹น ๋ชจ๋ธ์˜ config.json ํŒŒ์ผ์˜ quantization_config ์†์„ฑ์„ ํ†ตํ•ด ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.:

{
  "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source",
  "architectures": [
    "MistralForCausalLM"
  ],
  ...
  ...
  ...
  "quantization_config": {
    "quant_method": "awq",
    "zero_point": true,
    "group_size": 128,
    "bits": 4,
    "version": "gemm"
  }
}

์–‘์žํ™”๋œ ๋ชจ๋ธ์€ from_pretrained() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ CPU์— ๊ฐ€์ ธ์™”๋‹ค๋ฉด, ๋จผ์ € ๋ชจ๋ธ์„ GPU ์žฅ์น˜๋กœ ์˜ฎ๊ฒจ์•ผ ํ•ฉ๋‹ˆ๋‹ค. device_map ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋ฐฐ์น˜ํ•  ์œ„์น˜๋ฅผ ์ง€์ •ํ•˜์„ธ์š”:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/zephyr-7B-alpha-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")

AWQ ์–‘์žํ™” ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๋ฉด ์ž๋™์œผ๋กœ ์„ฑ๋Šฅ์ƒ์˜ ์ด์œ ๋กœ ์ธํ•ด ๊ฐ€์ค‘์น˜๋“ค์˜ ๊ธฐ๋ณธ๊ฐ’์ด fp16์œผ๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค๋ฅธ ํ˜•์‹์œผ๋กœ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด, torch_dtype ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/zephyr-7B-alpha-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)

์ถ”๋ก ์„ ๋”์šฑ ๊ฐ€์†ํ™”ํ•˜๊ธฐ ์œ„ํ•ด AWQ ์–‘์žํ™”์™€ FlashAttention-2 ๋ฅผ ๊ฒฐํ•ฉ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ ์ •ํ™•๋„์™€ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ Llama ์•„ํ‚คํ…์ฒ˜์™€ Mistral ์•„ํ‚คํ…์ฒ˜์˜ AWQ๋ชจ๋“ˆ์— ๊ธฐ๋ณธ์ ์œผ๋กœ ์ง€์›๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ง€์›๋˜์ง€ ์•Š๋Š” ์•„ํ‚คํ…์ฒ˜์— ๋Œ€ํ•ด์„œ๋„ AWQ ๋ชจ๋“ˆ์„ ํ“จ์ฆˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ FlashAttention-2์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๊ธฐ์ˆ ๊ณผ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

supported architectures
unsupported architectures

์ง€์›๋˜๋Š” ์•„ํ‚คํ…์ฒ˜์—์„œ ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์„ ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด, AwqConfig ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋งค๊ฐœ๋ณ€์ˆ˜ fuse_max_seq_len ๊ณผ do_fuse=True๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. fuse_max_seq_len ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ์ „์ฒด ์‹œํ€€์Šค ๊ธธ์ด๋กœ, ์ปจํ…์ŠคํŠธ ๊ธธ์ด์™€ ์˜ˆ์ƒ ์ƒ์„ฑ ๊ธธ์ด๋ฅผ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•ˆ์ „ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋” ํฐ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, TheBloke/Mistral-7B-OpenOrca-AWQ ๋ชจ๋ธ์˜ AWQ ๋ชจ๋“ˆ์„ ํ“จ์ฆˆํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

import torch
from transformers import AwqConfig, AutoModelForCausalLM

model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,
    do_fuse=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)

TheBloke/Mistral-7B-OpenOrca-AWQ ๋ชจ๋ธ์€ ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์ด ์žˆ๋Š” ๊ฒฝ์šฐ์™€ ์—†๋Š” ๊ฒฝ์šฐ ๋ชจ๋‘ batch_size=1 ๋กœ ์„ฑ๋Šฅ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ“จ์ฆˆ๋˜์ง€ ์•Š์€ ๋ชจ๋“ˆ
๋ฐฐ์น˜ ํฌ๊ธฐ ํ”„๋ฆฌํ•„ ๊ธธ์ด ๋””์ฝ”๋“œ ๊ธธ์ด ํ”„๋ฆฌํ•„ ํ† ํฐ/์ดˆ ๋””์ฝ”๋“œ ํ† ํฐ/์ดˆ ๋ฉ”๋ชจ๋ฆฌ (VRAM)
1 32 32 60.0984 38.4537 4.50 GB (5.68%)
1 64 64 1333.67 31.6604 4.50 GB (5.68%)
1 128 128 2434.06 31.6272 4.50 GB (5.68%)
1 256 256 3072.26 38.1731 4.50 GB (5.68%)
1 512 512 3184.74 31.6819 4.59 GB (5.80%)
1 1024 1024 3148.18 36.8031 4.81 GB (6.07%)
1 2048 2048 2927.33 35.2676 5.73 GB (7.23%)
ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ
๋ฐฐ์น˜ ํฌ๊ธฐ ํ”„๋ฆฌํ•„ ๊ธธ์ด ๋””์ฝ”๋“œ ๊ธธ์ด ํ”„๋ฆฌํ•„ ํ† ํฐ/์ดˆ ๋””์ฝ”๋“œ ํ† ํฐ/์ดˆ ๋ฉ”๋ชจ๋ฆฌ (VRAM)
1 32 32 81.4899 80.2569 4.00 GB (5.05%)
1 64 64 1756.1 106.26 4.00 GB (5.05%)
1 128 128 2479.32 105.631 4.00 GB (5.06%)
1 256 256 1813.6 85.7485 4.01 GB (5.06%)
1 512 512 2848.9 97.701 4.11 GB (5.19%)
1 1024 1024 3044.35 87.7323 4.41 GB (5.57%)
1 2048 2048 2715.11 89.4709 5.57 GB (7.04%)

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ ๋ฐ ํ“จ์ฆˆ๋˜์ง€ ์•Š์€ ๋ชจ๋“ˆ์˜ ์†๋„์™€ ์ฒ˜๋ฆฌ๋Ÿ‰์€ optimum-benchmark๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

generate throughput per batch size
ํฌ์›Œ๋“œ ํ”ผํฌ ๋ฉ”๋ชจ๋ฆฌ (forward peak memory)/๋ฐฐ์น˜ ํฌ๊ธฐ
forward latency per batch size
์ƒ์„ฑ ์ฒ˜๋ฆฌ๋Ÿ‰/๋ฐฐ์น˜ํฌ๊ธฐ

ExLlama-v2 ์„œํฌํŠธ

์ตœ์‹  ๋ฒ„์ „ autoawq๋Š” ๋น ๋ฅธ ํ”„๋ฆฌํ•„๊ณผ ๋””์ฝ”๋”ฉ์„ ์œ„ํ•ด ExLlama-v2 ์ปค๋„์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์‹œ์ž‘ํ•˜๊ธฐ ์œ„ํ•ด ๋จผ์ € ์ตœ์‹  ๋ฒ„์ „ autoawq ๋ฅผ ์„ค์น˜ํ•˜์„ธ์š” :

pip install git+https://github.com/casper-hansen/AutoAWQ.git

๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ version="exllama"๋กœ ์„ค์ •ํ•ด AwqConfig()๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋ชจ๋ธ์— ๋„˜๊ฒจ์ฃผ์„ธ์š”.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

quantization_config = AwqConfig(version="exllama")

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
    quantization_config=quantization_config,
    device_map="auto",
)

input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
output = model(input_ids)
print(output.logits)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

์ด ๊ธฐ๋Šฅ์€ AMD GPUs์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค.

< > Update on GitHub

ะ›ัƒั‡ัˆะธะน ั‡ะฐัั‚ะฝั‹ะน ั…ะพัั‚ะธะฝะณ