lynx   »   [go: up one dir, main page]

Transformers documentation

AWQ

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.56.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

AWQ

์ด ๋…ธํŠธ๋ถ ์œผ๋กœ AWQ ์–‘์žํ™”๋ฅผ ์‹ค์Šตํ•ด๋ณด์„ธ์š” !

Activation-aware Weight Quantization (AWQ)์€ ๋ชจ๋ธ์˜ ๋ชจ๋“  ๊ฐ€์ค‘์น˜๋ฅผ ์–‘์žํ™”ํ•˜์ง€ ์•Š๊ณ , LLM ์„ฑ๋Šฅ์— ์ค‘์š”ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ์จ 4๋น„ํŠธ ์ •๋ฐ€๋„๋กœ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•ด๋„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์–‘์žํ™” ์†์‹ค์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

AWQ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด llm-awq, autoawq , optimum-intel ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. Transformers๋Š” llm-awq, autoawq ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด ์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” autoawq๋กœ ์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ๋“œ๋ฆฌ๋‚˜, llm-awq๋กœ ์–‘์žํ™”๋œ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ๋„ ์œ ์‚ฌํ•œ ์ ˆ์ฐจ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

autoawq๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pip install autoawq

AWQ ์–‘์žํ™”๋œ ๋ชจ๋ธ์€ ํ•ด๋‹น ๋ชจ๋ธ์˜ config.json ํŒŒ์ผ์˜ quantization_config ์†์„ฑ์„ ํ†ตํ•ด ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.:

{
  "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source",
  "architectures": [
    "MistralForCausalLM"
  ],
  ...
  ...
  ...
  "quantization_config": {
    "quant_method": "awq",
    "zero_point": true,
    "group_size": 128,
    "bits": 4,
    "version": "gemm"
  }
}

์–‘์žํ™”๋œ ๋ชจ๋ธ์€ from_pretrained() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ CPU์— ๊ฐ€์ ธ์™”๋‹ค๋ฉด, ๋จผ์ € ๋ชจ๋ธ์„ GPU ์žฅ์น˜๋กœ ์˜ฎ๊ฒจ์•ผ ํ•ฉ๋‹ˆ๋‹ค. device_map ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋ฐฐ์น˜ํ•  ์œ„์น˜๋ฅผ ์ง€์ •ํ•˜์„ธ์š”:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/zephyr-7B-alpha-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")

AWQ ์–‘์žํ™” ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๋ฉด ์ž๋™์œผ๋กœ ์„ฑ๋Šฅ์ƒ์˜ ์ด์œ ๋กœ ์ธํ•ด ๊ฐ€์ค‘์น˜๋“ค์˜ ๊ธฐ๋ณธ๊ฐ’์ด fp16์œผ๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค๋ฅธ ํ˜•์‹์œผ๋กœ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด, dtype ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/zephyr-7B-alpha-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float32)

์ถ”๋ก ์„ ๋”์šฑ ๊ฐ€์†ํ™”ํ•˜๊ธฐ ์œ„ํ•ด AWQ ์–‘์žํ™”์™€ FlashAttention-2 ๋ฅผ ๊ฒฐํ•ฉ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ ์ •ํ™•๋„์™€ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ Llama ์•„ํ‚คํ…์ฒ˜์™€ Mistral ์•„ํ‚คํ…์ฒ˜์˜ AWQ๋ชจ๋“ˆ์— ๊ธฐ๋ณธ์ ์œผ๋กœ ์ง€์›๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ง€์›๋˜์ง€ ์•Š๋Š” ์•„ํ‚คํ…์ฒ˜์— ๋Œ€ํ•ด์„œ๋„ AWQ ๋ชจ๋“ˆ์„ ํ“จ์ฆˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ FlashAttention-2์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๊ธฐ์ˆ ๊ณผ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

supported architectures
unsupported architectures

์ง€์›๋˜๋Š” ์•„ํ‚คํ…์ฒ˜์—์„œ ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์„ ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด, AwqConfig ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋งค๊ฐœ๋ณ€์ˆ˜ fuse_max_seq_len ๊ณผ do_fuse=True๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. fuse_max_seq_len ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ์ „์ฒด ์‹œํ€€์Šค ๊ธธ์ด๋กœ, ์ปจํ…์ŠคํŠธ ๊ธธ์ด์™€ ์˜ˆ์ƒ ์ƒ์„ฑ ๊ธธ์ด๋ฅผ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•ˆ์ „ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋” ํฐ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, TheBloke/Mistral-7B-OpenOrca-AWQ ๋ชจ๋ธ์˜ AWQ ๋ชจ๋“ˆ์„ ํ“จ์ฆˆํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

import torch
from transformers import AwqConfig, AutoModelForCausalLM

model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,
    do_fuse=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)

TheBloke/Mistral-7B-OpenOrca-AWQ ๋ชจ๋ธ์€ ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์ด ์žˆ๋Š” ๊ฒฝ์šฐ์™€ ์—†๋Š” ๊ฒฝ์šฐ ๋ชจ๋‘ batch_size=1 ๋กœ ์„ฑ๋Šฅ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ“จ์ฆˆ๋˜์ง€ ์•Š์€ ๋ชจ๋“ˆ
๋ฐฐ์น˜ ํฌ๊ธฐ ํ”„๋ฆฌํ•„ ๊ธธ์ด ๋””์ฝ”๋“œ ๊ธธ์ด ํ”„๋ฆฌํ•„ ํ† ํฐ/์ดˆ ๋””์ฝ”๋“œ ํ† ํฐ/์ดˆ ๋ฉ”๋ชจ๋ฆฌ (VRAM)
1 32 32 60.0984 38.4537 4.50 GB (5.68%)
1 64 64 1333.67 31.6604 4.50 GB (5.68%)
1 128 128 2434.06 31.6272 4.50 GB (5.68%)
1 256 256 3072.26 38.1731 4.50 GB (5.68%)
1 512 512 3184.74 31.6819 4.59 GB (5.80%)
1 1024 1024 3148.18 36.8031 4.81 GB (6.07%)
1 2048 2048 2927.33 35.2676 5.73 GB (7.23%)
ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ
๋ฐฐ์น˜ ํฌ๊ธฐ ํ”„๋ฆฌํ•„ ๊ธธ์ด ๋””์ฝ”๋“œ ๊ธธ์ด ํ”„๋ฆฌํ•„ ํ† ํฐ/์ดˆ ๋””์ฝ”๋“œ ํ† ํฐ/์ดˆ ๋ฉ”๋ชจ๋ฆฌ (VRAM)
1 32 32 81.4899 80.2569 4.00 GB (5.05%)
1 64 64 1756.1 106.26 4.00 GB (5.05%)
1 128 128 2479.32 105.631 4.00 GB (5.06%)
1 256 256 1813.6 85.7485 4.01 GB (5.06%)
1 512 512 2848.9 97.701 4.11 GB (5.19%)
1 1024 1024 3044.35 87.7323 4.41 GB (5.57%)
1 2048 2048 2715.11 89.4709 5.57 GB (7.04%)

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ ๋ฐ ํ“จ์ฆˆ๋˜์ง€ ์•Š์€ ๋ชจ๋“ˆ์˜ ์†๋„์™€ ์ฒ˜๋ฆฌ๋Ÿ‰์€ optimum-benchmark๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

generate throughput per batch size
ํฌ์›Œ๋“œ ํ”ผํฌ ๋ฉ”๋ชจ๋ฆฌ (forward peak memory)/๋ฐฐ์น˜ ํฌ๊ธฐ
forward latency per batch size
์ƒ์„ฑ ์ฒ˜๋ฆฌ๋Ÿ‰/๋ฐฐ์น˜ํฌ๊ธฐ

ExLlama-v2 ์„œํฌํŠธ

์ตœ์‹  ๋ฒ„์ „ autoawq๋Š” ๋น ๋ฅธ ํ”„๋ฆฌํ•„๊ณผ ๋””์ฝ”๋”ฉ์„ ์œ„ํ•ด ExLlama-v2 ์ปค๋„์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์‹œ์ž‘ํ•˜๊ธฐ ์œ„ํ•ด ๋จผ์ € ์ตœ์‹  ๋ฒ„์ „ autoawq ๋ฅผ ์„ค์น˜ํ•˜์„ธ์š” :

pip install git+https://github.com/casper-hansen/AutoAWQ.git

๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ version="exllama"๋กœ ์„ค์ •ํ•ด AwqConfig()๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋ชจ๋ธ์— ๋„˜๊ฒจ์ฃผ์„ธ์š”.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

quantization_config = AwqConfig(version="exllama")

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
    quantization_config=quantization_config,
    device_map="auto",
)

input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
output = model(input_ids)
print(output.logits)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

์ด ๊ธฐ๋Šฅ์€ AMD GPUs์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค.

< > Update on GitHub

ะ›ัƒั‡ัˆะธะน ั‡ะฐัั‚ะฝั‹ะน ั…ะพัั‚ะธะฝะณ