ahassoun's picture
Upload 3018 files
ee6e328
|
raw
history blame
26.9 kB

์ „์ฒ˜๋ฆฌ[[preprocess]]

[[open-in-colab]]

๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋ ค๋ฉด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ชจ๋ธ์— ๋งž๋Š” ์ž…๋ ฅ ํ˜•์‹์œผ๋กœ ์ „์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€ ๋˜๋Š” ์˜ค๋””์˜ค์ธ์ง€ ๊ด€๊ณ„์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ํ…์„œ ๋ฐฐ์น˜๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์กฐ๋ฆฝํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๐Ÿค— Transformers๋Š” ๋ชจ๋ธ์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ์ผ๋ จ์˜ ์ „์ฒ˜๋ฆฌ ํด๋ž˜์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” ๋‹ค์Œ ๋‚ด์šฉ์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ํ…์ŠคํŠธ๋Š” Tokenizer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฐ ์‹œํ€€์Šค๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ํ† ํฐ์˜ ์ˆซ์ž ํ‘œํ˜„์„ ๋งŒ๋“  ํ›„ ํ…์„œ๋กœ ์กฐ๋ฆฝํ•ฉ๋‹ˆ๋‹ค.
  • ์Œ์„ฑ ๋ฐ ์˜ค๋””์˜ค๋Š” Feature extractor๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ค๋””์˜ค ํŒŒํ˜•์—์„œ ์‹œํ€€์Šค ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜์—ฌ ํ…์„œ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฏธ์ง€ ์ž…๋ ฅ์€ ImageProcessor์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ํ…์„œ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ์€ Processor์„ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฌ๋‚˜์ด์ €์™€ ํŠน์„ฑ ์ถ”์ถœ๊ธฐ ๋˜๋Š” ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.

AutoProcessor๋Š” ์–ธ์ œ๋‚˜ ์ž‘๋™ํ•˜์—ฌ ํ† ํฌ๋‚˜์ด์ €, ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ, ํŠน์„ฑ ์ถ”์ถœ๊ธฐ ๋˜๋Š” ํ”„๋กœ์„ธ์„œ ๋“ฑ ์‚ฌ์šฉ ์ค‘์ธ ๋ชจ๋ธ์— ๋งž๋Š” ํด๋ž˜์Šค๋ฅผ ์ž๋™์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๐Ÿค— Datasets๋ฅผ ์„ค์น˜ํ•˜์—ฌ ์‹คํ—˜์— ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pip install datasets

์ž์—ฐ์–ด์ฒ˜๋ฆฌ[[natural-language-processing]]

ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ณธ ๋„๊ตฌ๋Š” tokenizer์ž…๋‹ˆ๋‹ค. ํ† ํฌ๋‚˜์ด์ €๋Š” ์ผ๋ จ์˜ ๊ทœ์น™์— ๋”ฐ๋ผ ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ์œผ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ํ† ํฐ์€ ์ˆซ์ž๋กœ ๋ณ€ํ™˜๋˜๊ณ  ํ…์„œ๋Š” ๋ชจ๋ธ ์ž…๋ ฅ์ด ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์— ํ•„์š”ํ•œ ์ถ”๊ฐ€ ์ž…๋ ฅ์€ ํ† ํฌ๋‚˜์ด์ €์— ์˜ํ•ด ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค.

์‚ฌ์ „ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ณ„ํš์ด๋ผ๋ฉด ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ํ…์ŠคํŠธ๊ฐ€ ์‚ฌ์ „ํ›ˆ๋ จ ๋ง๋ญ‰์น˜์™€ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ถ„ํ• ๋˜๊ณ  ์‚ฌ์ „ํ›ˆ๋ จ ์ค‘์— ๋™์ผํ•œ ํ•ด๋‹น ํ† ํฐ-์ธ๋ฑ์Šค ์Œ(์ผ๋ฐ˜์ ์œผ๋กœ vocab์ด๋ผ๊ณ  ํ•จ)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๋ ค๋ฉด [AutoTokenizer.from_pretrained] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ถˆ๋Ÿฌ์˜ค์„ธ์š”. ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์ „ํ›ˆ๋ จ๋œ vocab์„ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

๊ทธ ๋‹ค์Œ์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ํ† ํฌ๋‚˜์ด์ €์— ๋„ฃ์–ด์ฃผ์„ธ์š”:

>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
>>> print(encoded_input)
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

ํ† ํฌ๋‚˜์ด์ €๋Š” ์„ธ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ํ•ญ๋ชฉ์„ ํฌํ•จํ•œ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • input_ids๋Š” ๋ฌธ์žฅ์˜ ๊ฐ ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ์ธ๋ฑ์Šค์ž…๋‹ˆ๋‹ค.
  • attention_mask๋Š” ํ† ํฐ์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • token_type_ids๋Š” ๋‘ ๊ฐœ ์ด์ƒ์˜ ์‹œํ€€์Šค๊ฐ€ ์žˆ์„ ๋•Œ ํ† ํฐ์ด ์†ํ•œ ์‹œํ€€์Šค๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

input_ids๋ฅผ ๋””์ฝ”๋”ฉํ•˜์—ฌ ์ž…๋ ฅ์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

>>> tokenizer.decode(encoded_input["input_ids"])
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ๋‘ ๊ฐœ์˜ ํŠน์ˆ˜ํ•œ ํ† ํฐ(๋ถ„๋ฅ˜ ํ† ํฐ CLS์™€ ๋ถ„ํ•  ํ† ํฐ SEP)์„ ๋ฌธ์žฅ์— ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ชจ๋ธ์— ํŠน์ˆ˜ํ•œ ํ† ํฐ์ด ํ•„์š”ํ•œ ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ, ํ•„์š”ํ•˜๋‹ค๋ฉด ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ์ž๋™์œผ๋กœ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์ „์ฒ˜๋ฆฌํ•  ๋ฌธ์žฅ์ด ์—ฌ๋Ÿฌ ๊ฐœ ์žˆ๋Š” ๊ฒฝ์šฐ์—๋Š” ๋ฆฌ์ŠคํŠธ๋กœ ํ† ํฌ๋‚˜์ด์ €์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1]]}

ํŒจ๋”ฉ[[pad]]

๋ชจ๋ธ ์ž…๋ ฅ์ธ ํ…์„œ๋Š” ๋ชจ์–‘์ด ๊ท ์ผํ•ด์•ผ ํ•˜์ง€๋งŒ, ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ ํ•ญ์ƒ ๊ฐ™์ง€๋Š” ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŒจ๋”ฉ์€ ์งง์€ ๋ฌธ์žฅ์— ํŠน์ˆ˜ํ•œ ํŒจ๋”ฉ ํ† ํฐ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํ…์„œ๋ฅผ ์ง์‚ฌ๊ฐํ˜• ๋ชจ์–‘์ด ๋˜๋„๋ก ํ•˜๋Š” ์ „๋žต์ž…๋‹ˆ๋‹ค.

padding ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ True๋กœ ์„ค์ •ํ•˜์—ฌ ๋ฐฐ์น˜ ๋‚ด์˜ ์งง์€ ์‹œํ€€์Šค๋ฅผ ๊ฐ€์žฅ ๊ธด ์‹œํ€€์Šค์— ๋งž์ถฐ ํŒจ๋”ฉํ•ฉ๋‹ˆ๋‹ค.

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

๊ธธ์ด๊ฐ€ ์งง์€ ์ฒซ ๋ฌธ์žฅ๊ณผ ์„ธ ๋ฒˆ์งธ ๋ฌธ์žฅ์ด ์ด์ œ 0์œผ๋กœ ์ฑ„์›Œ์กŒ์Šต๋‹ˆ๋‹ค.

์ž˜๋ผ๋‚ด๊ธฐ[[truncation]]

ํ•œํŽธ, ๋•Œ๋กœ๋Š” ์‹œํ€€์Šค๊ฐ€ ๋ชจ๋ธ์—์„œ ์ฒ˜๋ฆฌํ•˜๊ธฐ์— ๋„ˆ๋ฌด ๊ธธ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ, ์‹œํ€€์Šค๋ฅผ ๋” ์งง๊ฒŒ ์ค„์ผ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ์—์„œ ํ—ˆ์šฉํ•˜๋Š” ์ตœ๋Œ€ ๊ธธ์ด๋กœ ์‹œํ€€์Šค๋ฅผ ์ž๋ฅด๋ ค๋ฉด truncation ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ True๋กœ ์„ค์ •ํ•˜์„ธ์š”:

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

๋‹ค์–‘ํ•œ ํŒจ๋”ฉ๊ณผ ์ž˜๋ผ๋‚ด๊ธฐ ์ธ์ˆ˜์— ๋Œ€ํ•ด ๋” ์•Œ์•„๋ณด๋ ค๋ฉด ํŒจ๋”ฉ๊ณผ ์ž˜๋ผ๋‚ด๊ธฐ ๊ฐœ๋… ๊ฐ€์ด๋“œ๋ฅผ ํ™•์ธํ•ด๋ณด์„ธ์š”.

ํ…์„œ ๋งŒ๋“ค๊ธฐ[[build-tensors]]

๋งˆ์ง€๋ง‰์œผ๋กœ, ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ๋ชจ๋ธ์— ๊ณต๊ธ‰๋˜๋Š” ์‹ค์ œ ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

return_tensors ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ PyTorch์˜ ๊ฒฝ์šฐ pt, TensorFlow์˜ ๊ฒฝ์šฐ tf๋กœ ์„ค์ •ํ•˜์„ธ์š”:

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
>>> print(encoded_input)
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
```py >>> batch_sentences = [ ... "But what about second breakfast?", ... "Don't think he knows about second breakfast, Pip.", ... "What about elevensies?", ... ] >>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf") >>> print(encoded_input) {'input_ids': , 'token_type_ids': , 'attention_mask': } ```

์˜ค๋””์˜ค[[audio]]

์˜ค๋””์˜ค ์ž‘์—…์€ ๋ชจ๋ธ์— ๋งž๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ค€๋น„ํ•˜๊ธฐ ์œ„ํ•ด ํŠน์„ฑ ์ถ”์ถœ๊ธฐ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํŠน์„ฑ ์ถ”์ถœ๊ธฐ๋Š” ์›์‹œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์—์„œ ํŠน์„ฑ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ ํ…์„œ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.

์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ํŠน์„ฑ ์ถ”์ถœ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด๊ธฐ ์œ„ํ•ด MInDS-14 ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค์„ธ์š”. (๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์€ ๐Ÿค— ๋ฐ์ดํ„ฐ ์„ธํŠธ ํŠœํ† ๋ฆฌ์–ผ์—์„œ ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.)

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

audio ์—ด์˜ ์ฒซ ๋ฒˆ์งธ ์š”์†Œ์— ์ ‘๊ทผํ•˜์—ฌ ์ž…๋ ฅ์„ ์‚ดํŽด๋ณด์„ธ์š”. audio ์—ด์„ ํ˜ธ์ถœํ•˜๋ฉด ์˜ค๋””์˜ค ํŒŒ์ผ์„ ์ž๋™์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ณ  ๋ฆฌ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค.

>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์„ธ ๊ฐ€์ง€ ํ•ญ๋ชฉ์ด ๋ฐ˜ํ™˜๋ฉ๋‹ˆ๋‹ค:

  • array๋Š” 1D ๋ฐฐ์—ด๋กœ ๊ฐ€์ ธ์™€์„œ (ํ•„์š”ํ•œ ๊ฒฝ์šฐ) ๋ฆฌ์ƒ˜ํ”Œ๋ง๋œ ์Œ์„ฑ ์‹ ํ˜ธ์ž…๋‹ˆ๋‹ค.
  • path๋Š” ์˜ค๋””์˜ค ํŒŒ์ผ์˜ ์œ„์น˜๋ฅผ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค.
  • sampling_rate๋Š” ์Œ์„ฑ ์‹ ํ˜ธ์—์„œ ์ดˆ๋‹น ์ธก์ •๋˜๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” Wav2Vec2 ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ ์นด๋“œ๋ฅผ ๋ณด๋ฉด Wav2Vec2๊ฐ€ 16kHz ์ƒ˜ํ”Œ๋ง๋œ ์Œ์„ฑ ์˜ค๋””์˜ค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ์‚ฌ์ „ํ›ˆ๋ จํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ์™€ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๊ฐ€ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๊ฐ€ ๋‹ค๋ฅด๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ๋ฆฌ์ƒ˜ํ”Œ๋งํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  1. ๐Ÿค— Datasets์˜ [~datasets.Dataset.cast_column] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ 16kHz๋กœ ์—…์ƒ˜ํ”Œ๋งํ•˜์„ธ์š”:
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
  1. ์˜ค๋””์˜ค ํŒŒ์ผ์„ ๋ฆฌ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ์œ„ํ•ด audio ์—ด์„ ๋‹ค์‹œ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค:
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 16000}

๋‹ค์Œ์œผ๋กœ, ์ž…๋ ฅ์„ ์ •๊ทœํ™”ํ•˜๊ณ  ํŒจ๋”ฉํ•  ํŠน์„ฑ ์ถ”์ถœ๊ธฐ๋ฅผ ๊ฐ€์ ธ์˜ค์„ธ์š”. ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, ๋” ์งง์€ ์‹œํ€€์Šค์— ๋Œ€ํ•ด 0์ด ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์—๋„ ๊ฐ™์€ ๊ฐœ๋…์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ํŠน์„ฑ ์ถ”์ถœ๊ธฐ๋Š” ๋ฐฐ์—ด์— 0(๋ฌต์Œ์œผ๋กœ ํ•ด์„)์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

[AutoFeatureExtractor.from_pretrained]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์„ฑ ์ถ”์ถœ๊ธฐ๋ฅผ ๊ฐ€์ ธ์˜ค์„ธ์š”:

>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

์˜ค๋””์˜ค array๋ฅผ ํŠน์„ฑ ์ถ”์ถœ๊ธฐ์— ์ „๋‹ฌํ•˜์„ธ์š”. ๋˜ํ•œ, ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์กฐ์šฉํ•œ ์˜ค๋ฅ˜(silent errors)๋ฅผ ๋” ์ž˜ ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์žˆ๋„๋ก ํŠน์„ฑ ์ถ”์ถœ๊ธฐ์— sampling_rate ์ธ์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

>>> audio_input = [dataset[0]["audio"]["array"]]
>>> feature_extractor(audio_input, sampling_rate=16000)
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}

ํ† ํฌ๋‚˜์ด์ €์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ฐฐ์น˜ ๋‚ด์—์„œ ๊ฐ€๋ณ€์ ์ธ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํŒจ๋”ฉ ๋˜๋Š” ์ž˜๋ผ๋‚ด๊ธฐ๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋‘ ๊ฐœ์˜ ์˜ค๋””์˜ค ์ƒ˜ํ”Œ์˜ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ํ™•์ธํ•ด๋ณด์„ธ์š”:

>>> dataset[0]["audio"]["array"].shape
(173398,)

>>> dataset[1]["audio"]["array"].shape
(106496,)

์˜ค๋””์˜ค ์ƒ˜ํ”Œ์˜ ๊ธธ์ด๊ฐ€ ๋™์ผํ•˜๋„๋ก ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“œ์„ธ์š”. ์ตœ๋Œ€ ์ƒ˜ํ”Œ ๊ธธ์ด๋ฅผ ์ง€์ •ํ•˜๋ฉด ํŠน์„ฑ ์ถ”์ถœ๊ธฐ๊ฐ€ ํ•ด๋‹น ๊ธธ์ด์— ๋งž์ถฐ ์‹œํ€€์Šค๋ฅผ ํŒจ๋”ฉํ•˜๊ฑฐ๋‚˜ ์ž˜๋ผ๋ƒ…๋‹ˆ๋‹ค:

>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
...         max_length=100000,
...         truncation=True,
...     )
...     return inputs

preprocess_function์„ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ฒ˜์Œ ์˜ˆ์‹œ ๋ช‡ ๊ฐœ์— ์ ์šฉํ•ด๋ณด์„ธ์š”:

>>> processed_dataset = preprocess_function(dataset[:5])

์ด์ œ ์ƒ˜ํ”Œ ๊ธธ์ด๊ฐ€ ๋ชจ๋‘ ๊ฐ™๊ณ  ์ง€์ •๋œ ์ตœ๋Œ€ ๊ธธ์ด์— ๋งž๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋“œ๋””์–ด ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ชจ๋ธ์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

>>> processed_dataset["input_values"][0].shape
(100000,)

>>> processed_dataset["input_values"][1].shape
(100000,)

์ปดํ“จํ„ฐ ๋น„์ „[[computer-vision]]

์ปดํ“จํ„ฐ ๋น„์ „ ์ž‘์—…์˜ ๊ฒฝ์šฐ, ๋ชจ๋ธ์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ค€๋น„ํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ๋Š” ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋ธ์ด ์˜ˆ์ƒํ•˜๋Š” ์ž…๋ ฅ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹จ๊ณ„์—๋Š” ํฌ๊ธฐ ์กฐ์ •, ์ •๊ทœํ™”, ์ƒ‰์ƒ ์ฑ„๋„ ๋ณด์ •, ์ด๋ฏธ์ง€์˜ ํ…์„œ ๋ณ€ํ™˜ ๋“ฑ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ๋Š” ์ด๋ฏธ์ง€ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ๋ช‡ ๊ฐ€์ง€ ์ ์šฉํ•œ ๋’ค์— ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ ๋ฐ ์ด๋ฏธ์ง€ ์ฆ๊ฐ•์€ ๋ชจ๋‘ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ˜•ํ•˜์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ๋ชฉ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์ด๋ฏธ์ง€ ์ฆ๊ฐ•์€ ๊ณผ์ ํ•ฉ(over-fitting)์„ ๋ฐฉ์ง€ํ•˜๊ณ  ๋ชจ๋ธ์˜ ๊ฒฌ๊ณ ํ•จ(resiliency)์„ ๋†’์ด๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ๊ธฐ์™€ ์ƒ‰์ƒ ์กฐ์ •, ์ž๋ฅด๊ธฐ, ํšŒ์ „, ํฌ๊ธฐ ์กฐ์ •, ํ™•๋Œ€/์ถ•์†Œ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฆ๊ฐ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ฆ๊ฐ•์œผ๋กœ ์ด๋ฏธ์ง€์˜ ์˜๋ฏธ๊ฐ€ ๋ฐ”๋€Œ์ง€ ์•Š๋„๋ก ์ฃผ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ๋Š” ์ด๋ฏธ์ง€๊ฐ€ ๋ชจ๋ธ์ด ์˜ˆ์ƒํ•˜๋Š” ์ž…๋ ฅ ํ˜•์‹๊ณผ ์ผ์น˜ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ „ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•  ๋•Œ ์ด๋ฏธ์ง€๋Š” ๋ชจ๋ธ์ด ์ดˆ๊ธฐ์— ํ›ˆ๋ จ๋  ๋•Œ์™€ ์ •ํ™•ํžˆ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ „์ฒ˜๋ฆฌ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€ ์ฆ๊ฐ•์—๋Š” ์›ํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋ฌด์—‡์ด๋“  ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ์—๋Š” ๋ชจ๋ธ๊ณผ ์—ฐ๊ฒฐ๋œ ImageProcessor๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

food101 ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์™€์„œ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ๋ฅผ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์•Œ์•„๋ณด์„ธ์š”. ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋ฐฉ๋ฒ•์€ ๐Ÿค— ๋ฐ์ดํ„ฐ ์„ธํŠธ ํŠœํ† ๋ฆฌ์–ผ์„ ์ฐธ๊ณ ํ•˜์„ธ์š”.

๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ์ƒ๋‹นํžˆ ํฌ๊ธฐ ๋•Œ๋ฌธ์— ๐Ÿค— Datasets์˜ split ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ ์„ธํŠธ์—์„œ ์ž‘์€ ์ƒ˜ํ”Œ๋งŒ ๊ฐ€์ ธ์˜ค์„ธ์š”!

>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")

๋‹ค์Œ์œผ๋กœ, ๐Ÿค— Datasets์˜ image๋กœ ์ด๋ฏธ์ง€๋ฅผ ํ™•์ธํ•ด๋ณด์„ธ์š”:

>>> dataset[0]["image"]

[AutoImageProcessor.from_pretrained]๋กœ ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ๋ฅผ ๊ฐ€์ ธ์˜ค์„ธ์š”:

>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

๋จผ์ € ์ด๋ฏธ์ง€ ์ฆ๊ฐ• ๋‹จ๊ณ„๋ฅผ ์ถ”๊ฐ€ํ•ด ๋ด…์‹œ๋‹ค. ์•„๋ฌด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋‚˜ ์‚ฌ์šฉํ•ด๋„ ๊ดœ์ฐฎ์ง€๋งŒ, ์ด๋ฒˆ ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” torchvision์˜ transforms ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด, Albumentations ๋˜๋Š” Kornia notebooks์—์„œ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๋Š”์ง€ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Compose๋กœ RandomResizedCrop์™€ ColorJitter ๋“ฑ ๋ณ€ํ™˜์„ ๋ช‡ ๊ฐ€์ง€ ์—ฐ๊ฒฐํ•˜์„ธ์š”. ์ฐธ๊ณ ๋กœ ํฌ๊ธฐ ์กฐ์ •์— ํ•„์š”ํ•œ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ ์š”๊ตฌ์‚ฌํ•ญ์€ image_processor์—์„œ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ถ€ ๋ชจ๋ธ์€ ์ •ํ™•ํ•œ ๋†’์ด์™€ ๋„ˆ๋น„๋ฅผ ์š”๊ตฌํ•˜์ง€๋งŒ, ์ œ์ผ ์งง์€ ๋ณ€์˜ ๊ธธ์ด(shortest_edge)๋งŒ ์ •์˜๋œ ๋ชจ๋ธ๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose

>>> size = (
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
... )

>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])
  1. ๋ชจ๋ธ์€ ์ž…๋ ฅ์œผ๋กœ pixel_values๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค. ImageProcessor๋Š” ์ด๋ฏธ์ง€ ์ •๊ทœํ™” ๋ฐ ์ ์ ˆํ•œ ํ…์„œ ์ƒ์„ฑ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐฐ์น˜ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์ด๋ฏธ์ง€ ์ฆ๊ฐ• ๋ฐ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ๋ฅผ ๊ฒฐํ•ฉํ•˜๊ณ  pixel_values๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:
>>> def transforms(examples):
...     images = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"]
...     return examples

์œ„์˜ ์˜ˆ์—์„œ๋Š” ์ด๋ฏธ์ง€ ์ฆ๊ฐ• ์ค‘์— ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— do_resize=False๋กœ ์„ค์ •ํ•˜๊ณ , ํ•ด๋‹น image_processor์—์„œ size ์†์„ฑ์„ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์ฆ๊ฐ• ์ค‘์— ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ์ด ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ƒ๋žตํ•˜์„ธ์š”. ๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” ImageProcessor๊ฐ€ ํฌ๊ธฐ ์กฐ์ •์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ฆ๊ฐ• ๋ณ€ํ™˜ ๊ณผ์ •์—์„œ ์ด๋ฏธ์ง€๋ฅผ ์ •๊ทœํ™”ํ•˜๋ ค๋ฉด image_processor.image_mean ๋ฐ image_processor.image_std ๊ฐ’์„ ์‚ฌ์šฉํ•˜์„ธ์š”.

  1. ๐Ÿค— Datasets์˜ set_transform๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ณ€ํ™˜์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค:
>>> dataset.set_transform(transforms)
  1. ์ด์ œ ์ด๋ฏธ์ง€์— ์ ‘๊ทผํ•˜๋ฉด ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ๊ฐ€ pixel_values๋ฅผ ์ถ”๊ฐ€ํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋“œ๋””์–ด ์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ชจ๋ธ์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!
>>> dataset[0].keys()

๋‹ค์Œ์€ ๋ณ€ํ˜•์ด ์ ์šฉ๋œ ํ›„์˜ ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€๊ฐ€ ๋ฌด์ž‘์œ„๋กœ ์ž˜๋ ค๋‚˜๊ฐ”๊ณ  ์ƒ‰์ƒ ์†์„ฑ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))

ImageProcessor๋Š” ๊ฐ์ฒด ๊ฐ์ง€, ์‹œ๋งจํ‹ฑ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜(semantic segmentation), ์ธ์Šคํ„ด์Šค ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜(instance segmentation), ํŒŒ๋†‰ํ‹ฑ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜(panoptic segmentation)๊ณผ ๊ฐ™์€ ์ž‘์—…์— ๋Œ€ํ•œ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ์˜ ์›์‹œ ์ถœ๋ ฅ์„ ๊ฒฝ๊ณ„ ์ƒ์ž๋‚˜ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋งต๊ณผ ๊ฐ™์€ ์˜๋ฏธ ์žˆ๋Š” ์˜ˆ์ธก์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ค๋‹ˆ๋‹ค.

ํŒจ๋”ฉ[[pad]]

์˜ˆ๋ฅผ ๋“ค์–ด, DETR์™€ ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ๋ชจ๋ธ์ด ํ›ˆ๋ จํ•  ๋•Œ ํฌ๊ธฐ ์กฐ์ • ์ฆ๊ฐ•์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ๋ฐฐ์น˜ ๋‚ด ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. [DetrImageProcessor]์˜ [DetrImageProcessor.pad]๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‚ฌ์šฉ์ž ์ •์˜ collate_fn์„ ์ •์˜ํ•ด์„œ ๋ฐฐ์น˜ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

>>> def collate_fn(batch):
...     pixel_values = [item["pixel_values"] for item in batch]
...     encoding = image_processor.pad(pixel_values, return_tensors="pt")
...     labels = [item["labels"] for item in batch]
...     batch = {}
...     batch["pixel_values"] = encoding["pixel_values"]
...     batch["pixel_mask"] = encoding["pixel_mask"]
...     batch["labels"] = labels
...     return batch

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ[[multimodal]]

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ์ด ํ•„์š”ํ•œ ์ž‘์—…์˜ ๊ฒฝ์šฐ, ๋ชจ๋ธ์— ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ค€๋น„ํ•˜๊ธฐ ์œ„ํ•œ ํ”„๋กœ์„ธ์„œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ์„ธ์„œ๋Š” ํ† ํฌ๋‚˜์ด์ €์™€ ํŠน์„ฑ ์ถ”์ถœ๊ธฐ์™€ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€ ์ฒ˜๋ฆฌ ๊ฐ์ฒด๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.

LJ Speech ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์™€์„œ ์ž๋™ ์Œ์„ฑ ์ธ์‹(ASR)์„ ์œ„ํ•œ ํ”„๋กœ์„ธ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜์„ธ์š”. (๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๐Ÿค— ๋ฐ์ดํ„ฐ ์„ธํŠธ ํŠœํ† ๋ฆฌ์–ผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")

์ž๋™ ์Œ์„ฑ ์ธ์‹(ASR)์—์„œ๋Š” audio์™€ text์—๋งŒ ์ง‘์ค‘ํ•˜๋ฉด ๋˜๋ฏ€๋กœ, ๋‹ค๋ฅธ ์—ด๋“ค์€ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])

์ด์ œ audio์™€ text์—ด์„ ์‚ดํŽด๋ณด์„ธ์š”:

>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'

๊ธฐ์กด์— ์‚ฌ์ „ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์™€ ์ƒˆ๋กœ์šด ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ ์ผ์น˜์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ ๋ฆฌ์ƒ˜ํ”Œ๋งํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค!

>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))

[AutoProcessor.from_pretrained]๋กœ ํ”„๋กœ์„ธ์„œ๋ฅผ ๊ฐ€์ ธ์˜ค์„ธ์š”:

>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
  1. array์— ๋“ค์–ด ์žˆ๋Š” ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ input_values๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  text๋ฅผ ํ† ํฐํ™”ํ•˜์—ฌ labels๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ์ž…๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
>>> def prepare_dataset(example):
...     audio = example["audio"]

...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))

...     return example
  1. ์ƒ˜ํ”Œ์„ prepare_dataset ํ•จ์ˆ˜์— ์ ์šฉํ•˜์„ธ์š”:
>>> prepare_dataset(lj_speech[0])

์ด์ œ ํ”„๋กœ์„ธ์„œ๊ฐ€ input_values์™€ labels๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ , ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ 16kHz๋กœ ๋‹ค์šด์ƒ˜ํ”Œ๋งํ–ˆ์Šต๋‹ˆ๋‹ค. ๋“œ๋””์–ด ์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ชจ๋ธ์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!