'tokenizer' 태그의 글 목록

[Transformers] Bert Tokenizer 알아보기

Contents Transformers 패키지는 자연어처리(NLP) 분야에서 엄청 많이 사용되는 패키지 중 하나이다. BERT 등과 같은 모델을 구축할 때 Transformers 패키지를 사용하면 매우 편하게 구축할 수 있다. 이번 글에서는 Transformers에 존재하는 BERT에서 사용하는 tokenizer 함수를 뜯어본다. BertTokenizer BertTokenizer는 PreTrainedTokenizer를 상속받는다. PreTranedTokenizer는 나중에 알아보도록하고, 단순히 사전 학습된 tokenizer 정도로 이해하면 된다. BertTokenizer 내에는 vocab_file, do_lower_case, unk_token 등 다양한 파라미터들이 존재하는데, 중요한 파라미터 위주로 ..

Python/Transformers 2022.12.04

[NLP] Stemming and Lemmatization

Stemming은 어간 추출이라 부르고 Lemmatization은 표제어 추출이라 부른다. 이론적인 부분을 조금 더 상세하게 알고 싶다면 여기로 가면 상세하게 작성해두었다. Stemming from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize stemmer = PorterStemmer() sentence = "This was not the map we found in Billy Bones's chest, but an accurate copy, \ complete in all things--names and heights and soundings--with the single exception of the red \ cros..

Deep Learning/Natural Language Processing 2022.01.18

[NLP] Tokenization

Tokenization. 토큰화라고 불리는 이것은 단어를 작은 단위로 쪼개주는 역할을 한다. 영어를 토큰화 할 때에는 nltk를 사용하고, 한국어를 토큰화 할 때에는 konlpy를 사용한다. from nltk.tokenize import word_tokenize from nltk.tokenize import WordPunctTokenizer from torchtext.data import get_tokenizer sentence = "Don't be fooled by the dark sounding name, \ Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop." print('word_tokenize', word_tokenize(se..

Deep Learning/Natural Language Processing 2022.01.18

[NLP] Lexical Analysis

Lexical Analysis Lexical Analysis(어휘 분석) 이라함은 말 그대로 단어수준 토큰 수준으로 의미를 보존할 수 있는 최소한의 수준에서 분석을 하는 것을 의미한다. 어떠한 일정한 순서가 있는 characters 들의 조합을 tokens으로 변화하는 것을 의미한다. 이 tokens은 의미를 가지고 있는 character string이다. NLP에서는 morpheme(형태소)가 가장 기본적인 유닛이 되고, text mining에서는 단어 관점에서도 tokens을 사용하기도 한다. process of lexical analysis - Tokenizing - Part-of-Speech (POS) tagging - Additional analysis : NER, noun phrase reco..

Deep Learning/Natural Language Processing 2021.07.20

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

ok-lab

tokenizer 4

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역