카테고리 없음

머신러닝with파이썬8(1)_자연어 처리 기초

디지털랫드 2024. 3. 24. 12:20

자연어 처리 기초


이번 시간에는 문자열을 모델이 처리할 수 있는 수치 벡터로 만드는 방법과 형태소 분석기에 대해 설명드리도록 하겠습니다.

이번 시간 정리


CountVectorizer

  • 각 문장에서 단어 출현 횟수를 카운팅하는 방법 (BOW, Bag Of Word)

TfidfVectorizer

  • 다른 문서보다 특정 문서에 자주 나타나는 단어에 높은 가중치를 주는 방법
  • TF-IDF(Term Frequency - Inverse Document Frequency)

형태소 분석기

  • konlpy: 한국어 처리를 위한 형태소 분석기
  • 형태소: 의미를 가지는 요소로서는 더 이상 분석할 수 없는 가장 작은 말의 단위

아래의 코드 실행 버튼을 눌러 실습을 진행해 보세요!

 
 
실행 완료
 
 
 
실행 완료
Requirement already satisfied: konlpy in /opt/conda/lib/python3.9/site-packages (0.5.2)
Requirement already satisfied: colorama in /opt/conda/lib/python3.9/site-packages (from konlpy) (0.4.4)
Requirement already satisfied: tweepy>=3.7.0 in /opt/conda/lib/python3.9/site-packages (from konlpy) (3.10.0)
Requirement already satisfied: lxml>=4.1.0 in /opt/conda/lib/python3.9/site-packages (from konlpy) (4.6.3)
Requirement already satisfied: beautifulsoup4==4.6.0 in /opt/conda/lib/python3.9/site-packages (from konlpy) (4.6.0)
Requirement already satisfied: JPype1>=0.7.0 in /opt/conda/lib/python3.9/site-packages (from konlpy) (1.3.0)
Requirement already satisfied: numpy>=1.6 in /opt/conda/lib/python3.9/site-packages (from konlpy) (1.22.2)
Requirement already satisfied: six>=1.10.0 in /opt/conda/lib/python3.9/site-packages (from tweepy>=3.7.0->konlpy) (1.16.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.9/site-packages (from tweepy>=3.7.0->konlpy) (1.3.1)
Requirement already satisfied: requests[socks]>=2.11.1 in /opt/conda/lib/python3.9/site-packages (from tweepy>=3.7.0->konlpy) (2.26.0)
Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.9/site-packages (from requests-oauthlib>=0.7.0->tweepy>=3.7.0->konlpy) (3.2.0)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.9/site-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (1.26.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.9/site-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (2021.10.8)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /opt/conda/lib/python3.9/site-packages (from requests[socks]>=2.11.1->tweepy>=3.7.0->konlpy) (1.7.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
 
 
실행 완료
 
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/2083845187.py in <module>
      1 # 토큰화 (형태소 단위)
      2 text = "함께 탐험하며 성장하는 AI 학교 AIFFEL"
----> 3 tokenizer.morphs(text)

NameError: name 'tokenizer' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/4242851996.py in <module>
      1 # 토큰화 (명사만 추출)
----> 2 tokenizer.nouns(text)

NameError: name 'tokenizer' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/116418703.py in <module>
      1 # 토큰화 (품사 태깅)
----> 2 tokenizer.pos(text)

NameError: name 'tokenizer' is not defined

1. CountVectorizer


[리마인드] 각 문장에서 단어 출현 횟수를 카운팅하는 방법 (BOW, Bag Of Word)

 
 
실행 완료
 
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/2625783422.py in <module>
      1 # 단어 토큰화 (Okt)
----> 2 words = tokenizer.morphs(text)

NameError: name 'tokenizer' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/171572429.py in <module>
      1 # 데이터 학습
----> 2 vect.fit(words)

NameError: name 'vect' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/1742185403.py in <module>
      1 # 학습된 어휘
----> 2 vect.get_feature_names_out()

NameError: name 'vect' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/1465294740.py in <module>
      1 # 단어 사전
----> 2 vect.vocabulary_

NameError: name 'vect' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/2771693305.py in <module>
      1 # 단어 사전 크기
----> 2 len(vect.vocabulary_)

NameError: name 'vect' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/3463679428.py in <module>
      1 # 인코딩
----> 2 df_t = vect.transform(words)

NameError: name 'vect' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/607157972.py in <module>
      1 # 인코딩된 데이터 Matrix
----> 2 df_t.toarray()

NameError: name 'df_t' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/194159683.py in <module>
      1 # 어휘와 피처
      2 import pandas as pd
----> 3 pd.DataFrame(df_t.toarray(), columns=vect.get_feature_names_out())

NameError: name 'df_t' is not defined
 
 
실행 완료
 
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/3055294364.py in <module>
      1 # 단어 토큰화 (Okt)
----> 2 words = tokenizer.morphs(test)
      3 words

NameError: name 'tokenizer' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/2982316159.py in <module>
      1 #
----> 2 test_t = vect.transform(words)
      3 test_t.toarray()

NameError: name 'vect' is not defined
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/835758289.py in <module>
----> 1 pd.DataFrame(test_t.toarray(), columns=vect.get_feature_names_out())

NameError: name 'pd' is not defined

2. TfidfVectorizer


[리마인드] 다른 문서보다 특정 문서에 자주 나타나는 단어에 높은 가중치를 주는 방법

 
 
실행 완료
 
 
 
코드 실행
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/2365248704.py in <module>
      1 # tf-idf
----> 2 vect = TfidfVectorizer()
      3 words = tokenizer.morphs(text)
      4 vect.fit(words)
      5 vect.vocabulary_

NameError: name 'TfidfVectorizer' is not defined
 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_13/546313911.py in <module>
----> 1 vect.transform(words).toarray()

NameError: name 'words' is not defined
이전다음