parallel_apply() 함수 사용 시 pandas import 문제 해결 : NameError: not defined error

데이터사이언티스트로 살아남기/트래블슈팅

parallel_apply() 함수 사용 시 pandas import 문제 해결 : NameError: not defined error

별수호자룰루 2025. 2. 4. 10:04

혼자 공부하면서 작성하는 글입니다. 더 효율적인 해결방법이 있거나 오류가 있다면 댓글 남겨주세요~!

Parallel_apply() 사용 시 NameError 해결 1

parallel 함수를 사용하며 worker 프로세스에서도 모듈을 정의해줘야 한다는 사실을 알았다.

처음에 코드를 다음과 같이 작성했는데, NameError: re not defined가 떴다. 분명 처음에 import를 했는데.

#에러코드
def remove_one_char_words(text):
    cleaned_text = re.sub(r'\b[가-힣]\b', '', text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    
    return cleaned_text

df['cleaned'] = df['split_sentence'].parallel_apply(remove_one_char_words)

그래서 찾아보니까 worker프로세스에서는 메인프로세스에서 import한 전역변수를 공유하지 않는다고..2가지 방법을 사용할 수 있을 것 같더라.

1. 함수 내부에서 모듈 import

2. 전체 모듈이 있는 스크립트 실행('__main__')

노트북 파일로 작업하고 있어서 2번도 가능할 것 같지만 그냥 간단하고 확실한 방법을 선택했다.

#해결 코드
def remove_one_char_words(text):
    import re
    cleaned_text = re.sub(r'\b[가-힣]\b', '', text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    
    return cleaned_text

df['cleaned'] = df['split_sentence'].parallel_apply(remove_one_char_words)

짜잔 초심플 해결 완

근데 또 다른 문제가 발생했다.

Parallel_apply() 사용 시 NameError 해결 2

이렇게 Okt도 임포트해주고 당연히 될거라고 생각했는데..!

#에러코드
def tokens_sep(text, stopwords= stopword_path):
    from konlpy.tag import Okt
    
    okt = Okt()
    tags = okt.pos(text, stem=True)
    
    nouns = [word for word, tag in tags if tag =='Noun']
    verbs = [word for word, tag in tags if tag == 'Verb']
    adjs = [word for word, tag in tags if tag == 'Adjective']
    
    return nouns, verbs, adjs

df[['nouns', 'verbs', 'adjs']] = df['cleaned'].parallel_apply(lambda x:pd.Series(tokens_sep(x), index=['nouns','verbs', 'adjs']))

NameError: name 'pd' is not defined 라는 에러가 떴다.

람다식 안에, 만들어진 리스트를 시리즈로 가지고 오는 작업을 했는데 요기서 pandas를 가져오지 못했다. 우씨.

그래서 이건 그냥 반환값을 시리즈로 바로 하도록 수정해줬다.~~메인함수를 만들면 이럴 일이 없을텐데..~~

#수정코드
def tokens_sep(text, stopwords= stopword_path):
    import pandas as pd
    from konlpy.tag import Okt
    
    okt = Okt()
    tags = okt.pos(text, stem=True)
    
    nouns = [word for word, tag in tags if tag =='Noun']
    verbs = [word for word, tag in tags if tag == 'Verb']
    adjs = [word for word, tag in tags if tag == 'Adjective']
    
    return pd.Series({'nouns': nouns, 'verbs':verbs, 'adjs':adjs})

df[['nouns', 'verbs', 'adjs']] = df['cleaned'].parallel_apply(tokens_sep)

이제 잘 된다!

앞으로 요런 오류가 발생하면,

간단한 작업이면 그냥 함수 내에 모듈 임포트 하고 이 외에는 메인함수를 만들면 좋을 것 같다.

저작자표시 비영리 변경금지

현재글parallel_apply() 함수 사용 시 pandas import 문제 해결 : NameError: not defined error

mini world

취미블로그

파이썬시각화, 영어공부법, 스파크데이터프레임, parallel함수, parallel_apply, spark, 아파치스파크, 스파크sql, 영어공부, 데이터분석논문, 스파크, 오블완, 스파크트렌스포메이션, 스파크완벽가이드, 빅데이터분석, 스파크실행계획, 티스토리챌린지, 스파크파티션, 그래머인유스, 그래머인유즈,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

mini world