imputer를 이용한 결측치 처리

결측치 확인은 info() 또는 pandas의 isnull()과 합계를 구하는 sum()을 통해서도 할 수 있다.

train.isnull().sum()

단일 컬럼에 대해 단순한 값으로 결측치를 채우는 것은 fillna()로 가능

# 나이 컬럼의 결측치를 0으로 채우고 정보 보기
train['Age'].fillna(0).describe()

# 나이 컬럼의 결측치를 나이의 평균으로 채우고 정보 보기
train['Age'].fillna(train['Age'].mean()).describe()

2개 이상의 컬럼에 대해 보다 세밀한 처리는 imputer를 이용
fit() 함수로 학습을 진행한 후 transform() 함수로 실제 처리 진행

from sklearn.impute import SimpleImputer

# 평균값으로 채우는 전략을 선택하여 imputer 생성
imputer = SimpleImputer(strategy='mean')

# 아래 Pclass 컬럼은 결측치가 없음. imputer는 결측치가 있는 대상에 대해서만 처리함
# 결측치에 대한 학습
imputer.fit(train[['Age', 'Pclass']])

# 실제 결측치 처리
result = imputer.transform(train[['Age', 'Pclass']])
train[['Age', 'Pclass']] = result

fit_transform()은 fit()과 transform()을 한번에 해주는 함수이다.

# 중간값으로 채우는 전략을 선택하여 imputer 생성
imputer = SimpleImputer(strategy='median')
result = imputer.fit_transform(train[['Age', 'Pclass']])
train[['Age', 'Pclass']] = result

Categorical column 데이터에 대한 결측치 처리

# 1개의 컬럼을 처리하는 경우
train['Embarked'].fillna('S')

# imputer를 이용하여 2개 이상의 컬럼을 처리할 때
imputer = SimpleImputer(strategy='most_frequent')
result = imputer.fit_transform(train[['Embarked', 'Cabin']])
train[['Embarked', 'Cabin']] = result

이전Spring - IoC와 DI

다음AI/ML(패스트 캠퍼스) - 훈련 데이터 세트와 테스트 데이터 세트 나누기