데이터 분석 코드 필사 (2020.07.22)¶

- 원본: Titanic Data Science Solutions (by Manav Sehgal)¶

- 순서¶

1. 문제 정의 ¶

2. 훈련, 테스트 자료 정의 ¶

3. 데이터 분석 ¶

4. 데이터 전처리 ¶

문제 정의¶

1912년 4월 15일 타이타닉 호가 북대서양 해상에서 유빙과 충돌해 가라앉는 사고가 발생
전체 2224 명의 승객들 중 1502 명의 사망자를 기록 (32% 생존율)
높은 사망율은 부족한 구명선 때문
특정 그룹(예: 어린이, 상류층)의 생존율이 비교적 높았음

문제: 타이타닉 승객 정보를 이용해 타이타닉 사고로 인한 생존/사망 여부를 예측하는 것이 목적
훈련 자료: 타이타닉에 승선한 승객 정보와 사고로 인한 생존/사망 여부의 정보
테스트 자료: 오직 승객 정보만

훈련, 테스트 자료 정의¶

훈련, 테스트 자료 불러오기 & 사용 가능한 변수 확인¶

import pandas as pd

train_df = pd.read_csv("./DATA/train.csv")
test_df  = pd.read_csv("./DATA/test.csv")

print(train_df.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']

범주형
- 명목형: 생존 여부(Survived), 성별(Sex), 객실(Cabin), 승선지(Embarked)
- 순서형: 티켓 클래스(Pclass)
수치형
- 이산형: 연령(Age), 함께 승선한 형제자매와 배우자의 수(SibSp), 함께 승선한 부모와 자식의 수(Parch)
- 연속형: 요금(Fare)

훈련 자료 정보¶

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

훈련 자료의 결측 개수: Cabin (687) > Age (177) > Embarked (2)

테스트 자료 정보¶

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

테스트 자료의 결측 개수: Cabin (327) > Age (86) > Fare (1)

훈련 자료 확인¶

train_df.head()

train_df.tail()

데이터 분석¶

훈련 자료 int, float 변수 통계치¶

train_df.describe()

훈련 자료 샘플 수: 891 (count = 891)
훈련 자료 샘플 내 생존율: 38.4% (mean of Survived = 0.384)

훈련 자료 object 변수 통계치¶

train_df.describe(include = ["O"])

훈련 자료 남성 수: 577 명(top = male, freq = 577)
훈련 자료 가장 많은 승선지: S, 644 명(top = S, freq = 644)

훈련 자료에서 티켓 클래스에 따른 생존율 비교¶

train_df[["Pclass", "Survived"]].groupby(["Pclass"], as_index = False).mean().sort_values(by = "Survived", ascending = False)
# groupby에 as_index를 False로 하면 Pclass를 index로 사용하지 않음, ascending: 오름차순

티켓 클래스가 좋을수록 생존율이 높음

train_df[["Pclass", "Survived"]].groupby(["Pclass"], as_index = True).mean().sort_values(by = "Survived", ascending = False)
# as_index를 True로 하면 Pclass를 index로 사용

훈련 자료에서 성별에 따른 생존율 비교¶

train_df[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

여성의 생존율이 남성보다 높음

훈련 자료에서 함께 승선한 형제자매와 배우자 수에 따른 생존율 비교¶

train_df[["SibSp", "Survived"]].groupby(["SibSp"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

훈련 자료에서 함께 승선한 부모와 자식 수에 따른 생존율 비교¶

train_df[["Parch", "Survived"]].groupby(["Parch"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

동행이 적은 경우 생존율이 높음

훈련 자료에서 생존 여부에 따른 연령 분포¶

import seaborn as sns
import matplotlib.pyplot as plt

g = sns.FacetGrid(train_df, col = "Survived") # 열(col)을 생존 여부로 나눔
g.map(plt.hist, "Age", bins = 20) # 히스토그램으로 시각화, 연령의 분포를 확인, 히스토그램 bin을 20개로 설정

<seaborn.axisgrid.FacetGrid at 0x294d4b9e448>

4세 이하의 유아의 생존율이 높음
15~25세 승객들의 생존율이 낮음

훈련 자료에서 티켓 등급과 생존 여부에 따른 연령 분포¶

g = sns.FacetGrid(train_df, col = "Survived", row = "Pclass", hue = "Pclass", height = 2.2, aspect = 1.6)
# 열을 생존 여부, 행(row)과 색깔(hue)을 티켓 클래스로 나눔, width = height * aspect
g.map(plt.hist, "Age", alpha = 0.5, bins = 20) # 투명도(alpha): 0.5
g.add_legend() # 범례 추가

<seaborn.axisgrid.FacetGrid at 0x294d790f288>

티켓 등급이 3등급인 경우, 승객 수는 가장 많고, 생존율도 가장 낮음
티켓 등급이 2등급인 유아는 대부분 생존함
티켓 등급이 1등급인 경우 생존율이 비교적 높음

훈련 자료에서 승선지와 티켓 등급에 따른 생존율¶

g = sns.FacetGrid(train_df, col = "Embarked", height = 2.2, aspect = 1.6)
g.map(sns.pointplot, "Pclass", "Survived", "Sex", palette = "deep", order = [1, 2, 3], hue_order = ["male", "female"])
# Pointplot으로 시각화, x: 티켓 등급, y: 생존 여부, 색깔: 성별, x축 순서: [1, 2, 3], 색깔 순서: [남성, 여성]
g.add_legend()

<seaborn.axisgrid.FacetGrid at 0x294d7cf21c8>

Error bar는 1표준 편차
승선지가 C와 Q인 경우, 남성의 티켓 등급이 3등급일 때 2등급보다 생존율이 높을 가능성이 있음

훈련 자료에서 승선지, 생존 여부, 성별에 따른 요금¶

g = sns.FacetGrid(train_df, row = "Embarked", col = "Survived", hue = "Sex", height = 2.2, aspect = 1.6)
g.map(sns.barplot, "Sex", "Fare", alpha = 0.5, ci = None, order = ["male", "female"])
# 바그래프로 시각화, x: 성별, y: 요금, Error bar: 표시 안 함
g.add_legend()

<seaborn.axisgrid.FacetGrid at 0x294d7e9bec8>

승선지가 S 또는 C인 경우, 생존한 승객들의 평균 요금이 비교적 높음

데이터 전처리¶

안 쓸 변수(Ticket, Cabin) 제거¶

print("Before", train_df.shape, test_df.shape)

train_df = train_df.drop(["Ticket", "Cabin"], axis = 1) # 열(axis = 1) 제거
test_df  =  test_df.drop(["Ticket", "Cabin"], axis = 1)

print("After", train_df.shape, test_df.shape)

Before (891, 12) (418, 11)
After (891, 10) (418, 9)

이름에서 칭호만 추출해서 Title 열에 저장¶

train_df["Name"]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

train_df["Title"] = train_df.Name.str.extract("([A-Za-z]+)\.", expand = False)
test_df["Title"]  =  test_df.Name.str.extract("([A-Za-z]+)\.", expand = False)
# 알파벳으로 되어있고 .으로 찍힌 패턴 추출, expand = False면 Series 리턴
train_df

pandas.Series.str.extract 사용법

빈도표 만들기¶

pd.crosstab(train_df["Title"], train_df["Sex"]) # 행: 칭호, 열: 성별

드문 칭호를 바꾸기¶

train_df["Title"] = train_df["Title"].replace(["Capt", "Col", "Don", "Jonkheer", "Major", "Rev", "Sir"], "MaleRare")
# 훈련 자료에서 "Capt", "Col", "Don", "Jonkheer", "Major", "Rev", "Sir" 칭호를 "MaleRare"로 교체
train_df["Title"] = train_df["Title"].replace(["Countess", "Lady", "Mlle", "Mme", "Ms"], "FemaleRare")
train_df["Title"] = train_df["Title"].replace(["Dr"], "Rare")

test_df["Title"] = test_df["Title"].replace(["Capt", "Col", "Don", "Jonkheer", "Major", "Rev", "Sir"], "MaleRare")
test_df["Title"] = test_df["Title"].replace(["Countess", "Lady", "Mlle", "Mme", "Ms"], "FemaleRare")
test_df["Title"] = test_df["Title"].replace(["Dr"], "Rare")

train_df

훈련 자료에서 칭호에 따른 생존율¶

train_df[["Title", "Survived"]].groupby(["Title"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

여성 칭호(FemaleRare, Mrs, Miss)를 가진 승객의 생존율이 비교적 높음

칭호를 숫자로 변환¶

title_mapping = {"FemaleRare": 1, "Mrs": 2, "Miss": 3, "Master": 4, "Rare": 5, "MaleRare": 6, "Mr": 7} # 각 칭호에 매칭할 숫자

train_df["Title"] = train_df["Title"].map(title_mapping)
train_df["Title"] = train_df["Title"].fillna(0) # NaN (Null) 값을 0으로 채움

test_df["Title"] = test_df["Title"].map(title_mapping)
test_df["Title"] = test_df["Title"].fillna(0)

train_df

안 쓸 변수(이름, 고객 번호) 제거¶

print("Before", train_df.shape, test_df.shape)

train_df = train_df.drop(["Name", "PassengerId"], axis = 1)
test_df  =  test_df.drop(["Name"], axis = 1)

print("After", train_df.shape, test_df.shape)

Before (891, 11) (418, 10)
After (891, 9) (418, 9)

성별을 숫자로 변환¶

sex_mapping = {"male": 0, "female": 1}

train_df["Sex"] = train_df["Sex"].map(sex_mapping).astype(int)
test_df["Sex"]  =  test_df["Sex"].map(sex_mapping).astype(int)

train_df.head()

연령 결측값 보완¶

방법: 연령과 관련된 성별, 티켓 등급을 이용해 연령을 추정 (중앙값)

성별, 티켓 등급별 연령 분포 확인¶

g = sns.FacetGrid(train_df, row = "Sex", col = "Pclass", hue = "Sex", height = 2.2, aspect = 1.6)
g.map(plt.hist, "Age", alpha = 0.5, bins = 20)
g.add_legend()

<seaborn.axisgrid.FacetGrid at 0x294d80ffc48>

훈련 자료에서 성별, 티켓 등급별 연령 중앙값 계산¶

import numpy as np

guess_ages = np.zeros((2, 3)) # 성별(2), 티켓 등급별(3) 연령 추정 표

for i in range(guess_ages.shape[0]):
    for j in range(guess_ages.shape[1]):
        guess_df = train_df[(train_df["Sex"] == i) & (train_df["Pclass"] == j + 1)]["Age"].dropna()
        # 훈련 자료에서 성별이 i이고 티켓 등급이 j+1인 연령 추출하고 결측은 제외
        age_guess = guess_df.median() # 추출한 연령 집단의 중앙값
        guess_ages[i, j] = round(age_guess) # 소수점 반올림해서 연령 추정 표 채우기

guess_ages

array([[40., 30., 25.],
       [35., 28., 22.]])

각 티켓 등급별로 중앙값을 비교해보면, 남성이 여성보다 연령이 높음
티켓 등급이 좋을수록 연령 중앙값이 높음

연령 결측 보완¶

for i in range(guess_ages.shape[0]):
    for j in range(guess_ages.shape[1]):
        train_df.loc[(train_df["Age"].isnull()) & (train_df["Sex"] == i) & (train_df["Pclass"] == j + 1), "Age"] = guess_ages[i, j]
        # 훈련 자료에서 연령이 결측, 성별이 i, 티켓 등급이 j+1인 샘플의 연령에 해당하는 추정된 연령 채우기
        test_df.loc[ ( test_df["Age"].isnull()) & ( test_df["Sex"] == i) & ( test_df["Pclass"] == j + 1), "Age"] = guess_ages[i, j]
        
train_df["Age"] = train_df["Age"].astype(int) # 연령을 정수형으로 변환
test_df[ "Age"] =  test_df["Age"].astype(int)

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int32  
 3   Age       891 non-null    int32  
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  889 non-null    object 
 8   Title     891 non-null    int64  
dtypes: float64(1), int32(2), int64(5), object(1)
memory usage: 55.8+ KB

연령 결측값이 채워짐을 확인

연령대별 생존율 확인¶

train_df["AgeBand"] = pd.cut(train_df["Age"], 5) # 훈련 자료의 연령을 5 범주로 나눔
train_df[["AgeBand", "Survived"]].groupby(["AgeBand"], as_index = False).mean().sort_values(by = "AgeBand", ascending = True)

( ): 포함 안 함(미만, 초과)
[ ]: 포함(이하, 이상)
16세 이하 승객의 생존율이 55%로 가장 높고, 64세 초과 승객의 생존율이 약 9%로 나타남

연령대를 자료에 추가하고 숫자로 변환¶

train_df["AgeBand"] = pd.cut(train_df["Age"], bins = 5, labels = [0, 1, 2, 3, 4])

test_df.loc[                        (test_df["Age"] <= 16), "AgeBand"] = 0
test_df.loc[(test_df["Age"] > 16) & (test_df["Age"] <= 32), "AgeBand"] = 1
test_df.loc[(test_df["Age"] > 32) & (test_df["Age"] <= 48), "AgeBand"] = 2
test_df.loc[(test_df["Age"] > 48) & (test_df["Age"] <= 64), "AgeBand"] = 3
test_df.loc[(test_df["Age"] > 64)                         , "AgeBand"] = 4
# 훈련 자료 기준으로 연령대를 나눔

test_df

승선한 가족 규모를 자료에 추가 & 승선한 가족 규모에 따른 생존율¶

train_df["FamilySize"] = train_df["SibSp"] + train_df["Parch"] + 1 # 1: 자기 자신
test_df[ "FamilySize"] =  test_df["SibSp"] +  test_df["Parch"] + 1

train_df[["FamilySize", "Survived"]].groupby(["FamilySize"], as_index = False).mean().sort_values(by = "FamilySize")

혼자 여부를 자료에 추가 & 혼자 여부에 따른 생존율¶

train_df["IsAlone"] = 0
train_df.loc[train_df["FamilySize"] == 1, "IsAlone"] = 1

test_df["IsAlone"] = 0
test_df.loc[test_df["FamilySize"] == 1, "IsAlone"] = 1

train_df[["IsAlone", "Survived"]].groupby(["IsAlone"], as_index = False).mean()

인공 변수 추가¶

train_df["Age*Class"] = train_df["Age"] * train_df["Pclass"]
test_df[ "Age*Class"] =  test_df["Age"] *  test_df["Pclass"]

train_df.loc[:, ["Age*Class", "Age", "Pclass"]].head(10)

승선지 결측 보완¶

방법: 가장 빈번하게 확인되는 승선지로 보완

가장 빈번한 승선지 확인¶

freq_port = train_df["Embarked"].dropna().mode()[0]

freq_port

'S'

승선지 결측 채우기 & 승선지에 따른 생존율¶

train_df["Embarked"] = train_df["Embarked"].fillna(freq_port)
test_df[ "Embarked"] =  test_df["Embarked"].fillna(freq_port)

train_df[["Embarked", "Survived"]].groupby(["Embarked"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

승선지 문자를 숫자로 변환¶

embarked_mapping = {"S": 0, "C": 1, "Q": 2}

train_df["Embarked"] = train_df["Embarked"].map(embarked_mapping).astype(int)
test_df[ "Embarked"] =  test_df["Embarked"].map(embarked_mapping).astype(int)

train_df.head()

테스트 자료의 요금 변수 결측 보완¶

방법: 훈련 자료의 중앙값으로 보완

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Sex          418 non-null    int32  
 3   Age          418 non-null    int32  
 4   SibSp        418 non-null    int64  
 5   Parch        418 non-null    int64  
 6   Fare         417 non-null    float64
 7   Embarked     418 non-null    int32  
 8   Title        418 non-null    float64
 9   AgeBand      418 non-null    float64
 10  FamilySize   418 non-null    int64  
 11  IsAlone      418 non-null    int64  
 12  Age*Class    418 non-null    int64  
dtypes: float64(3), int32(3), int64(7)
memory usage: 37.7 KB

test_df["Fare"].fillna(train_df["Fare"].dropna().median(), inplace = True)

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Sex          418 non-null    int32  
 3   Age          418 non-null    int32  
 4   SibSp        418 non-null    int64  
 5   Parch        418 non-null    int64  
 6   Fare         418 non-null    float64
 7   Embarked     418 non-null    int32  
 8   Title        418 non-null    float64
 9   AgeBand      418 non-null    float64
 10  FamilySize   418 non-null    int64  
 11  IsAlone      418 non-null    int64  
 12  Age*Class    418 non-null    int64  
dtypes: float64(3), int32(3), int64(7)
memory usage: 37.7 KB

요금대에 따른 생존율¶

train_df["FareBand"] = pd.qcut(train_df["Fare"], q = 4) # 동일한 개수로 나눠서 4 개의 요금 범주 생성
train_df[["FareBand", "Survived"]].groupby(["FareBand"], as_index = False).mean().sort_values(by = "FareBand", ascending = True)

요금대가 비쌀수록 생존율이 높아짐

자료에 요금대 변수 추가 & 숫자로 변환¶

train_df["FareBand"] = pd.qcut(train_df["Fare"], q = 4, labels = [0, 1, 2, 3])

test_df.loc[                             (test_df["Fare"] <=   7.91), "FareBand"] = 0
test_df.loc[(test_df["Fare"] >   7.91) & (test_df["Fare"] <= 14.454), "FareBand"] = 1
test_df.loc[(test_df["Fare"] > 14.454) & (test_df["Fare"] <= 31.0  ), "FareBand"] = 2
test_df.loc[(test_df["Fare"] > 31.0  )                              , "FareBand"] = 3

train_df.head(10)

test_df.head(10)

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	Pclass	Survived
0	1	0.629630
1	2	0.472826
2	3	0.242363

	Survived
Pclass
1	0.629630
2	0.472826
3	0.242363

	SibSp	Survived
1	1	0.535885
2	2	0.464286
0	0	0.345395
3	3	0.250000
4	4	0.166667
5	5	0.000000
6	8	0.000000

	Parch	Survived
3	3	0.600000
1	1	0.550847
2	2	0.500000
0	0	0.343658
5	5	0.200000
4	4	0.000000
6	6	0.000000

	Name	Sex	Ticket	Cabin	Embarked
count	891	891	891	204	889
unique	891	2	681	147	3
top	Leader, Dr. Alice (Farnham)	male	CA. 2343	G6	S
freq	1	577	7	4	644

	Sex	Survived
0	female	0.742038
1	male	0.188908

Sex	female	male
Title
Capt	0	1
Col	0	2
Countess	1	0
Don	0	1
Dr	1	6
Jonkheer	0	1
Lady	1	0
Major	0	2
Master	0	40
Miss	182	0
Mlle	2	0
Mme	1	0
Mr	0	517
Mrs	125	0
Ms	1	0
Rev	0	6
Sir	0	1

	Title	Survived
0	FemaleRare	1.000000
5	Mrs	0.792000
3	Miss	0.697802
2	Master	0.575000
6	Rare	0.428571
1	MaleRare	0.214286
4	Mr	0.156673

Kaggle 타이타닉(문제 정의~데이터 전처리)

데이터 분석 코드 필사 (2020.07.22)¶

- 원본: Titanic Data Science Solutions (by Manav Sehgal)¶

- 순서¶

1. 문제 정의¶

2. 훈련, 테스트 자료 정의¶

3. 데이터 분석¶

4. 데이터 전처리¶

문제 정의¶

훈련, 테스트 자료 정의¶

훈련, 테스트 자료 불러오기 & 사용 가능한 변수 확인¶

훈련 자료 정보¶

테스트 자료 정보¶

훈련 자료 확인¶

데이터 분석¶

훈련 자료 int, float 변수 통계치¶

훈련 자료 object 변수 통계치¶

훈련 자료에서 티켓 클래스에 따른 생존율 비교¶

훈련 자료에서 성별에 따른 생존율 비교¶

훈련 자료에서 함께 승선한 형제자매와 배우자 수에 따른 생존율 비교¶

훈련 자료에서 함께 승선한 부모와 자식 수에 따른 생존율 비교¶

훈련 자료에서 생존 여부에 따른 연령 분포¶

훈련 자료에서 티켓 등급과 생존 여부에 따른 연령 분포¶

훈련 자료에서 승선지와 티켓 등급에 따른 생존율¶

훈련 자료에서 승선지, 생존 여부, 성별에 따른 요금¶

데이터 전처리¶

안 쓸 변수(Ticket, Cabin) 제거¶

이름에서 칭호만 추출해서 Title 열에 저장¶

빈도표 만들기¶

드문 칭호를 바꾸기¶

훈련 자료에서 칭호에 따른 생존율¶

칭호를 숫자로 변환¶

안 쓸 변수(이름, 고객 번호) 제거¶

성별을 숫자로 변환¶

연령 결측값 보완¶

성별, 티켓 등급별 연령 분포 확인¶

훈련 자료에서 성별, 티켓 등급별 연령 중앙값 계산¶

연령 결측 보완¶

연령대별 생존율 확인¶

연령대를 자료에 추가하고 숫자로 변환¶

승선한 가족 규모를 자료에 추가 & 승선한 가족 규모에 따른 생존율¶

혼자 여부를 자료에 추가 & 혼자 여부에 따른 생존율¶

인공 변수 추가¶

승선지 결측 보완¶

가장 빈번한 승선지 확인¶

승선지 결측 채우기 & 승선지에 따른 생존율¶

승선지 문자를 숫자로 변환¶

테스트 자료의 요금 변수 결측 보완¶

요금대에 따른 생존율¶

자료에 요금대 변수 추가 & 숫자로 변환¶

'인공지능 > 머신러닝' 카테고리의 다른 글

'인공지능/머신러닝'의 다른글

관련글

티스토리툴바

1. 문제 정의 ¶

2. 훈련, 테스트 자료 정의 ¶

3. 데이터 분석 ¶

4. 데이터 전처리 ¶