1. Problem
1.1 Introduction of background
幸福感是一个古老而深刻的话题,是人类世代追求的方向。与幸福感相关的因素成千上万,这些错综复杂的因素中,我们能找到其中的共性,一窥幸福感的要义吗?
该案例为幸福感预测这一经典课题,希望在现有社会科学研究外有其他维度的算法尝试,结合多学科各自优势,挖掘潜在的影响因素,发现更多可解释、可理解的相关关系。
具体来说,该案例就是一个数据挖掘类型的比赛——幸福感预测的baseline。具体来说,我们需要使用包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务等等)等139维度的信息来预测其对幸福感的影响。
数据来源于国家官方的《中国综合社会调查(CGSS)》文件中的调查结果中的数据,数据来源可靠可依赖。
数据包下载地址: Data of case。
1.2 Data information
要求使用以上 139 维的特征,使用 8000
余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代
表幸福感最高)。
因为考虑到变量个数较多,部分变量间关系复杂,数据分为完整版和精简版两类。同时给出了
index
文件中包含每个变量对应的问卷题目,以及变量取值的含义;survey
文件中为原版问卷,作为补充以方便理解问题背景。
2. Data process
2.1 Load data
1 | import pandas as pd |
Deleting the rows that happiness values equal to -8
1
2
3# Deleting the rows that happiness values equal to -8
target_name = 'happiness'
train_copy = train.copy()Train and target set
1
2train = train[train[target_name] != -8].reset_index(drop = True)
target = train_copy[target_name] # TargetConcatenation
1
2
3
4del train_copy[target_name] # Drop the target column
# Store the train set(without happiness values)
data = pd.concat([train_copy, test], axis = 0, ignore_index = True)
2.2 Information
describe1
train.happiness.describe()
Results:
count 7988.000000 mean 3.867927 std 0.818717 min 1.000000 25% 4.000000 50% 4.000000 75% 4.000000 max 5.000000 Name: happiness, dtype: float64info1
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8000 entries, 0 to 7999 Columns: 140 entries, id to public_service_9 dtypes: datetime64[ns](1), float64(25), int64(111), object(3) memory usage: 8.5+ MB
2.3 Data preparation
2.3.1 Missing values
Number of columns having missing values
通过如下代码了解数据中的缺省值信息:
1
2
3
4
5
6data_nomiss = data.dropna(axis=1, inplace = False) # 没有缺失值的数据信息
name_data = data.columns
name_miss = name_data.drop(data_nomiss.columns) # 含有缺失值的列名称
data_miss = data.loc[:, name_miss] # 存储含有缺失值的数据,方便查看数据1
2
3data_miss.shape
(10968, 25)可以发现,数据集中 25 列中含有缺失值,具体的列名称如下:
1
2
3miss_name
Index(['edu_other', 'edu_status', 'edu_yr', 'join_party', 'property_other', 'hukou_loc', 'social_neighbor', 'social_friend', 'work_status', 'work_yr', 'work_type', 'work_manage', 'family_income', 'invest_other', 'minor_child', 'marital_1st', 's_birth', 'marital_now', 's_edu', 's_political', 's_hukou', 's_income', 's_work_exper', 's_work_status', 's_work_type'], dtype='object')Processing
进一步查看缺失数据的信息:
>>> data_miss.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10968 entries, 0 to 10967 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 edu_other 6 non-null object 1 edu_status 9399 non-null float64 2 edu_yr 8212 non-null float64 3 join_party 1126 non-null float64 4 property_other 89 non-null object 5 hukou_loc 10964 non-null float64 6 social_neighbor 9871 non-null float64 7 social_friend 9871 non-null float64 8 work_status 4029 non-null float64 9 work_yr 4029 non-null float64 10 work_type 4030 non-null float64 11 work_manage 4030 non-null float64 12 family_income 10967 non-null float64 13 invest_other 45 non-null object 14 minor_child 9520 non-null float64 15 marital_1st 9839 non-null float64 16 s_birth 8601 non-null float64 17 marital_now 8521 non-null float64 18 s_edu 8601 non-null float64 19 s_political 8601 non-null float64 20 s_hukou 8601 non-null float64 21 s_income 8601 non-null float64 22 s_work_exper 8601 non-null float64 23 s_work_status 3524 non-null float64 24 s_work_type 3524 non-null float64 dtypes: float64(22), object(3) memory usage: 2.1+ MB可以发现,
edu_other,join_party,property_other,invest_other, 四列含有缺失值较多,占到了样本总体的90%以上,综合考虑后,选择将这四个变量从分析中剔除。1
2
3
4
5
6name_del = ['edu_other', 'join_party', 'property_other', 'invest_other']
name_data = name_data.drop(name_del)
name_miss = name_miss.drop(name_del) # 剔除上述四个变量
data_miss = data_miss.loc[:, name_miss]
data = data.copy().loc[:, name_miss]Information
查看每列缺失值的数量:
>>> data_miss.isnull().sum() edu_status 0 edu_yr 0 hukou_loc 0 social_neighbor 0 social_friend 0 work_status 0 work_yr 0 work_type 0 work_manage 0 family_income 0 minor_child 0 marital_1st 0 s_birth 0 marital_now 0 s_edu 0 s_political 0 s_hukou 0 s_income 0 s_work_exper 0 s_work_status 0 s_work_type 0 dtype: int64Fillna
可以发现,部分列缺失值比较多,部分缺失值较少,为此,我们对缺失值做如下处理:
- 缺失值数量少于总数的 20% (即 2193.6)的列,用该列的众数进行填充;
- 缺失值数量大于2193的列,用0进行填充,防止过分干预数据而导致失真。
1
2
3
4
5
6
7
8
9
10for i in range(len(name_miss)):
num_nulls = data_miss.isnull().sum() # 对缺失值进行计数
num_null = num_nulls[i] # 当前列的缺失值数
name = name_miss[i] # 当前列的名称
if num_null < 0.2*data.shape[0]:
data.loc[:, name].fillna(data_miss.mode().loc[0, name], inplace = True)
else:
data.loc[:, name].fillna(0, inplace = True)
data_miss = data.loc[:, name_miss]
至此,缺失值处理完成。
2.3.2 Negative values
对数据中的连续出现的负的数值进行处理。由于数据中的负值只有 -1, -2, -3, -8,故进行如下处理:
Self definition function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16# make feature +5
#csv中有复数值:-1、-2、-3、-8,将他们视为有问题的特征,但是不删去
def getres1(row):
return len([x for x in row.values if type(x)==int and x<0])
def getres2(row):
return len([x for x in row.values if type(x)==int and x==-8])
def getres3(row):
return len([x for x in row.values if type(x)==int and x==-1])
def getres4(row):
return len([x for x in row.values if type(x)==int and x==-2])
def getres5(row):
return len([x for x in row.values if type(x)==int and x==-3])1
2
3
4
5
6
7#检查数据
data['neg1'] = data[data.columns].apply(lambda row:getres1(row),axis=1)
data.loc[data['neg1']>20,'neg1'] = 20 #平滑处理,最多出现20次
data['neg2'] = data[data.columns].apply(lambda row:getres2(row),axis=1)
data['neg3'] = data[data.columns].apply(lambda row:getres3(row),axis=1)
data['neg4'] = data[data.columns].apply(lambda row:getres4(row),axis=1)
data['neg5'] = data[data.columns].apply(lambda row:getres5(row),axis=1)Process
记录下含有负值的列名称,然后对比 index.csv 中的信息对这些负值进行填充性处理。
1
2
3
4# 首先将问卷时间变为年份
data['survey_time'] = pd.to_datetime(data['survey_time'], format = '%Y-%m-%d',
errors = 'coerce') # 强制转换
data['survey_time'] = data['survey_time'].dt.year # 截取年份,用于计算年龄1
2
3
4
5
6
7
8
9# 查看所有含有负值的列
neg_name = data.min()[data.min()<0].index # 存储 data 中最小的值小于 0 的列名称
# 借助 index.csv 查看缺失值的信息
data_index = pd.read_csv("index.csv", encoding='utf-8')
data_index.set_index('变量名', inplace = True)
data_index_column = data_index.columns
print('There are {} columns have negative values'.format(len(neg_name)))There are 100 columns have negative values可以到整个数据表中有
100列中含有负值。查看字段详细信息
1
data_index.loc[neg_name, data_index_column[1:3]].head()
问题
取值含义
nationality
您的民族
1 = 汉; 2 = 蒙; 3 = 满; 4 = 回; 5 = 藏; 6 = 壮; 7 = ...
religion
您的宗教信仰-不信仰宗教
0 = 否; 1 = 是;
religion_freq
您参加宗教活动的频繁程度
1 = 从来没有参加过; 2 = 一年不到1次; 3 = 一年大概1到2次; 4 = 一年几...
edu
您目前的最高教育程度(包括目前在读的)
1 = 没有受过任何教育; 2 = 私塾、扫盲班; 3 = 小学; 4 = 初中; 5 = ...
edu_status
您目前的最高教育程度的状态
1 = 正在读; 2 = 辍学和中途退学; 3 = 肄业; 4 = 毕业;
填充
根据实际情况,对数据进行填充处理,填充分为如下几种方式:
- 根据生活常识进行填充
- 根据非负数中的众数进行填充
- 根据非负数中的均值进行填充
1
2
3for col in neg_name:
if(data[col].loc[data[col]<0].count()>1000):
print(col, data[col].loc[data[col]<0].count(), sep = ' | ')edu_yr | 1679 marital_1st | 1901 f_birth | 4664 m_birth | 4448 inc_ability | 1325 inc_exp | 1501 trust_3 | 1399 trust_4 | 1413 trust_6 | 1562 trust_8 | 1381 trust_10 | 2451 trust_11 | 5456 trust_12 | 3382通过上述预处理,可以发现,仅有 5列数据中的负值多余2000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98# ===================== 生活常识 ========================
# 民族
data.loc[data['nationality']<0,'nationality'] = 1 # 1为汉族
# 宗教
data.loc[data['religion']<0,'religion'] = 1 # 1为不信仰宗教
data.loc[data['religion_freq']<0,'religion_freq'] = 1 # 1为从来没参加过宗教活动
# ======================= 众数 ============================
# 教育
data.loc[data['edu']<0,'edu'] = data['edu'].mode()[0] # 众数为 4,初中
data.loc[data['edu_status']<0,'edu_status'] = data['edu_status'].mode()[0] # 众数为 4,毕业
data.loc[data['edu_yr']<0,'edu_yr'] = 0 # 毕业年份
# 政治面貌
data.loc[data['political']<0,'political'] = data['political'].mode()[0] # 众数为 1,群众
# 健康
data.loc[data['health']<0,'health'] = data['health'].mode()[0] # 众数为 4,比较健康
data.loc[data['health_problem']<0,'health_problem'] = data['health_problem'].mode()[0] # 众数为 4,比较健康
# 沮丧
data.loc[data['depression']<0,'depression'] = data['depression'].mode()[0] # 众数为 4,很少
# 媒体:报纸、杂志、广播、电视、互联网、手机定值消息
for i in range(6):
col = 'media_' + str(i+1)
data.loc[data[col]<0,col] = data[col].mode()[0] # 以众数填充
'''
data.loc[data['media_1']<0,'media_1'] = data['media_1'].mode().values # 众数为1,从不
data.loc[data['media_2']<0,'media_2'] = data['media_2'].mode().values # 众数为1,从不
data.loc[data['media_3']<0,'media_3'] = data['media_3'].mode().values # 众数为1,从不
data.loc[data['media_4']<0,'media_4'] = data['media_4'].mode().values # 众数为4,经常
data.loc[data['media_5']<0,'media_5'] = data['media_5'].mode().values # 众数为1,从不
data.loc[data['media_6']<0,'media_6'] = data['media_6'].mode().values # 众数为1,经常
'''
# 空闲:看碟、看电影、逛街购物、读书、文化活动等共12项
for i in range(12):
col = 'leisure_' + str(i+1)
data.loc[data[col]<0,col] = data[col].mode()[0] # 以众数填充
#
'''
socialize、relax、learn、social_neighbor、learn、social_neighbor、
social_friend、socia_outing、equity、class、class_10_before、
class_10_after、class_14、work_status、work_yr、work_type、work_manage、
insur_1、insur_2、insur_3、insur_4
'''
for i in range(29, 48):
col = neg_name[i]
data.loc[data[col]<0,col] = data[col].mode()[0] # 以众数填充
data.loc[data['socialize']<0,'socialize'] = data['socialize'].mode()[0]
# 收入状态
for i in range(49, 56):
col = neg_name[i]
data.loc[data[col]<0,col] = data[col].mode()[0] # 以众数填充
data.loc[data['socialize']<0,'socialize'] = data['socialize'].mode()[0]
for i in range(58, 64):
col = neg_name[i]
data.loc[data[col]<0,col] = data[col].mode()[0]
# 信任
for i in range(77, 100):
col = neg_name[i]
# 考虑到 trust 中包含较多负数,故考虑在大于零的数中取众数
data.loc[data[col]<0,col] = data[name].loc[data[name]>0].mode()[0]
# ========================= 均值 ==========================
# 收入
data.loc[data['income']<0,'income'] = data['income'].mean() # 取均值填充收入
# 家庭收入:49列
data.loc[data['family_income']<0,'family_income'] = data['family_income'].mean()
# 预期收入:76
data.loc[data['inc_exp']<0,'inc_exp'] = data['inc_exp'].mean()
# ======================== 0 填充 ===========================
# 婚姻:56-57
data.loc[data['marital_1st']<0,'marital_1st'] = 0 # 首次结婚
data.loc[data['marital_now']<0,'marital_now'] = 0 # 现在
# 父亲:64
data.loc[data['f_birth']<0,'f_birth'] = 0 # 出生年份
for i in range(65, 68):
col = neg_name[i]
data.loc[data[col]<0,col] = data[col].mode()[0]# 众数
# 母亲:68
data.loc[data['m_birth']<0,'m_birth'] = 0 # 出生年份
for i in range(69, 76):
col = neg_name[i]
data.loc[data[col]<0,col] = data[col].mode()[0] # 众数确认负值处理完成
对所有的列进行查看,确认是否所有的列中的负值都已处理完成。
1
2
3
4
5
6neg_count = len(data.min()[data.min()<0].index) # 记录负值列的个数
if neg_count != 0:
print('{} columns still have Negative values!'.format(neg_count))
else:
print('Congratulations! No columns have negative values!')Congratulations! No columns have negative values!
可以发现所有的负值都已处理完成。
Special infromation process
Age strafitication
考虑到年龄对于收入水平、教育水平等都有重要的影响,因此,根据表格信息
survey_time和birth两列的信息计算年龄,并对年龄进行分段。1
2
3
4
5
6# Calculate age
data['age'] = data['survey_time'] - data['birth']
# Age strafitication
bins = [0, 17, 26, 34, 50, 63, 100]
data['age_bin'] = pd.cut(data['age'], bins, labels = [0, 1, 2, 3, 4, 5]) # 分层Data augmentation
为了挖掘更多信息,我们需要进一步分析每个特征之间的关系,进行数据增广。经过仔细分析,加入了如下特征:
- 第一次结婚年龄
- 最近结婚年龄
- 是否再婚
- 配偶年龄
- 配偶年龄差
- 各种收入比(与配偶之间的收入比、十年后预期收入与现在的收入比等)
- 收入与住房买诺记比(包括10年后期望收入等)
- 社会阶级(10年后的社会阶级、14年后的社会阶级等)
- 悠闲指数
- 满意指数
- 信任指数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102# 第一次结婚年龄 143
new_col = 'marital_1stbir'
data[new_col] = data['marital_1st'] - data['birth']
# 最近结婚年龄 144
new_col = 'marital_nowtbir'
data[new_col] = data['marital_now'] - data['birth']
# 是否再婚 145
new_col = 'mar'
data[new_col] = data['marital_nowtbir'] - data['marital_1stbir']
# 配偶年龄 146
new_col = 'marital_sbir'
data[new_col] = data['marital_now']-data['s_birth']
# 配偶年龄差 147
new_col = 'age__'
data[new_col] = data['marital_nowtbir'] - data['marital_sbir']
# 收入比 147+7 = 154
data['income/s_income'] = data['income']/(data['s_income']+1)
data['income+s_income'] = data['income']+(data['s_income']+1)
data['income/family_income'] = data['income']/(data['family_income']+1)
data['all_income/family_income'] = (data['income']+data['s_income'])/(data['family_income']+1)
data['income/inc_exp'] = data['income']/(data['inc_exp']+1)
data['family_income/m'] = data['family_income']/(data['family_m']+0.01)
data['income/m'] = data['income']/(data['family_m']+0.01)
# 收入/面积比 154+4=158
data['income/floor_area'] = data['income']/(data['floor_area']+0.01)
data['all_income/floor_area'] = (data['income']+data['s_income'])/(data['floor_area']+0.01)
data['family_income/floor_area'] = data['family_income']/(data['floor_area']+0.01)
data['floor_area/m'] = data['floor_area']/(data['family_m']+0.01)
# class 158+3=161
data['class_10_diff'] = (data['class_10_after'] - data['class'])
data['class_diff'] = data['class'] - data['class_10_before']
data['class_14_diff'] = data['class'] - data['class_14']
# 悠闲指数 162
leisure_fea_lis = ['leisure_'+str(i) for i in range(1,13)]
data['leisure_sum'] = data[leisure_fea_lis].sum(axis=1) #skew
# 满意指数 163
public_service_fea_lis = ['public_service_'+str(i) for i in range(1,10)]
data['public_service_sum'] = data[public_service_fea_lis].sum(axis=1) #skew
# 信任指数 164
trust_fea_lis = ['trust_'+str(i) for i in range(1,14)]
data['trust_sum'] = data[trust_fea_lis].sum(axis=1) #skew
# province mean 164+13=177
data['province_income_mean'] = data.groupby(['province'])['income'].transform('mean').values
data['province_family_income_mean'] = data.groupby(['province'])['family_income'].transform('mean').values
data['province_equity_mean'] = data.groupby(['province'])['equity'].transform('mean').values
data['province_depression_mean'] = data.groupby(['province'])['depression'].transform('mean').values
data['province_floor_area_mean'] = data.groupby(['province'])['floor_area'].transform('mean').values
data['province_health_mean'] = data.groupby(['province'])['health'].transform('mean').values
data['province_class_10_diff_mean'] = data.groupby(['province'])['class_10_diff'].transform('mean').values
data['province_class_mean'] = data.groupby(['province'])['class'].transform('mean').values
data['province_health_problem_mean'] = data.groupby(['province'])['health_problem'].transform('mean').values
data['province_family_status_mean'] = data.groupby(['province'])['family_status'].transform('mean').values
data['province_leisure_sum_mean'] = data.groupby(['province'])['leisure_sum'].transform('mean').values
data['province_public_service_sum_mean'] = data.groupby(['province'])
['public_service_sum'].transform('mean').values
data['province_trust_sum_mean'] = data.groupby(['province'])['trust_sum'].transform('mean').values
# city mean 177+13=190
data['city_income_mean'] = data.groupby(['city'])['income'].transform('mean').values
data['city_family_income_mean'] = data.groupby(['city'])['family_income'].transform('mean').values
data['city_equity_mean'] = data.groupby(['city'])['equity'].transform('mean').values
data['city_depression_mean'] = data.groupby(['city'])['depression'].transform('mean').values
data['city_floor_area_mean'] = data.groupby(['city'])['floor_area'].transform('mean').values
data['city_health_mean'] = data.groupby(['city'])['health'].transform('mean').values
data['city_class_10_diff_mean'] = data.groupby(['city'])['class_10_diff'].transform('mean').values
data['city_class_mean'] = data.groupby(['city'])['class'].transform('mean').values
data['city_health_problem_mean'] = data.groupby(['city'])['health_problem'].transform('mean').values
data['city_family_status_mean'] = data.groupby(['city'])['family_status'].transform('mean').values
data['city_leisure_sum_mean'] = data.groupby(['city'])['leisure_sum'].transform('mean').values
data['city_public_service_sum_mean'] = data.groupby(['city'])['public_service_sum'].transform('mean').values
data['city_trust_sum_mean'] = data.groupby(['city'])['trust_sum'].transform('mean').values
# county mean 190 + 13 = 203
data['county_income_mean'] = data.groupby(['county'])['income'].transform('mean').values
data['county_family_income_mean'] = data.groupby(['county'])['family_income'].transform('mean').values
data['county_equity_mean'] = data.groupby(['county'])['equity'].transform('mean').values
data['county_depression_mean'] = data.groupby(['county'])['depression'].transform('mean').values
data['county_floor_area_mean'] = data.groupby(['county'])['floor_area'].transform('mean').values
data['county_health_mean'] = data.groupby(['county'])['health'].transform('mean').values
data['county_class_10_diff_mean'] = data.groupby(['county'])['class_10_diff'].transform('mean').values
data['county_class_mean'] = data.groupby(['county'])['class'].transform('mean').values
data['county_health_problem_mean'] = data.groupby(['county'])['health_problem'].transform('mean').values
data['county_family_status_mean'] = data.groupby(['county'])['family_status'].transform('mean').values
data['county_leisure_sum_mean'] = data.groupby(['county'])['leisure_sum'].transform('mean').values
data['county_public_service_sum_mean'] = data.groupby(['county'])
['public_service_sum'].transform('mean').values
data['county_trust_sum_mean'] = data.groupby(['county'])['trust_sum'].transform('mean').values
# ratio 相比同省 203 + 13 =216
data['income/province'] = data['income']/(data['province_income_mean'])
data['family_income/province'] = data['family_income']/(data['province_family_income_mean'])
data['equity/province'] = data['equity']/(data['province_equity_mean'])