各主要城市数据分析岗位薪资水平分析

各主要城市数据分析岗位薪资水平分析一、项目背景

由于个人考虑转行数据分析，故通过对招聘信息数据的分析，了解该岗位的市场需求、行业分布、薪资水平，以便明确求职方向

二、数据获取

数据来源于boss直聘网，通过爬虫采集采集的城市主要为一线、新一线等较为发达的城市爬虫代码如下：

from selenium import webdriver from bs4 import BeautifulSoup

driver = webdriver.Chrome(r'D:\PycharmProjects\python_present\boss直聘爬取\chromedriver.exe')

cities = [{"name": "北京", "code": 101010100, "url": "/beijing/"}, {"name": "上海", "code": 101020100, "url": "/shanghai/"}, {"name": "广州", "code": 101280100, "url": "/guangzhou/"}, {"name": "深圳", "code": 101280600, "url": "/shenzhen/"}, {"name": "杭州", "code": 101210100, "url": "/hangzhou/"}, {"name": "天津", "code": 101030100, "url": "/tianjin/"}, {"name": "苏州", "code": 101190400, "url": "/suzhou/"}, {"name": "武汉", "code": 101200100, "url": "/wuhan/"}, {"name": "厦门", "code": 101230200, "url": "/xiamen/"}, {"name": "长沙", "code": 101250100, "url": "/changsha/"}, {"name": "成都", "code": 101270100, "url": "/chengdu/"}, {"name": "郑州", "code": 101180100, "url": "/zhengzhou/"}, {"name": "重庆", "code": 101040100, "url": "/chongqing/"}, {"name": "青岛", "code": 101120200, "url": "/qingdao/"}, {"name": "南京", "code": 101190100, "url": "/nanjing/"}]

for city in cities: urls = ['https://www.zhipin.com/c{}/?query=数据分析&page={}&ka=page-{}'.format(city['code'], i, i) for i in range(1, 8)] for url in urls: driver.get(url) html = driver.page_source bs = BeautifulSoup(html, 'html.parser') job_all = bs.find_all('div', {"class": "job-primary"}) for job in job_all: position = job.find('span', {"class": "job-name"}).get_text() address = job.find('span', {'class': "job-area"}).get_text() company = job.find('div', {'class': 'company-text'}).find('h3', {'class': "name"}).get_text() salary = job.find('span', {'class': 'red'}).get_text() diploma = job.find('div', {'class': 'job-limit'}).find('p').get_text()[-2:] experience = job.find('div', {'class': 'job-limit'}).find('p').get_text()[:-2] labels = job.find('a', {'class': 'false-link'}).get_text() with open('position.csv', 'a+', encoding='UTF-8-SIG') as f_obj: f_obj.write(position.replace(',', '、') + "," + address + "," + company + "," + salary + "," + diploma

                          + "," + experience + ',' + labels + "\n")

driver.quit()

三、数据清洗

In [59]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings 
from scipy.stats import norm,mode
import re
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

原数据没有字段名，设置字段名： position：岗位名 address：公司所在地区 company：公司名 salary：薪水 diploma：学历要求 experience：工作经验要求 lables：行业标签

In [60]:

df = pd.read_csv('job.csv',header=None,names=['position','address','company','salary','diploma','experience','lables'])

查看数据整体情况

In [61]:

df.head()

Out[61]:

positionaddresscompanysalarydiplomaexperiencelables0数据分析北京·朝阳区·亚运村中信百信银行25-40K·15薪本科5-10年银行1数据分析北京·朝阳区·太阳宫BOSS直聘25-40K·16薪博士1-3年人力资源服务2数据分析北京·朝阳区·鸟巢京东集团50-80K·14薪本科3-5年电子商务3数据分析北京·海淀区·清河一亩田15-25K本科3-5年O2O4数据分析岗北京·海淀区·西北旺建信金科20-40K·14薪硕士5-10年银行

In [62]:

df.shape

Out[62]:

(3045, 7)

In [63]:

df.info()


RangeIndex: 3045 entries, 0 to 3044
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   position    3045 non-null   object
 1   address     3045 non-null   object
 2   company     3045 non-null   object
 3   salary      3045 non-null   object
 4   diploma     3045 non-null   object
 5   experience  3045 non-null   object
 6   lables      3045 non-null   object
dtypes: object(7)
memory usage: 83.3+ KB

发现有45行重复数据，进行删除

In [64]:

df.duplicated().sum()

Out[64]:

In [65]:

df.drop_duplicates(keep='first',inplace=True)

In [66]:

df.duplicated().sum()

Out[66]:

In [67]:

df.shape

Out[67]:

(3000, 7)

In [68]:

df.isnull().sum()

Out[68]:

position      0
address       0
company       0
salary        0
diploma       0
experience    0
lables        0
dtype: int64

考虑到数据中有实习岗位，实习岗薪资按天算，不具有太大的参考价值，故删除包含实习的数据

In [69]:

#df['position'] = df['position'].astype('string')

In [70]:

x=df['position'].str.contains('实习')
df=df[~x]
df.reset_index(drop=True,inplace=True)

address列的值不规范，进行处理，全部转换为城市名

In [71]:

df['address']=df['address'].str[:2]

In [72]:

df['address'].unique()

Out[72]:

array(['北京', '上海', '广州', '深圳', '杭州', '天津', '苏州', '武汉', '厦门', '长沙', '成都',
       '郑州', '重庆', '青岛', '南京'], dtype=object)

观察salary列的值

In [73]:

df['salary'].unique()

Out[73]:

array(['25-40K·15薪', '25-40K·16薪', '50-80K·14薪', '15-25K', '20-40K·14薪',
       '15-30K·14薪', '20-30K', '15-25K·14薪', '40-55K·13薪', '20-35K',
       '30-55K·13薪', '20-40K·16薪', '35-40K·15薪', '45-65K', '15-30K',
       '25-50K·14薪', '25-35K·14薪', '15-25K·16薪', '15-28K·14薪', '18-28K',
       '30-50K·13薪', '20-35K·14薪', '15-28K', '20-30K·13薪', '30-50K·16薪',
       '18-30K·14薪', '18-22K·15薪', '25-45K·16薪', '13-25K', '14-25K·14薪',
       '18-35K·14薪', '25-45K·14薪', '25-40K', '15-26K·13薪', '12-24K',
       '25-45K', '20-40K', '20-30K·15薪', '15-25K·15薪', '25-40K·17薪',
       '20-30K·14薪', '18-35K', '18-27K', '30-45K', '20-40K·15薪',
       '20-30K·16薪', '25-30K·15薪', '17-27K', '28-50K·14薪', '25-35K',
       '30-60K·14薪', '30-55K', '35-60K·14薪', '15-22K', '30-50K',
       '30-50K·14薪', '40-70K', '30-60K·13薪', '25-50K·15薪', '13-26K·16薪',
       '25-50K', '12-24K·14薪', '17-25K·15薪', '18-25K·15薪', '28-40K·16薪',
       '30-40K', '28-40K·13薪', '20-25K·16薪', '30-60K·16薪', '25-30K·14薪',
       '15-30K·15薪', '25-40K·14薪', '35-65K·16薪', '30-45K·14薪',
       '20-35K·16薪', '15-30K·16薪', '35-65K·15薪', '25-26K', '20-25K',
       '25-50K·16薪', '18-35K·16薪', '18-25K·14薪', '25-30K', '19-35K',
       '12-22K·14薪', '28-45K·14薪', '18-30K', '18-25K', '15-25K·13薪',
       '15-25K·17薪', '15-30K·13薪', '40-60K·15薪', '18-30K·15薪',
       '25-40K·13薪', '25-30K·13薪', '20-35K·15薪', '18-24K', '30-60K',
       '40-70K·14薪', '18-30K·13薪', '16-25K·13薪', '20-28K·15薪',
       '15-20K·13薪', '15-20K·14薪', '12-18K', '11-20K', '20-40K·13薪',
       '14-28K', '11-17K·13薪', '15-20K', '9-14K', '12-15K', '11-22K',
       '10-15K', '12-20K', '12-17K', '9-13K·13薪', '10-15K·14薪',
       '10-15K·13薪', '7-12K·14薪', '10-11K', '6-9K', '10-12K',
       '20-25K·14薪', '8-10K·13薪', '9-13K·14薪', '7-10K', '7-10K·13薪',
       '20-35K·13薪', '25-35K·16薪', '30-40K·13薪', '30-50K·15薪',
       '30-60K·15薪', '12-20K·14薪', '28-55K', '23-45K', '8-13K',
       '30-35K·15薪', '30-45K·16薪', '15-28K·15薪', '60-90K·16薪', '40-60K',
       '30-35K', '12-24K·16薪', '16-30K·15薪', '11-15K·15薪', '15-16K',
       '6-10K·13薪', '4-8K', '5-7K', '4-6K', '4-7K', '8-13K·13薪',
       '14-20K·13薪', '18-28K·16薪', '6-8K', '35-50K', '11-18K', '6-10K',
       '25-35K·15薪', '5-10K·13薪', '8-10K', '5-10K', '12-17K·14薪',
       '11-20K·13薪', '10-13K·14薪', '8-12K', '13-25K·14薪', '11-22K·18薪',
       '28-40K·14薪', '3-6K', '12-22K', '5-8K', '9-14K·16薪', '13-20K',
       '14-20K·14薪', '15-17K·13薪', '5-6K', '6-8K·13薪', '15-17K', '3-5K',
       '6-7K·13薪', '18-35K·15薪', '3-4K', '8-13K·14薪', '8-12K·13薪',
       '7-12K·13薪', '4-5K', '9-14K·13薪', '5-9K', '12-18K·13薪',
       '20-25K·15薪', '9-11K', '8-16K', '13-23K', '14-25K', '7-12K',
       '12-15K·13薪', '3-5K·13薪', '12-24K·13薪', '16-23K', '6-10K·15薪',
       '11-16K', '7-11K', '16-22K·13薪', '10-20K', '14-22K', '60-90K',
       '30-35K·14薪', '35-50K·16薪', '13-22K·14薪', '5-8K·13薪', '10-15K·16薪',
       '5-6K·13薪', '13-25K·13薪', '8-11K', '13-26K', '16-32K', '16-28K',
       '80-110K·14薪', '9-13K', '12-16K', '21-22K', '20-40K·18薪', '16-30K',
       '30-55K·16薪', '11-16K·13薪', '70-100K·14薪', '15-22K·13薪',
       '18-25K·13薪', '20-21K', '10-15K·15薪', '9-12K', '23-45K·16薪',
       '25-50K·13薪', '25-30K·20薪', '35-50K·15薪', '30-40K·18薪',
       '40-70K·16薪', '15-26K', '14-28K·14薪', '18-22K', '35-65K', '15-21K',
       '30-55K·18薪', '12-20K·13薪', '21-35K·16薪', '15-30K·17薪', '4-9K',
       '9-14K·15薪', '20-40K·17薪', '18-36K', '6-8K·15薪', '4-6K·13薪',
       '25-35K·13薪', '16-30K·14薪', '22-27K', '11-18K·13薪', '18-26K',
       '28-50K·13薪', '35-40K', '20-24K', '17-25K', '13-21K·13薪',
       '12-20K·17薪', '12-24K·15薪', '15-22K·14薪', '12-18K·15薪',
       '30-50K·18薪', '8-13K·15薪', '65-95K', '24-38K', '6-11K·13薪',
       '6-11K', '9-15K', '11-15K', '7-8K', '8-9K', '2-5K', '7-11K·13薪',
       '6-7K', '4-8K·13薪', '3-4K·13薪', '3-7K', '12-13K·13薪', '12-17K·15薪',
       '7-9K', '14-28K·13薪', '8-15K', '9-11K·13薪', '10-12K·13薪', '8-14K',
       '12-18K·14薪', '4-5K·13薪', '9-14K·14薪', '12-16K·13薪', '5-8K·15薪',
       '5-10K·14薪', '11-20K·14薪', '12-20K·15薪', '17-30K·15薪', '6-9K·14薪',
       '15-18K·13薪', '40-70K·13薪', '11-22K·14薪', '12-22K·15薪', '15-23K',
       '18-23K', '14-28K·15薪', '35-50K·14薪', '50-80K', '13-20K·15薪',
       '15-20K·15薪', '6-8K·14薪', '17-30K', '7-8K·13薪', '10-13K',
       '4-6K·14薪', '2-4K', '6-12K', '6-11K·14薪', '10-13K·13薪',
       '8-12K·14薪', '5-7K·13薪', '35-50K·13薪', '11-12K', '4-5K·14薪',
       '10-13K·15薪', '27-40K', '16-25K·14薪', '12-22K·13薪', '11-22K·13薪',
       '5-9K·13薪', '13-21K', '13-17K', '11-20K·15薪', '11-19K', '14-18K',
       '11-20K·17薪', '3-8K', '13-18K', '10-20K·18薪', '8-11K·13薪',
       '45-60K·15薪', '13-26K·14薪', '13-20K·14薪', '15-16K·13薪',
       '11-18K·14薪', '2-6K', '8-10K·14薪', '3-5K·14薪', '2-3K',
       '10-11K·16薪', '18-20K', '12-13K', '12-13K·15薪', '2-7K',
       '8-12K·15薪', '15-30K·18薪', '6-7K·14薪', '5-8K·16薪', '18-22K·18薪',
       '11-16K·15薪', '15-25K·20薪', '18-35K·13薪', '14-20K', '13-16K',
       '4-7K·13薪', '10-12K·15薪', '7-14K', '12-14K', '3-7K·13薪',
       '7-10K·14薪', '22-40K', '4-6K·15薪', '15-24K', '13-22K·16薪',
       '26-50K', '10-18K', '6-9K·13薪', '14-15K·14薪', '9-10K', '3-6K·13薪',
       '4-9K·13薪', '16-20K·13薪', '12-23K', '1-4K', '11-16K·14薪',
       '13-18K·13薪', '12-15K·15薪', '20-28K·13薪', '6-10K·14薪',
       '12-17K·13薪', '13-15K', '13-14K', '11-20K·16薪', '50-60K',
       '5-7K·14薪', '10-15K·17薪', '13-20K·13薪', '4-9K·14薪', '17-34K',
       '20-25K·19薪'], dtype=object)

将薪资列的值进行拆分，新增bottom，top两列，作为一个岗位薪资的最低值和最高值

In [74]:

df['bottom']=df['salary'].str.extract('^(\d+).*')

In [75]:

df['top']=df['salary'].str.extract('^.*?-(\d+).*')

有些公司的薪资是单个值，则用bottom列的值填充top列

In [76]:

df['top'].fillna(df['bottom'],inplace=True)

In [77]:

df

Out[77]:

positionaddresscompanysalarydiplomaexperiencelablesbottomtop0数据分析北京中信百信银行25-40K·15薪本科5-10年银行25401数据分析北京BOSS直聘25-40K·16薪博士1-3年人力资源服务25402数据分析北京京东集团50-80K·14薪本科3-5年电子商务50803数据分析北京一亩田15-25K本科3-5年O2O15254数据分析岗北京建信金科20-40K·14薪硕士5-10年银行2040..............................2921助理数据分析员南京万得4-6K本科经验不限数据服务462922数据分析师（经济）南京万得4-6K本科经验不限数据服务462923（金融）数据分析员南京万得4-6K本科经验不限数据服务462924数据分析员南京万得4-6K本科1年以内数据服务462925助理数据分析员南京万得4-8K本科经验不限数据服务48

2926 rows × 9 columns

有些公司有标明年终奖，如14薪等，故新增一列commission_pct作为奖金率，并计算每个岗位的奖金率

In [78]:

df['commision_pct']=df['salary'].str.extract('^.*?·(\d{2})薪')
df['commision_pct'].fillna(12,inplace=True)
df['commision_pct']=df['commision_pct'].astype('float64')
df['commision_pct']=df['commision_pct']/12

将bottom，top，commission__pct列转换为数值形式，并以此计算出每个岗位的平均薪资作为新增列avg_salary

In [79]:

df['bottom'] = df['bottom'].astype('int64')
df['top'] = df['top'].astype('int64')
df['avg_salary'] = (df['bottom']+df['top'])/2*df['commision_pct']
df['avg_salary'] = df['avg_salary'].astype('int64')

In [80]:

df.head()

Out[80]:

positionaddresscompanysalarydiplomaexperiencelablesbottomtopcommision_pctavg_salary0数据分析北京中信百信银行25-40K·15薪本科5-10年银行25401.250000401数据分析北京BOSS直聘25-40K·16薪博士1-3年人力资源服务25401.333333432数据分析北京京东集团50-80K·14薪本科3-5年电子商务50801.166667753数据分析北京一亩田15-25K本科3-5年O2O15251.000000204数据分析岗北京建信金科20-40K·14薪硕士5-10年银行20401.16666735

In [81]:

cols=list(df)
cols.insert(4,cols.pop(cols.index('bottom')))
cols.insert(5,cols.pop(cols.index('top')))
cols.insert(6,cols.pop(cols.index('commision_pct')))
cols.insert(7,cols.pop(cols.index('avg_salary')))
df=df.loc[:,cols]
df

Out[81]:

positionaddresscompanysalarybottomtopcommision_pctavg_salarydiplomaexperiencelables0数据分析北京中信百信银行25-40K·15薪25401.25000040本科5-10年银行1数据分析北京BOSS直聘25-40K·16薪25401.33333343博士1-3年人力资源服务2数据分析北京京东集团50-80K·14薪50801.16666775本科3-5年电子商务3数据分析北京一亩田15-25K15251.00000020本科3-5年O2O4数据分析岗北京建信金科20-40K·14薪20401.16666735硕士5-10年银行....................................2921助理数据分析员南京万得4-6K461.0000005本科经验不限数据服务2922数据分析师（经济）南京万得4-6K461.0000005本科经验不限数据服务2923（金融）数据分析员南京万得4-6K461.0000005本科经验不限数据服务2924数据分析员南京万得4-6K461.0000005本科1年以内数据服务2925助理数据分析员南京万得4-8K481.0000006本科经验不限数据服务

2926 rows × 11 columns

再次查看数据，发现极端异常值，月薪1000和月薪10万这些极端值数量都很少，剔除月薪小于2000大于55000的数据

In [82]:

df.describe()

Out[82]:

bottomtopcommision_pctavg_salarycount2926.0000002926.0000002926.0000002926.000000mean11.98086120.0584421.05792917.056391std7.84100413.8244060.10042712.582388min1.0000003.0000001.0000002.00000025%6.0000009.0000001.0000007.00000050%10.00000015.0000001.00000013.00000075%15.00000030.0000001.08333323.000000max80.000000110.0000001.666667110.000000

In [83]:

df=df[(df.avg_salary>2)&(df.avg_salary

各主要城市数据分析岗位薪资水平分析

[ 申请 ]友情链接：