文章目录
一、多进程和多线程介绍
- 一、多进程和多线程介绍
- 二、为什么要使用多线程或者多进程
- 三、多线程爬虫的两种写法
- 1.普通方法调用
- 2.线程类调用
- 四、多进程爬虫
多进程和多线程文章传送门
二、为什么要使用多线程或者多进程看一个简单的代码,访问100次百度的耗时
# coding: utf-8
import time
import requests
def get_response():
try:
url = 'https://www.baidu.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
}
response = requests.get(url, headers=headers, timeout=2)
print(response.status_code)
except Exception as e:
print(e)
if __name__ == '__main__':
a = time.time()
for i in range(100):
get_response()
print(time.time() - a)
如果使用多线程或者多进程进行并发抓取,那么速度会不会很快
# coding: utf-8
import time
import threading
import requests
def get_response():
try:
url = 'https://www.baidu.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
}
response = requests.get(url, headers=headers, timeout=2)
print(response.status_code)
except Exception as e:
print(e)
def fun():
for i in range(10):
get_response()
if __name__ == '__main__':
for i in range(10):
threading.Thread(target=fun).start()
windows环境下100次10个线程:耗时7s
2.线程类调用# coding: utf-8
import time
import threading
import requests
import multiprocessing
class Spider(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def get_response(self):
try:
url = 'https://www.baidu.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
}
response = requests.get(url, headers=headers, timeout=2)
print(response.status_code)
except Exception as e:
print(e)
def run(self):
for i in range(10):
self.get_response()
if __name__ == '__main__':
for i in range(10):
Spider().run()
四、多进程爬虫
# coding: utf-8
import time
import threading
import requests
import multiprocessing
def get_response():
try:
url = 'https://www.baidu.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400',
}
response = requests.get(url, headers=headers, timeout=2)
print(response.status_code)
except Exception as e:
print(e)
def fun():
for i in range(25):
get_response()
if __name__ == '__main__':
for i in range(4):
multiprocessing.Process(target=fun).start()
windows环境下100次并发4个进程:耗时12秒