Scrapy 爬取西刺代理
1. 创建项目
- scrapy startproject XcSpider
2. 创建爬虫实例
- scrapy genspider xcdl xicidaili.com
先把项目文件夹 Sources Root 一下,防止导入自己的文件时出错
3. 创建一个启动文件 main.py
from scrapy import cmdline
cmdline.execute('scrapy crawl xcdl'.split())
4. 项目的总体树结构
Windows 下查看树结构命令
tree /F
(/F 可显示完整文件)
│ main.py
│ scrapy.cfg
│ xcdl.log
│
└───XcSpider
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├───mysqlpipelines
│ │ pipelines.py
│ │ sql.py
│ │ __init__.py
│ │
│ └───__pycache__
│ pipelines.cpython-36.pyc
│ sql.cpython-36.pyc
│ __init__.cpython-36.pyc
│
├───spiders
│ │ xcdl.py
│ │ __init__.py
│ │
│ └───__pycache__
│ xcdl.cpython-36.pyc
│ __init__.cpython-36.pyc
│
└───__pycache__
items.cpython-36.pyc
pipelines.cpython-36.pyc
settings.cpython-36.pyc
__init__.cpython-36.pyc
5. settings.py 文件配置
- 设置MySQL、MongoDB数据相关配置
- 设置 ITEM_PIPELINES (中间件最后添加),这个后面会有讲解
- 设置 DEFAULT_REQUEST_HEADERS (头部信息),防止反爬,我们加入头部信息
# -*- coding: utf-8 -*-
# Scrapy settings for XcSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'XcSpider'
SPIDER_MODULES = ['XcSpider.spiders']
NEWSPIDER_MODULE = 'XcSpider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'XcSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/80.0.3987.149 Safari/537.36',
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'XcSpider.middlewares.XcspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'XcSpider.middlewares.XcspiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 'XcSpider.pipelines.XcspiderPipeline': 300,
'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,
'XcSpider.pipelines.XcPipeline': 200,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 开启日志
LOG_FILE = 'xcdl.log'
LOG_LEVEL = 'ERROR'
LOG_ENABLED = True
# Mysql相关配置
MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_PORT = 3306
MYSQL_DB = 'db_xici'
# MongoDB 相关配置
# MONGODB 主机名
MONGODB_HOST = '127.0.0.1'
# MONGODB 端口号
MONGODB_PORT = 27017
# 数据库名称
MONGODB_DBNAME = 'XCDL'
# 存放数据的表名称
MONGODB_SHEETNAME = 'xicidaili'
6. items.py 文件
- 编写自己所需爬取的数据
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class XcspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class XiciDailiItem(scrapy.Item):
country = scrapy.Field()
ipaddress = scrapy.Field()
port = scrapy.Field()
serveraddr = scrapy.Field()
isanonymous = scrapy.Field()
type = scrapy.Field()
alivetime = scrapy.Field()
verificationtime = scrapy.Field()
7. xcdl.py
- 进行页面处理,提取需要的数据
# -*- coding: utf-8 -*-
import scrapy
from XcSpider.items import XiciDailiItem
class XcdlSpider(scrapy.Spider):
name = 'xcdl'
allowed_domains = ['xicidaili.com']
start_urls = ['https://www.xicidaili.com/']
def parse(self, response):
# print(response.body.decode('utf-8'))
items_1 = response.xpath('//tr[@class="odd"]')
items_2 = response.xpath('//tr[@class=""]')
items = items_1 + items_2
infos = XiciDailiItem()
for item in items:
# 获取国家图片链接
counties = item.xpath('./td[@class="country"]/img/@src').extract()
try:
country = counties[0]
except:
country = 'None'
# 获取ipaddress
ipaddress = item.xpath('./td[2]/text()').extract()
try:
ipaddress = ipaddress[0]
except:
ipaddress = 'None'
# 获取port
port = item.xpath('./td[3]/text()').extract()
try:
port = port[0]
except:
port = 'None'
# 获取serveraddr
serveraddr = item.xpath('./td[4]/text()').extract()
try:
serveraddr = serveraddr[0]
except:
serveraddr = 'None'
# 获取isanonymous
isanonymous = item.xpath('./td[5]/text()').extract()
try:
isanonymous = isanonymous[0]
except:
isanonymous = 'None'
# 获取type
type = item.xpath('./td[6]/text()').extract()
try:
type = type[0]
except:
type = 'None'
# 获取存活时间
alivetime = item.xpath('./td[7]/text()').extract()
try:
alivetime = alivetime[0]
except:
alivetime = 'None'
# 获取验证时间
verficationtime = item.xpath('./td[8]/text()').extract()
try:
verificationtime = verficationtime[0]
except:
verificationtime = 'None'
print(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
infos['country'] = country
infos['ipaddress'] = ipaddress
infos['port'] = port
infos['serveraddr'] = serveraddr
infos['isanonymous'] = isanonymous
infos['type'] = type
infos['alivetime'] = alivetime
infos['verificationtime'] = verificationtime
yield infos
8. pipelines.py
i. 存入MongoDB 数据库
- 数据我们已经提取出来了,现在我们可以存入数据库了,先写MongoDB 的 pipeline,这里我们直接在项目中的 pipelines.py 文件中编写即可
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from XcSpider import settings
class XcspiderPipeline(object):
def process_item(self, item, spider):
return item
class XcPipeline(object):
def __init__(self):
host = settings.MONGODB_HOST
port = settings.MONGODB_PORT
dbname = settings.MONGODB_DBNAME
sheetname = settings.MONGODB_SHEETNAME
# 创建MONGODB数据库连接
client = pymongo.MongoClient(host=host, port=port)
# 指定数据库
mydb = client[dbname]
# 存放数据的数据库表名
self.post = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item
ii. 存入MySQL 数据库
- 存入MySQL 数据库我们可以自定义自己的 pipelines
- 在项目文件夹下新建一个 mysqlpipelines 文件夹或者 Package,具体位置可查看前面树结构
- 首先,我们先编写一个 sql 模板 --> sql.py
# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File :project -> sql
@IDE :PyCharm
@Author :ruochen
@Date :2020/4/3 12:53
@Desc
=================================================='''
import pymysql
from XcSpider import settings
MYSQL_HOST = settings.MYSQL_HOST
MYSQL_USER = settings.MYSQL_USER
MYSQL_PASSWORD = settings.MYSQL_PASSWORD
MYSQL_PORT = settings.MYSQL_PORT
MYSQL_DB = settings.MYSQL_DB
db = pymysql.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOST, port=MYSQL_PORT, database=MYSQL_DB, charset="utf8")
cursor = db.cursor()
class Sql(object):
@classmethod
def insert_db_xici(cls, country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime):
sql = 'insert into xicidaili(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)' \
' values (%(country)s, %(ipaddress)s, %(port)s, %(serveraddr)s, %(isanonymous)s, %(type)s, %(alivetime)s, %(verificationtime)s) '
value = {
'country': country,
'ipaddress': ipaddress,
'port': port,
'serveraddr': serveraddr,
'isanonymous': isanonymous,
'type': type,
'alivetime': alivetime,
'verificationtime': verificationtime,
}
try:
cursor.execute(sql, value)
db.commit()
except Exception as e:
print('插入失败----', e)
db.rollback()
# 去重
@classmethod
def select_name(cls, ipaddress):
sql = "select exists(select 1 from xicidaili where ipaddress=%(ipaddress)s)"
value = {
'ipaddress': ipaddress
}
cursor.execute(sql, value)
return cursor.fetchall()[0]
- 然后是管道文件 mysqlpipelines\pipelines.py
# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File :project -> pipelines
@IDE :PyCharm
@Author :ruochen
@Date :2020/4/3 12:53
@Desc :
=================================================='''
from XcSpider.items import XiciDailiItem
from .sql import Sql
class XicidailiPipeline(object):
def process_item(self, item, spider):
if isinstance(item, XiciDailiItem):
ipaddress = item['ipaddress']
ret = Sql.select_name(ipaddress)
if ret[0] == 1:
print("ip: {} 已经存在啦----".format(ipaddress))
else:
country = item['country']
ipaddress = item['ipaddress']
port = item['port']
serveraddr = item['serveraddr']
isanonymous = item['isanonymous']
type = item['type']
alivetime = item['alivetime']
verificationtime = item['verificationtime']
Sql.insert_db_xici(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
9. settings.py 中 pipelines 设置
- 前面 settings.py 文件已经有添加,这里再说一次
- 一个是MySQL的中间件,一个是MongoDB的中间件
- 优先级可以随便设置
- 两个可以同时打开,可也单独打开
这里,给大家提供一个小技巧,我们可以先以导包的形式找到我们的 pipelines ,然后复制过去即可,如下
# from XcSpider.mysqlpipelines.pipelines import XicidailiPipeline
ITEM_PIPELINES = {
# 'XcSpider.pipelines.XcspiderPipeline': 300,
'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,
'XcSpider.pipelines.XcPipeline': 200,
}
10. 运行程序
- 现在,我们就可以运行 main.py 文件来启动我们的爬虫程序了
- 然后就可以在数据库中看到爬取的数据了
Create Table: CREATE TABLE `xicidaili` (
`id` int(255) unsigned NOT NULL AUTO_INCREMENT,
`country` varchar(1000) NOT NULL,
`ipaddress` varchar(1000) NOT NULL,
`port` int(255) NOT NULL,
`serveraddr` varchar(50) NOT NULL,
`isanonymous` varchar(30) NOT NULL,
`type` varchar(30) NOT NULL,
`alivetime` varchar(30) NOT NULL,
`verificationtime` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=64 DEFAULT CHARSET=utf8;
end. 运行结果
MySQL 数据库
MongoDB 数据库
【生长吧!Python】有奖征文火热进行中:https://bbs.huaweicloud.com/blogs/278897