Scrapy 爬取西刺代理

1. 创建项目

scrapy startproject XcSpider

2. 创建爬虫实例

scrapy genspider xcdl xicidaili.com

先把项目文件夹 Sources Root 一下，防止导入自己的文件时出错

3. 创建一个启动文件 main.py

from scrapy import cmdline
cmdline.execute('scrapy crawl xcdl'.split())

4. 项目的总体树结构

Windows 下查看树结构命令 tree /F（/F 可显示完整文件）

│   main.py
│   scrapy.cfg
│   xcdl.log
│
└───XcSpider
    │   items.py
    │   middlewares.py
    │   pipelines.py
    │   settings.py
    │   __init__.py
    │
    ├───mysqlpipelines
    │   │   pipelines.py
    │   │   sql.py
    │   │   __init__.py
    │   │
    │   └───__pycache__
    │           pipelines.cpython-36.pyc
    │           sql.cpython-36.pyc
    │           __init__.cpython-36.pyc
    │
    ├───spiders
    │   │   xcdl.py
    │   │   __init__.py
    │   │
    │   └───__pycache__
    │           xcdl.cpython-36.pyc
    │           __init__.cpython-36.pyc
    │
    └───__pycache__
            items.cpython-36.pyc
            pipelines.cpython-36.pyc
            settings.cpython-36.pyc
            __init__.cpython-36.pyc

5. settings.py 文件配置

设置MySQL、MongoDB数据相关配置
设置 ITEM_PIPELINES （中间件最后添加），这个后面会有讲解
设置 DEFAULT_REQUEST_HEADERS （头部信息），防止反爬，我们加入头部信息

# -*- coding: utf-8 -*-

# Scrapy settings for XcSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'XcSpider'

SPIDER_MODULES = ['XcSpider.spiders']
NEWSPIDER_MODULE = 'XcSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'XcSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                ' Chrome/80.0.3987.149 Safari/537.36',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'XcSpider.middlewares.XcspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'XcSpider.middlewares.XcspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'XcSpider.pipelines.XcspiderPipeline': 300,
    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,
    'XcSpider.pipelines.XcPipeline': 200,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 开启日志
LOG_FILE = 'xcdl.log'
LOG_LEVEL = 'ERROR'
LOG_ENABLED = True

# Mysql相关配置
MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_PORT = 3306
MYSQL_DB = 'db_xici'

# MongoDB 相关配置
# MONGODB 主机名
MONGODB_HOST = '127.0.0.1'
# MONGODB 端口号
MONGODB_PORT = 27017
# 数据库名称
MONGODB_DBNAME = 'XCDL'
# 存放数据的表名称
MONGODB_SHEETNAME = 'xicidaili'

6. items.py 文件

编写自己所需爬取的数据

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class XcspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class XiciDailiItem(scrapy.Item):
    country = scrapy.Field()
    ipaddress = scrapy.Field()
    port = scrapy.Field()
    serveraddr = scrapy.Field()
    isanonymous = scrapy.Field()
    type = scrapy.Field()
    alivetime = scrapy.Field()
    verificationtime = scrapy.Field()

7. xcdl.py

进行页面处理，提取需要的数据

# -*- coding: utf-8 -*-
import scrapy
from XcSpider.items import XiciDailiItem

class XcdlSpider(scrapy.Spider):
    name = 'xcdl'
    allowed_domains = ['xicidaili.com']
    start_urls = ['https://www.xicidaili.com/']

    def parse(self, response):
        # print(response.body.decode('utf-8'))
        items_1 = response.xpath('//tr[@class="odd"]')
        items_2 = response.xpath('//tr[@class=""]')
        items = items_1 + items_2

        infos = XiciDailiItem()
        for item in items:
            # 获取国家图片链接
            counties = item.xpath('./td[@class="country"]/img/@src').extract()
            try:
                country = counties[0]
            except:
                country = 'None'
            # 获取ipaddress
            ipaddress = item.xpath('./td[2]/text()').extract()
            try:
                ipaddress = ipaddress[0]
            except:
                ipaddress = 'None'
            # 获取port
            port = item.xpath('./td[3]/text()').extract()
            try:
                port = port[0]
            except:
                port = 'None'
            # 获取serveraddr
            serveraddr = item.xpath('./td[4]/text()').extract()
            try:
                serveraddr = serveraddr[0]
            except:
                serveraddr = 'None'
            # 获取isanonymous
            isanonymous = item.xpath('./td[5]/text()').extract()
            try:
                isanonymous = isanonymous[0]
            except:
                isanonymous = 'None'
            # 获取type
            type = item.xpath('./td[6]/text()').extract()
            try:
                type = type[0]
            except:
                type = 'None'
            # 获取存活时间
            alivetime = item.xpath('./td[7]/text()').extract()
            try:
                alivetime = alivetime[0]
            except:
                alivetime = 'None'
            # 获取验证时间
            verficationtime = item.xpath('./td[8]/text()').extract()
            try:
                verificationtime = verficationtime[0]
            except:
                verificationtime = 'None'

            print(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)

            infos['country'] = country
            infos['ipaddress'] = ipaddress
            infos['port'] = port
            infos['serveraddr'] = serveraddr
            infos['isanonymous'] = isanonymous
            infos['type'] = type
            infos['alivetime'] = alivetime
            infos['verificationtime'] = verificationtime


            yield infos

8. pipelines.py

i. 存入MongoDB 数据库

数据我们已经提取出来了，现在我们可以存入数据库了，先写MongoDB 的 pipeline，这里我们直接在项目中的 pipelines.py 文件中编写即可

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from XcSpider import settings

class XcspiderPipeline(object):
    def process_item(self, item, spider):
        return item

class XcPipeline(object):
    def __init__(self):
        host = settings.MONGODB_HOST
        port = settings.MONGODB_PORT
        dbname = settings.MONGODB_DBNAME
        sheetname = settings.MONGODB_SHEETNAME
        # 创建MONGODB数据库连接
        client = pymongo.MongoClient(host=host, port=port)
        # 指定数据库
        mydb = client[dbname]
        # 存放数据的数据库表名
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)
        return item

ii. 存入MySQL 数据库

存入MySQL 数据库我们可以自定义自己的 pipelines
在项目文件夹下新建一个 mysqlpipelines 文件夹或者 Package，具体位置可查看前面树结构
首先，我们先编写一个 sql 模板 --> sql.py

# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File   ：project -> sql
@IDE    ：PyCharm
@Author ：ruochen
@Date   ：2020/4/3 12:53
@Desc   
=================================================='''
import pymysql
from XcSpider import settings

MYSQL_HOST = settings.MYSQL_HOST
MYSQL_USER = settings.MYSQL_USER
MYSQL_PASSWORD = settings.MYSQL_PASSWORD
MYSQL_PORT = settings.MYSQL_PORT
MYSQL_DB = settings.MYSQL_DB

db = pymysql.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOST, port=MYSQL_PORT, database=MYSQL_DB, charset="utf8")
cursor = db.cursor()

class Sql(object):

    @classmethod
    def insert_db_xici(cls, country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime):
        sql = 'insert into xicidaili(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)' \
              ' values (%(country)s, %(ipaddress)s, %(port)s, %(serveraddr)s, %(isanonymous)s, %(type)s, %(alivetime)s, %(verificationtime)s) '
        value = {
            'country': country,
            'ipaddress': ipaddress,
            'port': port,
            'serveraddr': serveraddr,
            'isanonymous': isanonymous,
            'type': type,
            'alivetime': alivetime,
            'verificationtime': verificationtime,
        }
        try:
            cursor.execute(sql, value)
            db.commit()
        except Exception as e:
            print('插入失败----', e)
            db.rollback()

    # 去重
    @classmethod
    def select_name(cls, ipaddress):
        sql = "select exists(select 1 from xicidaili where ipaddress=%(ipaddress)s)"
        value = {
            'ipaddress': ipaddress
        }
        cursor.execute(sql, value)
        return cursor.fetchall()[0]

然后是管道文件 mysqlpipelines\pipelines.py

# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File   ：project -> pipelines
@IDE    ：PyCharm
@Author ：ruochen
@Date   ：2020/4/3 12:53
@Desc   ：
=================================================='''
from XcSpider.items import XiciDailiItem
from .sql import Sql

class XicidailiPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, XiciDailiItem):
            ipaddress = item['ipaddress']
            ret = Sql.select_name(ipaddress)
            if ret[0] == 1:
                print("ip: {} 已经存在啦----".format(ipaddress))
            else:
                country = item['country']
                ipaddress = item['ipaddress']
                port = item['port']
                serveraddr = item['serveraddr']
                isanonymous = item['isanonymous']
                type = item['type']
                alivetime = item['alivetime']
                verificationtime = item['verificationtime']

                Sql.insert_db_xici(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)

9. settings.py 中 pipelines 设置

前面 settings.py 文件已经有添加，这里再说一次
一个是MySQL的中间件，一个是MongoDB的中间件
优先级可以随便设置
两个可以同时打开，可也单独打开

这里，给大家提供一个小技巧，我们可以先以导包的形式找到我们的 pipelines ，然后复制过去即可，如下

# from XcSpider.mysqlpipelines.pipelines import XicidailiPipeline
ITEM_PIPELINES = {
   # 'XcSpider.pipelines.XcspiderPipeline': 300,
    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,
    'XcSpider.pipelines.XcPipeline': 200,
}

10. 运行程序

现在，我们就可以运行 main.py 文件来启动我们的爬虫程序了
然后就可以在数据库中看到爬取的数据了

$\color{red}存入 MongoDB 数据库可以直接运行$
$\color{red}存入MySQL数据库需要先创建数据库和数据表，数据表名称是 xicidaili，我把建表语句贴在下面$

Create Table: CREATE TABLE `xicidaili` (
  `id` int(255) unsigned NOT NULL AUTO_INCREMENT,
  `country` varchar(1000) NOT NULL,
  `ipaddress` varchar(1000) NOT NULL,
  `port` int(255) NOT NULL,
  `serveraddr` varchar(50) NOT NULL,
  `isanonymous` varchar(30) NOT NULL,
  `type` varchar(30) NOT NULL,
  `alivetime` varchar(30) NOT NULL,
  `verificationtime` varchar(30) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=64 DEFAULT CHARSET=utf8;

end. 运行结果

MySQL 数据库

在这里插入图片描述

MongoDB 数据库

在这里插入图片描述

【生长吧！Python】有奖征文火热进行中：https://bbs.huaweicloud.com/blogs/278897

（完）