Home >  > Scrpy的第五个爬虫(爬取详细页面写入SQLite)

Scrpy的第五个爬虫(爬取详细页面写入SQLite)

0

学习目标:

  • 掌握采集时翻页功能的实现。
  • 掌握采集详细页面内容的方法
  • 掌握运用Navicat可视化界面建立Sqlite数据库、数据表的方法。
  • 掌握运用Scrapy从建立爬虫到写入Sqlite数据库的全部流程。


测试环境:
win7 旗舰版
Python 3.5.2(Anaconda3 4.2.0 64-bit)

一、创建项目及爬虫
创建一个名为teachers的项目。并且在spiders下面使用下面的命令新建一个teacher.py文件。

scrapy genspider teacher http://ggglxy.scu.edu.cn

系统会自动调用"basic"模板生成teacher.py文件,并自动生成以下的代码:

# -*- coding: utf-8 -*-
import scrapy


class Teacher2Spider(scrapy.Spider):
    name = 'teacher2'
    allowed_domains = ['http://ggglxy.scu.edu.cn']
    start_urls = ['http://http://ggglxy.scu.edu.cn/']

    def parse(self, response):
        pass

二、设定items.py文件,定义要采集的数据

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TeachersItem(scrapy.Item):
  # define the fields for your item here like:
  name = scrapy.Field()   #姓名
  position = scrapy.Field()  #职称
  workfor = scrapy.Field()  #所属系
  email = scrapy.Field()  #email
  link = scrapy.Field()  #详细页链接
  desc = scrapy.Field()  #简介

三、编写teacher.py爬虫文件

注意:"item['name']="后面的div前面是没有//的。

import scrapy
import hashlib

from scrapy.selector import Selector
from teachers.items import *


class Teachers(scrapy.Spider):
  name="tspider"
  allowed_domains=["ggglxy.scu.edu.cn"]
  start_urls=[
    'http://ggglxy.scu.edu.cn/index.php?c=article&a=type&tid=18&page_1_page=1',
  ]

  def parse(self,response):
    for teacher in response.xpath("//ul[@class='teachers_ul mt20 cf']/li"):
      item=TeachersItem()
      item['name']=teacher.xpath("div[@class='r fr']/h3/text()").extract_first()
      item['position']=teacher.xpath("div[@class='r fr']/p/text()").extract_first()
      item['email']=teacher.xpath("div[@class='r fr']/div[@class='desc']/p[2]/text()").extract_first()
      item['workfor']=teacher.xpath("div[@class='r fr']/div[@class='desc']/p[1]/text()").extract_first()
      href=teacher.xpath("div[@class='l fl']/a/@href").extract_first()
      request=scrapy.http.Request(response.urljoin(href),callback=self.parse_desc)
      request.meta['item']=item
      yield request


    next_page=response.xpath("//div[@class='pager cf tc pt10 pb10 mobile_dn']/li[last()-1]/a/@href").extract_first()
    last_page=response.xpath("//div[@class='pager cf tc pt10 pb10 mobile_dn']/li[last()]/a/@href").extract_first()
    if last_page:
        next_page="http://ggglxy.scu.edu.cn/"+next_page
        yield scrapy.http.Request(next_page,callback=self.parse)

  def parse_desc(self,response):
    item=response.meta['item']
    item['link']=response.url
    item['desc']=response.xpath("//div[@class='desc']/text()").extract()
    yield item

关于request.meta:
Scrapy采用的是回调(callback)的方式,把请求处理交给下一次请求,在请求时用meta传递参数。

Request(url=item_details_url, meta={'item': item},callback=self.parse_details)

可传递简单类型参数或对象类型参数。
还可以传递多个参数:

yield Request(url, meta={'item': item, 'rdt': rdt, 'comments':cmt,'rewards':rewards,'total': total, 'curpage': cur}, callback=self.parse)

我们看一下相应代码的执行效果:
Snap1
可以看到,class="fl"对应的是左边的区块,然后又分成了l fr和r fr两个部分。
Snap2
我们可以通过print函数测试代码:
Snap3

四、执行爬虫

scrapy crawl tspider -o teachers.json -s FEED_EXPORT_ENCODING=utf-8

五、查看结果

Snap295

https://www.jianshu.com/p/ad6bf3f2a883

下面是自己实践的流程及代码:

一、创建一个名为tech2的项目

scrapy startproject tech2

二、利用scrapy genspider命令生成teacher2.py文件。

cd tech2
scrapy genspider example example.com

注意事项:生成的文件中我用的是

allowed_domains = ['http://ggglxy.scu.edu.cn']

由于加了一个http://,所以导致后面实现翻页功能的时候老是失败,找了半天才发现原来是加了http://的原因。所以记住这行代码不要加http://。

三、修改items.py文件

import scrapy
class Tech2Item(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    rank = scrapy.Field()
    depart = scrapy.Field()
    email = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

四、teacher2.py文件
在抓取desc的时候,抓取到的内容有一个换行符号“/n”,入库时因为“/”无法入库,通过加上“normalize-space”得到解决。

import scrapy
from tech2.items import Tech2Item

class Teacher2Spider(scrapy.Spider):
    name = 'teacher2'
    allowed_domains = ['ggglxy.scu.edu.cn']
    start_urls = ['http://ggglxy.scu.edu.cn/index.php?c=article&a=type&tid=18&page_1_page=1/']

    def parse(self, response):
        for teacher in response.xpath("//ul[@class='teachers_ul mt20 cf']/li"):
            item = Tech2Item()
            item['name'] = teacher.xpath("div[@class='r fr']/h3/text()").extract_first()
            item['rank'] = teacher.xpath("div[@class='r fr']/p/text()").extract_first()
            item['email'] = teacher.xpath("div[@class='r fr']/div[@class='desc']/p[2]/text()").extract_first()
            item['depart'] = teacher.xpath("div[@class='r fr']/div[@class='desc']/p[1]/text()").extract_first()
            href = teacher.xpath("div[@class='l fl']/a/@href").extract_first()
            request = scrapy.http.Request(response.urljoin(href), callback=self.parse_desc)
            request.meta['item'] = item
            yield request

        next_page = response.xpath("//div[@class='pager cf tc pt10 pb10 mobile_dn']/li[last()-1]/a/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def parse_desc(self, response):
        item = response.meta['item']
        item['link'] = response.url
        item['desc'] = response.xpath("normalize-space(//div[@class='desc']/text())").extract_first()
        yield item

五、修改pipiline.py
这个可以作为模板,以后需要使用时只需要修改数据库名称(即tbase.sqlite)、表格名称即可。

import sqlite3


class Tech2Pipeline(object):
    def open_spider(self, spider):
        self.con = sqlite3.connect("tbase.sqlite")
        self.cu = self.con.cursor()

    def process_item(self, item, spider):
        print(spider.name, 'pipelines')
        insert_sql = "insert into tbase (name,rank,depart,email,desc) values('{}','{}','{}','{}','{}')".format(item['name'], item['rank'],item['depart'],item['email'],item['desc'])
        print(insert_sql)  # 为了方便调试        
        self.cu.execute(insert_sql)
        self.con.commit()
        return item

    def spider_close(self, spider):
        self.con.close()

六、在settings.py文件中开启pipeline,将下面的代码去除注释即可(按快捷键“CTRL+/”。

ITEM_PIPELINES = {
   'tech2.pipelines.Tech2Pipeline': 300,
}

七、建立一个sqlite数据表
Snap309
Snap310
Snap311
通过单击鼠标右键,在弹出的对话框中执行“添加字段”功能,为表格添加五个字段。
Snap312
点击保存,输入表格名称
Snap313

八、运行爬虫
打开tbase这张表,可以看到数据已经入库。
Snap314

补充:
另一个版本的teacher2.py

import scrapy
from tech2.items import Tech2Item


class Teacher2Spider(scrapy.Spider):
    name = 'teacher2'
    allowed_domains = ['http://ggglxy.scu.edu.cn']
    start_urls = ['http://ggglxy.scu.edu.cn/index.php?c=article&a=type&tid=18&page_1_page=1/']

    def parse(self, response):
        item = Tech2Item()
        name_list = response.xpath("//h3[@class='mb10']/text()").extract()
        rank_list = response.xpath("//p[@class='color_main f14']/text()").extract()
        depart_list = response.xpath("//div[@class='desc']/p[1]/text()").extract()
        email_list = response.xpath("//div[@class='desc']/p[2]/text()").extract()
        url_list = response.xpath("//div[@class='l fl']/a/@href").extract()
        for i, j,k,l,m in zip(name_list, rank_list,depart_list,email_list,url_list):
            item['name'] = i
            item['rank'] = j
            item['depart'] = k
            item['email'] = l
            item['url'] = m
            yield item

原载:蜗牛博客
网址:http://www.snailtoday.com
尊重版权,转载时务必以链接形式注明作者和原始出处及本声明。

暧昧帖

本文暂无标签

发表评论

*

*