学习目标:
- 掌握采集时翻页功能的实现。
- 掌握采集详细页面内容的方法
- 掌握运用Navicat可视化界面建立Sqlite数据库、数据表的方法。
- 掌握运用Scrapy从建立爬虫到写入Sqlite数据库的全部流程。
测试环境:
win7 旗舰版
Python 3.5.2(Anaconda3 4.2.0 64-bit)
一、创建项目及爬虫
创建一个名为teachers的项目。并且在spiders下面使用下面的命令新建一个teacher.py文件。
scrapy genspider teacher http://ggglxy.scu.edu.cn
系统会自动调用"basic"模板生成teacher.py文件,并自动生成以下的代码:
# -*- coding: utf-8 -*- import scrapy class Teacher2Spider(scrapy.Spider): name = 'teacher2' allowed_domains = ['http://ggglxy.scu.edu.cn'] start_urls = ['http://http://ggglxy.scu.edu.cn/'] def parse(self, response): pass
二、设定items.py文件,定义要采集的数据
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class TeachersItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() #姓名 position = scrapy.Field() #职称 workfor = scrapy.Field() #所属系 email = scrapy.Field() #email link = scrapy.Field() #详细页链接 desc = scrapy.Field() #简介
三、编写teacher.py爬虫文件
注意:"item['name']="后面的div前面是没有//的。
import scrapy import hashlib from scrapy.selector import Selector from teachers.items import * class Teachers(scrapy.Spider): name="tspider" allowed_domains=["ggglxy.scu.edu.cn"] start_urls=[ 'http://ggglxy.scu.edu.cn/index.php?c=article&a=type&tid=18&page_1_page=1', ] def parse(self,response): for teacher in response.xpath("//ul[@class='teachers_ul mt20 cf']/li"): item=TeachersItem() item['name']=teacher.xpath("div[@class='r fr']/h3/text()").extract_first() item['position']=teacher.xpath("div[@class='r fr']/p/text()").extract_first() item['email']=teacher.xpath("div[@class='r fr']/div[@class='desc']/p[2]/text()").extract_first() item['workfor']=teacher.xpath("div[@class='r fr']/div[@class='desc']/p[1]/text()").extract_first() href=teacher.xpath("div[@class='l fl']/a/@href").extract_first() request=scrapy.http.Request(response.urljoin(href),callback=self.parse_desc) request.meta['item']=item yield request next_page=response.xpath("//div[@class='pager cf tc pt10 pb10 mobile_dn']/li[last()-1]/a/@href").extract_first() last_page=response.xpath("//div[@class='pager cf tc pt10 pb10 mobile_dn']/li[last()]/a/@href").extract_first() if last_page: next_page="http://ggglxy.scu.edu.cn/"+next_page yield scrapy.http.Request(next_page,callback=self.parse) def parse_desc(self,response): item=response.meta['item'] item['link']=response.url item['desc']=response.xpath("//div[@class='desc']/text()").extract() yield item
关于request.meta:
Scrapy采用的是回调(callback)的方式,把请求处理交给下一次请求,在请求时用meta传递参数。
Request(url=item_details_url, meta={'item': item},callback=self.parse_details)
可传递简单类型参数或对象类型参数。
还可以传递多个参数:
yield Request(url, meta={'item': item, 'rdt': rdt, 'comments':cmt,'rewards':rewards,'total': total, 'curpage': cur}, callback=self.parse)
我们看一下相应代码的执行效果:
可以看到,class="fl"对应的是左边的区块,然后又分成了l fr和r fr两个部分。
我们可以通过print函数测试代码:
四、执行爬虫
scrapy crawl tspider -o teachers.json -s FEED_EXPORT_ENCODING=utf-8
五、查看结果
https://www.jianshu.com/p/ad6bf3f2a883
下面是自己实践的流程及代码:
一、创建一个名为tech2的项目
scrapy startproject tech2
二、利用scrapy genspider命令生成teacher2.py文件。
cd tech2 scrapy genspider example example.com
注意事项:生成的文件中我用的是
allowed_domains = ['http://ggglxy.scu.edu.cn']
由于加了一个http://,所以导致后面实现翻页功能的时候老是失败,找了半天才发现原来是加了http://的原因。所以记住这行代码不要加http://。
三、修改items.py文件
import scrapy class Tech2Item(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() rank = scrapy.Field() depart = scrapy.Field() email = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()
四、teacher2.py文件
在抓取desc的时候,抓取到的内容有一个换行符号“/n”,入库时因为“/”无法入库,通过加上“normalize-space”得到解决。
import scrapy from tech2.items import Tech2Item class Teacher2Spider(scrapy.Spider): name = 'teacher2' allowed_domains = ['ggglxy.scu.edu.cn'] start_urls = ['http://ggglxy.scu.edu.cn/index.php?c=article&a=type&tid=18&page_1_page=1/'] def parse(self, response): for teacher in response.xpath("//ul[@class='teachers_ul mt20 cf']/li"): item = Tech2Item() item['name'] = teacher.xpath("div[@class='r fr']/h3/text()").extract_first() item['rank'] = teacher.xpath("div[@class='r fr']/p/text()").extract_first() item['email'] = teacher.xpath("div[@class='r fr']/div[@class='desc']/p[2]/text()").extract_first() item['depart'] = teacher.xpath("div[@class='r fr']/div[@class='desc']/p[1]/text()").extract_first() href = teacher.xpath("div[@class='l fl']/a/@href").extract_first() request = scrapy.http.Request(response.urljoin(href), callback=self.parse_desc) request.meta['item'] = item yield request next_page = response.xpath("//div[@class='pager cf tc pt10 pb10 mobile_dn']/li[last()-1]/a/@href").extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) def parse_desc(self, response): item = response.meta['item'] item['link'] = response.url item['desc'] = response.xpath("normalize-space(//div[@class='desc']/text())").extract_first() yield item
五、修改pipiline.py
这个可以作为模板,以后需要使用时只需要修改数据库名称(即tbase.sqlite)、表格名称即可。
import sqlite3 class Tech2Pipeline(object): def open_spider(self, spider): self.con = sqlite3.connect("tbase.sqlite") self.cu = self.con.cursor() def process_item(self, item, spider): print(spider.name, 'pipelines') insert_sql = "insert into tbase (name,rank,depart,email,desc) values('{}','{}','{}','{}','{}')".format(item['name'], item['rank'],item['depart'],item['email'],item['desc']) print(insert_sql) # 为了方便调试 self.cu.execute(insert_sql) self.con.commit() return item def spider_close(self, spider): self.con.close()
六、在settings.py文件中开启pipeline,将下面的代码去除注释即可(按快捷键“CTRL+/”。
ITEM_PIPELINES = { 'tech2.pipelines.Tech2Pipeline': 300, }
七、建立一个sqlite数据表
通过单击鼠标右键,在弹出的对话框中执行“添加字段”功能,为表格添加五个字段。
点击保存,输入表格名称
八、运行爬虫
打开tbase这张表,可以看到数据已经入库。
补充:
另一个版本的teacher2.py
import scrapy from tech2.items import Tech2Item class Teacher2Spider(scrapy.Spider): name = 'teacher2' allowed_domains = ['http://ggglxy.scu.edu.cn'] start_urls = ['http://ggglxy.scu.edu.cn/index.php?c=article&a=type&tid=18&page_1_page=1/'] def parse(self, response): item = Tech2Item() name_list = response.xpath("//h3[@class='mb10']/text()").extract() rank_list = response.xpath("//p[@class='color_main f14']/text()").extract() depart_list = response.xpath("//div[@class='desc']/p[1]/text()").extract() email_list = response.xpath("//div[@class='desc']/p[2]/text()").extract() url_list = response.xpath("//div[@class='l fl']/a/@href").extract() for i, j,k,l,m in zip(name_list, rank_list,depart_list,email_list,url_list): item['name'] = i item['rank'] = j item['depart'] = k item['email'] = l item['url'] = m yield item
原载:蜗牛博客
网址:http://www.snailtoday.com
尊重版权,转载时务必以链接形式注明作者和原始出处及本声明。