Home >  > 《Python网络爬虫实战》笔记(Xpath)

《Python网络爬虫实战》笔记(Xpath)

0

一、Python命名规则

二、xpath用法:


这里的下标是从1开始的,不是0

抓取图片:

小技巧:
如果遇到]怎么办?

1links = dom_tree.xpath("//a[@class='download']")#在xml中定位节点,返回的是一个列表
2for index in range(len(links)):
3    # links[index]返回的是一个字典
4    if (index % 2) == 0:
5        print(links[index].tag) 
6        print(links[index].attrib)
7        print(links[index].text)

比如url为磁力链接

那么

print(links[index].tag)#获取a标签名a
print(links[index].attrib)#获取a标签的属性href和class
print(links[index].text)#获取a标签的文字部分

的结果分别为:

1a
2{'href': 'magnet:?xt=urn:btih:7502edea0dfe9c2774f95118db3208a108fe10ca', 'class': 'download'}
3磁力链接

参考:https://www.cnblogs.com/z-x-y/p/8260213.html

截取超级链接

1//a[@class="text--link"]/@href
2 
3//span[@class='l fl']/a/@href   #截取超级链接

截取多个标签
比如,文章的正文部分又有h2,又有h3,又有p标签,可以通过下面的方法截取,不过我试了一下,发现它先将所有的h2,找出来,再将所有的p标签找出来排列,那么和原文章的顺序就乱了。。
所以最后还是要用正则。

1xpath('//*[@id="xxx"]/h2|*[@id="xxx"]/h3|*[@id="xxx"]/p')

特殊的字段

1ip.xpath('string(td[5])')[0].extract().strip()  #获取第5个单元格的所有文本。
2 
3ip.xpath('td[8]/div[@class="bar"]/@title').re('\d{0,2}\.\d{0,}')[0]  #匹配<div class="bar" title="0.0885秒"> 中的数字

如果图片url在不同的存储位置,xpath的时候用“|”符号。

常见问题:
(1)如果在chrome的xpath插件中可以用q_urls = response.xpath('//div[@class="line content"]')看到取值,但是在代码中取不到值,需要用下面的代码:

1def get_detail(url):
2    html = requests.get(url,headers = headers)
3    response = etree.HTML(html.content)
4    q_urls = response.xpath('//div[@class="line content"]')
5    result = q_urls[0].xpath('string(.)').strip()
6    return result

(2)查看元素

1content = selector.xpath('//div[@class="metarial"]')[0]

参考:
https://www.cnblogs.com/just-do/p/9778941.html

(3)乱码问题
如果用xpath取到的中文乱码,可以用下面的方案:

1content=etree.tostring(content,encoding="utf-8").decode('utf-8')

参考:
https://www.cnblogs.com/Rhythm-/p/11374832.html

第五章 scrapy爬虫框架
1. __init__.py文件,它是个空文件,作用是将它的上级目录变成了一个模块,,可以供python导入使用。
2. items.py决定抓取哪些项目,wuhanmoviespider.py决定怎么爬的,settings.py决定由谁去处理爬取的内容,pipilines.py决定爬取后的内容怎么样处理。
3.

1&lt;h3&gt;武汉&lt;font color="#0066cc"&gt;今天&lt;/font&gt;天气&lt;/h3&gt;

选择方式为:.h3//text()而不是.h3/text()

4.使用json输出
settings.py中的Item_pipeline项,它是一个字典,字典是可以添加元素的,因此完全可以自行构造一个Python文件,然后加进去。
(1)创建pipelines2json.py文件

1import time
2import json
3import codecs
4 
5class WeatherPipeline(object):
6    def process_item(self, item, spider):
7        today=time.strftime(‘%Y%m%d‘,time.localtime())
8        fileName=today+‘.json‘
9        with codecs.open(fileName,‘a‘,encoding=‘utf8‘) as fp:
10            line=json.dumps(dict(item),ensure_ascii=False)+‘\n‘
11            fp.write(line)
12        return item

(2)修改settings.py,将pipeline2json加入到Item_pipelines中去

1Item_pipelines = {
2'weather.pipelines.weatherpipeline':1,
3'weather.将pipelines2json.weatherpipeline':2,
4}

5.登陆数据库,数据表的命令

其实之前在这里已经学过了。不过那里是直接修改pipeline,这里新建了一个pipelines2mysql.py用来入库。

这里主要是为了标记一下数据库操作命令。

1# 创建数据库:scrapyDB ,以utf8位编码格式,每条语句以’;‘结尾
2CREATE DATABASE scrapyDB CHARACTER SET 'utf8' COLLATE 'utf8_general-Ci';
3 
4# 选中刚才创建的表:
5use scrapyDB;
6 
7# 创建我们需要的字段:字段要和我们代码里一一对应,方便我们一会写sql语句
8CREATE TABLE weather(
9id INT AUTO_INCREMENT,
10date char(24),
11week char(24),
12img char(128),
13temperature char(24),
14weather char(24),
15wind char(24),
16PRIMARY KEY(id)
17)ENGINE=InnoDB DEFAULT CHARSET='utf8'

查看一下weather表格的样子

1show columns from weather   或者:desc weather

6.添加一个User_agent

在scrapy中的确是有默认的headers,但这个headrs与浏览器的headers是有区别的。有的网站会检查headers,所以需要给scrapy一个浏览器的headers。

当然还可以用下面的方法:

1from getProxy import userAgents
2  
3BOT_NAME='getProxy'
4  
5SPIDER_MODULES=['getProxy.spiders']
6  
7NEWSPIDER_MODULE='getProxy.spiders'
8  
9USER_AGENT=userAgents.pcUserAgent.get('Firefox 4.0.1 – Windows')
10  
11ITEM_PIPELINES={'getProxy.pipelines.GetProxyPipeline':300}

只需要在settings.py里添加一个User_agent项就可以了。

这里修改了USER_AGENT,导入userAgents模块,下面给出userAgents模块代码:

1pcUserAgent = {
2"safari 5.1 – MAC":"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
3"safari 5.1 – Windows":"User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
4"IE 9.0":"User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
5"IE 8.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
6"IE 7.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
7"IE 6.0":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
8"Firefox 4.0.1 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
9"Firefox 4.0.1 – Windows":"User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
10"Opera 11.11 – MAC":"User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
11"Opera 11.11 – Windows":"User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
12"Chrome 17.0 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
13"Maxthon":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
14"Tencent TT":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
15"The World 2.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
16"The World 3.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
17"sogou 1.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
18"360":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
19"Avant":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
20"Green Browser":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
21}
22  
23mobileUserAgent = {
24"iOS 4.33 – iPhone":"User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
25"iOS 4.33 – iPod Touch":"User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
26"iOS 4.33 – iPad":"User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
27"Android N1":"User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
28"Android QQ":"User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
29"Android Opera ":"User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
30"Android Pad Moto Xoom":"User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
31"BlackBerry":"User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
32"WebOS HP Touchpad":"User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
33"Nokia N97":"User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
34"Windows Phone Mango":"User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
35"UC":"User-Agent: UCWEB7.0.2.37/28/999",
36"UC standard":"User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999",
37"UCOpenwave":"User-Agent: Openwave/ UCWEB7.0.2.37/28/999",
38"UC Opera":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
39}

7.封锁user-agent破解
方法其实很简单:
(1).新建一个middlewares目录,创建__init__.py,以及资源文件resource.py和中间件文件customUserAgent.py。
(2).编辑customUserAgent.py,将资源文件resource.py中的user-agent随机选择一个出来,作为Scrapy的user-agent。
(3)修改settings.py文件,将RandomUserAgent加入DOWNLOADER_MIDDLEWARES

8.封锁IP破解
方法和上面差不多:
(1)在resource.py中添加PROXIES。
(2)创建customProxy.py,让Scrapy爬取网站时随机使用IP池中的代理。
(3)修改settings.py文件。

可以参考:https://www.cnblogs.com/hqutcy/p/7341212.html

第六章 自写爬虫模板

项目文件图示:

代码:
getTrendsMV.py

1from bs4 import BeautifulSoup
2import urllib.request
3import time
4from mylog import MyLog as mylog
5import resource
6import random
7 
8 
9class Item(object):
10    top_num = None  # 排名
11    score = None  # 打分
12    mvname = None  # mv名字
13    singer = None  # 演唱者
14    releasetime = None  # 发布时间
15 
16 
17class GetMvList(object):
18    """ the all data from www.yinyuetai.com
19    所有的数据都来自www.yinyuetai.com
20    """
21    def __init__(self):
22        self.urlbase = 'http://vchart.yinyuetai.com/vchart/trends?'
23        self.areasDic = {
24                         'ALL': '总榜',
25                         'ML': '内地篇',
26                         'HT': '港台篇',
27                         'US': '欧美篇',
28                         'KR': '韩国篇',
29                         'JP': '日本篇',
30                         }
31        self.log = mylog()
32        self.geturls()
33 
34    def geturls(self):
35        # 获取url池
36        areas = [i for i in self.areasDic.keys()]
37        pages = [str(i) for i in range(1, 4)]
38        for area in areas:
39            urls = []
40            for page in pages:
41                urlEnd = 'area=' + area + '&amp;page=' + page
42                url = self.urlbase + urlEnd
43                urls.append(url)
44                self.log.info('添加URL:{}到URLS'.format(url))
45            self.spider(area, urls)
46 
47    def getResponseContent(self, url):
48        """从页面返回数据"""
49        fakeHeaders = {"User-Agent": self.getRandomHeaders()}
50        request = urllib.request.Request(url, headers=fakeHeaders)
51        proxy = urllib.request.ProxyHandler({'http': 'http://' + self.getRandomProxy()})
52        opener = urllib.request.build_opener(proxy)
53        urllib.request.install_opener(opener)
54        try:
55            response = urllib.request.urlopen(request)
56            html = response.read().decode('utf8')
57            time.sleep(1)
58        except Exception as e:
59            self.log.error('Python 返回 URL:{} 数据失败'.format(url))
60            return ''
61        else:
62            self.log.info('Python 返回 URL:{} 数据成功'.format(url))
63            return html
64 
65    def getRandomProxy(self):
66        # 随机选取Proxy代理地址
67        return random.choice(resource.PROXIES)
68 
69    def getRandomHeaders(self):
70        # 随机选取User-Agent头
71        return random.choice(resource.UserAgents)
72 
73    def spider(self, area, urls):
74        items = []
75        for url in urls:
76            responseContent = self.getResponseContent(url)
77            if not responseContent:
78                continue
79            soup = BeautifulSoup(responseContent, 'lxml')
80            tags = soup.find_all('li', attrs={'name': 'dmvLi'})
81            for tag in tags:
82                item = Item()
83                item.top_num = tag.find('div', attrs={'class': 'top_num'}).get_text()
84                if tag.find('h3', attrs={'class': 'desc_score'}):
85                    item.score = tag.find('h3', attrs={'class': 'desc_score'}).get_text()
86                else:
87                    item.score = tag.find('h3', attrs={'class': 'asc_score'}).get_text()
88                item.mvname = tag.find('a', attrs={'class': 'mvname'}).get_text()
89                item.singer = tag.find('a', attrs={'class': 'special'}).get_text()
90                item.releasetime = tag.find('p', attrs={'class': 'c9'}).get_text()
91                items.append(item)
92                self.log.info('添加mvName为{}的数据成功'.format(item.mvname))
93        self.pipelines(items, area)
94 
95    def pipelines(self, items, area):
96        filename = '音悦台V榜-榜单.txt'
97        nowtime = time.strftime('%Y-%m-%d %H:%M:S', time.localtime())
98        with open(filename, 'a', encoding='utf8') as f:
99            f.write('{} --------- {}\r\n'.format(self.areasDic.get(area), nowtime))
100            for item in items:
101                f.write("{} {} \t {} \t {} \t {}\r\n".format(item.top_num,
102                                                             item.score,
103                                                             item.releasetime,
104                                                             item.mvname,
105                                                             item.singer
106                                                             ))
107                self.log.info('添加mvname为{}的MV到{}...'.format(item.mvname, filename))
108            f.write('\r\n'*4)
109 
110 
111if __name__ == '__main__':
112    GetMvList()

mylog.py

1#!/usr/bin/env python
2# coding: utf-8
3import logging
4import getpass
5import sys
6 
7 
8# 定义MyLog类
9class MyLog(object):
10    def __init__(self):
11        self.user = getpass.getuser()  # 获取用户
12        self.logger = logging.getLogger(self.user)
13        self.logger.setLevel(logging.DEBUG)
14 
15        # 日志文件名
16        self.logfile = sys.argv[0][0:-3] + '.log'  # 动态获取调用文件的名字
17        self.formatter = logging.Formatter('%(asctime)-12s %(levelname)-8s %(message)-12s\r\n')
18 
19        # 日志显示到屏幕上并输出到日志文件内
20        self.logHand = logging.FileHandler(self.logfile, encoding='utf-8')
21        self.logHand.setFormatter(self.formatter)
22        self.logHand.setLevel(logging.DEBUG)
23 
24        self.logHandSt = logging.StreamHandler()
25        self.logHandSt.setFormatter(self.formatter)
26        self.logHandSt.setLevel(logging.DEBUG)
27 
28        self.logger.addHandler(self.logHand)
29        self.logger.addHandler(self.logHandSt)
30 
31    # 日志的5个级别对应以下的5个函数
32    def debug(self, msg):
33        self.logger.debug(msg)
34 
35    def info(self, msg):
36        self.logger.info(msg)
37 
38    def warn(self, msg):
39        self.logger.warn(msg)
40 
41    def error(self, msg):
42        self.logger.error(msg)
43 
44    def critical(self, msg):
45        self.logger.critical(msg)
46 
47 
48if __name__ == '__main__':
49    mylog = MyLog()
50    mylog.debug(u"I'm debug 中文测试")
51    mylog.info(u"I'm info 中文测试")
52    mylog.warn(u"I'm warn 中文测试")
53    mylog.error(u"I'm error 中文测试")
54    mylog.critical(u"I'm critical 中文测试")

resource.py

1#!/usr/bin/env python
2# coding: utf-8
3UserAgents = [
4    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
5    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
6    "Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50",
7    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
8    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
9    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
10    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
11    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
12    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
13    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
14    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
15    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
16    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
17    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
18    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
19    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
20]
21 
22# 代理ip地址,如果不能使用,去网上找几个免费的使用
23# 这里使用的都是http
24PROXIES = [
25    "120.83.102.255:808",
26    "111.177.106.196:9999",
27]

暧昧帖

本文暂无标签

发表评论

*

*