一、Python命名规则

二、xpath用法:


这里的下标是从1开始的,不是0


抓取图片:

小技巧:
如果遇到]怎么办?
1 | links = dom_tree.xpath( "//a[@class='download']" )#在xml中定位节点,返回的是一个列表 |
2 | for index in range(len(links)): |
5 | print (links[index].tag) |
6 | print (links[index].attrib) |
7 | print (links[index].text) |
比如url为磁力链接
那么
print(links[index].tag)#获取a标签名a
print(links[index].attrib)#获取a标签的属性href和class
print(links[index].text)#获取a标签的文字部分
的结果分别为:
2 | { 'href' : 'magnet:?xt=urn:btih:7502edea0dfe9c2774f95118db3208a108fe10ca' , 'class' : 'download' } |
参考:https://www.cnblogs.com/z-x-y/p/8260213.html
截取超级链接
截取多个标签
比如,文章的正文部分又有h2,又有h3,又有p标签,可以通过下面的方法截取,不过我试了一下,发现它先将所有的h2,找出来,再将所有的p标签找出来排列,那么和原文章的顺序就乱了。。
所以最后还是要用正则。
1 | xpath( '//*[@id="xxx"]/h2|*[@id="xxx"]/h3|*[@id="xxx"]/p' ) |
特殊的字段
1 | ip.xpath( 'string(td[5])' )[0].extract().strip() #获取第5个单元格的所有文本。 |
3 | ip.xpath( 'td[8]/div[@class="bar"]/@title' ).re( '\d{0,2}\.\d{0,}' )[0] #匹配<div class = "bar" title= "0.0885秒" > 中的数字 |
如果图片url在不同的存储位置,xpath的时候用“|”符号。

常见问题:
(1)如果在chrome的xpath插件中可以用q_urls = response.xpath('//div[@class="line content"]')看到取值,但是在代码中取不到值,需要用下面的代码:
2 | html = requests.get(url,headers = headers) |
3 | response = etree.HTML(html.content) |
4 | q_urls = response.xpath( '//div[@class="line content"]' ) |
5 | result = q_urls[0].xpath( 'string(.)' ).strip() |
(2)查看元素
1 | content = selector.xpath( '//div[@class="metarial"]' )[0] |
参考:
https://www.cnblogs.com/just-do/p/9778941.html
(3)乱码问题
如果用xpath取到的中文乱码,可以用下面的方案:
1 | content=etree.tostring(content,encoding= "utf-8" ).decode( 'utf-8' ) |
参考:
https://www.cnblogs.com/Rhythm-/p/11374832.html
第五章 scrapy爬虫框架
1. __init__.py文件,它是个空文件,作用是将它的上级目录变成了一个模块,,可以供python导入使用。
2. items.py决定抓取哪些项目,wuhanmoviespider.py决定怎么爬的,settings.py决定由谁去处理爬取的内容,pipilines.py决定爬取后的内容怎么样处理。
3.
1 | <h3>武汉<font color= "#0066cc" >今天</font>天气</h3> |
选择方式为:.h3//text()而不是.h3/text()
4.使用json输出
settings.py中的Item_pipeline项,它是一个字典,字典是可以添加元素的,因此完全可以自行构造一个Python文件,然后加进去。
(1)创建pipelines2json.py文件
5 | class WeatherPipeline(object): |
6 | def process_item(self, item, spider): |
7 | today=time. strftime (‘%Y%m%d‘,time.localtime()) |
9 | with codecs.open(fileName,‘a‘,encoding=‘utf8‘) as fp: |
10 | line=json.dumps(dict(item),ensure_ascii=False)+‘\n‘ |
(2)修改settings.py,将pipeline2json加入到Item_pipelines中去
2 | 'weather.pipelines.weatherpipeline' :1, |
3 | 'weather.将pipelines2json.weatherpipeline' :2, |
5.登陆数据库,数据表的命令
其实之前在这里已经学过了。不过那里是直接修改pipeline,这里新建了一个pipelines2mysql.py用来入库。
这里主要是为了标记一下数据库操作命令。
1 | # 创建数据库:scrapyDB ,以utf8位编码格式,每条语句以’;‘结尾 |
2 | CREATE DATABASE scrapyDB CHARACTER SET 'utf8' COLLATE 'utf8_general-Ci' ; |
7 | # 创建我们需要的字段:字段要和我们代码里一一对应,方便我们一会写sql语句 |
17 | )ENGINE=InnoDB DEFAULT CHARSET= 'utf8' ; |
查看一下weather表格的样子
1 | show columns from weather 或者:desc weather |
6.添加一个User_agent
在scrapy中的确是有默认的headers,但这个headrs与浏览器的headers是有区别的。有的网站会检查headers,所以需要给scrapy一个浏览器的headers。

当然还可以用下面的方法:
1 | from getProxy import userAgents |
5 | SPIDER_MODULES=[ 'getProxy.spiders' ] |
7 | NEWSPIDER_MODULE= 'getProxy.spiders' |
9 | USER_AGENT=userAgents.pcUserAgent.get( 'Firefox 4.0.1 – Windows' ) |
11 | ITEM_PIPELINES={ 'getProxy.pipelines.GetProxyPipeline' :300} |
只需要在settings.py里添加一个User_agent项就可以了。
这里修改了USER_AGENT,导入userAgents模块,下面给出userAgents模块代码:
2 | "safari 5.1 – MAC" : "User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50" , |
3 | "safari 5.1 – Windows" : "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50" , |
4 | "IE 9.0" : "User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);" , |
5 | "IE 8.0" : "User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" , |
6 | "IE 7.0" : "User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" , |
7 | "IE 6.0" : "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" , |
8 | "Firefox 4.0.1 – MAC" : "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" , |
9 | "Firefox 4.0.1 – Windows" : "User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" , |
10 | "Opera 11.11 – MAC" : "User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11" , |
11 | "Opera 11.11 – Windows" : "User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11" , |
12 | "Chrome 17.0 – MAC" : "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11" , |
13 | "Maxthon" : "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)" , |
14 | "Tencent TT" : "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)" , |
15 | "The World 2.x" : "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" , |
16 | "The World 3.x" : "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)" , |
17 | "sogou 1.x" : "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)" , |
18 | "360" : "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)" , |
19 | "Avant" : "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)" , |
20 | "Green Browser" : "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" |
24 | "iOS 4.33 – iPhone" : "User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5" , |
25 | "iOS 4.33 – iPod Touch" : "User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5" , |
26 | "iOS 4.33 – iPad" : "User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5" , |
27 | "Android N1" : "User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1" , |
28 | "Android QQ" : "User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1" , |
29 | "Android Opera " : "User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10" , |
30 | "Android Pad Moto Xoom" : "User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13" , |
31 | "BlackBerry" : "User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+" , |
32 | "WebOS HP Touchpad" : "User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0" , |
33 | "Nokia N97" : "User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124" , |
34 | "Windows Phone Mango" : "User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)" , |
35 | "UC" : "User-Agent: UCWEB7.0.2.37/28/999" , |
36 | "UC standard" : "User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999" , |
37 | "UCOpenwave" : "User-Agent: Openwave/ UCWEB7.0.2.37/28/999" , |
38 | "UC Opera" : "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999" |
7.封锁user-agent破解
方法其实很简单:
(1).新建一个middlewares目录,创建__init__.py,以及资源文件resource.py和中间件文件customUserAgent.py。
(2).编辑customUserAgent.py,将资源文件resource.py中的user-agent随机选择一个出来,作为Scrapy的user-agent。
(3)修改settings.py文件,将RandomUserAgent加入DOWNLOADER_MIDDLEWARES

8.封锁IP破解
方法和上面差不多:
(1)在resource.py中添加PROXIES。
(2)创建customProxy.py,让Scrapy爬取网站时随机使用IP池中的代理。
(3)修改settings.py文件。

可以参考:https://www.cnblogs.com/hqutcy/p/7341212.html
第六章 自写爬虫模板
项目文件图示:

代码:
getTrendsMV.py
1 | from bs4 import BeautifulSoup |
4 | from mylog import MyLog as mylog |
14 | releasetime = None # 发布时间 |
17 | class GetMvList(object): |
18 | "" " the all data from www.yinyuetai.com |
19 | 所有的数据都来自www.yinyuetai.com |
36 | areas = [i for i in self.areasDic.keys()] |
37 | pages = [str(i) for i in range(1, 4)] |
41 | urlEnd = 'area=' + area + '&page=' + page |
42 | url = self.urlbase + urlEnd |
44 | self.log.info( '添加URL:{}到URLS' .format(url)) |
45 | self.spider(area, urls) |
47 | def getResponseContent(self, url): |
49 | fakeHeaders = { "User-Agent" : self.getRandomHeaders()} |
50 | request = urllib.request.Request(url, headers=fakeHeaders) |
51 | proxy = urllib.request.ProxyHandler({ 'http' : 'http://' + self.getRandomProxy()}) |
52 | opener = urllib.request.build_opener(proxy) |
53 | urllib.request.install_opener(opener) |
55 | response = urllib.request.urlopen(request) |
56 | html = response.read().decode( 'utf8' ) |
58 | except Exception as e: |
59 | self.log.error( 'Python 返回 URL:{} 数据失败' .format(url)) |
62 | self.log.info( 'Python 返回 URL:{} 数据成功' .format(url)) |
65 | def getRandomProxy(self): |
67 | return random.choice(resource.PROXIES) |
69 | def getRandomHeaders(self): |
71 | return random.choice(resource.UserAgents) |
73 | def spider(self, area, urls): |
76 | responseContent = self.getResponseContent(url) |
77 | if not responseContent: |
79 | soup = BeautifulSoup(responseContent, 'lxml' ) |
80 | tags = soup.find_all( 'li' , attrs={ 'name' : 'dmvLi' }) |
83 | item.top_num = tag.find( 'div' , attrs={ 'class' : 'top_num' }).get_text() |
84 | if tag.find( 'h3' , attrs={ 'class' : 'desc_score' }): |
85 | item.score = tag.find( 'h3' , attrs={ 'class' : 'desc_score' }).get_text() |
87 | item.score = tag.find( 'h3' , attrs={ 'class' : 'asc_score' }).get_text() |
88 | item.mvname = tag.find( 'a' , attrs={ 'class' : 'mvname' }).get_text() |
89 | item.singer = tag.find( 'a' , attrs={ 'class' : 'special' }).get_text() |
90 | item.releasetime = tag.find( 'p' , attrs={ 'class' : 'c9' }).get_text() |
92 | self.log.info( '添加mvName为{}的数据成功' .format(item.mvname)) |
93 | self.pipelines(items, area) |
95 | def pipelines(self, items, area): |
96 | filename = '音悦台V榜-榜单.txt' |
97 | nowtime = time. strftime ( '%Y-%m-%d %H:%M:S' , time.localtime()) |
98 | with open(filename, 'a' , encoding= 'utf8' ) as f: |
99 | f.write( '{} --------- {}\r\n' .format(self.areasDic.get(area), nowtime)) |
101 | f.write( "{} {} \t {} \t {} \t {}\r\n" .format(item.top_num, |
107 | self.log.info( '添加mvname为{}的MV到{}...' .format(item.mvname, filename)) |
111 | if __name__ == '__main__' : |
mylog.py
11 | self.user = getpass.getuser() # 获取用户 |
12 | self.logger = logging.getLogger(self.user) |
13 | self.logger.setLevel(logging.DEBUG) |
16 | self.logfile = sys.argv[0][0:-3] + '.log' # 动态获取调用文件的名字 |
17 | self.formatter = logging.Formatter( '%(asctime)-12s %(levelname)-8s %(message)-12s\r\n' ) |
20 | self.logHand = logging.FileHandler(self.logfile, encoding= 'utf-8' ) |
21 | self.logHand.setFormatter(self.formatter) |
22 | self.logHand.setLevel(logging.DEBUG) |
24 | self.logHandSt = logging.StreamHandler() |
25 | self.logHandSt.setFormatter(self.formatter) |
26 | self.logHandSt.setLevel(logging.DEBUG) |
28 | self.logger.addHandler(self.logHand) |
29 | self.logger.addHandler(self.logHandSt) |
33 | self.logger.debug(msg) |
42 | self.logger.error(msg) |
44 | def critical(self, msg): |
45 | self.logger.critical(msg) |
48 | if __name__ == '__main__' : |
50 | mylog.debug(u "I'm debug 中文测试" ) |
51 | mylog.info(u "I'm info 中文测试" ) |
52 | mylog.warn(u "I'm warn 中文测试" ) |
53 | mylog.error(u "I'm error 中文测试" ) |
54 | mylog.critical(u "I'm critical 中文测试" ) |
resource.py
4 | "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0" , |
5 | "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50" , |
6 | "Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50" , |
7 | "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" , |
8 | "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;" , |
9 | "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" , |
10 | "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" , |
11 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" , |
12 | "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" , |
13 | "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11" , |
14 | "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)" , |
15 | "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)" , |
16 | "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)" , |
17 | "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)" , |
18 | "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" , |
19 | "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)" , |
22 | # 代理ip地址,如果不能使用,去网上找几个免费的使用 |
26 | "111.177.106.196:9999" , |