最近弄一个项目,需要使用Selenium,由于目标网站在国外,访问速度非常慢,而且由于使用Selenium触发了它的验证码机制,光登陆就非常麻烦。最后使用Selenium的加载cookie功能,终于解决了登陆问题。
附最终的代码:
一、保存Cookie
from selenium import webdriver from selenium.webdriver.support.select import Select import time import pickle def save_cookies(driver,location): pickle.dump(driver.get_cookies(),open(location,"wb")) print("启动浏览器,打开SimiliarWeb登录界面") #用webdriver启动谷歌浏览器 driver = webdriver.Chrome(executable_path = "C:\\Users\xxx\AppData\Local\Google\Chrome\Application\chromedriver.exe") #打开目标页面 driver.get('https://pro.similarweb.com/#/website/worldwide-overview/snailtoday.com/*/999/3m?webSource=Total') author = 你的用户名 passowrd = 你的密码 #自动填入登录用户名 driver.find_element_by_xpath("./*//input[@name='UserName']").send_keys(author) #自动填入登录密码 driver.find_element_by_xpath("./*//input[@name='Password']").send_keys(passowrd) #自动点击登录按钮进行登录 driver.find_element_by_xpath("./*//button[@class='form__submit']").click() print("登陆成功") # 休息150秒 time.sleep(150) # 保存cookie save_cookies(driver,"H:\py_project\similiarweb\cookies.txt")
二、载入cookie及切换标签页
from selenium import webdriver from selenium.webdriver.support.select import Select import time import pickle from bs4 import BeautifulSoup def load_cookies(driver,location,url=None): cookies = pickle.load(open(location,"rb")) driver.delete_all_cookies() url = "https://pro.similarweb.com/#/website/worldwide-overview/snailtoday.com/*/999/3m?webSource=Total" if url is None else url driver.get(url) for cookie in cookies: driver.add_cookie(cookie) print("启动浏览器,打开SimiliarWeb登录界面") #用webdriver启动谷歌浏览器 driver = webdriver.Chrome(executable_path = "C:\\Users\xxx\AppData\Local\Google\Chrome\Application\chromedriver.exe") load_cookies(driver,"H:\py_project\similiarweb\cookies.txt") #打开目标 页面 driver.get('https://pro.similarweb.com/#/website/worldwide-overview/snailtoday.com/*/999/3m?webSource=Total') time.sleep(30) html=driver.page_source soup=BeautifulSoup(html,'lxml') visitors = soup.find_all('div', class_='big-text u-blueMediumMedium')[0].text print(visitors) #打开新的标签页 js = 'window.open("https://pro.similarweb.com/#/website/worldwide-overview/baidu.com/*/999/3m?webSource=Total");' driver.execute_script(js) time.sleep(30) handles = driver.window_handles driver.switch_to_window(handles[2]) html=driver.page_source soup=BeautifulSoup(html,'lxml') visitors = soup.find_all('div', class_='big-text u-blueMediumMedium')[0].text print(visitors) print("发布文章成功")
三、填坑
由于自己是第一次使用beautiful soup,这里面有许多坑。
1.多个结果中取值
visitors = soup.find_all('div', class_='big-text u-blueMediumMedium')
使用上面的代码,获得的是两个结果,如果要取第一个结果,需要在后面加上“[0]”
2.取文本
上面的代码加上0之后,运行的结果是:
<div class="big-text u-blueMediumMedium" title="16,100">16,103</div>
如果要取文本,则要加上
visitors = soup.find_all('div', class_='big-text u-blueMediumMedium')[0].text
这里又是一个大坑,可能我最开始没有加“[0]”,直接加上“.text”不对,后来加上“[0]”之后,由于网站加载速度的问题,又显示IndexError: list index out of range,总之,被这个小问题一通折腾。
3.关于切换标签页
由于我的chrome默认一打开就有两个标签页,所以
driver.switch_to_window(handles[2])
这段代码这儿需要注意列表中的数值。
四、最终的代码:
from selenium import webdriver from selenium.webdriver.support.select import Select import time import pickle from bs4 import BeautifulSoup def load_cookies(driver,location,url=None): cookies = pickle.load(open(location,"rb")) driver.delete_all_cookies() url = "https://pro.similarweb.com/#/website/worldwide-overview/snailtoday.com/*/999/3m?webSource=Total" if url is None else url driver.get(url) for cookie in cookies: driver.add_cookie(cookie) print("启动浏览器,打开SimiliarWeb登录界面") #用webdriver启动谷歌浏览器 driver = webdriver.Chrome(executable_path = "C:\\Users\xxx\AppData\Local\Google\Chrome\Application\chromedriver.exe") load_cookies(driver,"H:\py_project\similiarweb\cookies.txt") for domain in open("domains.txt"): print(domain) url = 'https://pro.similarweb.com/#/website/worldwide-overview/{}/*/999/3m?webSource=Total'.format(domain) driver.get(url) time.sleep(30) html = driver.page_source soup = BeautifulSoup(html, 'lxml') visitors = soup.find_all('div', class_='big-text u-blueMediumMedium')[0].text print(visitors) time.sleep(5)
原载:蜗牛博客
网址:http://www.snailtoday.com
尊重版权,转载时务必以链接形式注明作者和原始出处及本声明。