a级毛片古装在线播放,欧美亚洲国产激情一区二区,视频一区二区三区欧美日韩

欄目導航

新聞資訊

新聞資訊

用Python開發爬蟲是一件很輕松愉悅的事情，因為其相關庫較多，而且使用方便，短短十幾行代碼就可以完成一個爬蟲的開發；
但是，在應對具有反爬措施的網站，使用js動態加載的網站，App采集的時候就得動動腦子了；并且在開發分布式爬蟲，高性能爬蟲的時候更得用心設計。

Python開發爬蟲常用的工具總結

reqeusts：Python HTTP網絡請求庫；
pyquery： Python HTML DOM結構解析庫，采用類似JQuery的語法；
BeautifulSoup：python HTML以及XML結構解析；
selenium：Python自動化測試框架，可以用于爬蟲；
phantomjs：無頭瀏覽器，可以配合selenium獲取js動態加載的內容；
re：python內建正則表達式模塊；
fiddler：抓包工具，原理就是是一個代理服務器，可以抓取手機包；
anyproxy：代理服務器，可以自己撰寫rule截取request或者response，通常用于客戶端采集；
celery：Python分布式計算框架，可用于開發分布式爬蟲；
gevent：Python基于協程的網絡庫，可用于開發高性能爬蟲
grequests：異步requests
aiohttp:異步http client/server框架
asyncio：python內建異步io，事件循環庫
uvloop：一個非常快速的事件循環庫，配合asyncio效率極高
concurrent：Python內建用于并發任務執行的擴展
scrapy：python 爬蟲框架；
Splash：一個JavaScript渲染服務，相當于一個輕量級的瀏覽器，配合lua腳本通過他的http API 解析頁面；
Splinter：開源自動化Python web測試工具
pyspider：Python爬蟲系統

網頁抓取思路

數據是否可以直接從HTML中獲?。繑祿苯忧短自陧撁娴腍TML結構中；
數據是否使用JS動態渲染到頁面中的？數據嵌套在js代碼中，然后采用js加載到頁面或者采用ajax渲染；
獲取的頁面使用是否需要認證？需要登錄后頁面才可以訪問；
數據是否直接可以通過API得到？有些數據是可以直接通過api獲取到，省去解析HTML的麻煩，大多數API都是以JSON格式返回數據；
來自客戶端的數據如何采集？例如：微信APP和微信客戶端

如何應對反爬

不要太過分，控制爬蟲的速率，別把人家整垮了，那就兩敗俱傷了；
使用代理隱藏真實IP，并且實現反爬；
讓爬蟲看起來像人類用戶，選擇性滴設置以下HTTP頭部：Host：https://www.baidu.comConnection：keep-aliveAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8UserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36Referer: http://s.weibo.com/user/gamelife1314&Refer=indexAccept-Encoding: gzip, deflateAccept-Language: zh-CN,zh;q=0.8
查看網站的cookie，在某些情況下，請求需要添加cookie用于通過服務端的一些校驗；

案例說明

靜態頁面解析（獲取微信公眾號文章）

import pyquery
import re


def weixin_article_html_parser(html):
 """
 解析微信文章，返回包含文章主體的字典信息
 :param html: 文章HTML源代碼
 :return:
 """

 pq = pyquery.PyQuery(html)

 article = {
 "weixin_id": pq.find("#js_profile_qrcode "
 ".profile_inner .profile_meta").eq(0).find("span").text().strip(),
 "weixin_name": pq.find("#js_profile_qrcode .profile_inner strong").text().strip(),
 "account_desc": pq.find("#js_profile_qrcode .profile_inner "
 ".profile_meta").eq(1).find("span").text().strip(),
 "article_title": pq.find("title").text().strip(),
 "article_content": pq("#js_content").remove('script').text().replace(r"\r\n", ""),
 "is_orig": 1 if pq("#copyright_logo").length > 0 else 0,
 "article_source_url": pq("#js_sg_bar .meta_primary").attr('href') if pq(
 "#js_sg_bar .meta_primary").length > 0 else '',

 }

 # 使用正則表達式匹配頁面中js腳本中的內容
 match = {
 "msg_cdn_url": {"regexp": "(?<=\").*(?=\")", "value": ""}, # 匹配文章封面圖
 "var ct": {"regexp": "(?<=\")\d{10}(?=\")", "value": ""}, # 匹配文章發布時間
 "publish_time": {"regexp": "(?<=\")\d{4}-\d{2}-\d{2}(?=\")", "value": ""}, # 匹配文章發布日期
 "msg_desc": {"regexp": "(?<=\").*(?=\")", "value": ""}, # 匹配文章簡介
 "msg_link": {"regexp": "(?<=\").*(?=\")", "value": ""}, # 匹配文章鏈接
 "msg_source_url": {"regexp": "(?<=').*(?=')", "value": ""}, # 獲取原文鏈接
 "var biz": {"regexp": "(?<=\")\w{1}.+?(?=\")", "value": ""},
 "var idx": {"regexp": "(?<=\")\d{1}(?=\")", "value": ""},
 "var mid": {"regexp": "(?<=\")\d{10,}(?=\")", "value": ""},
 "var sn": {"regexp": "(?<=\")\w{1}.+?(?=\")", "value": ""},
 }
 count = 0
 for line in html.split("\n"):
 for item, value in match.items():
 if item in line:
 m = re.search(value["regexp"], line)
 if m is not None:
 count += 1
 match[item]["value"] = m.group(0)
 break
 if count >= len(match):
 break

 article["article_short_desc"] = match["msg_desc"]["value"]
 article["article_pos"] = int(match["var idx"]["value"])
 article["article_post_time"] = int(match["var ct"]["value"])
 article["article_post_date"] = match["publish_time"]["value"]
 article["article_cover_img"] = match["msg_cdn_url"]["value"]
 article["article_source_url"] = match["msg_source_url"]["value"]
 article["article_url"] = "https://mp.weixin.qq.com/s?__biz={biz}&mid={mid}&idx={idx}&sn={sn}".format(
 biz=match["var biz"]["value"],
 mid=match["var mid"]["value"],
 idx=match["var idx"]["value"],
 sn=match["var sn"]["value"],
 )

 return article


if __name__ == '__main__':

 from pprint import pprint
 import requests
 url = ("https://mp.weixin.qq.com/s?__biz=MzI1NjA0MDg2Mw==&mid=2650682990&idx=1"
 "&sn=39419542de39a821bb5d1570ac50a313&scene=0#wechat_redirect")
 pprint(weixin_article_html_parser(requests.get(url).text))

# {'account_desc': '夜聽，讓更多的家庭越來越幸福。',
# 'article_content': '文字：安夢 \xa0 \xa0 聲音：劉筱 得到了什么？又失去了什么？',
# 'article_cover_img': 'http://mmbiz.qpic.cn/mmbiz_jpg/4iaBNpgEXstYhQEnbiaD0AwbKhmCVWSeCPBQKgvnSSj9usO4q997wzoicNzl52K1sYSDHBicFGL7WdrmeS0K8niaiaaA/0?wx_fmt=jpeg',
# 'article_pos': 1,
# 'article_post_date': '2017-07-02',
# 'article_post_time': 1499002202,
# 'article_short_desc': '周日 來自劉筱的晚安問候。',
# 'article_source_url': '',
# 'article_title': '【夜聽】走到這里',
# 'article_url': 'https://mp.weixin.qq.com/s?__biz=MzI1NjA0MDg2Mw==&mid=2650682990&idx=1&sn=39419542de39a821bb5d1570ac50a313',
# 'is_orig': 0,
# 'weixin_id': 'yetingfm',
# 'weixin_name': '夜聽'}

使用phantomjs解析js渲染的頁面–微博搜索

有些頁面采用復雜的js邏輯處理，包含各種Ajax請求，請求之間還包含一些加密操作，通過分析js邏輯重新渲染頁面拿到
想要的數據可謂比登天還難，沒有堅實的js基礎，不熟悉各種js框架，搞明白這種頁面就別想了；
采取類似瀏覽器的方式渲染頁面，直接獲取頁面HTML方便多了。

例如：http://s.weibo.com/ 搜索出來的結果是使用js動態渲染的，直接獲取HTML并不會得到搜索的結果，所以我們要運行
頁面中的js，將頁面渲染成功以后，再獲取它的HTML進行解析；

使用Python模擬登陸獲取cookie

有些網站比較蛋疼，通常需要登錄之后才可以獲取數據，下面展示一個簡單的例子：用于登錄網站嗎，獲取cookie，然后可以用于其他請求

但是，這里僅僅在沒有驗證碼的情況下，如果要有短信驗證，圖片驗證，郵箱驗證那就要另行設計了；

目標網站：http://www.newrank.cn，日期：2017-07-03，如果網站結構更改，就需要修改代以下碼了；

#!/usr/bin/env python3
# encoding: utf-8
import time
from urllib import parse

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from pyquery import PyQuery


def weibo_user_search(url: str):
 """通過phantomjs獲取搜索的頁面html"""

 desired_capabilities = DesiredCapabilities.CHROME.copy()
 desired_capabilities["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
 "AppleWebKit/537.36 (KHTML, like Gecko) "
 "Chrome/59.0.3071.104 Safari/537.36")
 desired_capabilities["phantomjs.page.settings.loadImages"] = True
 # 自定義頭部
 desired_capabilities["phantomjs.page.customHeaders.Upgrade-Insecure-Requests"] = 1
 desired_capabilities["phantomjs.page.customHeaders.Cache-Control"] = "max-age=0"
 desired_capabilities["phantomjs.page.customHeaders.Connection"] = "keep-alive"

 driver = webdriver.PhantomJS(executable_path="/usr/bin/phantomjs", # 設置phantomjs路徑
 desired_capabilities=desired_capabilities,
 service_log_path="ghostdriver.log",)
 # 設置對象的超時時間
 driver.implicitly_wait(1)
 # 設置頁面完全加載的超時時間，包括頁面全部渲染，異步同步腳本都執行完成
 driver.set_page_load_timeout(60)
 # 設置異步腳本的超時時間
 driver.set_script_timeout(60)

 driver.maximize_window()
 try:
 driver.get(url=url)
 time.sleep(1)
 try:
 # 打開頁面之后做一些操作
 company = driver.find_element_by_css_selector("p.company")
 ActionChains(driver).move_to_element(company)
 except WebDriverException:
 pass
 html = driver.page_source
 pq = PyQuery(html)
 person_lists = pq.find("div.list_person")
 if person_lists.length > 0:
 for index in range(person_lists.length):
 person_ele = person_lists.eq(index)
 print(person_ele.find(".person_name > a.W_texta").attr("title"))
 return html
 except (TimeoutException, Exception) as e:
 print(e)
 finally:
 driver.quit()

if __name__ == '__main__':
 weibo_user_search(url="http://s.weibo.com/user/%s" % parse.quote("新聞"))
# 央視新聞
# 新浪新聞
# 新聞
# 新浪新聞客戶端
# 中國新聞周刊
# 中國新聞網
# 每日經濟新聞
# 澎湃新聞
# 網易新聞客戶端
# 鳳凰新聞客戶端
# 皇馬新聞
# 網絡新聞聯播
# CCTV5體育新聞
# 曼聯新聞
# 搜狐新聞客戶端
# 巴薩新聞
# 新聞日日睇
# 新垣結衣新聞社
# 看看新聞KNEWS
# 央視新聞評論

使用Python模擬登陸獲取cookie

有些網站比較蛋疼，通常需要登錄之后才可以獲取數據，下面展示一個簡單的例子：用于登錄網站嗎，獲取cookie，然后可以用于其他請求

但是，這里僅僅在沒有驗證碼的情況下，如果要有短信驗證，圖片驗證，郵箱驗證那就要另行設計了；

目標網站：http://www.newrank.cn，日期：2017-07-03，如果網站結構更改，就需要修改代以下碼了；

#!/usr/bin/env python3
# encoding: utf-8

from time import sleep
from pprint import pprint

from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium import webdriver


def login_newrank():
 """登錄新榜，獲取他的cookie信息"""

 desired_capabilities = DesiredCapabilities.CHROME.copy()
 desired_capabilities["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
 "AppleWebKit/537.36 (KHTML, like Gecko) "
 "Chrome/59.0.3071.104 Safari/537.36")
 desired_capabilities["phantomjs.page.settings.loadImages"] = True

 # 自定義頭部
 desired_capabilities["phantomjs.page.customHeaders.Upgrade-Insecure-Requests"] = 1
 desired_capabilities["phantomjs.page.customHeaders.Cache-Control"] = "max-age=0"
 desired_capabilities["phantomjs.page.customHeaders.Connection"] = "keep-alive"

 # 填寫自己的賬戶進行測試
 user = {
 "mobile": "user",
 "password": "password"
 }

 print("login account: %s" % user["mobile"])

 driver = webdriver.PhantomJS(executable_path="/usr/bin/phantomjs",
 desired_capabilities=desired_capabilities,
 service_log_path="ghostdriver.log", )

 # 設置對象的超時時間
 driver.implicitly_wait(1)
 # 設置頁面完全加載的超時時間，包括頁面全部渲染，異步同步腳本都執行完成
 driver.set_page_load_timeout(60)
 # 設置異步腳本的超時時間
 driver.set_script_timeout(60)

 driver.maximize_window()

 try:
 driver.get(url="http://www.newrank.cn/public/login/login.html?back=http%3A//www.newrank.cn/")
 driver.find_element_by_css_selector(".login-normal-tap:nth-of-type(2)").click()
 sleep(0.2)
 driver.find_element_by_id("account_input").send_keys(user["mobile"])
 sleep(0.5)
 driver.find_element_by_id("password_input").send_keys(user["password"])
 sleep(0.5)
 driver.find_element_by_id("pwd_confirm").click()
 sleep(3)
 cookies = {user["name"]: user["value"] for user in driver.get_cookies()}
 pprint(cookies)

 except TimeoutException as exc:
 print(exc)
 except WebDriverException as exc:
 print(exc)
 finally:
 driver.quit()

if __name__ == '__main__':
 login_newrank()
# login account: 15395100590
# {'CNZZDATA1253878005': '1487200824-1499071649-%7C1499071649',
# 'Hm_lpvt_a19fd7224d30e3c8a6558dcb38c4beed': '1499074715',
# 'Hm_lvt_a19fd7224d30e3c8a6558dcb38c4beed': '1499074685,1499074713',
# 'UM_distinctid': '15d07d0d4dd82b-054b56417-9383666-c0000-15d07d0d4deace',
# 'name': '15395100590',
# 'rmbuser': 'true',
# 'token': 'A7437A03346B47A9F768730BAC81C514',
# 'useLoginAccount': 'true'}

在獲取cookie之后就可以將獲得的cookie添加到后續的請求中了，但是因為cookie是具有有效期的，因此需要定時更新；
可以通過設計一個cookie池來實現，動態定時登錄一批賬號，獲取cookie之后存放在數據庫中（redis，MySQL等等），
請求的時候從數據庫中獲取一條可用cookie，并且添加在請求中訪問；

使用pyqt5爬個數據試試（PyQt 5.9.2）

import sys
import csv

import pyquery

from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView


class Browser(QWebEngineView):

 def __init__(self):
 super(Browser, self).__init__()
 self.__results = []
 self.loadFinished.connect(self.__result_available)

 @property
 def results(self):
 return self.__results

 def __result_available(self):
 self.page().toHtml(self.__parse_html)

 def __parse_html(self, html):
 pq = pyquery.PyQuery(html)
 for rows in [pq.find("#table_list tr"), pq.find("#more_list tr")]:
 for row in rows.items():
 columns = row.find("td")
 d = {
 "avatar": columns.eq(1).find("img").attr("src"),
 "url": columns.eq(1).find("a").attr("href"),
 "name": columns.eq(1).find("a").attr("title"),
 "fans_number": columns.eq(2).text(),
 "view_num": columns.eq(3).text(),
 "comment_num": columns.eq(4).text(),
 "post_count": columns.eq(5).text(),
 "newrank_index": columns.eq(6).text(),
 }
 self.__results.append(d)

 with open("results.csv", "a+", encoding="utf-8") as f:
 writer = csv.DictWriter(f, fieldnames=["name", "fans_number", "view_num", "comment_num", "post_count",
 "newrank_index", "url", "avatar"])
 writer.writerows(self.results)

 def open(self, url: str):
 self.load(QUrl(url))


if __name__ == '__main__':
 app = QApplication(sys.argv)
 browser = Browser()
 browser.open("https://www.newrank.cn/public/info/list.html?period=toutiao_day&type=data")
 browser.show()
 app.exec_()

持續更新中：

5. 使用Fiddler抓包分析

瀏覽器抓包
fiddler手機抓包

6. 使用anyproxy抓取客戶端數據–客戶端數據采集

7. 關于開發高性能爬蟲的總結

關注微信公眾號“python社區營”了解更多技術

原文鏈接：https://www.cnblogs.com/pypypy/p/12019283.html

Python開發爬蟲常用的工具總結

reqeusts：Python HTTP網絡請求庫；
pyquery： Python HTML DOM結構解析庫，采用類似JQuery的語法；
BeautifulSoup：python HTML以及XML結構解析；
selenium：Python自動化測試框架，可以用于爬蟲；
phantomjs：無頭瀏覽器，可以配合selenium獲取js動態加載的內容；
re：python內建正則表達式模塊；
fiddler：抓包工具，原理就是是一個代理服務器，可以抓取手機包；
anyproxy：代理服務器，可以自己撰寫rule截取request或者response，通常用于客戶端采集；
celery：Python分布式計算框架，可用于開發分布式爬蟲；
gevent：Python基于協程的網絡庫，可用于開發高性能爬蟲
grequests：異步requests
aiohttp:異步http client/server框架
asyncio：python內建異步io，事件循環庫
uvloop：一個非?？焖俚氖录h庫，配合asyncio效率極高
concurrent：Python內建用于并發任務執行的擴展
scrapy：python 爬蟲框架；
Splash：一個JavaScript渲染服務，相當于一個輕量級的瀏覽器，配合lua腳本通過他的http API 解析頁面；
Splinter：開源自動化Python web測試工具
pyspider：Python爬蟲系統

網頁抓取思路

數據是否可以直接從HTML中獲取？數據直接嵌套在頁面的HTML結構中；
數據是否使用JS動態渲染到頁面中的？數據嵌套在js代碼中，然后采用js加載到頁面或者采用ajax渲染；
獲取的頁面使用是否需要認證？需要登錄后頁面才可以訪問；
數據是否直接可以通過API得到？有些數據是可以直接通過api獲取到，省去解析HTML的麻煩，大多數API都是以JSON格式返回數據；
來自客戶端的數據如何采集？例如：微信APP和微信客戶端

如何應對反爬

不要太過分，控制爬蟲的速率，別把人家整垮了，那就兩敗俱傷了；
使用代理隱藏真實IP，并且實現反爬；
讓爬蟲看起來像人類用戶，選擇性滴設置以下HTTP頭部：Host：https://www.baidu.comConnection：keep-aliveAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8UserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36Referer: http://s.weibo.com/user/gamelife1314&Refer=indexAccept-Encoding: gzip, deflateAccept-Language: zh-CN,zh;q=0.8
查看網站的cookie，在某些情況下，請求需要添加cookie用于通過服務端的一些校驗；

案例說明

靜態頁面解析（獲取微信公眾號文章）

import pyquery
import re


def weixin_article_html_parser(html):
 """
 解析微信文章，返回包含文章主體的字典信息
 :param html: 文章HTML源代碼
 :return:
 """

 pq = pyquery.PyQuery(html)

 article = {
 "weixin_id": pq.find("#js_profile_qrcode "
 ".profile_inner .profile_meta").eq(0).find("span").text().strip(),
 "weixin_name": pq.find("#js_profile_qrcode .profile_inner strong").text().strip(),
 "account_desc": pq.find("#js_profile_qrcode .profile_inner "
 ".profile_meta").eq(1).find("span").text().strip(),
 "article_title": pq.find("title").text().strip(),
 "article_content": pq("#js_content").remove('script').text().replace(r"\r\n", ""),
 "is_orig": 1 if pq("#copyright_logo").length > 0 else 0,
 "article_source_url": pq("#js_sg_bar .meta_primary").attr('href') if pq(
 "#js_sg_bar .meta_primary").length > 0 else '',

 }

 # 使用正則表達式匹配頁面中js腳本中的內容
 match = {
 "msg_cdn_url": {"regexp": "(?<=\").*(?=\")", "value": ""}, # 匹配文章封面圖
 "var ct": {"regexp": "(?<=\")\d{10}(?=\")", "value": ""}, # 匹配文章發布時間
 "publish_time": {"regexp": "(?<=\")\d{4}-\d{2}-\d{2}(?=\")", "value": ""}, # 匹配文章發布日期
 "msg_desc": {"regexp": "(?<=\").*(?=\")", "value": ""}, # 匹配文章簡介
 "msg_link": {"regexp": "(?<=\").*(?=\")", "value": ""}, # 匹配文章鏈接
 "msg_source_url": {"regexp": "(?<=').*(?=')", "value": ""}, # 獲取原文鏈接
 "var biz": {"regexp": "(?<=\")\w{1}.+?(?=\")", "value": ""},
 "var idx": {"regexp": "(?<=\")\d{1}(?=\")", "value": ""},
 "var mid": {"regexp": "(?<=\")\d{10,}(?=\")", "value": ""},
 "var sn": {"regexp": "(?<=\")\w{1}.+?(?=\")", "value": ""},
 }
 count = 0
 for line in html.split("\n"):
 for item, value in match.items():
 if item in line:
 m = re.search(value["regexp"], line)
 if m is not None:
 count += 1
 match[item]["value"] = m.group(0)
 break
 if count >= len(match):
 break

 article["article_short_desc"] = match["msg_desc"]["value"]
 article["article_pos"] = int(match["var idx"]["value"])
 article["article_post_time"] = int(match["var ct"]["value"])
 article["article_post_date"] = match["publish_time"]["value"]
 article["article_cover_img"] = match["msg_cdn_url"]["value"]
 article["article_source_url"] = match["msg_source_url"]["value"]
 article["article_url"] = "https://mp.weixin.qq.com/s?__biz={biz}&mid={mid}&idx={idx}&sn={sn}".format(
 biz=match["var biz"]["value"],
 mid=match["var mid"]["value"],
 idx=match["var idx"]["value"],
 sn=match["var sn"]["value"],
 )

 return article


if __name__ == '__main__':

 from pprint import pprint
 import requests
 url = ("https://mp.weixin.qq.com/s?__biz=MzI1NjA0MDg2Mw==&mid=2650682990&idx=1"
 "&sn=39419542de39a821bb5d1570ac50a313&scene=0#wechat_redirect")
 pprint(weixin_article_html_parser(requests.get(url).text))

# {'account_desc': '夜聽，讓更多的家庭越來越幸福。',
# 'article_content': '文字：安夢 \xa0 \xa0 聲音：劉筱 得到了什么？又失去了什么？',
# 'article_cover_img': 'http://mmbiz.qpic.cn/mmbiz_jpg/4iaBNpgEXstYhQEnbiaD0AwbKhmCVWSeCPBQKgvnSSj9usO4q997wzoicNzl52K1sYSDHBicFGL7WdrmeS0K8niaiaaA/0?wx_fmt=jpeg',
# 'article_pos': 1,
# 'article_post_date': '2017-07-02',
# 'article_post_time': 1499002202,
# 'article_short_desc': '周日 來自劉筱的晚安問候。',
# 'article_source_url': '',
# 'article_title': '【夜聽】走到這里',
# 'article_url': 'https://mp.weixin.qq.com/s?__biz=MzI1NjA0MDg2Mw==&mid=2650682990&idx=1&sn=39419542de39a821bb5d1570ac50a313',
# 'is_orig': 0,
# 'weixin_id': 'yetingfm',
# 'weixin_name': '夜聽'}

使用phantomjs解析js渲染的頁面–微博搜索

使用Python模擬登陸獲取cookie

有些網站比較蛋疼，通常需要登錄之后才可以獲取數據，下面展示一個簡單的例子：用于登錄網站嗎，獲取cookie，然后可以用于其他請求

但是，這里僅僅在沒有驗證碼的情況下，如果要有短信驗證，圖片驗證，郵箱驗證那就要另行設計了；

目標網站：http://www.newrank.cn，日期：2017-07-03，如果網站結構更改，就需要修改代以下碼了；

#!/usr/bin/env python3
# encoding: utf-8
import time
from urllib import parse

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from pyquery import PyQuery


def weibo_user_search(url: str):
 """通過phantomjs獲取搜索的頁面html"""

 desired_capabilities = DesiredCapabilities.CHROME.copy()
 desired_capabilities["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
 "AppleWebKit/537.36 (KHTML, like Gecko) "
 "Chrome/59.0.3071.104 Safari/537.36")
 desired_capabilities["phantomjs.page.settings.loadImages"] = True
 # 自定義頭部
 desired_capabilities["phantomjs.page.customHeaders.Upgrade-Insecure-Requests"] = 1
 desired_capabilities["phantomjs.page.customHeaders.Cache-Control"] = "max-age=0"
 desired_capabilities["phantomjs.page.customHeaders.Connection"] = "keep-alive"

 driver = webdriver.PhantomJS(executable_path="/usr/bin/phantomjs", # 設置phantomjs路徑
 desired_capabilities=desired_capabilities,
 service_log_path="ghostdriver.log",)
 # 設置對象的超時時間
 driver.implicitly_wait(1)
 # 設置頁面完全加載的超時時間，包括頁面全部渲染，異步同步腳本都執行完成
 driver.set_page_load_timeout(60)
 # 設置異步腳本的超時時間
 driver.set_script_timeout(60)

 driver.maximize_window()
 try:
 driver.get(url=url)
 time.sleep(1)
 try:
 # 打開頁面之后做一些操作
 company = driver.find_element_by_css_selector("p.company")
 ActionChains(driver).move_to_element(company)
 except WebDriverException:
 pass
 html = driver.page_source
 pq = PyQuery(html)
 person_lists = pq.find("div.list_person")
 if person_lists.length > 0:
 for index in range(person_lists.length):
 person_ele = person_lists.eq(index)
 print(person_ele.find(".person_name > a.W_texta").attr("title"))
 return html
 except (TimeoutException, Exception) as e:
 print(e)
 finally:
 driver.quit()

if __name__ == '__main__':
 weibo_user_search(url="http://s.weibo.com/user/%s" % parse.quote("新聞"))
# 央視新聞
# 新浪新聞
# 新聞
# 新浪新聞客戶端
# 中國新聞周刊
# 中國新聞網
# 每日經濟新聞
# 澎湃新聞
# 網易新聞客戶端
# 鳳凰新聞客戶端
# 皇馬新聞
# 網絡新聞聯播
# CCTV5體育新聞
# 曼聯新聞
# 搜狐新聞客戶端
# 巴薩新聞
# 新聞日日睇
# 新垣結衣新聞社
# 看看新聞KNEWS
# 央視新聞評論

使用Python模擬登陸獲取cookie

有些網站比較蛋疼，通常需要登錄之后才可以獲取數據，下面展示一個簡單的例子：用于登錄網站嗎，獲取cookie，然后可以用于其他請求

但是，這里僅僅在沒有驗證碼的情況下，如果要有短信驗證，圖片驗證，郵箱驗證那就要另行設計了；

目標網站：http://www.newrank.cn，日期：2017-07-03，如果網站結構更改，就需要修改代以下碼了；

#!/usr/bin/env python3
# encoding: utf-8

from time import sleep
from pprint import pprint

from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium import webdriver


def login_newrank():
 """登錄新榜，獲取他的cookie信息"""

 desired_capabilities = DesiredCapabilities.CHROME.copy()
 desired_capabilities["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
 "AppleWebKit/537.36 (KHTML, like Gecko) "
 "Chrome/59.0.3071.104 Safari/537.36")
 desired_capabilities["phantomjs.page.settings.loadImages"] = True

 # 自定義頭部
 desired_capabilities["phantomjs.page.customHeaders.Upgrade-Insecure-Requests"] = 1
 desired_capabilities["phantomjs.page.customHeaders.Cache-Control"] = "max-age=0"
 desired_capabilities["phantomjs.page.customHeaders.Connection"] = "keep-alive"

 # 填寫自己的賬戶進行測試
 user = {
 "mobile": "user",
 "password": "password"
 }

 print("login account: %s" % user["mobile"])

 driver = webdriver.PhantomJS(executable_path="/usr/bin/phantomjs",
 desired_capabilities=desired_capabilities,
 service_log_path="ghostdriver.log", )

 # 設置對象的超時時間
 driver.implicitly_wait(1)
 # 設置頁面完全加載的超時時間，包括頁面全部渲染，異步同步腳本都執行完成
 driver.set_page_load_timeout(60)
 # 設置異步腳本的超時時間
 driver.set_script_timeout(60)

 driver.maximize_window()

 try:
 driver.get(url="http://www.newrank.cn/public/login/login.html?back=http%3A//www.newrank.cn/")
 driver.find_element_by_css_selector(".login-normal-tap:nth-of-type(2)").click()
 sleep(0.2)
 driver.find_element_by_id("account_input").send_keys(user["mobile"])
 sleep(0.5)
 driver.find_element_by_id("password_input").send_keys(user["password"])
 sleep(0.5)
 driver.find_element_by_id("pwd_confirm").click()
 sleep(3)
 cookies = {user["name"]: user["value"] for user in driver.get_cookies()}
 pprint(cookies)

 except TimeoutException as exc:
 print(exc)
 except WebDriverException as exc:
 print(exc)
 finally:
 driver.quit()

if __name__ == '__main__':
 login_newrank()
# login account: 15395100590
# {'CNZZDATA1253878005': '1487200824-1499071649-%7C1499071649',
# 'Hm_lpvt_a19fd7224d30e3c8a6558dcb38c4beed': '1499074715',
# 'Hm_lvt_a19fd7224d30e3c8a6558dcb38c4beed': '1499074685,1499074713',
# 'UM_distinctid': '15d07d0d4dd82b-054b56417-9383666-c0000-15d07d0d4deace',
# 'name': '15395100590',
# 'rmbuser': 'true',
# 'token': 'A7437A03346B47A9F768730BAC81C514',
# 'useLoginAccount': 'true'}

使用pyqt5爬個數據試試（PyQt 5.9.2）

import sys
import csv

import pyquery

from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView


class Browser(QWebEngineView):

 def __init__(self):
 super(Browser, self).__init__()
 self.__results = []
 self.loadFinished.connect(self.__result_available)

 @property
 def results(self):
 return self.__results

 def __result_available(self):
 self.page().toHtml(self.__parse_html)

 def __parse_html(self, html):
 pq = pyquery.PyQuery(html)
 for rows in [pq.find("#table_list tr"), pq.find("#more_list tr")]:
 for row in rows.items():
 columns = row.find("td")
 d = {
 "avatar": columns.eq(1).find("img").attr("src"),
 "url": columns.eq(1).find("a").attr("href"),
 "name": columns.eq(1).find("a").attr("title"),
 "fans_number": columns.eq(2).text(),
 "view_num": columns.eq(3).text(),
 "comment_num": columns.eq(4).text(),
 "post_count": columns.eq(5).text(),
 "newrank_index": columns.eq(6).text(),
 }
 self.__results.append(d)

 with open("results.csv", "a+", encoding="utf-8") as f:
 writer = csv.DictWriter(f, fieldnames=["name", "fans_number", "view_num", "comment_num", "post_count",
 "newrank_index", "url", "avatar"])
 writer.writerows(self.results)

 def open(self, url: str):
 self.load(QUrl(url))


if __name__ == '__main__':
 app = QApplication(sys.argv)
 browser = Browser()
 browser.open("https://www.newrank.cn/public/info/list.html?period=toutiao_day&type=data")
 browser.show()
 app.exec_()

持續更新中：

5. 使用Fiddler抓包分析

瀏覽器抓包
fiddler手機抓包

6. 使用anyproxy抓取客戶端數據–客戶端數據采集

7. 關于開發高性能爬蟲的總結

關注微信公眾號“python社區營”了解更多技術

原文鏈接：https://www.cnblogs.com/pypypy/p/12019283.html

欧美vvv,亚洲第一成人在线,亚洲成人欧美日韩在线观看,日本猛少妇猛色XXXXX猛叫

Python開發爬蟲常用的工具總結

網頁抓取思路

如何應對反爬

案例說明

靜態頁面解析（獲取微信公眾號文章）

使用Python模擬登陸獲取cookie

使用pyqt5爬個數據試試（PyQt 5.9.2）

Python開發爬蟲常用的工具總結

網頁抓取思路

如何應對反爬

案例說明

靜態頁面解析（獲取微信公眾號文章）

使用Python模擬登陸獲取cookie

使用pyqt5爬個數據試試（PyQt 5.9.2）