抓http://www.santostang.com/2018/07/04/hello-world/
抓包找传数据的url
先抓包 F12 -> Network -> F5 一般ajax数据是json格式获取
筛选XHR 再点
Preview
看数据 发现是空的不是json数据
只能回到All
再看
这样很难找不如用selenium
一个个点终于找到:
https://api-zero.livere.com/v1/comments/list?callback=jQuery112409131255202867523_1543847210853&limit=10&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1543847210855
得json 解析出数据
点击进去看是这样
typeof jQuery112409131255202867523_1543847210853 === 'function' && jQuery112409131255202867523_1543847210853({这之间的数据是被传的json数据});
import requests
link = "https://api-zero.livere.com/v1/comments/list?callback=jQuery1124049866736766120545_1506309304525&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506309304527"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link,heade rs= headers)
json_string = r.text
json_string = json_string[json_string.find('{'):-2]
看json结构
"results": {
"parents":
[
{ ....
"content": "评论试试啊 :smiley:",
....
},
{ ....
"content": "121212",
....
},
]
}
json_data = json.loads(json_string)
comment_list = json_data['results']['parents']
for eachone in comment_list:
message = eachone['content']
print (message)
URL地址的规律
以上https://api-zero.livere.com/v1/comments/list?callback=jQuery112409131255202867523_1543847210853&limit=10&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&=1543847210855
只是评论的一部分
请求第二页
url是
https://api-zero.livere.com/v1/comments/list?callback=jQuery1124010814306767104775_1543848692464&limit=10&offset=2&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&=1543848692469
对比第一页第二页的参数
对比可以发现 关键是offset
for page in range(1,4):
link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
page_str = str(page)
# 拼接得url
link = link1 + page_str + link2
完整代码
import requests
import json
# 打印一页的评论
def single_page_comment(link):
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= headers)
# 获取 json 的 string
json_string = r.text
json_string = json_string[json_string.find('{'):-2]
json_data = json.loads(json_string)
comment_list = json_data['results']['parents']
for eachone in comment_list:
message = eachone['content']
print (message)
# 1 2 3 4 页的评论
for page in range(1,4):
link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
page_str = str(page)
link = link1 + page_str + link2
print (link)
single_page_comment(link)