Python下有个爬虫框架,Scrapy,用来抓取页面比较方便,适用一些结构简单的网站。
知识点
安装及使用
首先需要安装Python,推荐Python3。再通过pip3
安装scrapy
pip3 install scrapy
新建一个示例test.py
,代码如下,抓取某博客上的与python
有关的全部博文
import scrapy
class BlogSpider(scrapy.Spider):
name = 'testspider'
start_urls = ['https://www.xxx.com/?s=python']
def parse(self, response):
file1 = open("result.txt", 'a+')
for h2 in response.css('h2.article-title'):
self.logger.info('A response from %s just arrived!', response.url)
title = h2.css('a::text').get()
href = h2.css('a').attrib['href']
file1.write(title + " " + href + "\n")
yield {'title': title}
for next_page in response.css('a.page-ens'):
yield response.follow(next_page, self.parse)
file1.close()
开始跑
scrapy runspider test.py
抓取完成后在同目录下的result.txt
就是全部博文的标题和链接列表。