Python爬虫框架Scrapy上手试用-小果冻之家

Python下有个爬虫框架，Scrapy，用来抓取页面比较方便，适用一些结构简单的网站。

知识点

安装及使用

首先需要安装Python，推荐Python3。再通过pip3安装scrapy

pip3 install scrapy

新建一个示例test.py，代码如下，抓取某博客上的与python有关的全部博文

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['https://www.xxx.com/?s=python']

    def parse(self, response):
        file1 = open("result.txt", 'a+')

        for h2 in response.css('h2.article-title'):
            self.logger.info('A response from %s just arrived!', response.url)
            title = h2.css('a::text').get()
            href = h2.css('a').attrib['href']
            file1.write(title + " " + href + "\n")
            yield {'title': title}

        for next_page in response.css('a.page-ens'):
            yield response.follow(next_page, self.parse)

        file1.close()

开始跑

scrapy runspider test.py

抓取完成后在同目录下的result.txt就是全部博文的标题和链接列表。

点击量: 248

知识点

安装及使用

相关文章:

Leave a Comment Cancel reply