Skip to content
This repository has been archived by the owner on Mar 19, 2022. It is now read-only.

Latest commit

 

History

History
37 lines (34 loc) · 1.41 KB

README.md

File metadata and controls

37 lines (34 loc) · 1.41 KB

broadcrawler.py

采用Crawlspider,Rule 来提取链接,并利用自定义后的newspaper库来判断是否是新闻,解析网页内容,是新闻我们就解析,不是的话对于还没对接的系统来说可以先丢掉

Begin.py

可以直接运行,调试这个文件. 代替cmd启动,方便

items.py

BroadcrawlerItem是存放新闻item的,similarity可以无视

patcher.py

通过重写一些函数来实现自定义,写法如下:

  1. 实现你要写的方法
  2. 在enable_patch()中指定原类中方法的替换或者增加
  3. 测试是否成功
    def enable_patch(self):
        Article.download = download
        Article.parse = parse
        Article.is_news = None
        Config.fetch_videos = None
        ContentExtractor.get_publishing_date = get_publishing_date
        ContentExtractor.get_authors = get_authors
        ContentExtractor.get_title = get_title

pipeline.py

  • JsonWithEncodingPipeline是以json保存
  • MongoPipeline是以mongo保存 没有配置mongo的配置一下,网上教程很多

settings.py

settings中相应设置字段我都有说明

Reference