构建一个多源（公众号、RSS）、干净、个性化的阅读环境

Last update: Dec 28, 2022

Overview

2C

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

作为一名微信公众号的重度用户，公众号一直被我设为汲取知识的地方。随着使用程度的增加，相信大家或多或少会有一个比较头疼的问题——广告问题。

假设你关注的公众号有十来个，若一个公众号两周接一次广告，理论上你会面临二十多次广告，实际上会更多，运气不好的话一天刷下来都是广告也不一定。若你关注了二三十个公众号，那很难避免现阶段公众号环境的广告轰炸。

更可恶的是，大部分的广告，无不是贩卖焦虑，营造消极气氛，实在无法忍受且严重影响我的心情。但有些公众号写的文章又确实不错，那怎么做可以不看广告只看文章呢？如果你在公众号阅读体验下深切感受到对于广告的无奈，那么这个项目就是你需要的。

这就是本项目的产生的原因，构建一个多源（公众号、RSS）、干净、个性化的阅读环境。

PS: 这里声明一点，看广告是对作者的支持，这样一定程度上可以促进作者更好地产出。但我看到喜欢的会直接打赏支持，所以搭便车的言论在我这里站不住脚，谢谢。

实现

我的思路很简单，大概流程如下：

简单解释一下：

采集器：监控各自关注的公众号或者博客源，最终构建Feed流作为输入源；
分类器（广告）：基于历史广告数据，利用机器学习实现一个广告分类器（可自定义规则），然后给每篇文章自动打上标签再持久化到MongoDB；
分发器：依靠接口层进行数据请求&响应，为使用者提供个性化配置，然后根据配置自动进行分发，将干净的文章流向微信、钉钉、TG甚至自建网站都行。

这样做就实现了干净阅读环境的构建，衍生一下，还可以实现个人知识库的构建，可以做诸如标签管理、图谱构建等，这些都可以在接口层进行实现。

实现详情可参考文章[打造一个干净且个性化的公众号阅读环境]

使用

本项目使用 pipenv 进行项目管理，安装使用过程如下：

# 确保有Python3.6+环境
git clone https://github.com/howie6879/2c.git
cd 2c

# 创建基础环境
pipenv install --python={your_python3.6+_path}  --skip-lock --dev
# 配置.env 具体查看 doc/00.环境变量.md
# 启动
pipenv run dev

使用前建议阅读文档：

00.2C环境变量
01.2C使用教程：博客阅读地址较舒适

帮助

为了提升模型的识别准确率，我希望大家能尽力贡献一些广告样本，请看样本文件：.files/datasets/ads.csv，我设定格式如下：

title	url
广告文章标题	广告文章连接

来个实例：

一般广告会重复在多个公众号投放，填写的时候麻烦查一下是否存在此条记录，真的真的希望大家能一起合力贡献，亲，来个PR贡献你的力量吧！

致谢

非常感谢以下项目：

感谢以下开发者的贡献（排名不分先后）：

关于

欢迎与我交流（关注入群）：

Comments

使用 docker 一键安装，运行报错 ERROR Liuli 执行失败！'doc_source'

运行日志如下，请问这是啥问题。

[2022:02:18 10:51:54] INFO  Liuli Schedule(v0.2.1) task([email protected]_team) started successfully :)

[2022:02:18 10:51:54] INFO  Liuli Task([email protected]_team) schedule time:

 00:10

 12:10

 21:10

[2022:02:18 10:51:54] ERROR Liuli 执行失败！'doc_source'

opened by GuoZhaoHui628 24

带有空格的公众号采集总是失败

[2022:05:27 08:11:47] INFO Request <GET: https://weixin.sogou.com/weixin?type=1&query=丁爸20%情报分析师的工具箱&ie=utf8&s_from=input&sug=n&sug_type=> liuli_schedule | [2022:05:27 08:11:48] ERROR SGWechatSpider <Item: Failed to get target_item's value from html.> liuli_schedule | Traceback (most recent call last): liuli_schedule | File "/root/.local/share/virtualenvs/code-nY5aaahP/lib/python3.9/site-packages/ruia/spider.py", line 197, in _process_async_callback liuli_schedule | async for callback_result in callback_results: liuli_schedule | File "/data/code/src/collector/wechat/sg_ruia_start.py", line 58, in parse liuli_schedule | async for item in SGWechatItem.get_items(html=html): liuli_schedule | File "/root/.local/share/virtualenvs/code-nY5aaahP/lib/python3.9/site-packages/ruia/item.py", line 127, in get_items liuli_schedule | raise ValueError(value_error_info) liuli_schedule | ValueError: <Item: Failed to get target_item's value from html.>
bug

opened by hackdoors 7

liuli_schedule exited with code 0

根据https://mp.weixin.qq.com/s/rxoq97YodwtAdTqKntuwMA的提示进行安装。

实际文件和代码如下：

pro.env文件的内容：

PYTHONPATH=${PYTHONPATH}:${PWD}
LL_M_USER="liuli"
LL_M_PASS="liuli"
LL_M_HOST="liuli_mongodb"
LL_M_PORT="27017"
LL_M_DB="admin"
LL_M_OP_DB="liuli"
LL_FLASK_DEBUG=0
LL_HOST="0.0.0.0"
LL_HTTP_PORT=8765
LL_WORKERS=1
# 上面这么多配置不用改，下面的才需要各自配置
# 请填写你的实际IP
LL_DOMAIN="http://172.17.0.1:8765"
# 请填写微信分发配置
LL_WECOM_ID="自定义"
LL_WECOM_AGENT_ID="自定义"
LL_WECOM_SECRET="自定义"

default.json的内容如下：

{
    "name": "default",
    "author": "liuli_team",
    "collector": {
        "wechat_sougou": {
            "wechat_list": [
                "老胡的储物柜"
            ],
            "delta_time": 5,
            "spider_type": "playwright"
        }
    },
    "processor": {
        "before_collect": [],
        "after_collect": [{
            "func": "ad_marker",
            "cos_value": 0.6
        }, {
            "func": "to_rss",
            "link_source": "github"
        }]
    },
    "sender": {
        "sender_list": ["wecom"],
        "query_days": 7,
        "delta_time": 3
    },
    "backup": {
        "backup_list": ["mongodb"],
        "query_days": 7,
        "delta_time": 3,
        "init_config": {},
        "after_get_content": [{
            "func": "str_replace",
            "before_str": "data-src=\"",
            "after_str": "src=\"https://images.weserv.nl/?url="
        }]
    },
    "schedule": {
        "period_list": [
            "00:10",
            "12:10",
            "21:10"
        ]
    }
}

docker-compose.yml文件的内容如下：

version: "3"
services:
  liuli_api:
    image: liuliio/api:v0.1.3
    restart: always
    container_name: liuli_api
    ports:
      - "8765:8765"
    volumes:
      - ./pro.env:/data/code/pro.env
    depends_on:
      - liuli_mongodb
    networks:
      - liuli-network
  liuli_schedule:
    image: liuliio/schedule:v0.2.4
    restart: always
    container_name: liuli_schedule
    volumes:
      - ./pro.env:/data/code/pro.env
      - ./liuli_config:/data/code/liuli_config
    depends_on:
      - liuli_mongodb
    networks:
      - liuli-network
  liuli_mongodb:
    image: mongo:3.6
    restart: always
    container_name: liuli_mongodb
    environment:
      - MONGO_INITDB_ROOT_USERNAME=liuli
      - MONGO_INITDB_ROOT_PASSWORD=liuli
    ports:
      - "27027:27017"
    volumes:
      - ./mongodb_data:/data/db
    command: mongod
    networks:
      - liuli-network

networks:
  liuli-network:
    driver: bridge

报错内容如下：

liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Loading .env environment variables...
liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
liuli_schedule  | Loading .env environment variables...
liuli_schedule exited with code 0

我感觉是python路径的问题。我的python路径是：

which python3 # /usr/bin/python3

我的VPS中没有${PYTHONPATH}这个系统变量：

echo ${PYTHONPATH} # NULL

请问大佬，我应该如何改正？

opened by huangwb8 7

Liuli 项目需要一个 logo

项目名称来源，群友 @ Sngxpro 提供：

代号：琉璃（Liuli）

英文：RuriElysion
 or：RuriWorld

slogan：琉璃开净界，薜荔启禅关 ---梅尧臣《缑山子晋祠 会善寺》

寓意：构建一方净土如东方琉璃净世界。《药师经》云：「然彼佛土，一向清净，无有女人，亦无恶趣，及苦音声。」

help wanted

opened by howie6879 7

希望能在RSS订阅里面包含~原始文章链接

目前打算写一个脚本，通过全文获取API来去获取全文，在根据自定义的格式寄给我的gmail...这样除了newsletter之外，一些RSS订阅和微信公众号都可以直接在spark阅读...

然而我找到的全文获取的付费api要求有些高，RSS里面的link格式不行，就算经过decodeURIComponent函数转换也还是格式不正确。

如果RSS订阅有原始网页的连接，就可以抓取用原始链接来获取全文而不会出错！

希望作者可以给与支持！感谢：）

opened by CenBoMin 5
希望增加功能，取消生成的RSS中的updated的变动
截取一部分生成的RSS信息如下，此处的 updated 日期，为liuli在周期性运行的过程中更新时的时间，即使对于一条很久以前的RSS信息，它的 updated 也会被更新到当前时间。

<entry> <id>liuli_wechat - 谷歌开发者 - 社区说｜TensorFlow 在工业视觉中的落地</id> <title>社区说｜TensorFlow 在工业视觉中的落地 </title> <updated>2022-05-28T13:17:35.903720+00:00</updated> <author> <name>liuli_wechat - GDG</name> </author> <content/> <link href="https://ddns.ysmox.com:8766/backup/liuli_wechat/谷歌开发者/%E7%A4%BE%E5%8C%BA%E8%AF%B4%EF%BD%9CTensorFlow%20%E5%9C%A8%E5%B7%A5%E4%B8%9A%E8%A7%86%E8%A7%89%E4%B8%AD%E7%9A%84%E8%90%BD%E5%9C%B0" rel="alternate"/> <published>2022-05-25T17:30:46+08:00</published> </entry>

这样会引起一些问题，在某些RSS订阅器上（如Tiny Tiny RSS），其时间轴上是根据 updated 来排序，而并非 published，如此一来，无法有效地区分当前的RSS哪些内容是最近生成的，哪些又是以前生成过的。

所以希望保留 updated 的时间不变（如第一次存到mongodb中时，记录当前时间；若周期性更新时则不改变其值）或者与 published 保持一致。

最后，希望我已经清楚地表达了我的问题和请求，谢谢！
enhancement
opened by YsMox 3
爬取微信公众号的Demo执行失败
参考的https://mp.weixin.qq.com/s/rxoq97YodwtAdTqKntuwMA 刚起了demo试着爬一下微信公众号的内容，但是日志里显示执行失败了。

Loading .env environment variables... [2022:05:09 10:55:45] INFO Liuli Schedule(v0.2.4) task([email protected]_team) started successfully :) [2022:05:09 10:55:45] INFO Liuli Task([email protected]_team) schedule time: 00:10 12:10 21:10 [2022:05:09 10:55:45] ERROR Liuli 执行失败！'doc_source'

文章里给你docker compose配置文件里使用的liuli schedule镜像版本是不带playwright的，我看文章里提供的default的json里描述的使用playwright爬取微信内容，尝试着更改为了带playwright的版本，也显示执行失败。
opened by Colin-XKL 3
抓取公众号文章时，时间格式清洗出错
测试脚本如下：

from src.collector.wechat_feddd.start import WeiXinSpider WeiXinSpider.request_config = {"RETRIES": 3, "DELAY": 5, "TIMEOUT": 20} WeiXinSpider.start_urls = ['https://mp.weixin.qq.com/s/OrCRVCZ8cGOLRf5p5avHOg'] WeiXinSpider.start()

错误原因：数据清洗时，期望的数据格式是 2022-03-21 20:59，但实际抓取回来的数据是 2022-03-22 20:37:12，导致 clean_doc_ts函数报错。如下图
opened by showthesunli 3
动态获取企业微信分发部门ID参数
新增两个配置项：

# 企业微信分发用户（填写用户帐号，不区分大小写），多个用户用;分割 CC_WECOM_TO_USER="" # 企业微信分发部门（填写部门名称），多个部门用;分割 CC_WECOM_PARTY=""

如两项都不填写，默认向当前应用所有部门的所有用户分发，如用户填写，则按用户填写的配置进行分发
opened by zyd16888 1
0.24版本参照教程无法启动schedule

如果按照教程手动添加pro.env文件，无法启动docker，但是如果不手动添加文件，启动docker的话会自动创建pro.env文件夹，然后docker会循环输出如下日志 Loading .env environment variables... Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py Warning: file PIPENV_DOTENV_LOCATION=./pro.env does not exist!! Not loading environment variables. Process Process-1: Traceback (most recent call last): File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/data/code/src/liuli_schedule.py", line 84, in run_liuli_schedule ll_config = json.load(load_f) File "/usr/local/lib/python3.9/json/init.py", line 293, in load return loads(fp.read(), File "/usr/local/lib/python3.9/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/lib/python3.9/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None

opened by zhyueyueniao 1
分发器支持
目前计划支持将文章输出到如下终端：

[x] 钉钉，比较开放，方便介入，推荐 @howie6879

[x] 微信，可考虑企业微信或Hook @howie6879

[x] RSS生成器模块 @howie6879

[x] TG @123seven

[x] Bark @LeslieLeung

[ ] 飞书

更多分发终端需求大家可在评论区请求支持
enhancement help wanted
opened by howie6879 14

Releases(v0.2.0)

v0.2.0(Feb 10, 2022)
v0.2.0 2022-02-11

liuli v0.2.0 👏 成功发布，看板计划见这里，相关特性和功能提升见下方描述。

提升:

部分代码重构，重命名为 liuli

提升部署效率，支持docker-compose #17

项目容量从100m缩小到3m（移除模型）

修复:

分发器：企业微信分发部门ID参数不定 #16 @zyd16888

修复含有特殊字符密码链接失败 #35 @gclm

特性:

官网 @123seven

LOGO @我妹妹

[采集器]书籍小说大类订阅支持

[分发器]支持 TG、Bark #8

TG @123seven

Bark @LeslieLeung

RSS 支持

备份器支持：

MongoDB

GitHub

Source code(tar.gz)
Source code(zip)

Owner

howie.hu

奇文共欣赏，疑义相与析

GitHub Repository

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022

Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

121 Jan 03, 2023

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

50 Dec 21, 2022

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

67 Dec 01, 2022

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

2 Oct 22, 2022

Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

213 Dec 22, 2022

Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

43 Nov 28, 2022

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

2.1k Dec 28, 2022

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023

Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022

Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022

Python3 to Crystal Translation using Python AST Walker

py2cr.py A code translator using AST from Python to Crystal. This is basically a NodeVisitor with Crystal output. See AST documentation (https://docs.

66 Jul 25, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

1 Jan 21, 2022

Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

3 Jun 28, 2022

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

BERTopic BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable

3.6k Jan 07, 2023

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021

AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

Related tags

Overview

2C

实现

使用

帮助

致谢

关于

Comments

Releases(v0.2.0)

v0.2.0(Feb 10, 2022)

v0.2.0 2022-02-11

Owner

howie.hu

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Entity Disambiguation as text extraction (ACL 2022)

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Kerberoast with ACL abuse capabilities

Anomaly Detection 이상치 탐지 전처리 모듈

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Nested Named Entity Recognition for Chinese Biomedical Text

Text vectorization tool to outperform TFIDF for classification tasks

Python3 to Crystal Translation using Python AST Walker

Pipeline for chemical image-to-text competition

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

Kurumi ChatBot

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

This is a project of data parallel that running on NLP tasks.

AI_Assistant - This is a Python based Voice Assistant.