This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

Overview

crawler_to_visual_gmane

Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library.

This is a set of tools that allow you to pull down an archive of a gmane repository using the instructions at:

http://gmane.org/export.php

In order not to overwhelm the gmane.org server, I have used a copy of the messages present at:

http://mbox.dr-chuck.net/

This server will be faster and take a lot of load off the gmane.org server.

You should install the SQLite browser to view and modify the databases from:

http://sqlitebrowser.org/

The first step is to spider the gmane repository. The base URL is hard-coded in the gmane.py and is hard-coded to the Sakai developer list. You can spider another repository by changing that base url. Make sure to delete the content.sqlite file if you switch the base url. The gmane.py file operates as a spider in that it runs slowly and retrieves one mail message per second so as to avoid getting throttled by gmane.org. It stores all of its data in a database and can be interrupted and re-started as often as needed. It may take many hours to pull all the data down. So you may need to restart several times.

The program scans content.sqlite from 1 up to the first message number not already spidered and starts spidering at that message. It continues spidering until it has spidered the desired number of messages or it reaches a page that does not appear to be a properly formatted message.

Sometimes gmane.org is missing a message. Perhaps administrators can delete messages or perhaps they get lost - I don't know. If your spider stops, and it seems it has hit a missing message, go into the SQLite Manager and add a row with the missing id - leave all the other fields blank - and then restart gmane.py. This will unstick the spidering process and allow it to continue. These empty messages will be ignored in the next phase of the process.

IMPORTANT

One nice thing is that once you have spidered all of the messages and have them in content.sqlite, you can run gmane.py again to get new messages as they get sent to the list. gmane.py will quickly scan to the end of the already-spidered pages and check if there are new messages and then quickly retrieve those messages and add them to content.sqlite.

The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. This is intentional as it allows you to look at content.sqlite to debug the process. It would be a bad idea to run any queries against this database as they would be slow.

The second process is running the program gmodel.py. gmodel.py reads the rough/raw data from content.sqlite and produces a cleaned-up and well-modeled version of the data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X smaller) than content.sqlite because it also compresses the header and body text.

Each time gmodel.py runs - it completely wipes out and re-builds index.sqlite, allowing you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the data cleaning process.

You can re-run the gmodel.py over and over as you look at the data, and add mappings to make the data cleaner and cleaner. When you are done, you will have a nicely indexed version of the email in index.sqlite. This is the file to use to do data analysis. With this file, data analysis will be really quick.

You can look at the data in index.sqlite and if you find a problem, you can update the Mapping table and DNSMapping table in content.sqlite and re-run gmodel.py.

  1. The first, simplest data analysis is to do a "who does the most" and "which organzation does the most"? This is done using gbasic.py.

  2. There is a simple vizualization of the word frequence in the subject lines in the file gword.py.

  3. A second visualization is in gline.py. It visualizes email participation by organizations over time.

Owner
Saim Zafar
I am a student at University of Waterloo, currently studying Mathematics. My interests include Computer Science, problem analysis, Algorithm design and gaming.
Saim Zafar
This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

Saim Zafar 1 Dec 20, 2021
Explore scraping with BeautifulSoup!

beautifulsoup-scrape Explore scraping with BeautifulSoup! Part One: Start from Shakespeare As my professor is a poet (yes, and he teaches me data and

Chuqin 2 Oct 05, 2022
LSpider 一个为被动扫描器定制的前端爬虫

LSpider LSpider - 一个为被动扫描器定制的前端爬虫 什么是LSpider? 一款为被动扫描器而生的前端爬虫~ 由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

Knownsec, Inc. 321 Dec 12, 2022
jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人, 照顾我们这样的马大哈, 不会忘记抢购了, 祝大家过年都能喝上茅台. 特别声明: 本仓库发布的jd_maotai_rpa项目定义为自动化rpa项目, 是用于防止忘记参与jd茅台的活动(由于本人时常忘记), 而不是为了秒杀和抢

35 Nov 18, 2022
A way to scrape sports streams for use with Jellyfin.

Sportyfin Description Stream sports events straight from your Jellyfin server. Sportyfin allows users to scrape for live streamed events and watch str

axelmierczuk 38 Nov 05, 2022
Audio media crawler for lbry.

Audio media crawler for lbry. Requirements Python 3.8 Poetry 1.1.7 Elasticsearch 7.14.0 Lbry-sdk 0.99.0 Development This project uses poetry as a depe

Hound.fm 4 Dec 03, 2022
Incredibly fast crawler designed for OSINT.

Photon Incredibly fast crawler designed for OSINT. Photon Wiki • How To Use • Compatibility • Photon Library • Contribution • Roadmap Key Features Dat

Somdev Sangwan 9.3k Jan 02, 2023
This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

Devansh Singh 1 Feb 10, 2022
Find papers by keywords and venues. Then download it automatically

paper finder Find papers by keywords and venues. Then download it automatically. How to use this? Search CLI python search.py -k "knowledge tracing,kn

Jiahao Chen (TabChen) 2 Dec 15, 2022
PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

PaperRobot PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。 PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。 Installation Down

moxiaoxi 47 Nov 23, 2022
A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022
Python web scrapper

Website scrapper Web scrapping project in Python. Created for learning purposes. Start Install python Update configuration with websites Launch script

Nogueira Vitor 1 Dec 19, 2021
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 07, 2023
A Very simple free proxy list scraper.

Scrappp A Very simple free proxy list scraper, made in python The tool scrape proxy from diffrent sites and api's. Screenshots About the script !!! RE

Joji aka Moncef 12 Oct 27, 2022
a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

George Reyes 7 Nov 27, 2022
👁️ Tool for Data Extraction and Web Requests.

httpmapper 👁️ Project • Technologies • Installation • How it works • License Project 🚧 For educational purposes. This is a project that I developed,

15 Dec 05, 2021
An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022
Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

1 Jan 28, 2022
Web Scraping COVID 19 Meta Portal with Python

Web-Scraping-COVID-19-Meta-Portal-with-Python - Requests API and Beautiful Soup to scrape real-time COVID statistics from worldometer website and perform data cleaning and visual analysis in Jupyter

Aarif Munwar Jahan 1 Jan 04, 2022
淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

MaoTai 118 Dec 16, 2022