Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
Instagram-follower-bot - An Instagram follower bot written in Python

Instagram Follower Bot An Instagram follower bot written in Python. The bot follows the follower of which account you want. e.g. (You want to follow @

Aytaç Kaşoğlu 1 Dec 31, 2021
A Python wrapper for the Dogehouse API.

Python wrapper for the dogehouse API Installation pip install dogehouse Example from dogehouse import DogeClient, event, command from dogehouse.entiti

Arthur 36 Jun 15, 2022
LoL 台版10周年活動自動輸入邀請碼

LoLTW_10Year_88Event LoLTW 8.8 周年慶 邀請碼自動輸入 設定 在 LoLTW_10Year_88Evnet.exe 的位置建立一個檔案 .env,內容如下 Bahamut_Discussion = https://forum.gamer.com.tw/C.php?bsn

古丁丁 5 Dec 13, 2021
Fetch Flipkart product details including name, price, MRP and Stock details in general as well as specific to a pincode

Fetch Flipkart product details including name, price, MRP and Stock details in general as well as specific to a pincode

Vishal Das 6 Jul 11, 2022
Library written in Python that wraps Halo Infinite API.

haloinfinite Library written in Python that wraps Halo Infinite API. Before start It's unofficial, reverse-engineered, neither stable nor production r

Miguel Ferrer 4 Dec 28, 2022
Interact and easily use Google Chat room webhooks.

Chat Webhooks Easily interact and send messages with Google Chat's webhooks feature. This API is small, but should be a nice framework for working wit

BD103 2 Dec 13, 2021
WhatsApp Api Python - This documentation aims to exemplify the use of Moorse Whatsapp API in Python

WhatsApp API Python ChatBot Este repositório contém uma aplicação que se utiliza

Moorse.io 3 Jan 08, 2022
Python script to extract all Humble Bundle keys and redeem them on Steam automagically.

humble-steam-key-redeemer Python script to extract all Humble keys and redeem them on Steam automagically. This is primarily designed to be a set-it-a

74 Jan 08, 2023
A mass account list editor for python

Account-List-Editor This is an mass account list editor Usage Run the editor.py file with python (python3 ./editor.py) Press a button (1/2) and drag &

ExtremeDev 1 Dec 20, 2021
A fork of lavalink.py built for nextcord

nextcord-ext-lava is a wrapper for Lavalink which abstracts away most of the code necessary to use Lavalink, allowing for easier integration into your projects, while still promising full API coverag

nextcord-ext 4 Feb 27, 2022
Pancakeswap Sniper Bot GUI Uniswap Matic 2022 (WINDOWS LINUX MAC) AUTO BUY TOKEN ON LAUNCH AFTER ADD LIQUIDITY

Pancakeswap Sniper Bot GUI Uniswap Matic 2022 (WINDOWS LINUX MAC) ⭐️ AUTO BUY TOKEN ON LAUNCH AFTER ADD LIQUIDITY ⭐️ ⭐️ First GUI SNIPER BOT for WINDO

Crypto Trader 1 Jan 05, 2022
An API wrapper for Discord written in Python.

disnake A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. About disnake All the contributors and develop

557 Jan 05, 2023
Wrapper for the Swiss Parliament API for Python

swissparlpy This module provides easy access to the data of the OData webservice of the Swiss parliament. Table of Contents Installation Usage Get tab

Stefan Oderbolz 8 Jun 13, 2022
A Simple Telegram Bot that can Download Files From Mega.nz and Upload It to Telegram

MegaDL-Bot A Simple Telegram Bot By @mrkpbots to Download Files From Mega.nz and Upload It to Telegram Features No Login Required All Mega.nz File Lin

MRKP BOTS 5 Feb 20, 2022
Trust-minimized Bitcoin wallet

coldcore Trust-minimized, airgapped Bitcoin management This is experimental software. Wait for a formal release before use with real funds. A trust-mi

James O'Beirne 121 Jan 01, 2023
A simple Discord Mass Dm with Scraper

Python-Mass-DM A simple Discord Mass Dm with Scraper If Member Scraper in Taliban.py doesn't work. You can DM me cuz that scraper is for tokens that g

RyanzSantos 4 Sep 02, 2022
Telegram File Renamer Bot

RENAMER_BOT Telegram File Renamer Bot Configs TG_BOT_TOKEN - Get bot token from @BotFather API_ID - From my.telegram.org API_HASH - From my.telegram.o

Lntechnical 37 Dec 27, 2022
This Telegram bot allows you to create direct links with pre-filled text to WhatsApp Chats

WhatsApp API Bot Telegram bot to create direct links with pre-filled text for WhatsApp Chats You can check our bot here. The bot is based on the API p

RobotTrick • רובוטריק 17 Aug 20, 2022
An unofficial library for discord components (under-development)

discord-components An unofficial library for discord components (under-development) Welcome! Discord components are cool, but discord.py will support

11 Jun 14, 2021
Coin-based opinion monitoring system

介绍 本仓库提供了基于币安 (Binance) 的二级市场舆情系统,可以根据自己的需求修改代码,设定各类告警提示 代码结构 binance.py - 与币安API交互 data_loader.py - 数据相关的读写 monitor.py - 监控的核心方法实现 analyze.py - 基于历史数

luv_dusk 6 Jun 08, 2022