Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
⚡TIKTOK BOT - FAST OPTIMIZED ZEFOY SCRIPT

⚡ ZEFOY [ TikTok Zefoy Bot ] Get the script in: discord.gg/onlp !! Official shop: onlp.sellix.io Newest version v.9.0.0 Requirements pip install p

Tekky 186 Dec 31, 2022
Documentation and Samples for the Official HN API

Hacker News API Overview In partnership with Firebase, we're making the public Hacker News data available in near real time. Firebase enables easy acc

Y Combinator Hacker News 9.6k Jan 03, 2023
As Slack no longer provides an API to invite people, this is a Selenium Python script to do so

As Slack no longer provides an API to invite people, this is a Selenium Python script to do so

A Python API for Connected 2

connected API for Connected 2 api for the { connected 2 } programmer : api report api follow api check username api forget password api Search api cha

2 Jun 05, 2022
Maintained Fork of Jishaku For nextcord

Onami a debugging and utility extension for nextcord bots Read the documentation online. Fork Onami is a actively maintained fork of Jishaku for nextc

RPS 11 Dec 14, 2022
Unofficial Meteor Client wiki

Welcome to the Unofficial Meteor Client wiki! Meteor FAQs | A rewritten and better FAQ page. Installation Guide | A guide on how to install Meteor Cli

Anti Cope 0 Feb 21, 2022
A simple terminal UI for viewing fund P/L analysis through TEFAS

Tefas UI A simple terminal UI for viewing fund P/L analysis through TEFAS. Features (that my own bank's UI lack): Daily and weekly P/L FX comparisons

Batuhan Taskaya 4 Mar 14, 2022
A telegram bot writen in python for mirroring files on the internet to Google Drive

owner of this repo :- AYUSH contact me :- AYUSH Slam Mirror Bot This is a telegram bot writen in python for mirroring files on the internet to our bel

Thanusara Pasindu 1 Nov 21, 2021
Coinbase Listing Sniper

Coinbase Listing Sniper Script that listens to the @CoinbaseAssets twitter to find information about new Coinbase listings, and automatically buys 100

4 Oct 26, 2022
in-progress decompilation of Gauntlet Legends decompression code on the N64

Gauntlet-Legends A in-progress decompilation of Gauntlet-Legends (N64) decompression code. This project currently supports the US release. Building (L

6 Jul 23, 2022
CLI tool that checks who does and who does not follow you back on Instagram

CLI tool that checks who does and who does not follow you back on Instagram. It also checks who you don't follow back on Instagram.

Ayushman Roy 3 Dec 02, 2022
Cryptocurrency Prices Telegram Bot For Python

Cryptocurrency Prices Telegram Bot How to Run Set your telegram bot token as environment variable TELEGRAM_BOT_TOKEN: export TELEGRAM_BOT_TOKEN=your_

Sina Nazem 3 Oct 31, 2022
The Social-Engineer Toolkit (SET) is specifically designed to perform advanced attacks against the human element.

The Social-Engineer Toolkit (SET) The Social-Engineer Toolkit (SET) is specifically designed to perform advanced attacks against the human element. SE

Professor 6 Nov 28, 2022
Zendesk Ticket Viewer is a lightweight commandline client for fetching and displaying tickets from a Zendesk account provided by the user

Zendesk Ticket Viewer is a lightweight commandline client for fetching and displaying tickets from a Zendesk account provided by the user.

Parthesh Soni 1 Jan 24, 2022
A tool to customize your discord tokens

Fastest Discord Token Manager - Features: Change Token Username Change Token Password Change Token Avatar Change Token Bio This tool is created by Ace

trey 15 Dec 27, 2022
Discord bots that update their status to the price of any coin listed on x.vite.net

Discord bots that update their status to the price of any coin listed on x.vite.net

5am 3 Nov 27, 2022
Running Performance Calculator

Running Performance Calculator 👉 Have you ever wondered if you ran 10km at 2000

Davide Liu 6 Oct 26, 2022
Music cog for discord bots. Supports YouTube, YoutubeMusic, SoundCloud and Spotify.

dismusic Music cog for discord bots. Supports YouTube, YoutubeMusic, SoundCloud and Spotify. Installation python3 -m pip install dismusic Usage from d

Md Shahriyar Alam 59 Jan 08, 2023
A delightful and complete interface to GitHub's amazing API

ghapi A delightful and complete interface to GitHub's amazing API ghapi provides 100% always-updated coverage of the entire GitHub API. Because we aut

fast.ai 428 Jan 08, 2023
Administration Panel for Control FiveM Servers From Discord

FiveM Discord Administration Panel Version 1.0.0 If you would like to report an issue or request a feature. Join our Discord or create an issue. Contr

NIma 9 Jun 17, 2022