Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
Send to Telegram, Vk, Discord

Triple send Версия для русских: здесь Demo: Telegram: @Triple_project_bot Discord: Triple project#0877 Vkontakte: @dev.santaspeen How to run Install r

2 Sep 27, 2022
A file-based quote bot written in Python

Let's Write a Python Quote Bot! This repository will get you started with building a quote bot in Python. It's meant to be used along with the Learnin

1 Feb 23, 2022
A python package to fetch results of various national examinations done in Tanzania.

Necta-API Get a formated data of examination results scrapped from necta results website. Note this is not an official NECTA API and is still in devel

vincent laizer 16 Dec 23, 2022
Example app to be deployed to AWS as an API Gateway / Lambda Stack

Disclaimer I won't answer issues or emails regarding the project anymore. The project is old and not maintained anymore. I'm not sure if it still work

Ben 123 Jan 01, 2023
Projeto de estudantes do primeiro período do CIn - UFPE voltado para a criação de um sistema interativo no fechamento da disciplina IF669 - Introdução a Programação.

Projeto Game: Dona da Lua Alunos: Beatriz Férre Clara Kenderessy Matheus Silva Rafael Baltar Roseane Oliveira Samuel Marsaro Sinopse O Cebolinha apron

Maria Clara Kenderessy 5 Dec 20, 2021
An API which returns random AOT quote everytime it's invoked

An API which returns random AOT quote everytime it's invoked

Nishant Sapkota 1 Feb 07, 2022
A Telegram Userbot to play Audio and Video songs / files in Telegram Voice Chats.

VC UserBot A Telegram Userbot to play Audio and Video songs / files in Telegram Voice Chats. It's made with PyTgCalls and Pyrogram Requirements Python

조던 1 Nov 29, 2021
A simple google translator telegram bot

Translator-Bot A simple google translator telegram bot Please fork this repository don't import code Made with Python3 (C) @FayasNoushad Copyright per

Fayas Noushad 14 Nov 12, 2022
Aria/qBittorrent Telegram mirror/leech bot

This Telegram Bot written in Python for mirroring files on the Internet to our Google Drive or Telegram. Based on python-aria-mirror-bot Features: qBi

Anas 2.1k Jan 04, 2023
Monitoring plugin for MikroTik devices

check_routeros - Monitoring MikroTik devices This is a monitoring plugin for Icinga, Nagios and other compatible monitoring solutions to check MikroTi

DinoTools 6 Dec 24, 2022
A Python Jupyter Kernel in Slack. Just send Python code as a message.

Slack IPython bot 🤯 One Slack bot to rule them all. PyBot. Just send Python code as a message. Install pip install slack-ipython To start the bot, si

Rick Lamers 44 May 23, 2022
LoL 台版10周年活動自動輸入邀請碼

LoLTW_10Year_88Event LoLTW 8.8 周年慶 邀請碼自動輸入 設定 在 LoLTW_10Year_88Evnet.exe 的位置建立一個檔案 .env,內容如下 Bahamut_Discussion = https://forum.gamer.com.tw/C.php?bsn

古丁丁 5 Dec 13, 2021
Using twitter lists as your feed

Twitlists A while ago, Twitter changed their timeline to be algorithmically-fed rather than a simple reverse-chronological feed. In particular, they p

Peyton Walters 5 Nov 21, 2022
Skautský discord bot

Jáchym 🤖 Open-source skautský discord bot postavený na discord.py O čem? • Funkce • TODO • Poděkování ❓ O čem? Jáchym vznikl jako projekt do odborky

10 May 12, 2022
Create CDK projects with projen

The Projenator: I'll be back! Description This is a CDKv2 project that takes the grind out of setting up new cdk projects/implementations by using aut

Andrew 2 Dec 11, 2021
Shred your reddit comment and post history

trasheddit Shred your reddit comment and post history (x89/Shreddit replacement) Usage Simple Example Download trasheddit: git clone https://github.co

Elly 2 Jan 05, 2022
Discord bot that generates boba drinks. Submission for sunhacks 2021

boba-bot Team Poggies' submission for Sunhacks 2021. Find our project page on Devpost, and a video demonstration can be found on YouTube. Commands $he

Joshua Tenorio 3 Nov 02, 2022
This will create new discord accounts and add them to your server

Discord-Botter This tool will create new discord accounts add them to your server, this tool needs a captcha api like capmonster.cloud or anti-captcha

Shahzain 27 Nov 30, 2022
Telegram Group Management Bot based on Pyrogram

Komi-San Telegram Group Management Bot based on Pyrogram More updates coming soon Support Group Open a Pull request if you wana contribute Example for

33 Nov 07, 2022
This is a Telegram Bot that tracks packages from the Brazilian Mail Service.

RastreioBot About Setup Run Contribute Contact About This is a Telegram Bot that tracks packages from the Brazilian Mail Service. It runs on Python 3

Gabriel R F 320 Dec 22, 2022