Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Overview

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

This project is a good starting point for those who have little or no experience with Apache Spark Streaming. We use Twitter data since Twitter provides an API for developers that is easy to access. We present an end-to-end architecture on how to stream data from Twitter, clean it, and apply a simple sentiment analysis model to detect the polarity and subjectivity of each tweet.

Input data: Live tweets with a keyword
Main model: Data preprocessing and apply sentiment analysis on the tweets
Output: A parquet file with all the tweets and their sentiment analysis scores (polarity and subjectivity)

We use Python version 3.7.6 and Spark version 2.4.7. We should be cautious on the versions that we use because different versions of Spark require a different version of Python.

Main Libraries

tweepy: interact with the Twitter Streaming API and create a live data streaming pipeline with Twitter
pyspark: preprocess the twitter data (Python's Spark library)
textblob: apply sentiment analysis on the twitter text data

Instructions

First, run the Part 1: twitter_connection.py and let it continue running.
Then, run the Part 2: sentiment_analysis.py from a different IDE.

Part 1: Send tweets from the Twitter API 

In this part, we use our developer credentials to authenticate and connect to the Twitter API. We also create a TCP socket between Twitter's API and Spark, which waits for the call of the Spark Structured Streaming and then sends the Twitter data. Here, we use Python's Tweepy library for connecting and getting the tweets from the Twitter API. 

Part 2: Tweet preprocessing and sentiment analysis

In this part, we receive the data from the TCP socket and preprocess it with the pyspark library, which is Python's API for Spark. Then, we apply sentiment analysis using textblob, which is Python's library for processing textual data. After sentiment analysis, we save the tweet and the sentiment analysis scores in a parquet file, which is a data storage format.

Owner
Himanshu Kumar singh
Always Passionate to work on Artificial Intelligence technologies. Getting better at AI with each passing day. #TowardsAI
Himanshu Kumar singh
Flood modeling by 2D shallow water equation

hydraulicmodel Flood modeling by 2D shallow water equation. Refer to Hunter et al (2005), Bates et al. (2010). Diffusive wave approximation Local iner

6 Nov 30, 2022
PyClustering is a Python, C++ data mining library.

pyclustering is a Python, C++ data mining library (clustering algorithm, oscillatory networks, neural networks). The library provides Python and C++ implementations (C++ pyclustering library) of each

Andrei Novikov 1k Jan 05, 2023
EOD Historical Data Python Library (Unofficial)

EOD Historical Data Python Library (Unofficial) https://eodhistoricaldata.com Installation python3 -m pip install eodhistoricaldata Note Demo API key

Michael Whittle 20 Dec 22, 2022
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright ยฉ 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022
.npy, .npz, .mtx converter.

npy-converter Matrix Data Converter. Expand matrix for multi-thread, multi-process Divid matrix for multi-thread, multi-process Support: .mtx, .npy, .

taka 1 Feb 07, 2022
statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

1 Oct 03, 2021
Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

Felipe Demenech Vasconcelos 2 Jan 20, 2022
Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment Brief explanation of PT Bukalapak.com Tbk Bukalapak was found

Najibulloh Asror 2 Feb 10, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
First steps with Python in Life Sciences

First steps with Python in Life Sciences This course material is part of the "First Steps with Python in Life Science" three-day course of SIB-trainin

SIB Swiss Institute of Bioinformatics 22 Jan 08, 2023
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022
Programmatically access the physical and chemical properties of elements in modern periodic table.

API to fetch elements of the periodic table in JSON format. Uses Pandas for dumping .csv data to .json and Flask for API Integration. Deployed on "pyt

the techno hack 3 Oct 23, 2022
Nobel Data Analysis

Nobel_Data_Analysis This project is for analyzing a set of data about people who have won the Nobel Prize in different fields and different countries

Mohammed Hassan El Sayed 1 Jan 24, 2022
๐Ÿงช Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

๐Ÿงช๐Ÿ“ˆ ๐Ÿ. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python a

Marc Skov Madsen 97 Dec 08, 2022
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

Sean Breckenridge 27 Dec 28, 2022
Analysis scripts for QG equations

qg-edgeofchaos Analysis scripts for QG equations FIle/Folder Structure eigensolvers.py - Spectral and finite-difference solvers for Rossby wave eigenf

Norman Cao 2 Sep 27, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
Very basic but functional Kakuro solver written in Python.

kakuro.py Very basic but functional Kakuro solver written in Python. It uses a reduction to exact set cover and Ali Assaf's elegant implementation of

Louis Abraham 4 Jan 15, 2022
Data analysis and visualisation projects from a range of individual projects and applications

Python-Data-Analysis-and-Visualisation-Projects Data analysis and visualisation projects from a range of individual projects and applications. Python

Tom Ritman-Meer 1 Jan 25, 2022