Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Last update: Dec 06, 2021

Related tags

Data Analysis kafka-to-spark-streaming

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Make sure that VPN is switched on, so that you can use Twitter. In some countries Twitter is blocked.

Moreover, you should have own consumer_key, consumer_secret, and access_token with its secret inside config.py file

Create environment using conda with Python 3.8:
- conda create -n python38 python=3.8
- conda activate python38
- Check requirements inside requirements.txt and install then using conda:
  - conda install -c conda-forge tweepy==4.4.0
  - conda install -c conda-forge kafka-python==2.0.2
Kafka should be installed in your machine, check the documentation for installation. if you use brew with Mac you can use brew install kafka
Start zookeeper: zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties, port: 2181
On another terminal window start broker: kafka-server-start /usr/local/etc/kafka/server.properties, port: 9092 - In terminal window list topics you have: kafka-topics --list --bootstrap-server localhost:9092
Create Kafka topic "tweeter" with 1 partition and no replication because we use local machine: kafka-topics --create --topic tweeter --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
Now list again, the topics you have: kafka-topics --list --bootstrap-server localhost:9092
Let's see what we have inside the "tweeter" topic kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning, absolutely noting), but when we start streaming, data will be generated
Now run python kafka_producer.py to start stream Twitter and push message to topic.
And now check that the data is inside topic with kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning
Congrats! You have done it!

So what's next?

You can use generated data with Kafka Stream and Spark Streaming, and practice more!

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Related tags

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Owner

Rustam Zokirov

DefAP is a program developed to facilitate the exploration of a material's defect chemistry

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Fast, flexible and easy to use probabilistic modelling in Python.

Instant search for and access to many datasets in Pyspark.

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

follow-analyzer helps GitHub users analyze their following and followers relationship

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

Py-price-monitoring - A Python price monitor

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

Big Data & Cloud Computing for Oceanography

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Data Analysis for First Year Laboratory at Imperial College, London.

A variant of LinUCB bandit algorithm with local differential privacy guarantee

Bigdata Simulation Library Of Dream By Sandman Books

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Clean and reusable data-sciency notebooks.

Using Python to derive insights on particular Pokemon, Types, Generations, and Stats