Estudos e projetos feitos com PySpark.

Last update: Nov 06, 2022

Related tags

Overview

PySpark (Spark com Python)

PySpark é uma biblioteca Spark escrita em Python, e seu objetivo é permitir a análise interativa dos dados em um ambiente distribuído. Seu uso é extremamente importante quando o assunto é grande volume de dados, BigData, por conta do seu processamento eficiente de grandes conjuntos de dados.

Documentação

Data

Os dados para esse tutorial foram obtidos no Kaggle, a base é pequena, então teoricamente utilizar o pyspark nesse caso seria "matar uma mosca com um canhão", mas como o objetivo é explorar as principais funções, esse dataset vai nos atender bem.

Para fazer download desse conjunto de dados você precisa ter uma conta no kaggle.

Tópicos

Vamos explorar as principais funções:

Count
Describe
Select
OrderBy
WithColumnRenamed
WithColumn
When
Drop
Filter
Where
GroupBy

Requisitos

Você precisará de Python 3 e pip. É altamente recomendado utilizar ambientes virtuais com o virtualenv ou com o conda e o arquivo requirements.txt para instalar os pacotes dependências do projeto:

Conda

$ conda create --name nameenv python
$ conda activate nameenv
$ pip install -r requirements.txt

Virtualenv

$ pip3 install virtualenv
$ virtualenv venv -p python3
$ source venv/bin/activate
$ pip install -r requirements.txt

Observação

Para executar o PySpark, você também precisa que o Java seja instalado.

Estudos e projetos feitos com PySpark.

Related tags

Overview

PySpark (Spark com Python)

Data

Para fazer download desse conjunto de dados você precisa ter uma conta no kaggle.

Tópicos

Requisitos

Observação

Owner

Karinne Cristina

UpliftML: A Python Package for Scalable Uplift Modeling

Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

Bayesian Additive Regression Trees For Python

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

ML Kaggle Titanic Problem using LogisticRegrission

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

李航《统计学习方法》复现

mlpack: a scalable C++ machine learning library --

Mixing up the Invariant Information clustering architecture, with self supervised concepts from SimCLR and MoCo approaches

Pydantic based mock data generation

Real-time stream processing for python

A library of extension and helper modules for Python's data analysis and machine learning libraries.

whylogs: A Data and Machine Learning Logging Standard

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Fit interpretable models. Explain blackbox machine learning.

Projeto: Machine Learning: Linguagens de Programacao 2004-2001

K-means clustering is a method used for clustering analysis, especially in data mining and statistics.

database for artificial intelligence/machine learning data

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill