PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Last update: Dec 24, 2022

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

uber.github.io/h3-py

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">

>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

Bump version in setup.cfg
Publish:

python3 -m build
python3 -m twine upload --repository pypi dist/*

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

dataframe:

errors:

opened by Tingmi 5
Fix indexing for polygons and lines

Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

opened by rwaldman 1
Better error handling when null values are passed in
Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

This type behavior would be better/more resilient:

@F.udf(T.ArrayType(T.StringType())) def index_shape(geometry, resolution): if geometry is None: return None return _index_shape(geometry, resolution)
opened by kevinschaich 1
Fix bug in index_shape function which missed hexes for long line segments

Fixes #8

Previous behavior for problematic line:

New behavior for same line:

Previous behavior for problematic polygon:

New behavior for same polygon:

cc: @deankieserman @rwaldman

opened by kevinschaich 0
Bug in index_shape function which misses several hexes

Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

opened by kevinschaich 0

polyfill fails with valid multipolygon geojson

h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

implementation in src/h3_pyspark/__init__.py

@F.udf(returnType=T.ArrayType(T.StringType()))
@handle_nulls
def polyfill(polygons, res, geo_json_conformant):
    # NOTE: this behavior differs from default
    # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
    polygons = json.loads(polygons)
    type_ = polygons["type"].lower()
    if type_ == "multipolygon":
        output = []
        for i in polygons["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
        return sanitize_types(output)
    return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))

test in tests/test_core.py

multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'

def test_polyfill_multipolygon(self):
        h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
        print(h3_pyspark_test_args)
        integer = 12
        data = {
            "res": integer,
            "geo_json_conformant": True,
            "geojson": multipolygon,
        }
        df = spark.createDataFrame([data])
        actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
        actual = actual.collect()[0]["actual"]
        print(actual)
        expected = []
        for i in json.loads(multipolygon)["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            expected.extend(list(h3.polyfill(_polygon, integer, True)))
        expected = sanitize_types(expected)
        assert sort(actual) == sort(expected)

opened by kangeugine 0

Releases(1.2.6)

1.2.6(Mar 10, 2022)
Add edge cases for lines (#11)

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.5...1.2.6
Source code(tar.gz)
Source code(zip)
1.2.4(Mar 4, 2022)
What's Changed

Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4
Source code(tar.gz)
Source code(zip)
1.2.3(Feb 24, 2022)
What's Changed

Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3

Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

New Contributors

@deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3
Source code(tar.gz)
Source code(zip)
1.2.2(Jan 5, 2022)

Source code(tar.gz)
Source code(zip)
1.1.0(Dec 8, 2021)
What's Changed

Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1

Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

New Contributors

@kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0
Source code(tar.gz)
Source code(zip)

Owner

Kevin Schaich

Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.

GitHub Repository https://uber.github.io/h3-py/intro.html

CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

0 Mar 13, 2022

Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

109 Dec 14, 2022

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

53 Dec 08, 2022

Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

0 Jul 05, 2022

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

1 Oct 31, 2021

Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022

Gathering data of likes on Tinder within the past 7 days

tinder_likes_data Gathering data of Likes Sent on Tinder within the past 7 days. Versions November 25th, 2021 - Functionality to get the name and age

12 Jan 05, 2023

Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

259 Dec 24, 2022

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

3 Nov 08, 2022

INFO-H515 - Big Data Scalable Analytics

INFO-H515 - Big Data Scalable Analytics Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group Exercise clas

58 Dec 11, 2022

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

1 Jul 13, 2021

Picka: A Python module for data generation and randomization.

Picka: A Python module for data generation and randomization. Author: Anthony Long Version: 1.0.1 - Fixed the broken image stuff. Whoops What is Picka

108 Nov 30, 2021

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 09, 2023

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

What is Retentioneering? Retentioneering is a Python framework and library to assist product analysts and marketing analysts as it makes it easier to

581 Jan 07, 2023

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science is among the first European programs in Data Science and is fully focused on data engineering and data a

2 Jun 17, 2022

Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Statistical_Modelling Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data Statistical Methods for Decision Ma

1 Jan 27, 2022

An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

6.2k Jan 08, 2023

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and lo

102 Nov 10, 2022

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

2 Jan 06, 2022

ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

0 Jul 07, 2021

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Related tags

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

Installation

Usage

Publishing

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Fix indexing for polygons and lines

Better error handling when null values are passed in

Fix bug in index_shape function which missed hexes for long line segments

Bug in index_shape function which misses several hexes

polyfill fails with valid multipolygon geojson

Releases(1.2.6)

1.2.6(Mar 10, 2022)

1.2.4(Mar 4, 2022)

What's Changed

1.2.3(Feb 24, 2022)

What's Changed

New Contributors

1.2.2(Jan 5, 2022)

1.1.0(Dec 8, 2021)

What's Changed

New Contributors

Owner

Kevin Schaich

CINECA molecular dynamics tutorial set

Manage large and heterogeneous data spaces on the file system.

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

Used for data processing in machine learning, and help us to construct ML model more easily from scratch

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

Convert tables stored as images to an usable .csv file

Gathering data of likes on Tinder within the past 7 days

Pandas and Spark DataFrame comparison for humans

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

INFO-H515 - Big Data Scalable Analytics

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Picka: A Python module for data generation and randomization.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

An orchestration platform for the development, production, and observation of data assets.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

ETL pipeline on movie data using Python and postgreSQL