PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Overview

H3 Logo

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PyPI version PyPI downloads conda version

Tests

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">
>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

  1. Bump version in setup.cfg
  2. Publish:
python3 -m build
python3 -m twine upload --repository pypi dist/*
Comments
  • 'TypeError: must be real number, not NoneType' when using h3_pyspark

    'TypeError: must be real number, not NoneType' when using h3_pyspark

    Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

    dataframe: image

    errors: image

    opened by Tingmi 5
  • Fix indexing for polygons and lines

    Fix indexing for polygons and lines

    Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

    opened by rwaldman 1
  • Better error handling when null values are passed in

    Better error handling when null values are passed in

    Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

    This type behavior would be better/more resilient:

    @F.udf(T.ArrayType(T.StringType()))
    def index_shape(geometry, resolution):
        if geometry is None:
            return None
        return _index_shape(geometry, resolution)
    
    opened by kevinschaich 1
  • Fix bug in index_shape function which missed hexes for long line segments

    Fix bug in index_shape function which missed hexes for long line segments

    Fixes #8

    Previous behavior for problematic line:

    Screen Shot 2022-02-24 at 3 40 36 PM

    New behavior for same line:

    Screen Shot 2022-02-24 at 4 02 47 PM

    Previous behavior for problematic polygon:

    Screen Shot 2022-02-24 at 4 34 59 PM

    New behavior for same polygon:

    Screen Shot 2022-02-24 at 4 35 46 PM

    cc: @deankieserman @rwaldman

    opened by kevinschaich 0
  • Bug in index_shape function which misses several hexes

    Bug in index_shape function which misses several hexes

    Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

    image

    Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

    image
    opened by kevinschaich 0
  • polyfill fails with valid multipolygon geojson

    polyfill fails with valid multipolygon geojson

    h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

    however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

    implementation in src/h3_pyspark/__init__.py

    @F.udf(returnType=T.ArrayType(T.StringType()))
    @handle_nulls
    def polyfill(polygons, res, geo_json_conformant):
        # NOTE: this behavior differs from default
        # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
        polygons = json.loads(polygons)
        type_ = polygons["type"].lower()
        if type_ == "multipolygon":
            output = []
            for i in polygons["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
            return sanitize_types(output)
        return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))
    

    test in tests/test_core.py

    multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'
    
    def test_polyfill_multipolygon(self):
            h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
            print(h3_pyspark_test_args)
            integer = 12
            data = {
                "res": integer,
                "geo_json_conformant": True,
                "geojson": multipolygon,
            }
            df = spark.createDataFrame([data])
            actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
            actual = actual.collect()[0]["actual"]
            print(actual)
            expected = []
            for i in json.loads(multipolygon)["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                expected.extend(list(h3.polyfill(_polygon, integer, True)))
            expected = sanitize_types(expected)
            assert sort(actual) == sort(expected)
    
    opened by kangeugine 0
Releases(1.2.6)
  • 1.2.6(Mar 10, 2022)

  • 1.2.4(Mar 4, 2022)

    What's Changed

    • Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4

    Source code(tar.gz)
    Source code(zip)
  • 1.2.3(Feb 24, 2022)

    What's Changed

    • Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3
    • Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

    New Contributors

    • @deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3

    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Dec 8, 2021)

    What's Changed

    • Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1
    • Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

    New Contributors

    • @kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0

    Source code(tar.gz)
    Source code(zip)
Owner
Kevin Schaich
Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.
Kevin Schaich
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022
Template for a Dataflow Flex Template in Python

Dataflow Flex Template in Python This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build D

STOIX 5 Apr 28, 2022
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 01, 2022
Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

Snakemake workflow: name A Snakemake workflow for description Usage The usage of this workflow is described in the Snakemake Workflow Catalog. If

Algorithms for reproducible bioinformatics (Koesterlab) 1 Dec 16, 2021
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI Hallo

Florent Zahoui 1 Feb 07, 2022
collect training and calibration data for gaze tracking

Collect Training and Calibration Data for Gaze Tracking This tool allows collecting gaze data necessary for personal calibration or training of eye-tr

Pascal 5 Dec 17, 2022
Automated Exploration Data Analysis on a financial dataset

Automated EDA on financial dataset Just a simple way to get automated Exploration Data Analysis from financial dataset (OHLCV) using Streamlit and ta.

Darío López Padial 28 Nov 27, 2022
Data and code accompanying the paper Politics and Virality in the Time of Twitter

Politics and Virality in the Time of Twitter Data and code accompanying the paper Politics and Virality in the Time of Twitter. In specific: the code

Cardiff NLP 3 Jul 02, 2022
Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day. Correlate the market activity with the Apple Keynote presentations.

2 Jan 04, 2022
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

Samuel 1 Jan 04, 2022
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
The repo for mlbtradetrees.com. Analyze any trade in baseball history!

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

7 Nov 20, 2022
A simple and efficient tool to parallelize Pandas operations on all available CPUs

Pandaral·lel Without parallelization With parallelization Installation $ pip install pandarallel [--upgrade] [--user] Requirements On Windows, Pandara

Manu NALEPA 2.8k Dec 31, 2022
Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021