Delta Sharing: An Open Protocol for Secure Data Sharing

Last update: Jan 02, 2023

Overview

Delta Sharing: An Open Protocol for Secure Data Sharing

Delta Sharing is an open protocol for secure real-time exchange of large datasets, which enables organizations to share data in real time regardless of which computing platforms they use. It is a simple REST protocol that securely shares access to part of a cloud dataset and leverages modern cloud storage systems, such as S3, ADLS, or GCS, to reliably transfer data.

With Delta Sharing, a user accessing shared data can directly connect to it through pandas, Tableau, Apache Spark, Rust, or other systems that support the open protocol, without having to deploy a specific compute platform first. Data providers can share a dataset once to reach a broad range of consumers, while consumers can begin using the data in minutes.

This repo includes the following components:

Delta Sharing protocol specification.
Python Connector: A Python library that implements the Delta Sharing Protocol to read shared tables as pandas DataFrame or Apache Spark DataFrames.
Apache Spark Connector: An Apache Spark connector that implements the Delta Sharing Protocol to read shared tables from a Delta Sharing Server. The tables can then be accessed in SQL, Python, Java, Scala, or R.
Delta Sharing Server: A reference implementation server for the Delta Sharing Protocol for development purposes. Users can deploy this server to share existing tables in Delta Lake and Apache Parquet format on modern cloud storage systems.

Python Connector

The Delta Sharing Python Connector is a Python library that implements the Delta Sharing Protocol to read tables from a Delta Sharing Server. You can load shared tables as a pandas DataFrame, or as an Apache Spark DataFrame if running in PySpark with the Apache Spark Connector installed.

System Requirements

Python 3.6+

Installation

pip install delta-sharing

If you are using Databricks Runtime, you can follow Databricks Libraries doc to install the library on your clusters.

Accessing Shared Data

The connector accesses shared tables based on profile files, which are JSON files containing a user's credentials to access a Delta Sharing Server. We have several ways to get started:

Download the profile file to access an open, example Delta Sharing Server that we're hosting here. You can try the connectors with this sample data.
Start your own Delta Sharing Server and create your own profile file following profile file format to connect to this server.
Download a profile file from your data provider.

Quick Start

After you save the profile file, you can use it in the connector to access shared tables.

" # Create a SharingClient. client = delta_sharing.SharingClient(profile_file) # List all shared tables. client.list_all_tables() # Create a url to access a shared table. # A table path is the profile file path following with `#` and the fully qualified name of a table (`

.

.

`). table_url = profile_file + "#

.

.

" # Fetch 10 rows from a table and convert it to a Pandas DataFrame. This can be used to read sample data from a table that cannot fit in the memory. delta_sharing.load_as_pandas(table_url, limit=10) # Load a table as a Pandas DataFrame. This can be used to process tables that can fit in the memory. delta_sharing.load_as_pandas(table_url) # If the code is running with PySpark, you can use `load_as_spark` to load the table as a Spark DataFrame. delta_sharing.load_as_spark(table_url) ">
import delta_sharing

# Point to the profile file. It can be a file on the local file system or a file on a remote storage.
profile_file = "
          
           "
          
# Create a SharingClient.
client = delta_sharing.SharingClient(profile_file)

# List all shared tables.
client.list_all_tables()

# Create a url to access a shared table.
# A table path is the profile file path following with `#` and the fully qualified name of a table (`
          
           .
           
            .
            
             `).
            
table_url = profile_file + "#
          
           .
           
            .
            
             "
            
# Fetch 10 rows from a table and convert it to a Pandas DataFrame. This can be used to read sample data from a table that cannot fit in the memory.
delta_sharing.load_as_pandas(table_url, limit=10)

# Load a table as a Pandas DataFrame. This can be used to process tables that can fit in the memory.
delta_sharing.load_as_pandas(table_url)

# If the code is running with PySpark, you can use `load_as_spark` to load the table as a Spark DataFrame.
delta_sharing.load_as_spark(table_url)

You can try this by running our examples with the open, example Delta Sharing Server.

Details on Profile Paths

The profile file path for SharingClient and load_as_pandas can be any URL supported by FSSPEC (such as s3a://my_bucket/my/profile/file). If you are using Databricks File System, you can also preface the path with /dbfs/ to access the profile file as if it were a local file.
The profile file path for load_as_spark can be any URL supported by Hadoop FileSystem (such as s3a://my_bucket/my/profile/file).
A table path is the profile file path following with # and the fully qualified name of a table (. .).

Apache Spark Connector

The Apache Spark Connector implements the Delta Sharing Protocol to read shared tables from a Delta Sharing Server. It can be used in SQL, Python, Java, Scala and R.

System Requirements

Java 8+
Scala 2.12.x
Apache Spark 3+ or Databricks Runtime 7+

Accessing Shared Data

The connector loads user credentials from profile files. Please see Download the share profile file to download a profile file for our example server or for your own data sharing server.

Configuring Apache Spark

You can set up Apache Spark to load the Delta Sharing connector in the following two ways:

Run interactively: Start the Spark shell (Scala or Python) with the Delta Sharing connector and run the code snippets interactively in the shell.
Run as a project: Set up a Maven or SBT project (Scala or Java) with the Delta Sharing connector, copy the code snippets into a source file, and run the project.

If you are using Databricks Runtime, you can skip this section and follow Databricks Libraries doc to install the connector on your clusters.

Set up an interactive shell

To use Delta Sharing connector interactively within the Spark’s Scala/Python shell, you can launch the shells as follows.

PySpark shell

pyspark --packages io.delta:delta-sharing-spark_2.12:0.2.0

Scala Shell

bin/spark-shell --packages io.delta:delta-sharing-spark_2.12:0.2.0

Set up a standalone project

If you want to build a Java/Scala project using Delta Sharing connector from Maven Central Repository, you can use the following Maven coordinates.

Maven

You include Delta Sharing connector in your Maven project by adding it as a dependency in your POM file. Delta Sharing connector is compiled with Scala 2.12.

<dependency>
  <groupId>io.deltagroupId>
  <artifactId>delta-sharing-spark_2.12artifactId>
  <version>0.2.0version>
dependency>

SBT

You include Delta Sharing connector in your SBT project by adding the following line to your build.sbt file:

libraryDependencies += "io.delta" %% "delta-sharing-spark" % "0.2.0"

Quick Start

After you save the profile file and launch Spark with the connector library, you can access shared tables using any language.

SQL

-- A table path is the profile file path following with `#` and the fully qualified name of a table (`
   
    .
    
     .
     
      `).
     
    
   
CREATE TABLE mytable USING deltaSharing LOCATION '
   
    #
    
     .
     
      .
      
       '
      
     
    
   ;
SELECT * FROM mytable;

Python

#

.

.

" df = spark.read.format("deltaSharing").load(table_path) ">
# A table path is the profile file path following with `#` and the fully qualified name of a table (`
       
        .
        
         .
         
          `).
         
table_path = "
       
        #
        
         .
         
          .
          
           "
          
df = spark.read.format("deltaSharing").load(table_path)

Scala

#

.

.

" val df = spark.read.format("deltaSharing").load(tablePath) ">
// A table path is the profile file path following with `#` and the fully qualified name of a table (`
       
        .
        
         .
         
          `).
         
val tablePath = "
       
        #
        
         .
         
          .
          
           "
          
val df = spark.read.format("deltaSharing").load(tablePath)

Java

#

.

.

"; Dataset

df = spark.read.format("deltaSharing").load(tablePath); ">
// A table path is the profile file path following with `#` and the fully qualified name of a table (`
        
         .
         
          .
          
           `).
          
String tablePath = "
        
         #
         
          .
          
           .
           
            "
           
        ;
Dataset<Row> df = spark.read.format("deltaSharing").load(tablePath);

R

#

.

.

" df <- read.df(table_path, "deltaSharing") ">
# A table path is the profile file path following with `#` and the fully qualified name of a table (`
       
        .
        
         .
         
          `).
         
table_path <- "
       
        #
        
         .
         
          .
          
           "
          
df <- read.df(table_path, "deltaSharing")

You can try this by running our examples with the open, example Delta Sharing Server.

Table paths

A profile file path can be any URL supported by Hadoop FileSystem (such as s3a://my_bucket/my/profile/file).
A table path is the profile file path following with # and the fully qualified name of a table (. .).

Delta Sharing Reference Server

The Delta Sharing Reference Server is a reference implementation server for the Delta Sharing Protocol. This can be used to set up a small service to test your own connector that implements the Delta Sharing Protocol. Please note that this is not a completed implementation of secure web server. We highly recommend you to put this behind a secure proxy if you would like to expose it to public.

Some vendors offer managed services for Delta Sharing too (for example, Databricks). Please refer to your vendor's website for how to set up sharing there. Vendors that are interested in being listed as a service provider should open an issue on GitHub to be added to this README and our project's website.

Here are the steps to setup the reference server to share your own data.

Get the pre-built package

Download the pre-built package delta-sharing-server-x.y.z.zip from GitHub Releases.

Server configuration and adding Shared Data

Unpack the pre-built package and copy the server config template file conf/delta-sharing-server.yaml.template to create your own server yaml file, such as conf/delta-sharing-server.yaml.
Make changes to your yaml file. You may also need to update some server configs for special requirements.
To add Shared Data, add reference to Delta Lake tables you would like to share from this server in this config file.

Config the server to access tables on cloud storage

We support sharing Delta Lake tables on S3, Azure Blob Storage and Azure Data Lake Storage Gen2.

S3

There are multiple ways to config the server to access S3.

EC2 IAM Metadata Authentication (Recommended)

Applications running in EC2 may associate an IAM role with the VM and query the EC2 Instance Metadata Service for credentials to access S3.

Authenticating via the AWS Environment Variables

We support configuration via the standard AWS environment variables. The core environment variables are for the access key and associated secret:

export AWS_ACCESS_KEY_ID=my.aws.key
export AWS_SECRET_ACCESS_KEY=my.secret.key

Other S3 authentication methods

The server is using hadooop-aws to read S3. You can find other approaches in hadoop-aws doc.

Azure Blob Storage

The server is using hadoop-azure to read Azure Blob Storage. Using Azure Blob Storage requires configuration of credentials. You can create a Hadoop configuration file named core-site.xml and add it to the server's conf directory. Then add the following content to the xml file:

fs.azure.account.key.YOUR-ACCOUNT-NAME.blob.core.windows.net YOUR-ACCOUNT-KEY ">

xml version="1.0"?>
xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.azure.account.key.YOUR-ACCOUNT-NAME.blob.core.windows.netname>
    <value>YOUR-ACCOUNT-KEYvalue>
  property>
configuration>

YOUR-ACCOUNT-NAME is your Azure storage account and YOUR-ACCOUNT-KEY is your account key.

Azure Data Lake Storage Gen2

The server is using hadoop-azure to read Azure Data Lake Storage Gen2. We support the Shared Key authentication. You can create a Hadoop configuration file named core-site.xml and add it to the server's conf directory. Then add the following content to the xml file:

fs.azure.account.auth.type.YOUR-ACCOUNT-NAME.dfs.core.windows.net SharedKey fs.azure.account.key.YOUR-ACCOUNT-NAME.dfs.core.windows.net YOUR-ACCOUNT-KEY The secret password. Never share these. ">

xml version="1.0"?>
xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.azure.account.auth.type.YOUR-ACCOUNT-NAME.dfs.core.windows.netname>
    <value>SharedKeyvalue>
    <description>
    description>
  property>
  <property>
    <name>fs.azure.account.key.YOUR-ACCOUNT-NAME.dfs.core.windows.netname>
    <value>YOUR-ACCOUNT-KEYvalue>
    <description>
    The secret password. Never share these.
    description>
  property>
configuration>

YOUR-ACCOUNT-NAME is your Azure storage account and YOUR-ACCOUNT-KEY is your account key.

More cloud storage supports will be added in the future.

Authorization

The server supports a basic authorization with pre-configed bearer token. You can add the following config to your server yaml file:

authorization:
  bearerToken:

Then any request should send with the above token, otherwise, the server will refuse the request.

If you don't config the bearer token in the server yaml file, all requests will be accepted without authorization.

To be more secure, you recommend you to put the server behind a secure proxy such as NGINX to set up JWT Authentication.

Start the server

Run the following shell command:

bin/delta-sharing-server -- --config

should be the path of the yaml file you created in the previous step. You can find options to config JVM in sbt-native-packager.

Use the pre-built Docker image

You can use the pre-built docker image from https://hub.docker.com/r/deltaio/delta-sharing-server by running the following command

docker run -p 
   
    :
    
      --mount type=bind,source=
     
      ,target=/config/delta-sharing-server-config.yaml deltaio/delta-sharing-server:0.2.0 -- --config /config/delta-sharing-server-config.yaml

Note that should be the same as the port defined inside the config file.

API Compatibility

The REST APIs provided by Delta Sharing Server are stable public APIs. They are defined by Delta Sharing Protocol and we will follow the entire protocol strictly.

The interfaces inside Delta Sharing Server are not public APIs. They are considered internal, and they are subject to change across minor/patch releases.

Delta Sharing Protocol

The Delta Sharing Protocol specification details the protocol.

Building this Project

Python Connector

To execute tests, run

python/dev/pytest

To install in develop mode, run

cd python/
pip install -e .

To install locally, run

cd python/
pip install .

To generate a wheel file, run

cd python/
python setup.py sdist bdist_wheel

It will generate python/dist/delta_sharing-x.y.z-py3-none-any.whl.

Apache Spark Connector and Delta Sharing Server

Apache Spark Connector and Delta Sharing Server are compiled using SBT.

To compile, run

build/sbt compile

To execute tests, run

build/sbt test

To generate the Apache Spark Connector, run

build/sbt spark/package

It will generate spark/target/scala-2.12/delta-sharing-spark_2.12-x.y.z.jar.

To generate the pre-built Delta Sharing Server package, run

build/sbt server/universal:packageBin

It will generate server/target/universal/delta-sharing-server-x.y.z.zip.

To build the Docker image for Delta Sharing Server, run

build/sbt server/docker:publishLocal

This will build a Docker image tagged delta-sharing-server:x.y.z, which you can run with:

docker run -p 
   
    :
    
      --mount type=bind,source=
     
      ,target=/config/delta-sharing-server-config.yaml delta-sharing-server:x.y.z -- --config /config/delta-sharing-server-config.yaml

Note that should be the same as the port defined inside the config file.

Refer to SBT docs for more commands.

Reporting Issues

We use GitHub Issues to track community reported issues. You can also contact the community for getting answers.

Contributing

We welcome contributions to Delta Sharing. See our CONTRIBUTING.md for more details.

We also adhere to the Delta Lake Code of Conduct.

License

Apache License 2.0.

Community

We use the same community resources as the Delta Lake project:

Public Slack Channel
- Register here
- Login here
Public Mailing list

Comments

Errors trying the delta-sharing access with pandas and spark

Hello team,

I'm trying to use your delta-sharing library with python for both PANDAS and SPARK

SPARK METHOD

As you can see from the below screenshot I can access the columns or schema using the load_as_spark() method

However, when I try .show() I see the following Java certificate errors

javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

I've tried to use different Java versions and also added the certificate of the delta-sharing server hosted on Kubernetes, but nothing seems to be working. If I try this from a linux machine or kubernetes I seem to have the same problem

PANDAS METHOD

When I use the load_as_pandas() method I get a FileNotFoundError. The strange thing is that when I click on that s3 url link that you can see from the screenshot, I download the parquet file (therefore the link does work). Is the PyArrow library trying to look for a local file or what do you think the issue may be ?

Any ideas about the above 2 errors using spark and pandas ?

Thank you very much,

Peter

opened by pknowles-9 19
_delta_log and EC2 instance

Hi,

I am trying to run a delta sharing server on my local, and using my Mac terminal, or do I have to run an EC2 instance instead of running the commands on my terminal?

On the other hand, I am trying to fetch the data from S3, I created a bucket and uploaded a file in it, do I have to create an empty _delta_log myself or will iti be generated on its own?

Thanks

opened by skaplan81 15
fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties must be present

Even after having IAM role access to S3 and specifying AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY. I am having this error in server logs. As it uses hadoop-aws to read files, how do we authenticate and pass credentials in server?

Before running the below command, where should I specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties? bin/delta-sharing-server -- --config conf/delta-sharing-server.yaml

opened by Aayushpatel007 15
Connection Issue

Whenever i am trying to read the delta table from s3 using load_as_pandas function i am getting a connection issue in ec2 instance. Following is the issue: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection raise err File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen chunked=chunked, File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/lib64/python3.7/http/client.py", line 1277, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1323, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1272, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1032, in _send_output self.send(msg) File "/usr/lib64/python3.7/http/client.py", line 972, in send self.connect() File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect conn = self._new_conn() File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb6514c1c10>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=5044): Max retries exceeded with url: /delta-sharing/test/shares/share1/schemas/schema1/tables/table1/query (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb6514c1c10>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/delta_sharing.py", line 61, in load_as_pandas rest_client=DataSharingRestClient(profile), File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/reader.py", line 62, in to_pandas self._table, predicateHints=self._predicateHints, limitHint=self._limitHint File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/rest_client.py", line 84, in func_with_retry raise e File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/rest_client.py", line 77, in func_with_retry return func(self, *arg, **kwargs) File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/rest_client.py", line 182, in list_files_in_table f"/shares/{table.share}/schemas/{table.schema}/tables/{table.name}/query", data=data, File "/usr/lib64/python3.7/contextlib.py", line 112, in enter return next(self.gen) File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/rest_client.py", line 204, in _request_internal response = request(f"{self._profile.endpoint}{target}", json=data) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 590, in post return self.request('POST', url, data=data, json=json, **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 542, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 655, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=5044): Max retries exceeded with url: /delta-sharing/test/shares/share1/schemas/schema1/tables/table1/query (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb6514c1c10>: Failed to establish a new connection: [Errno 111] Connection refused'))

opened by nish4528 10
Add possibility to set endpoint-url for s3

Unfortunately, default AWS variables doesn't support endpoint url for compatible s3 storages, only through CLI "--endpoint-url". Any chances it could be implemented?
enhancement

opened by alarex 9
Support new parameter includeHistoricalMetadata for queryTableChange RPC
A couple changes:

Support new parameter includeHistoricalMetadata for queryTableChanges.

Update the way SparkStructuredStreaming is put in user agent header.

Added two more tests on service side to verify additional metadata is only returned for queryTableChanges from spark streaming.

Update to real id in DeltaSharingRestClientSuite.scala
opened by linzhou-db 7
limit feature returning an empty dataframe
The delta_sharing limit feature doesn't seem to work properly. It returns an empty dataframe.

table = delta_sharing.load_as_pandas(table_url, limit=10)

Empty DataFrame Columns: [a, b, c, ...] Index: []
opened by YannOrieult-EngieDigital 7

Delta sharing container on AWS ecs getting access denied error even with all s3 permissions there

Delta sharing container on ecs getting access denied error even with all Iam s3 and kms permission there for the bucket on the ecs service .

Error

io.delta.sharing.server.DeltaInternalException: java.util.concurrent.ExecutionException: java.nio.file.AccessDeniedException:  s3a://foo-lake/foo/foo_fact/_delta_log: getFileStatus on s3a://foo-lake/foo/foo_fact/_delta_log: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: C0DZXXXYNWKQ9CWD; S3 
Extended Request ID: eDvJMbR8UtIRDg8nXD7+0ix04VN8UPsVSEJDIBosFC5u/YJPsnAGpm/hvGdrXteQBpeQNu5DW9Q=), S3 Extended Request ID: eDvJMbR8UtIRDg8nXD7+0ix04VN8UPsVSEJDIBosFC5u/YJPsnAGpm/hvGdrXteQBpeQNu5DW9Q=
--
@timestamp | 1635392335291

Also


(s3a://foo-lake/foo/foo_fact/_delta_log/_delta_log/_last_checkpoint is corrupted.
 Will search the checkpoint files directly,java.nio.file.AccessDeniedException: s3a://foo-lake/foo/foo_fact/_delta_log/_last_checkpoint: getFileStatus on 
s3a://wfg1stg-datahub-lake/ovdm/tran_fact/_delta_log/_last_checkpoint: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: C0DM6WHHRX0RR7DG; S3 Extended Request ID: yEdeFmnz49jLTXu+LzqoNZqFy0sK8X3ge0p7Gmp5ia9FVFjWN7/HLLZ5sWatqfi6cDH0ZRGGf9s=), S3 Extended Request ID: yEdeFmnz49jLTXu+LzqoNZqFy0sK8X3ge0p7Gmp5ia9FVFjWN7/HLLZ5sWatqfi6cDH0ZRGGf9s=)
--

Shouldn't delta-server already use EC2ContainerCredentialsProviderWrapper or if there is a way to still configure this

opened by gauravbrills 7

OSS Delta Sharing Server: Adds api to accept cdf query
OSS Delta Sharing Server: Adds api to accept cdf query

@Get("/shares/{share}/schemas/{schema}/tables/{table}/changes")

Parse url parameters and construct the cdfoptions map

Add classes DeltaErrors and DeltaDataSource for some exceptions and constants

Add DeltaSharedTable.queryCDF and return not implemented exception
opened by linzhou-db 6
Doc on how to create a "share"?

If I'm self hosting the server, getting it to read S3, how do I create a share? I've checked https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md - which focuses on list share, query table etc.

opened by felixsafegraph 6
Delta Share Fails When Attempting to Read Delta Table
import delta_sharing

table_url = "/Users/user1/Applications/open-datasets.share#share1.default.test_facilities"

pandas_df = delta_sharing.load_as_pandas(table_url)

pandas_df.head(10)

My config.yaml:

The format version of this config file

version: 1

Config shares/schemas/tables to share

shares:

name: "share1" schemas:

name: "default" tables:

name: "test_facilities" location: "/tmp/test_facilities"

host: "localhost" port: 9999 endpoint: "/delta-sharing"

I keep getting the following error when I run the above program. The delta table is there

HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:9999/delta-sharing/shares/share1/schemas/default/tables/test_facilities/query Response from server: {'errorCode': 'INTERNAL_ERROR', 'message': ''}

Caused by: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalStateException: File system class org.apache.hadoop.fs.LocalFileSystem is not supported at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2055) at com.google.common.cache.LocalCache.get(LocalCache.java:3966) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863) at io.delta.standalone.internal.DeltaSharedTableLoader.loadTable(DeltaSharedTableLoader.scala:54) at io.delta.sharing.server.DeltaSharingService.$anonfun$listFiles$1(DeltaSharingService.scala:282) at io.delta.sharing.server.DeltaSharingService.processRequest(DeltaSharingService.scala:169) ... 60 more Caused by: java.lang.IllegalStateException: File system class org.apache.hadoop.fs.LocalFileSystem is not supported at io.delta.standalone.internal.DeltaSharedTable.$anonfun$fileSigner$1(DeltaSharedTableLoader.scala:97) at io.delta.standalone.internal.DeltaSharedTable.withClassLoader(DeltaSharedTableLoader.scala:109) at io.delta.standalone.internal.DeltaSharedTable.(DeltaSharedTableLoader.scala:84) at io.delta.standalone.internal.DeltaSharedTableLoader.$anonfun$loadTable$1(DeltaSharedTableLoader.scala:58) at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4868) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3533) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2282) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2159) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049) ... 65 more
opened by dtgdev 5
Rest client in Java - access data without spark session

I am trying to create rest client (receiver) in Java. Is there a possibility to load data without using spark session and eventually be able to filter that data, similar to load as pandas in python? I tried delta standalone reader, however, it does not provide data filtering capabilities.

opened by kaundinyaekta 0
Clarification regarding predicate pushdown

Does this library take into account the stats field present in the transaction log of a particular table version to filter the parquet files that are supposed to be read while instantiating a DeltaTable?

For example, I have a table with total 100 records across 2 table versions and each version has 5 parquet files associated with it covering 50 records of that version (assuming even 10 records per parquet file). The transaction log for both versions contains a stats field with information like minValues, maxValues, nullCount etc.

I have already verified that if I try to read first version (0), the DeltaTable object will read only 1 files and for second version (1) it will read both the files. This means that the number of parquet files are affected by the version already.

Just needed a clarification whether the stats fields are also used anywhere for selection of files to be read?

opened by chitralverma 0
Add load_as methods for pyarrow dataset and table
Adds separate implementations for load_as_pyarrow_table and load_as_pyarrow_dataset that allows users to read delta sharing tables as pyarrow table and dataset respectively.

[x] Add basic implementation

[x] Fix lint

[x] Refactor common code

[x] Verify performance with and without limit

[x] Add tests - converter

[x] Add tests - reader

[ ] Add tests - delta_sharing

[x] Add examples

closes https://github.com/delta-io/delta-sharing/issues/238
opened by chitralverma 2
While accessing the data on recipient side using delta_sharing.load_table_changes_as_spark(), it shows data of all versions.

When I tried to access specific version data and set the arguments value to the specific number, I get all version data.

data1 = delta_sharing.load_table_changes_as_spark(table_url, starting_version=1, ending_version=1)

data2 = delta_sharing.load_table_changes_as_spark(table_url, starting_version=2, ending_version=2)

Here data1 and data2 gives the same data. When I check the same version data using load_table_changes_as_pandas(), it gives specific version data.

data1 = delta_sharing.load_table_changes_as_pandas(table_url, starting_version=1, ending_version=1)

data2 = delta_sharing.load_table_changes_as_pandas(table_url, starting_version=2, ending_version=2)

In the pandas scenario, data1 is having version 1 data and data2 is having version 2 data. Both of these, data1 and data2, are having different data which was as expected.

What we have to do to get the specific version data in spark dataframe using load_table_changes_as_spark function?

opened by MaheshChahare123 0
Support for load_as_pyarrow_dataset or load_as_pyarrow_table

This is a new feature request or rather a little refactoring in the code for reader to allow users to read datasets directly as pyarrow datasets and tables.

As you can see here, we are anyways creating the pyarrow dataset and table, which is then used to convert to a pandas DF in the to_pandas method

I would like to refactor this part and expose this as separate functionalities - to_pyarrow_dataset and to_pyarrow_table.

Advantage of this refactoring is that users will then be able to efficiently get the pyarrow things directly without an additional full copy/ conversion to pandas dataframe if required. This will allow the extension of delta-sharing on other processing systems like Datafusion, Polars etc, since they all extensively rely on pyarrow datasets.

Please let me know if this issue makes sense to you, I can raise a PR quick for this in a day or so.

Note: the existing functionalities will remain unaffected by this refactoring.

opened by chitralverma 0

Releases(v0.5.3)

v0.5.3(Dec 21, 2022)
We’d like to announce the release of Delta Sharing 0.5.3, which introduces the following bug fixes.

Bug fixes:

Extends DeltaSharingProfileProvider to customize tablePath and refresher (#225)

Refresh pre-signed urls for cdf queries (#226)

Fix partitionFilters issue for cdf queries. (#230)

Fix comparison of the expiration time to current time for pre-signed urls.(#237)

Credits: Abhijit Chakankar, Lin Zhou, William Chau
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.5.3.zip(380.31 MB)
delta-sharing-server-0.5.3.zip.asc(833 bytes)
delta-sharing-server-0.5.3.zip.asc.sha256(101 bytes)
delta-sharing-server-0.5.3.zip.sha256(97 bytes)
v0.6.2(Dec 20, 2022)
We’d like to announce the release of Delta Sharing 0.6.2, which introduces the following improvement and bug fixes.

Bug fixes:

Fix comparison of the expiration time to current time for pre-signed urls.(#236)

Credits: Lin Zhou, William Chau
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.6.2.zip(385.80 MB)
delta-sharing-server-0.6.2.zip.asc(833 bytes)
delta-sharing-server-0.6.2.zip.asc.sha256(101 bytes)
delta-sharing-server-0.6.2.zip.sha256(97 bytes)
v0.6.1(Dec 20, 2022)
We’d like to announce the release of Delta Sharing 0.6.1, which introduces the following improvement and bug fixes.

Improvements:

Spark connector changes to consume size from metadata. (#228)

Improve delta sharing error messages(#235)

Bug fixes:

Extends DeltaSharingProfileProvider to customize tablePath and refresher (#223)

Refresh pre-signed urls for cdf and streaming queries (#221, #222)

Allow 0 for versionAsOf parameter, to be consistent with Delta (#224)

Fix partitionFilters issue: apply it to all file indices. (#227, #229)

Credits: Abhijit Chakankar, Lin Zhou
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.6.1.zip(385.80 MB)
delta-sharing-server-0.6.1.zip.asc(833 bytes)
delta-sharing-server-0.6.1.zip.asc.sha256(101 bytes)
delta-sharing-server-0.6.1.zip.sha256(97 bytes)
v0.6.0(Dec 2, 2022)
We are excited to announce the release of Delta Sharing 0.6.0, which introduces the following improvements.

Improvements:

Support using a delta sharing table as a source in spark structured streaming, which allows recipients to stay up to date with the shared data. (#189, #190, #194, #195, #198, #199, #200, #201, #204, #205, #207, #208, #209, #211, #212, #214, #216, #217, #218, #219)

Fix a few nits in the PROTOCOL documentation (#213)

Support timestampAsOf parameter in delta sharing data source. (#186, #187, #188)

Credits: Abhijit Chakankar, Lin Zhou, Xiaotong Sun
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.6.0.zip(385.80 MB)
delta-sharing-server-0.6.0.zip.asc(833 bytes)
delta-sharing-server-0.6.0.zip.asc.sha256(101 bytes)
delta-sharing-server-0.6.0.zip.sha256(97 bytes)
v0.5.2(Oct 10, 2022)
Delta Sharing 0.5.2 has one single change that adds ability to override HTTP headers included in the request to the Delta Sharing server.

Add a Custom Http Header Provider (#192)

Credits: Xiaotong Sun
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.5.2.zip(380.31 MB)
delta-sharing-server-0.5.2.zip.asc(833 bytes)
delta-sharing-server-0.5.2.zip.asc.sha256(101 bytes)
delta-sharing-server-0.5.2.zip.sha256(97 bytes)
v0.5.1(Sep 8, 2022)
We are excited to announce the release of Delta Sharing 0.5.1, which introduces the following changes.

Improvements:

Upgrade AWS SDK to 1.12.189 (#170)

More tests on the error message when loading table fails (#164)

Add ability to configure armeria server request timeout (#163)

documentation improvements (#171, #179)

Bug fixes:

Fix column selection bug on Delta Sharing CDF spark dataframe (#184)

Fix GCS path reading (#181)

Credits: Antonio Irizarry, Lin Zhou, Shixiong Zhu, Pat McCauley
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.5.1.zip(380.31 MB)
delta-sharing-server-0.5.1.zip.asc(833 bytes)
delta-sharing-server-0.5.1.zip.asc.sha256(101 bytes)
delta-sharing-server-0.5.1.zip.sha256(97 bytes)
v0.5.0(Aug 30, 2022)
We are excited to announce the release of Delta Sharing 0.5.0, which introduces the following improvements.

Improvements:

Support for Change Data Feed which allows clients to fetch incremental changes for the shared tables. (#135, #136, #137, #138, #140, #141, #142, #145, #146, #147, #148, #149, #150, #151, #152, #153, #155, #159)

Include response body in HTTPError exception in Python library (#124)

Improve the error message for the /share/schema/table APIs (#120)

Protocol and REST API documentation improvements (#121, #128, #131)

Add query_table_version to the rest client (#111)

Credits: Abhijit Chakankar, Alex Ott, Lin Zhou, Shixiong Zhu, William Chau, Xiaotong Sun, harksin, Kohei Toshimitsu, Vuong Nguyen
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.5.0.zip(222.53 MB)
delta-sharing-server-0.5.0.zip.asc(833 bytes)
delta-sharing-server-0.5.0.zip.asc.sha256(101 bytes)
delta-sharing-server-0.5.0.zip.sha256(97 bytes)
v0.4.0(Jan 14, 2022)
We are excited to announce the release of Delta Sharing 0.4.0, which introduces the following improvements and fixes.

Improvements:

Support Google Cloud Storage on Delta Sharing Server (#81, #105)

Add a new API to get the metadata of a Share (#97)

Protocol and REST API documentation enhancements (#85, #89, #93, #98)

Allow for customization of recipient profile in Apache Spark connector (#99, #107)

Bug fixes:

Block managed table creation for Delta Sharing to prevent user confusions (#92)

Credits: Denny Lee, Lin Zhou, Shixiong Zhu, William Chau, Xiaotong Sun, Kohei Toshimitsu
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.4.0.zip(219.01 MB)
delta-sharing-server-0.4.0.zip.asc(833 bytes)
delta-sharing-server-0.4.0.zip.asc.sha256(101 bytes)
delta-sharing-server-0.4.0.zip.sha256(97 bytes)
v0.3.0(Dec 1, 2021)
We are excited to announce the release of Delta Sharing 0.3.0, which introduces the following improvements and fixes issues:

Improvements:

Support Azure Blob Storage and Azure Data Lake Gen2 in Delta Sharing Server (#56, #59)

Apache Spark Connector now can send the limitHint parameter when a user query is using limit (#55)

load_as_pandas in Python Connector now accepts a limit parameter to allow users fetching only a few rows to explore (#76)

Apache Spark Connector will re-fetch pre-signed urls before they expire to support long running queries (#69)

Add a new API to list all tables in a share to save network round trips (#63, #66, #67, #88)

Add a User-Agent header to request sent from Apache Spark Connector and Python (#75)

Add an optional expirationTime field to Delta Sharing Profile File Format to provide the token expiration time (#77)

Bug fixes:

Fix a corner case that list_all_tables may not return correct results in the Python Connector (#84)

Credits: Denny Lee, Felix Cheung, Lin Zhou, Matei Zaharia, Shixiong Zhu, Will Girten, Xiaotong Sun, Yuhong Chen, kohei-tosshy, William Chau
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.3.0.zip(202.59 MB)
delta-sharing-server-0.3.0.zip.asc(833 bytes)
delta-sharing-server-0.3.0.zip.asc.sha256(101 bytes)
delta-sharing-server-0.3.0.zip.sha256(97 bytes)
v0.2.0(Aug 11, 2021)
We are excited to announce the release of Delta Sharing 0.2.0, which introduces the following improvements and fixes multiple issues:

Improvements:

Added official Docker images for Delta Sharing Server

Added an examples project to show how to try the open Delta Sharing Server (#26)

Added the conf directory to the Delta Sharing Server classpath to allow users to add their Hadoop configuration files in the directory (#45)

Added retry with exponential backoff for REST requests in the Python connector (#49)

Bug fixes:

Added the minimum fsspec requirement in the Python connector (#23)

Fixed an issue when files in a table have no stats in the Python connector (#30)

Improve error handling in Delta Sharing Server to report 400 Bad Request properly (#32)

Fixed the table schema when a table is empty in the Python connector (#37)

Fixed KeyError when there are no shared tables in the Python connector (#50)

Credits: Denny Lee, Matei Zaharia, Shixiong Zhu, Yaohua, Yuhong Chen, dobachi
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.2.0.zip(201.42 MB)
delta-sharing-server-0.2.0.zip.asc(833 bytes)
delta-sharing-server-0.2.0.zip.asc.sha256(101 bytes)
delta-sharing-server-0.2.0.zip.sha256(97 bytes)
v0.1.0(May 26, 2021)
We are excited to announce the release of Delta Sharing 0.1.0.

Delta Sharing is an open protocol for secure real-time exchange of large datasets, which enables organizations to share data in real time regardless of which computing platforms they use. It is a simple REST protocol that securely shares access to part of a cloud dataset and leverages modern cloud storage systems, such as S3, ADLS, or GCS, to reliably transfer data.

With Delta Sharing, a user accessing shared data can directly connect to it through pandas, Tableau, Apache Spark, Rust, Python, or dozens of other systems that support the open protocol, without having to deploy a specific compute platform first. This makes life simpler for both data providers and consumers. Data providers can share a dataset once to reach a broad range of consumers on any platform, and data consumers can get started using the data in minutes on their existing computing tools.

This repo includes the following components:

Delta Sharing protocol specification.

Python Connector: A Python library that implements the Delta Sharing Protocol to read shared tables as pandas DataFrame or Apache Spark DataFrames.

Apache Spark Connector: An Apache Spark connector that implements the Delta Sharing Protocol to read shared tables from a Delta Sharing Server. The tables can then be accessed in SQL, Python, Java, Scala, or R.

Delta Sharing Server: A reference implementation server for the Delta Sharing Protocol for development purposes. Users can deploy this server to share existing tables in Delta Lake and Apache Parquet format on modern cloud storage systems.

See the documentation for more details.
Source code(tar.gz)
Source code(zip)
delta-sharing-server-0.1.0.zip(201.42 MB)
delta-sharing-server-0.1.0.zip.asc(833 bytes)
delta-sharing-server-0.1.0.zip.asc.sha256(101 bytes)
delta-sharing-server-0.1.0.zip.sha256(97 bytes)

Delta Sharing: An Open Protocol for Secure Data Sharing

Related tags

Overview

Delta Sharing: An Open Protocol for Secure Data Sharing

Python Connector

System Requirements

Installation

Accessing Shared Data

Quick Start

Details on Profile Paths

Apache Spark Connector

System Requirements

Accessing Shared Data

Configuring Apache Spark

Set up an interactive shell

PySpark shell

Scala Shell

Set up a standalone project

Maven

SBT

Quick Start

SQL

Python

Scala

Java

R

Table paths

Delta Sharing Reference Server

Get the pre-built package

Server configuration and adding Shared Data

Config the server to access tables on cloud storage

S3

EC2 IAM Metadata Authentication (Recommended)

Authenticating via the AWS Environment Variables

Other S3 authentication methods

Azure Blob Storage

Azure Data Lake Storage Gen2

Authorization

Start the server

Use the pre-built Docker image

API Compatibility

Delta Sharing Protocol

Building this Project

Python Connector

Apache Spark Connector and Delta Sharing Server

Reporting Issues

Contributing

License

Community

Comments

The format version of this config file

Config shares/schemas/tables to share

Releases(v0.5.3)

v0.5.3(Dec 21, 2022)

v0.6.2(Dec 20, 2022)

v0.6.1(Dec 20, 2022)

v0.6.0(Dec 2, 2022)

v0.5.2(Oct 10, 2022)

v0.5.1(Sep 8, 2022)

v0.5.0(Aug 30, 2022)

v0.4.0(Jan 14, 2022)

v0.3.0(Dec 1, 2021)

v0.2.0(Aug 11, 2021)

v0.1.0(May 26, 2021)

Owner

Delta Lake

A Python r2pipe script to automatically create a Frida hook to intercept TLS traffic for Flutter based apps

An easy-to-use wrapper for NTFS-3G on macOS

DCSync - DCSync Attack from Outside using Impacket

Find vulnerable Log4j2 versions on disk and also inside Java Archive Files (Log4Shell CVE-2021-44228)

Scarecrow is a tool written in Python3 allowing you to protect your Python3 scripts.

BurpSuite Extension: Log4j2 RCE Scanner

Instagram brute force tool that uses tor as its proxy connections

PreviewGram is for users that wants get a more private experience with the Telegram's Channel.

A Safer PoC for CVE-2022-22965 (Spring4Shell)

ProxyLogon(CVE-2021-26855+CVE-2021-27065) Exchange Server RCE(SSRF->GetWebShell)

An forensics tool to help aid in the investigation of spoofed emails based off the email headers.

Add a Web Server based on Rogue Mysql Server to allow remote user get

Transparent proxy server that works as a poor man's VPN. Forwards over ssh. Doesn't require admin. Works with Linux and MacOS. Supports DNS tunneling.

Chromepass - Hacking Chrome Saved Passwords