Cloud-native, data onboarding architecture for the Google Cloud Public Datasets program

Last update: Dec 30, 2022

Overview

Public Datasets Pipelines

Cloud-native, data pipeline architecture for onboarding datasets to the Google Cloud Public Datasets Program.

Overview

Requirements

Familiarity with Apache Airflow (>=v1.10.14)
pipenv for creating similar Python environments via Pipfile.lock
gcloud command-line tool with Google Cloud Platform credentials configured. Instructions can be found here.
Terraform >=v0.15.1
Google Cloud Composer environment running Apache Airflow v1.10.14,<2.0. To create a new Cloud Composer environment, see this guide.

Environment Setup

We use Pipenv to make environment setup more deterministic and uniform across different machines.

If you haven't done so, install Pipenv using the instructions found here. Now with Pipenv installed, run the following command:

pipenv install --ignore-pipfile --dev

This uses the Pipfile.lock found in the project root and installs all the development dependencies.

Finally, initialize the Airflow database:

pipenv run airflow initdb

Building Data Pipelines

Configuring, generating, and deploying data pipelines in a programmatic, standardized, and scalable way is the main purpose of this repository.

Follow the steps below to build a data pipeline for your dataset:

1. Create a folder hierarchy for your pipeline

mkdir -p datasets/DATASET/PIPELINE

[example]
datasets/covid19_tracking/national_testing_and_outcomes

where DATASET is the dataset name or category that your pipeline belongs to, and PIPELINE is your pipeline's name.

For examples of pipeline names, see these pipeline folders in the repo.

Use only underscores and alpha-numeric characters for the names.

2. Write your config (YAML) files

If you created a new dataset directory above, you need to create a datasets/DATASET/dataset.yaml config file. See this section for the dataset.yaml reference.

Create a datasets/DATASET/PIPELINE/pipeline.yaml config file for your pipeline. See this section for the pipeline.yaml reference.

If you'd like to get started faster, you can inspect config files that already exist in the repository and infer the patterns from there:

covid19_tracking dataset config
covid19_tracking/national_testing_and_outcomes pipeline config (simple, only uses built-in Airflow operators)
covid19_tracking/city_level_cases_and_deaths pipeline config (involves custom data transforms)

Every YAML file supports a resources block. To use this, identify what Google Cloud resources need to be provisioned for your pipelines. Some examples are

BigQuery datasets and tables to store final, customer-facing data
GCS bucket to store intermediate, midstream data.
GCS bucket to store final, downstream, customer-facing data
Sometimes, for very large datasets, you might need to provision a Dataflow job

3. Generate Terraform files and actuate GCP resources

Run the following command from the project root:

$ python scripts/generate_terraform.py \
    --dataset DATASET_DIR_NAME \
    --gcp-project-id GCP_PROJECT_ID \
    --region REGION \
    --bucket-name-prefix UNIQUE_BUCKET_PREFIX \
    [--env] dev \
    [--tf-apply] \
    [--impersonating-acct] IMPERSONATING_SERVICE_ACCT

This generates Terraform files (*.tf) in a _terraform directory inside that dataset. The files contain instrastructure-as-code on which GCP resources need to be actuated for use by the pipelines. If you passed in the --tf-apply parameter, the command will also run terraform apply to actuate those resources.

The --bucket-name-prefix is used to ensure that the buckets created by different environments and contributors are kept unique. This is to satisfy the rule where bucket names must be globally unique across all of GCS. Use hyphenated names (some-prefix-123) instead of snakecase or underscores (some_prefix_123).

In addition, the command above creates a "dot" directory in the project root. The directory name is the value you pass to the --env parameter of the command. If no --env argument was passed, the value defaults to dev (which generates the .dev folder).

Consider this "dot" directory as your own dedicated space for prototyping. The files and variables created in that directory will use an isolated environment. All such directories are gitignored.

As a concrete example, the unit tests use a temporary .test directory as their environment.

4. Generate DAGs and container images

Run the following command from the project root:

$ python scripts/generate_dag.py \
    --dataset DATASET_DIR \
    --pipeline PIPELINE_DIR \
    [--skip-builds] \
    [--env] dev

This generates a Python file that represents the DAG (directed acyclic graph) for the pipeline (the dot dir also gets a copy). To standardize DAG files, the resulting Python code is based entirely out of the contents in the pipeline.yaml config file.

Using KubernetesPodOperator requires having a container image available for use. The command above allows this architecture to build and push it to Google Container Registry on your behalf. Follow the steps below to prepare your container image:

Create an _images folder under your dataset folder if it doesn't exist.
Inside the _images folder, create another folder and name it after what the image is expected to do, e.g. process_shapefiles, read_cdf_metadata.
In that subfolder, create a Dockerfile and any scripts you need to process the data. See the samples/container folder for an example. Use the COPY command in your Dockerfile to include your scripts in the image.

The resulting file tree for a dataset that uses two container images may look like

datasets
└── DATASET
    ├── _images
    │   ├── container_a
    │   │   ├── Dockerfile
    │   │   ├── requirements.txt
    │   │   └── script.py
    │   └── container_b
    │       ├── Dockerfile
    │       ├── requirements.txt
    │       └── script.py
    ├── _terraform/
    ├── PIPELINE_A
    ├── PIPELINE_B
    ├── ...
    └── dataset.yaml

Docker images will be built and pushed to GCR by default whenever the command above is run. To skip building and pushing images, use the optional --skip-builds flag.

5. Declare and set your pipeline variables

Running the command in the previous step will parse your pipeline config and inform you about the templated variables that need to be set for your pipeline to run.

All variables used by a dataset must have their values set in

  [.dev|.test]/datasets/{DATASET}/{DATASET}_variables.json

Airflow variables use JSON dot notation to access the variable's value. For example, if you're using the following variables in your pipeline config:

{{ var.json.shared.composer_bucket }}
{{ var.json.parent.nested }}
{{ var.json.parent.another_nested }}

then your variables JSON file should look like this

{
  "shared": {
    "composer_bucket": "us-east4-test-pipelines-abcde1234-bucket"
  },
  "parent": {
    "nested": "some value",
    "another_nested": "another value"
  }
}

6. Deploy the DAGs and variables

Deploy the DAG and the variables to your own Cloud Composer environment using one of the two commands:

$ python scripts/deploy_dag.py \
  --dataset DATASET \
  --composer-env CLOUD_COMPOSER_ENVIRONMENT_NAME \
  --composer-bucket CLOUD_COMPOSER_BUCKET \
  --composer-region CLOUD_COMPOSER_REGION \
  --env ENV

Testing

Run the unit tests from the project root as follows:

$ pipenv run python -m pytest -v

YAML Config Reference

Every dataset and pipeline folder must contain a dataset.yaml and a pipeline.yaml configuration file, respectively:

For dataset configuration syntax, see samples/dataset.yaml as a reference.
For pipeline configuration syntax, see samples/pipeline.yaml as a reference.

Best Practices

When running scripts/generate_terraform.py, the argument --bucket-name-prefix helps prevent GCS bucket name collisions because bucket names must be globally unique. Use hyphens over underscores for the prefix and make it as unique as possible, and specific to your own environment or use case.
When naming BigQuery columns, always use snake_case and lowercase.
When specifying BigQuery schemas, be explicit and always include name, type and mode for every column. For column descriptions, derive it from the data source's definitions when available.
When provisioning resources for pipelines, a good rule-of-thumb is one bucket per dataset, where intermediate data used by various pipelines (under that dataset) are stored in distinct paths under the same bucket. For example:
```
gs://covid19-tracking-project-intermediate
    /dev
        /preprocessed_tests_and_outcomes
        /preprocessed_vaccinations
    /staging
        /national_tests_and_outcomes
        /state_tests_and_outcomes
        /state_vaccinations
    /prod
        /national_tests_and_outcomes
        /state_tests_and_outcomes
        /state_vaccinations
```
The "one bucket per dataset" rule prevents us from creating too many buckets for too many purposes. This also helps in discoverability and organization as we scale to thousands of datasets and pipelines.

Quick note: If you can conveniently fit the data in memory, the data transforms are close-to-trivial and are computationally cheap, you may skip having to store mid-stream data. Just apply the transformations in one go, and store the final resulting data to their final destinations.

Comments

Feat: Onboard New york taxi trips dataset
Description

dataset: new_york_taxi_trips pipelines: tlc_green_trips, tlc_yellow_trips

Checklist

Note: If an item applies to you, all of its sub-items must be fulfilled

[x] (Required) This pull request is appropriately labeled

[x] Please merge this pull request after it's approved

[x] I'm adding or editing a dataset

[ ] The Google Cloud Datasets team is aware of the proposed dataset

[ ] I put all my code inside datasets/new_york_taxi_trips> and nothing outside of that directory
opened by nlarge-google 11
feature: Initial implementation for austin_311.311_service_requests
"Pipeline for austin_311.311_Service_Requests"

Description

v2 architecture implementation of 311_service_requests in austin, TX. This implements the first version of the csv transform python script.

Based on #

Note: It's recommended to open an issue first for context and discussion.

Checklist

Note: Delete items below that aren't applicable to your pull request.

[ ] Please merge this PR for me once it is approved.

[ ] If this PR adds or edits a feature, I have updated the README accordingly.

[ ] If this PR adds or edits a dataset or pipeline, it was reviewed and approved by the Google Cloud Public Datasets team beforehand.

[ ] If this PR adds or edits a dataset or pipeline, I put all my code inside datasets/<YOUR-DATASET> and nothing outside of that directory.

[ ] If this PR adds or edits a dataset or pipeline that I'm responsible for maintaining, my GitHub username is in the CONTRIBUTORS file.

[ ] This PR is appropriately labeled.

cla: yes
opened by nlarge-google 11
feat: Onboard NOAA
Checklist

Note: Delete items below that aren't applicable to your pull request.

[x] Please merge this PR for me once it is approved.

[x] If this PR adds or edits a dataset or pipeline, it was reviewed and approved by the Google Cloud Public Datasets team beforehand.

[x] If this PR adds or edits a dataset or pipeline, I put all my code inside datasets/noaa and nothing outside of that directory.

[x] This PR is appropriately labeled.

data onboarding cla: yes
opened by nlarge-google 7
feat: Onboard EPA historical air quality dataset
Description

Included: Annual summaries CO Daily Summary CO Hourly Summary HAP Daily Summary HAP Hourly Summary Lead Daily Summary NO2 Daily Summary NO2 Hourly Summary NONOxNOy Daily Summary NONOxNOy Hourly Summary Ozone Daily Summary Ozone Hourly Summary PM 10 Daily Summary PM10 Hourly Summary PM25 Frm Hourly Summary PM25 NonFrm Daily Summary PM25 NonFrm Hourly Summary PM25 Speciation Daily Summary PM25 Speciation Hourly Summary Pressure Daily Summary Pressure Hourly Summary RH and DP Daily Summary RH and DP Hourly Summary SO2 Daily Summary SO2 Hourly Summary Temperature Daily Summary Temperature Hourly Summary VOC Daily Summary VOC Hourly Summary Wind Daily Summary Wind Hourly Summary

Checklist

Note: Delete items below that aren't applicable to your pull request.

[x] Please merge this PR for me once it is approved.

[x] If this PR adds or edits a dataset or pipeline, it was reviewed and approved by the Google Cloud Public Datasets team beforehand.

[x] If this PR adds or edits a dataset or pipeline, I put all my code inside datasets/epa_historical_air_quality and nothing outside of that directory.

[x] This PR is appropriately labeled.

cla: yes
opened by nlarge-google 6
feat: Onboard San Francisco Bikeshare Trips
Checklist

Note: Delete items below that aren't applicable to your pull request.

[x] Please merge this PR for me once it is approved.

[x] If this PR adds or edits a dataset or pipeline, it was reviewed and approved by the Google Cloud Public Datasets team beforehand.

[x] If this PR adds or edits a dataset or pipeline, I put all my code inside san_francisco_bikeshare_trips/bikeshare_trips and nothing outside of that directory.

[x] This PR is appropriately labeled.

cla: yes
opened by nlarge-google 6
feat: Onboard Census opportunity atlas tract outcomes
Description

Tract Outcomes

Checklist

Note: Delete items below that aren't applicable to your pull request.

[x] Please merge this PR for me once it is approved.

[x] If this PR adds or edits a dataset or pipeline, it was reviewed and approved by the Google Cloud Public Datasets team beforehand.

[x] If this PR adds or edits a dataset or pipeline, I put all my code inside datasets/census_opportunity_atlas> and nothing outside of that directory.

[x] This PR is appropriately labeled.
opened by nlarge-google 5
feat: Onboard Census Bureau International Dataset
Description

Based on #

Note: It's recommended to open an issue first for context and discussion.

Checklist

Note: Delete items below that aren't applicable to your pull request.

[x] Please merge this PR for me once it is approved.

[x] If this PR adds or edits a dataset or pipeline, it was reviewed and approved by the Google Cloud Public Datasets team beforehand.

[x] If this PR adds or edits a dataset or pipeline, I put all my code inside datasets/census_bureau_international and nothing outside of that directory.

[x] This PR is appropriately labeled.

data onboarding cla: yes
opened by vasuc-google 5
Containerize custom tasks
Note: The following is taken from @tswast's recommendation on a separate thread.

What are you trying to accomplish?

One of the Airflow "gotchas" is that workers share resources with the scheduler, so any "real work" that uses CPU and/or memory can cause slowdowns in the scheduler or even instability if memory is used up.

The recommendation is to do any "real work" in one of:

separate node pool via KubernetesPodOperator

Apache Beam / Dataflow

BigQuery

Spark

other external processing environment

What challenges are you running into?

In the generated DAG, I see the following operator:

# Run the custom/csv_transform.py script to process the raw CSV contents into a BigQuery friendly format process_raw_csv_file = bash_operator.BashOperator( task_id="process_raw_csv_file", bash_command="SOURCE_CSV=$airflow_home/data/$dataset/$pipeline/{{ ds }}/raw-data.csv TARGET_CSV=$airflow_home/data/$dataset/$pipeline/{{ ds }}/data.csv python $airflow_home/dags/$dataset/$pipeline/custom/csv_transform.py\n", env={'airflow_home': '{{ var.json.shared.airflow_home }}', 'dataset': 'covid19_tracking', 'pipeline': 'city_level_cases_and_deaths'}, )

I haven't looked closely at the csv_transform.py script yet, but I'd expect it to use non-trivial CPU / memory resources.

For custom Python scripts such as this, I'd expect us to use the KubernetesPodOperator, where the work is scheduled on a separate node pool.

Checklist

[x] I created this issue in accordance with the Code of Conduct.

[x] This issue is appropriately labeled.

feature request
opened by adlersantos 5
Feat: Onboard Mimiciii dataset
Description

This is to onboard mimiciii dataset with 25 pipelines using Airflow v2 operators only.

Checklist

[x] (Required) This pull request is appropriately labeled

[x] Please merge this pull request after it's approved

Use the sections below based on what's applicable to your PR and delete the rest:

Feature

[ ] I'm adding or editing a feature

[ ] I have updated the README accordingly

[ ] I have added/revised tests for the feature

Data Onboarding

[x] I'm adding or editing a dataset

[x] The Google Cloud Datasets team is aware of the proposed dataset

[x] I put all my code inside datasets/mimiciii and nothing outside of that directory

Code cleanup or refactoring

[x] I'm refactoring or cleaning up some code

data onboarding
opened by Naveen130 4
Refactor: Combine New York pipelines into one
Description

These are changes and clean-up to the existing dataset pipelines for new-york

311_service_requests citibike_stations nypd_mv_collisions

Checklist

Note: If an item applies to you, all of its sub-items must be fulfilled

[x] (Required) This pull request is appropriately labeled

[x] Please merge this pull request after it's approved

[x] I'm adding or editing a dataset

[x] The Google Cloud Datasets team is aware of the proposed dataset

[x] I put all my code inside datasets/new_york and nothing outside of that directory

[x] I'm refactoring or cleaning up some code
opened by nlarge-google 4
Feat: Onboard SEC Failure to Deliver dataset
Checklist

Note: If an item applies to you, all of its sub-items must be fulfilled

[x] (Required) This pull request is appropriately labeled

[x] Please merge this pull request after it's approved

[x] I'm adding or editing a dataset

[x] The Google Cloud Datasets team is aware of the proposed dataset

[x] I put all my code inside datasets/sec_failure_to_deliver and nothing outside of that directory pipelines/tree/main/tests) folder)

[x] I'm refactoring or cleaning up some code
opened by nlarge-google 4
chore(deps): update dependency black to v22.12.0
This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | black (changelog) | ==22.10.0 -> ==22.12.0 | | | | |

Release Notes

psf/black

v22.12.0

Compare Source

Preview style

Enforce empty lines before classes and functions with sticky leading comments (#3302)

Reformat empty and whitespace-only files as either an empty file (if no newline is present) or as a single newline character (if a newline is present) (#3348)

Implicitly concatenated strings used as function args are now wrapped inside parentheses (#3307)

Correctly handle trailing commas that are inside a line's leading non-nested parens (#3370)

Configuration

Fix incorrectly applied .gitignore rules by considering the .gitignore location and the relative path to the target file (#3338)

Fix incorrectly ignoring .gitignore presence when more than one source directory is specified (#3336)

Parser

Parsing support has been added for walruses inside generator expression that are passed as function args (for example, any(match := my_re.match(text) for text in texts)) (#3327).

Integrations

Vim plugin: Optionally allow using the system installation of Black via let g:black_use_virtualenv = 0(#3309)

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.

[ ] If you want to rebase/retry this PR, check this box

This PR has been generated by Mend Renovate. View repository job log here.
dependencies
opened by renovate-bot 0
Fix: Onboard HRRR processes in NOAA ETL
Description

Notes:

If you are adding or editing a dataset, please specify the dataset folder involved, e.g. datasets/google_trends.

If you are an external contributor, please contact the Google Cloud Datasets team for your proposed dataset or feature.

If you are adding or editing a dataset, please do it one dataset at a time. Have all the code changes inside a single datasets/noaa folder.
opened by nlarge-google 0
chore(deps): update dependency pandas-gbq to v0.18.0
This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | pandas-gbq | ==0.17.9 -> ==0.18.0 | | | | |

Release Notes

googleapis/python-bigquery-pandas

v0.18.0

Compare Source

Features

Map "if_exists" value to LoadJobConfig.WriteDisposition (#583) (7389cd2)

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.

[ ] If you want to rebase/retry this PR, check this box

This PR has been generated by Mend Renovate. View repository job log here.
dependencies
opened by renovate-bot 1
chore(deps): update dependency flake8 to v6
This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | flake8 (changelog) | ==5.0.4 -> ==6.0.0 | | | | |

Release Notes

pycqa/flake8

v6.0.0

Compare Source

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.

[ ] If you want to rebase/retry this PR, check this box

This PR has been generated by Mend Renovate. View repository job log here.
dependencies
opened by renovate-bot 0
Feat: Onboard Af dag notifications
Description

Notes:

If you are adding or editing a dataset, please specify the dataset folder involved, e.g. datasets/google_trends.

If you are an external contributor, please contact the Google Cloud Datasets team for your proposed dataset or feature.

If you are adding or editing a dataset, please do it one dataset at a time. Have all the code changes inside a single datasets/af_dag_notifications folder.
opened by nlarge-google 0

Releases(v5.2.0)

v5.2.0(Nov 1, 2022)
5.2.0 (2022-11-01)

Features

Add geom columns for thelook_ecommerce dataset (#307) (f39a177)

Add Municipal Calendar to San Francisco Dataset (#480) (a21c2ef)

Add PM25_FRM_DAILY_SUMMARY Pipeline To Epa_Historical_Air_Quality Dataset (#518) (4f66c05)

Add Storms Database to Noaa Dataset (#498) (8d02866)

Adding a tutorial for the Iowa Liquor dataset (#419) (b619b71)

Adding New Pipelines To San Francisco Dataset. (#487) (58cda71)

Extract the tabular metadata for Cloud Datasets program (#452) (1a3d59e)

Launch AFDB v4 dataset (#522) (c6664a7)

Migrate the dataset Covid19 Italy from Xenon (#488) (1ca6bd6)

Migrate the World Bank datasets x 3 from Xenon (#506) (65295d0)

Migrate the Xenon World Bank WDI dataset (#482) (35457a9)

onboard chembl-30 dataset (#467) (ef9c57b)

Onboard COVID-19 Genome Sequence dataset (#460) (0b7828f)

Onboard dataset Open Buildings (#453) (739b6cf)

Onboard EBI CHemBL Previous Data dataset (#470) (63b4012)

Onboard FDIC dataset (#495) (e20e157)

Onboard Fec dataset (#485) (2da413e)

Onboard Human Variant Annotation dataset (#438) (ebfe4de)

Onboard IDC v10 dataset (#433) (c2ffc77)

onboard irs 990 ein dataset (#481) (65544a2)

Onboard MERFISH Mouse Brain Receptor Map dataset (#457) (4333fca)

Onboard Multilingual Spoken Words Corpus - MLCommons Association dataset (#461) (22cc27c)

Onboard New Fec dataset (#486) (6ee1fa3)

Onboard New FEC dataset (#513) (e770220)

Onboard NHTSA Traffic Fatalities dataset (#454) (eb409c4)

Onboard NOAA Passive Bioacoustic dataset (#471) (2ecd9ea)

Onboard Uniref50 dataset (#443) (dbf2300)

Onboard Uniref50 dataset (#473) (b44d572)

YAML custom tag for interpolating GCR image URLs (#372) (ef901e5)

Bug Fixes

Added "is_public" to cloud_datasets.tabular_datasets table (#501) (802cff6)

Added Airport Fee To Schema Files And Pipeline.Yaml In New York Taxi Trips Dataset (#476) (d94105a)

Adds BRL currency in Google Political Ads (#469) (edd3654)

AlphaFold dataset - add accession_ids.csv to the bucket (#451) (cacd9f1)

Change Destination Dataset in Noaa Pipelines (#479) (c7c047c)

City Health Dashboard Schema Changes (#515) (1bdb0dd)

deleting pod error (#511) (77fe479)

Fixing the forecasting issue in the notebook. (#472) (de7f1fa)

For COVID-19 Italy, resolve bucket variable in pipeline.yaml (#509) (1f913ac)

For FDA Food Enforcement, Resolve invalid source DateTime data. (#508) (f4b5a52)

Increase number of years to back date to 2009 in New York Taxi Trips Dataset (#445) (a9c5998)

Modified Resources for Kpod Operator (#521) (e715154)

Remove GKE cluster operator for dataset Census Opportunity Atlas (#458) (9ecfbc4)

Removed create cluster process (#517) (d36e6d4)

Resolve cluster name mismatch in pipeline.yaml (#439) (3e8d20d)

Resolve cluster name mismatch in pipeline.yaml (#440) (d2658f6)

Resolve DateTime Issues In FEC Dataset (#514) (014465b)

Resolve failure in production for the dataset Open Buildings (#468) (9a22d5f)

Resolve Failures In New york Pipeline And Merge To One Image (#516) (7d21778)

Resolve Issue With Name Node Corruption In New york Dataset (#459) (59e3aed)

Resolve null column for csv output and changed copyright year (#466) (00e636e)

Resolve production issue for Iowa Liquor Sales dataset (#520) (cf2b460)

Resolve reference to hard coded bucket. (#477) (039ff61)

Resolve San Francisco Pipeline Yaml Variable Assignment Issue (#489) (2d34cf9)

Resolve source file location and format issue in the New York Taxi Trips dataset (#441) (13a829f)

Resolve Typo Issue In EPA Historical Air Quality Pipeline.yaml (#519) (c54836d)

Resolve variables. (#464) (3c34e7e)

Resolved reference to destination bucket causing failure in production. (#507) (69128bc)

set gnomAD pipeline to run daily (#510) (5f50601)

Update project parameters for COVID-19 Genome Sequence dataset (#462) (78d55d9)

Source code(tar.gz)
Source code(zip)
v5.1.0(Aug 3, 2022)
5.1.0 (2022-07-30)

Features

Add scaffold script for directory + dataset.yaml setup (#412) (5bf354b)

Adding a notebook tutorial for the EPA dataset: CO levels (#422) (f0bab59)

Adds operators for Cloud SQL, Cloud Functions, and GCE (#429) (9b5da34)

Support --async-builds flag for generate_dag.py (#424) (7536df9)

Datasets

Onboard DeepMind AlphaFold DB (#431) (02c887e)

Onboard CelebA dataset (#420) (0c28563)

Adds BQ views to scalable_open_source dataset (#416) (2785234)

Rename co2 columns to emissions to make it generic from Travel Impact Model dataset. (#418) (e1ac106)

Bug Fixes

Change cms_medicare tables with column provider_zipcode from integer to string type (#417) (27b0a9b)

Resolve conflicts on Census Bureau ACS (#414) (492b973)

Resolve CRON value in Cloud Storage Geo Index dataset (#413) (8903e82)

Resolve IP error when creating NOAA cluster (#423) (82d53f4)

Use proper GCS prefix for custom data folder (#408) (9d56363)

Source code(tar.gz)
Source code(zip)
v5.0.0(Jul 11, 2022)
5.0.0 (2022-07-11)

⚠ BREAKING CHANGES

Upgrade to Airflow 2.2.5 and Python 3.8.12 (#394)

Datasets

Onboard Carbon-Free Energy Calculator dataset (#391) (f3a9447)

Onboard Census Bureau ACS Dataset (#399) (98e0179)

Onboard Fashion MNIST dataset (#387) (91b7f6a)

Onboard IMDb dataset (#406) (2559838)

Optimize tests for DAG and Terraform generation (#395) (ffcd18c)

Remove co2e columns from Travel Impact Model dataset. (#400) (d7179ce)

Bug Fixes

NOAA - Resolve table field name issue. (#402) (51860eb)

Use specific Python version for Airflow 1 tests (#401) (6fa94a7)

Source code(tar.gz)
Source code(zip)
v4.2.0(Jun 29, 2022)
4.2.0 (2022-06-25)

Datasets

Onboard COVID-19 dataset from The New York Times (#383) (9aac451)

Onboard NOAA dataset (#378) (02cc038)

Onboard San Jose Translation dataset (#377) (63ea9b9)

Onboarding MIMIC-III dataset (#389) (baf6b8d)

[datasets/gbif] Add a query to uncover species found in one region only (#388) (bd5a135)

Features

Manage local and remote Airflow variables during deployment (#392) (f26db3a)

Source code(tar.gz)
Source code(zip)
v4.1.1(Jun 20, 2022)
4.1.1 (2022-06-16)

Datasets

Onboard IMDB dataset (#382) (8bf7065)

Onboard MNIST dataset (#379) (9809935)

Onboard New York Taxi Trips dataset (#381) (897ac3f)

Bug Fixes

Fixed variable reference to container images for New York dataset (#380) (e4a6718)

Source code(tar.gz)
Source code(zip)
v4.1.0(Jun 14, 2022)
4.1.0 (2022-06-10)

Datasets

Onboard City Health Dashboard dataset (#374) (c7cd9dd)

Onboard Cloud Storage Geo Index (#367) (63cdb2a)

Onboard EPA Historical Air Quality (#373) (4f4c87e)

Onboard IDC v9 dataset (#364) (bfb9f23)

Onboard NOAA datasets (#353) (0f1c696)

Onboard The General Index Dataset (#342) (67d7216)

Revised COVID-19 Google Mobility dataset (#363) (ddd3dac)

Documentation Set

Adds a simple mapping tutorial for the GBIF dataset (#360) (e7a726a)

Source code(tar.gz)
Source code(zip)
v4.0.0(May 26, 2022)
4.0.0 (2022-05-23)

⚠ BREAKING CHANGES

Unified variables and adds support for IAM policies (#341)

Use poetry over pipenv (#337)

Datasets

Onboard Census Opportunity Atlas Dataset (#263) (13ce71d)

Onboard deps.dev (Open Source Insights) dataset (#356) (12143af)

Onboard Diversity Annual Report and complementary datasets (#358) (4a8a2cd)

Onboard EPA Historical Air Quality dataset (#301) (214a56f)

Onboard GBIF dataset (#355) (ab4e208)

Onboard IDC v8 dataset (#319) (0f112e0)

Onboard International Search Terms for Google Trends (#323) (855aa7f)

Onboard NASA wildfire (#275) (f593161)

Onboard New York Trees dataset (#265) (2905308)

Onboard Open Targets Genetics dataset (#318) (03b4f89)

Onboard Open Targets Platform dataset (#313) (c5adce6)

Onboard SEC Failure to Deliver dataset (#309) (afa6492)

Rename Travel Sustainability to Travel Impact Model (#351) (83df285)

Retrieve Composer bucket name when deploying DAGs (#312) (220f1d5)

Update BLS - CPSAAT18 with 2021 data (#357) (a8f8856)

Features

Added functionality to support a data folder to store schema files (#354) (f893dff)

Unified variables and adds support for IAM policies (#341) (c4a45a0)

Use poetry over pipenv (#337) (ca43066)

Bug Fixes

Adds packages for docs dependency group (#339) (6721490)

bump black version due to click dependency issue (#320) (cac6f18)

Fix generating BQ views for IDC dataset (#324) (5896865)

Removed unecessary pathlib param from test_deploy_dag (#345) (45dd0b2)

thelook_ecommerce - increase # of customers and revised order_items (#352) (ed1570d)

Source code(tar.gz)
Source code(zip)
v3.0.0(Mar 24, 2022)
3.0.0 (2022-03-24)

⚠ BREAKING CHANGES

Reorganize pipelines and infra files into their respective folders (#292)

Features

Reorganize pipelines and infra files into their respective folders (#292) (7408d44)

Upgrade some pipelines to Airflow 2 and explicitly set pod storage (#283) (cbc3278)

Datasets

Onboard Broad Genome References dataset (#316) (4f1f6db)

Onboard Imaging Data Commons (IDC) v7 dataset (#287) (dfda5d9)

Onboard ML dataset (#276) (48e51af)

Onboard Travel Sustainability dataset (#280) (8e9731a)

Onboard Travel Sustainability dataset (schema update) (#298) (7a13daa)

Onboarding TheLook E-Commerce dataset (#294) (15f663a)

Revise Google Political Ads due to new dataset version (#317) (6ffb0d0)

Update "location" to GEOGRAPHY type for datasets/google_trends schema (#297) (9d9d3bd)

Docs

Docs: Add SF 311 example (#310) (844a7fb)

Docs: Add a query snippet to calculate the monthly average bike trips for san_francisco_bikeshare (#284) (7a009f6)

Docs: Added a template for tutorials (#299) (ae23d4b)

Docs: SF 311 Calls - Predicting the number of calls per category using LSTM (#293) (88637ca)

Bug Fixes

Allow other JSON files to be checked in (such as schema.json) (#281) (2c94b79)

Update and fix city_health_dashboard dataset (#285) (4767fed)

Source code(tar.gz)
Source code(zip)
v2.8.0(Feb 3, 2022)
2.8.0 (2022-01-27)

Features

Onboard America Health Rankings dataset (#244) (8ecbfda)

Onboard American Community Survey dataset (#222) (861d0e6)

Onboard Census Opportunity Atlas dataset (#248) (0e62f27)

Onboard Census tract 2019 dataset (#272) (d2b5e52)

Onboard CFPB Complaints dataset (#225) (9051773)

Onboard Chronic Disease Indicators dataset (#242) (48c96f2)

Onboard City Health Dashboard dataset (#250) (8cc5286)

Onboard COVID-19 CDS EU dataset (#261) (d710dec)

Onboard EUMETSAT Solar Forecasting dataset (#273) (db479cf)

Onboard FDA Drug Enforcement dataset (#245) (53c98ac)

Onboard gnomAD dataset (#264) (804b440)

Onboard MLCommons Multilingual Spoken Words Corpus (MSWC) dataset (#252) (ec93997)

Onboard News Hate Crimes dataset (#238) (9b242ef)

Onboard Race and Economic Opportunity dataset (#236) (fe6c826)

Onboarding COVID-19 (UK) Government Response dataset (#262) (914d39c)

Update IDC dataset with new views and v6 version (#266) (02cae2b)

Source code(tar.gz)
Source code(zip)
v2.7.0(Dec 14, 2021)
Datasets

Onboard CDC Places Dataset (#241) (e2fcb0c)

Onboard Cloud Storage Geo Index Dataset (#219) (27a2c8e)

Onboard EPA historical air quality dataset (#221) (6267b82)

Onboard FDA food dataset (#223) (f0ced96)

Onboard IDC PDP datasets (#230) (3f944df)

Features

Support CloudDataTransferServiceGCSToGCSOperator (#229) (977b687)

Bug Fixes

Namespace Terraform resources under dataset names (#227) (a3f4b34)

Renamed dataset from sunroof to sunroof_solar (#226) (0780df8)

Source code(tar.gz)
Source code(zip)
v2.6.0(Nov 12, 2021)
Datasets

Onboard Austin Waste dataset (#200) (79dbf5d)

Onboard BLS dataset (#201) (c7cdd82)

Onboard Chicago Crime dataset (#199) (d766547)

Onboard Sunroof Solar dataset (#166) (375cbae)

Onboard World Bank Intl Education dataset (#182) (ff384fd)

Onboard World Bank WDI dataset (#198) (cbad321)

Bug Fixes

Set location field as required for GCS buckets (#224) (bd8a3db)

Source code(tar.gz)
Source code(zip)
v2.5.0(Oct 14, 2021)
Datasets

Onboard Iowa Liquor Sales dataset (#193) (06848c8)

Onboard San Francisco Bikeshare Station dataset (#191) (0707012)

Onboard San Francisco Bikeshare Status dataset (#192) (e4e1f26)

Onboard San Francisco Film Locations dataset (#190) (2284e09)

Bug Fixes

Combine san_francisco_bikeshare_* folders into san_francisco_bikeshare (#211) (50e4e6d)

Rename san_francisco_311_service_requests folder to san_francisco_311 (#209) (697f7be)

Source code(tar.gz)
Source code(zip)
v2.4.0(Oct 12, 2021)
Datasets

Onboard Austin Crime dataset (#174) (b4fbaad)

Onboard CMS Medicare dataset (#185) (d0425cd)

Onboard COVID-19 Google Mobility dataset (#177) (1653a8e)

Onboard New York datasets: 311 Service Requests, Citibike Stations, and Tree Census (#167) (d1f1d7c)

Onboard San Francisco 311 Service Requests dataset (#184) (a8ba2e9)

Onboard San Francisco Street Trees dataset (#176) (7da5061)

Onboard World Bank Health Population dataset (#178) (4aba767)

Onboard World Bank International Debt dataset (#179) (5ebbabb)

Features

Support specifying an alternate BQ dataset_id for BQ tables (#203) (9115e82)

Source code(tar.gz)
Source code(zip)
v2.3.1(Sep 28, 2021)
Bug Fixes

Delete temp GCS objects generated by gsutil's parallel composite upload for geos_fp dataset (#195) (f307cce)

Use patched flask-openid version to fix failing builds (#188) (1ea15a0)

Source code(tar.gz)
Source code(zip)
v2.3.0(Sep 10, 2021)
Datasets

Onboard google_political_ads.advertiser_geo_spend dataset (#154) (2201ebe)

Onboard Austin Bikeshare dataset (#156) (0bd5659)

Onboard NOAA's GSOD Stations and Lightning Strikes datasets (#158) (8371856)

Features

Support Dataflow operator and job requirements (#153) (119f8fb)

Source code(tar.gz)
Source code(zip)
v2.2.0(Aug 27, 2021)
Datasets

Onboard COVID19-Italy dataset (#148) (f56b5f2)

Onboard GEOS-FP dataset (#130) (d32f46b)

Onboard Google CFE dataset (#146) (9bca8ef)

Onboard Google Political Ads dataset (#149) (5903253)

Onboard IRS 990 dataset (#150) (1105eed)

Bug Fixes

Regenerate Terraform files for Google Political Ads (#152) (102f8e5)

shared_variables.json should not be reset when deploying (#147) (a6754df)

Source code(tar.gz)
Source code(zip)
v2.1.0(Aug 14, 2021)
Datasets

Onboard Google Cloud Release Notes dataset (#133) (5c98c05)

Bug Fixes

Revised Airflow DB initialization command (#141) (47b4717)

Source code(tar.gz)
Source code(zip)
v2.0.0(Aug 11, 2021)
⚠ BREAKING CHANGES

Pipeline YAML template using Airflow 2 operators (#138)

Adds support for Airflow 2 Cloud Composer environment and operators (#134)

Features

Adds support for Airflow 2 Cloud Composer environment and operators (#134) (b2749c6)

Pipeline YAML template using Airflow 2 operators (#138) (90ae7cd)

Source code(tar.gz)
Source code(zip)
v1.11.0(Jul 22, 2021)
Features

Adds Google license header bot config (#106) (d587689)

Use a single file for shared Airflow variables (#122) (f5d227d)

Source code(tar.gz)
Source code(zip)
v1.10.0(Jul 21, 2021)
Datasets

Onboard USA names dataset (#96) (eb28f0f)

Source code(tar.gz)
Source code(zip)
v1.9.0(Jul 16, 2021)
Datasets

Onboard Vaccination Search Insights dataset (#113) (ad39cfa)

Features

Support partitioning, clustering, and protection properties for BQ tables (#116) (288c5a2)

Source code(tar.gz)
Source code(zip)
v1.8.0(Jul 2, 2021)
Features

Onboard Google Diversity Annual Report 2021 dataset (#111) (13ebee9)

Source code(tar.gz)
Source code(zip)
v1.7.0(Jun 24, 2021)
Datasets

Onboard BLS - CPSAAT 2020 dataset (#105) (61f4394)

Bug Fixes

Allow newline and quotes for BQ dataset and table descriptions (#103) (ef01fe6)

Source code(tar.gz)
Source code(zip)
v1.6.0(Jun 18, 2021)
Datasets

Onboard Google Trends dataset for top N terms (#92) (df96d1d)

Bug Fixes

Allow DAG deploys without variables.json (#91) (8eaaae9)

Source code(tar.gz)
Source code(zip)
v1.5.1(Jun 16, 2021)
Bug Fixes

Fix BigQuery dataset descriptions for covid19_tracking and ml_datasets (#83) (b5b7640)

Source code(tar.gz)
Source code(zip)
v1.5.0(Jun 15, 2021)
Datasets

Onboard Iowa liquor sales forecasting samples for Vertex AI Forecasting tutorial (#85) (d832327)

Features

Support BigQueryToBigQueryOperator (#86) (fd26476)

Source code(tar.gz)
Source code(zip)
v1.4.1(Jun 9, 2021)
Bug Fixes

Update covid19_vaccination_access tables to use facility_country_region_code column (#80) (6d01c95)

Source code(tar.gz)
Source code(zip)
v1.4.0(Jun 9, 2021)
Datasets

Onboard COVID-19 Vaccination Access dataset (#74) (e68b4f8)

Bug Fixes

Fix issue where Terraform resource names can't start with digits, but BQ tables can (#70) (7c0f339)

Source code(tar.gz)
Source code(zip)
v1.3.0(Jun 8, 2021)
Features

Support BigQuery table descriptions (#59) (4b364a1)

Upgrade Airflow version to 1.10.15

Source code(tar.gz)
Source code(zip)
v1.2.0(Jun 2, 2021)
Features

Configure Renovate (#36) (d6fd93b)

Support deploying a single pipeline in a dataset (#46) (8bdb8d7)

Support Terraform remote state when generating GCP resources (#39) (9e01936)

Source code(tar.gz)
Source code(zip)

Cloud-native, data onboarding architecture for the Google Cloud Public Datasets program

Related tags

Overview

Public Datasets Pipelines

Overview

Requirements

Environment Setup

Building Data Pipelines

1. Create a folder hierarchy for your pipeline

2. Write your config (YAML) files

3. Generate Terraform files and actuate GCP resources

4. Generate DAGs and container images

5. Declare and set your pipeline variables

6. Deploy the DAGs and variables

Testing

YAML Config Reference

Best Practices

Comments

Description

Checklist

Description

Checklist

Checklist

Description

Checklist

Checklist

Description

Checklist

Description

Checklist

What are you trying to accomplish?

What challenges are you running into?

Checklist

Description

Checklist

Feature

Data Onboarding

Code cleanup or refactoring

Description

Checklist

Checklist

Release Notes

Preview style

Configuration

Parser

Integrations

Configuration

Description

Release Notes

Features

Configuration

Release Notes

Configuration

Description

Releases(v5.2.0)

v5.2.0(Nov 1, 2022)

5.2.0 (2022-11-01)

Features

Bug Fixes

v5.1.0(Aug 3, 2022)

5.1.0 (2022-07-30)

Features

Datasets

Bug Fixes

v5.0.0(Jul 11, 2022)

5.0.0 (2022-07-11)

⚠ BREAKING CHANGES

Datasets

Bug Fixes

v4.2.0(Jun 29, 2022)

4.2.0 (2022-06-25)

Datasets

Features

v4.1.1(Jun 20, 2022)

4.1.1 (2022-06-16)

Datasets

Bug Fixes

v4.1.0(Jun 14, 2022)

4.1.0 (2022-06-10)

Datasets