Generating synthetic mobility data for a realistic population with RNNs to improve utility and privacy

Overview

lbs-data

Motivation

Location data is collected from the public by private firms via mobile devices. Can this data also be used to serve the public good while preserving privacy? Can we realize this goal by generating synthetic data for use instead of the real data? The synthetic data would need to balance utility and privacy.

Overview

What:

This project uses location based services (LBS) data provided by a location intelligence company in order to train a RNN model to generate synthetic location data. The goal is for the synthetic data to maintain the properties of the real data, at the individual and aggregate levels, in order to retain its utility. At the same time, the synthetic data should sufficiently differ from the real data at the individual level, in order to preserve user privacy.

Furthermore, the system uses home and work areas as labels and inputs in order to generate location data for synthetic users with the given home and work areas.
This addresses the issue of limited sample sizes. Population data, such as census data, can be used to create the input necessary to output a synthetic location dataset that represents the true population in size and distribution.

Data

/data/

ACS data

data/ACS/ma_acs_5_year_census_tract_2018/

Population data is sourced from the 2018 American Community Survey 5-year estimates.

LBS data

/data/mount/

Privately stored on a remote server.

Geography and time period

  • Geography: The region of study is limited to 3 counties surrounding Boston, MA.
  • Time period: The training and output data is for the first 5-day workweek of May 2018.

Data representation

The LBS data are provided as rows.

device ID, latitude, longitude, timestamp, dwelltime

The data are transformed into "stay trajectories", which are sequences where each index of a sequence represents a 1-hour time interval. Each stay trajectory represents the data for one user (device ID). The value at that index represents the location/area (census tract) where the user spent the most time during that 1-hour interval.

e.g.

[A,B,D,C,A,A,A,NULL,B...]

Where each letter represents a location. There are null values when no location data is reported in the time interval.

home and work locations are inferred for each user stay trajectory. stay trajectories are prefixed with the home and work locations. This home, work prefixes then serve as labels.

[home,work,A,B,D,C,A,A,A,NULL,B...]

Where home,work values are also elements (frequently) occuring in their associated stay trajectory (e.g. home=A).

These sequences are used to train the model and are also output by the model.

RNN

The RNN model developed in this work is meant to be simple and replicable. It was implemented via the open source textgenrnn library. https://github.com/minimaxir/textgenrnn.

Many models (>70) are trained with a variety of hyper parameter values. The models are each trained on the same training data and then use the same input (home, work labels) to generate output synthetic data. The output is evalued via a variety of utility and privacy metrics in order to determine the best model/parameters.

Pipeline

Preprocessing

Define geography / shapefiles

./shapefile_shaper.ipynb

Our study uses 3 counties surrounding Boston, MA: Middlesex, Norfolk, Suffolk counties.

shapefile_shaper prunes MA shapefiles for this geography.

Output files are in ./shapefiles/ma/

Census tracts are used as "areas"/locations in stay trajectories.

Data filtering

./preprocess_filtering.ipynb

The LBS data is sparse. Some users report just a few datapoints, while other users report many. In order to confidently infer home and work locations, and learn patterns, we only include data from devices with sufficient reporting.

./preprocess_filtering.ipynb filters the data accordingly. It pokes the data to try to determine what the right level of filtering is. It outputs saved files with filtered data. Namely, it saves a datafile with LBS data from devices that reported at least 3 days and 3 nights of data during the 1 workweek of the study period. This is the pruned dataset used in the following work.

Attach areas

/attach_areas.ipynb

Census areas are attached to LBS data rows.

Home, work inference

./infer_home_work.ipynb

Defines functions to infer home and work locations (census tracts ) for each device user, based on their LBS data. The home location is where the user spends most time in nighttime hours. The "work" location is where the user spends the most time in workday hours. These locations can be the same.

This file helps determine good hours to use for nighttime hours. Once the functions are defined, they are used to evaluate the data representativeness by comparing the inferred population statistics to ACS 2018 census data.

Saves a mapping of LBS user IDS to the inferred home,work locations.

Stay trajectories setup

./trajectory_synthesis/trajectory_synthesis_notebook.ipynb

Transforms preprocessed LBS data into prefixed stay trajectories.

And outputs files for model training, data generation, and comparison.

Note: for the purposes of model training and data generation, the area tokens within stay trajectories can be arbitrary. What is important for the model’s success is the relationship between them. In order to save the stay trajectories in this repository yet keep real data private, we do the following. We map real census areas to integers, and map areas in stay trajectories to the integers representing the areas. We use the transformed stay trajectories for model training and data generation. The mapping between real census areas and their integer representations is kept private. We can then map the integers in stay trajectories back to the real areas they represent when needed (such as when evaluating trip distance metrics).

Output files:

./data/relabeled_trajectories_1_workweek.txt: D: Full training set of 22704 trajectories

./data/relabeled_trajectories_1_workweek_prefixes_to_counts.json: Maps D home,work label prefixes to counts

./data/relabeled_trajectories_1_workweek_sample_2000.txt: S: Random sample of 2000 trajectories from D.

./data/relabeled_trajectories_1_workweek_prefixes_to_counts_sample_2000.json: Maps S home,work label prefixes to counts

  • This is used as the input for data generation so that the output sythetic sample, S', has a home,work label pair distribution that matches S.

Model training and data generation

./trajectory_synthesis/textgenrnn_generator/

Models with a variety of hyperparameter combinations were trained and then used to generate a synthetic sample.

The files model_trainer.py and generator.py are the templates for the scripts used to train and generate.

The model (hyper)parameter combinations were tracked in a spreadsheet. ./trajectory_synthesis/textgenrnn_generator/textgenrnn_model_parameters_.csv

Evaluation

./trajectory_synthesis/evaluation/evaluate_rnn.ipynb

A variety of utility and privacy evaluation tools and metrics were developed. Models were evaluated by their synthetic data outputs (S'). This was done in ./trajectory_synthesis/evaluation/evaluate_rnn.ipynb. The best model (i.e. best parameters) was determined by these evaluations. The results for this model are captured in trajectory_synthesis/evaluation/final_eval_plots.ipynb.

Owner
Alex
Systems Architect, product oriented Engineer, Hacker for the social good, Math Nerd that loves solving hard problems and working with great people.
Alex
Robust Lane Detection via Expanded Self Attention (WACV 2022)

Robust Lane Detection via Expanded Self Attention (WACV 2022) Minhyeok Lee, Junhyeop Lee, Dogyoon Lee, Woojin Kim, Sangwon Hwang, Sangyoun Lee Overvie

Min Hyeok Lee 18 Nov 12, 2022
Torch implementation of various types of GAN (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN, EBGAN, LSGAN)

gans-collection.torch Torch implementation of various types of GANs (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN, EBGAN). Note that EBGAN and

Minchul Shin 53 Jan 22, 2022
A PyTorch implementation of SlowFast based on ICCV 2019 paper "SlowFast Networks for Video Recognition"

SlowFast A PyTorch implementation of SlowFast based on ICCV 2019 paper SlowFast Networks for Video Recognition. Requirements Anaconda PyTorch conda in

Hao Ren 8 Dec 23, 2022
A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for ONNX.

sam4onnx A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for

Katsuya Hyodo 6 May 15, 2022
This is the official PyTorch implementation of the CVPR 2020 paper "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting".

TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting Project Page | YouTube | Paper This is the official PyTorch implementation of the C

Zhuoqian Yang 330 Dec 11, 2022
A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

Son Tran 6 Oct 04, 2022
This is Unofficial Repo. Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection (CVPR 2021)

Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection This is a PyTorch implementation of the LipForensics paper. This is an U

Minha Kim 2 May 11, 2022
Animatable Neural Radiance Fields for Modeling Dynamic Human Bodies

To make the comparison with Animatable NeRF easier on the Human3.6M dataset, we save the quantitative results at here, which also contains the results of other methods, including Neural Body, D-NeRF,

ZJU3DV 359 Jan 08, 2023
Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

Ursa Zrimsek 2 Dec 14, 2022
La source de mon module 'pyfade' disponible sur Pypi.

Version: 1.2 Introduction Pyfade est un module permettant de créer des dégradés colorés. Il vous permettra de changer chaque ligne de votre texte par

Billy 20 Sep 12, 2021
Learning Calibrated-Guidance for Object Detection in Aerial Images

Learning Calibrated-Guidance for Object Detection in Aerial Images arxiv We propose a simple yet effective Calibrated-Guidance (CG) scheme to enhance

51 Sep 22, 2022
PyMatting: A Python Library for Alpha Matting

Given an input image and a hand-drawn trimap (top row), alpha matting estimates the alpha channel of a foreground object which can then be composed onto a different background (bottom row).

PyMatting 1.4k Dec 30, 2022
Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021) Jiaxi Jiang, Kai Zhang, Radu Timofte Computer Vision Lab, ETH Zurich, Switzerland 🔥

Jiaxi Jiang 282 Jan 02, 2023
ULMFiT for Genomic Sequence Data

Genomic ULMFiT This is an implementation of ULMFiT for genomics classification using Pytorch and Fastai. The model architecture used is based on the A

Karl 276 Dec 12, 2022
YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

Yolo v4, v3 and v2 for Windows and Linux (neural networks for object detection) Paper YOLO v4: https://arxiv.org/abs/2004.10934 Paper Scaled YOLO v4:

Alexey 20.2k Jan 09, 2023
Distributional Sliced-Wasserstein distance code

Distributional Sliced Wasserstein distance This is a pytorch implementation of the paper "Distributional Sliced-Wasserstein and Applications to Genera

VinAI Research 39 Jan 01, 2023
VQMIVC - Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech

Disong Wang 262 Dec 31, 2022
Anonymous implementation of KSL

k-Step Latent (KSL) Implementation of k-Step Latent (KSL) in PyTorch. Representation Learning for Data-Efficient Reinforcement Learning [Paper] Code i

1 Nov 10, 2021
This is the code for Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

This is the code for Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning It includes /bert, which is the original BERT repos

Mitchell Gordon 11 Nov 15, 2022
Deep learning for spiking neural networks

A deep learning library for spiking neural networks. Norse aims to exploit the advantages of bio-inspired neural components, which are sparse and even

Electronic Vision(s) Group — BrainScaleS Neuromorphic Hardware 59 Nov 28, 2022