🎁 3,000,000+ Unsplash images made available for research and machine learning

Overview

The Unsplash Dataset

The Unsplash Dataset is made up of over 250,000+ contributing global photographers and data sourced from hundreds of millions of searches across a nearly unlimited number of uses and contexts. Due to the breadth of intent and semantics contained within the Unsplash dataset, it enables new opportunities for research and learning.

The Unsplash Dataset is offered in two datasets:

  • the Lite dataset: available for commercial and noncommercial usage, containing 25k nature-themed Unsplash photos, 25k keywords, and 1M searches
  • the Full dataset: available for noncommercial usage, containing 3M+ high-quality Unsplash photos, 5M keywords, and over 250M searches

As the Unsplash library continues to grow, we’ll release updates to the dataset with new fields and new images, with each subsequent release being semantically versioned.

We welcome any feedback regarding the content of the datasets or their format. With your input, we hope to close the gap between the data we provide and the data that you would like to leverage. You can open an issue to report a problem or to let us know what you would like to see in the next release of the datasets.

For more on the Unsplash Dataset, see our announcement and site.

Download

Lite Dataset

The Lite dataset contains all of the same fields as the Full dataset, but is limited to ~25,000 photos. It can be used for both commercial and non-commercial usage, provided you abide by the terms.

⬇️ Download the Lite dataset [~650MB compressed, ~1.4GB raw]

Full Dataset

The Full dataset is available for non-commercial usage and all uses must abide by the terms. To access, please go to unsplash.com/data and request access. The dataset weighs ~20 GB compressed (~43GB raw)).

Documentation

See the documentation for a complete list of tables and fields.

Usage

You can follow these examples to load the dataset in these common formats:

Share your work

We're making this data open and available with the hopes of enabling researchers and developers to discover interesting and useful connections in the data.

We'd love to see what you create, whether that's a research paper, a machine learning model, a blog post, or just an interesting discovery in the data. Send us an email at [email protected].

If you're using the dataset in a research paper, you can attribute the dataset as Unsplash Lite Dataset 1.2.0 or Unsplash Full Dataset 1.2.0 and link to the permalink unsplash.com/data.


The Unsplash Dataset is made available for research purposes. It cannot be used to redistribute the images contained within. To use the Unsplash library in a product, see the Unsplash API.

Comments
  • Potential Data Cleanup activities

    Potential Data Cleanup activities

    In the unsplash_photos.photo_location_country and unsplash_photos.photo_location_city the values appear to be freeform text that was probably direct user input, with effectively duplicate entries for example,

    unsplash_lite=# select '>' || photo_location_city || '<' as city, '>' || photo_location_country || '<' as country,  count(*) from unsplash_photos where lower(photo_location_city) like '%london%' group by 1,2;
     ?column?  |       ?column?       | count
    -----------+----------------------+-------
     >LONDON < | >United Kingdom <    |     1
     >London<  | >Canada<             |     7
     >London<  | >Egyesült Királyság< |     1
     >London<  | >England<            |     1
     >London<  | >U.K.<               |     2
     >London<  | >U.K<                |     1
     >London<  | >United Kingdom <    |     1
     >London<  | >United Kingdom<     |    73
     >London<  |                      |     3
    (9 rows)```
    

    It looks like there needs to be some data cleaning on these fields, definitely some stripping white space and such. Is it assumed that we should do our own location normalization on this and possibly add in a normalized_photo_location_country and normalized_photo_location_city ?

    Also - over in unsplash_conversion.conversion_country this appears to be ISO 2 letter country codes. Is this guaranteed to be a valid ISO 2 letter country code? And was this data created based upon a maxmind geoip lookup or something similar?

    Thanks so much for this dataset, I think it is going to be quite useful for demonstrational purposes. I hope these questions help increase the quality of what is a already great dataset.

    bug 
    opened by copiousfreetime 5
  • API for Random Images

    API for Random Images

    I love your API and would like to integrate your commercial images into our product but through your API, do you consider creating an API Endpoint?

    It could work the same way as your existing API, just for the usage of those datasets.

    Creative Commons Images from Flickr is the same, but I like your API more. 👍

    enhancement 
    opened by themataleao 5
  • 1.1.0 Release

    1.1.0 Release

    Overview

    • Closes #10
    • Closes #13
    • Closes #21
    • Closes #22
    • Closes #29

    A 1.1.0 release tag will be created with more details about what's new in this version:

    New:

    • #10: Included user descriptions of the photos
    • #21: Included width, height and aspect ratio of photos
    • #22: Included colors data from photos, coming from a 3rd party AI

    Fix:

    • #13: Trimmed some fields.
    • #29: Replaced newlines in keywords file by spaces to avoid CSV importation issues
    documentation enhancement 
    opened by TimmyCarbone 3
  • Image analysis metadata available?

    Image analysis metadata available?

    Ticket #21 mentions additional image metadata that should be in the dataset. Are there other image analysis things that unsplash calculates that could be added? I'm in particular thinking of color value statistics, mean/median pixel value, min/max pixel values etc.

    enhancement question 
    opened by copiousfreetime 3
  • 1.0.1 - Fix always empty landmark columns

    1.0.1 - Fix always empty landmark columns

    Overview

    • Closes #12
    • Fix: All landmark columns are null
    • Fix: Some landmark names showing as empty strings instead of null
    • Fix: Duplicate entry for one photo in the Full Dataset

    New release

    • Upon merging, create a 1.0.1 tag and release for the dataset.
    • /lite/latest API link updated to the 1.0.1 dataset.
    bug 
    opened by TimmyCarbone 3
  • Explanation on the ai_service_2_confidence column in keywords.tsv000 (range seems weird)

    Explanation on the ai_service_2_confidence column in keywords.tsv000 (range seems weird)

    Describe the bug Hello ,

    I was looking on the data from the lite dataset this morning and I noticed something weird in the column 'ai_service_2_confidence' from the keywords.tsv000 file.

    when I applied some stats on the columns about ai_service the column 'ai_service_2_confidence' seems to have extreme value that are exceeding 100 that is for me the expected max (if I take the ai_service_1_confidence as reference for exemple)

    image

    To Reproduce

    There is the code to build the stats

    import pandas as pd
    dfp_keywords_raw = pd.read_csv('keywords.tsv000', sep='\t', header=0)
    dfp_keywords_raw[['ai_service_1_confidence', 'ai_service_2_confidence']].describe()
    

    Steps to reproduce the behavior: Having a python environment (3.6.13) with pandas 1.1.5 installed

    Expected behavior I am expecting to have a value in the column 'ai_service_2_confidence' in keywords.tsv000 file between 0 and 100 or if it's not the case having a more precise description of the value for the 'ai_service_2_confidence' in the description (like the range)

    Additional context I have a list of the keywords that seems to be impacted by these extreme values unsplash_extreme_value.zip

    Hope that it will help on your investigation 🕵️‍♀️ (and I hope that is not just me that is missing something)

    PS: your dataset is great by the way (really hope to have access to the full version soon)👍

    bug 
    opened by jeanmidevacc 2
  • Downloading unsplash dataset

    Downloading unsplash dataset

    I am trying to download unsplash dataset lite version and the link for the lite version download doesn't give me images. Is image in the download link or do I need to download it by using API?

    opened by darwinkeem 2
  • Question about the entities behind the

    Question about the entities behind the "anonymous_user_id"

    Hey, I'm trying to calculate the average amount of images a user downloads.

    As I know from my own photo stats, a lot of downloads are generated via API requests from external applications. You state in your API doc that external applications don't need to authenticate on a user level.

    My question: Is for an external application like Trello one anonymous user id generated or do you guys have a better approach to distinguish users "behind" the external application?

    Example from the test dataset Could the user from the first row (942 downloads) really be one person or also a whole logical entity like Trello?

    anonymous_user_id | downloads ---- | --- 5a055748-57d2-45c1-a882-5b9bb9313509 | 942 beb0923e-c17d-4a90-a8db-47b0f45fb0fc |897 85e5db9c-07c7-49bf-9e08-5cbd1603dd74 | 546 ... | ...


    Thanks a lot for the answer and great job with the data set. 👍 enhancement question 
    opened by vii33 2
  • Is it expected to have fields where all values are null

    Is it expected to have fields where all values are null

    When reviewing the data in the lite dataset, all of the following fields are null in all records.

    • unsplash_photos.ai_primary_landmark_name
    • unsplash_photos.ai_primary_landmark_latitude
    • unsplash_photos.ai_primary_landmark_longitude
    • unsplash_photos.ai_primary_landmark_confidence

    If all of these are supposed to be null all the time - It may be useful to drop those columns from the dataset completely.

    Although if these columns do have data in the full dataset it makes sense to have them exist. If this is the case, it may be useful to update the documentation to note that these fields are null in the lite dataset and have values in the full dataset.

    In any case, just checking to make sure that this is the expected behavior.

    bug 
    opened by copiousfreetime 2
  • Feature request: Add description = true feature to your random API requests

    Feature request: Add description = true feature to your random API requests

    Hello,

    I'm an English Learning Educator and I have a registered project set up already.

    I would like to get random images, but ONLY the ones that have a clear or detailed description field, not null or empty values.

    Is there a way to do this with your current api or can it be implemented by your team as a feature request?

    something like this:

    https://api.unsplash.com/photos/random?description=true&description_min_chars=10&client_id=XXXXXX

    params

    description = true (required)

    description_min_chars = int (required) minimum description characters length or char count

    Regards

    Hugo Barbosa

    enhancement 
    opened by inglesuniversal 1
  • Will I be banned if I download the pictures?

    Will I be banned if I download the pictures?

    I've applied the full dataset and it was approved. Now I get the 47 GB dataset. Will I be banned if I download the pictures? I have a 1Gb/s server and I'd download the pictures in 'photos.tsv000', the url begins in 'images.unsplash.com', I wonder if I could get banned when I download them too quickly? (For a 1Gb server of mine, about 5-6 pictures per second.)

    opened by orange2008 1
  • Photo ID 9_9hzZVjV8s

    Photo ID 9_9hzZVjV8s

    Describe the bug URL is bad

    To Reproduce Try downloading it

    Expected behavior Good URL

    Additional context Add any other context about the problem here.

    bug 
    opened by peardox 4
  • Elaboration of data fields and additions

    Elaboration of data fields and additions

    [1] photo_featured does this include the featured topic categories at the top?
    [1.1] Could this field include which topics it is featured in, but also if the photo was submitted and rejected from a topic? [1.2] Is there historic data for which images were not included as 'searchable' before the search system was replaced for everything being searchable?

    [2] suggested_by_user the description mentions 'a user (human)'. At some point (maybe?) unsplash was adding tags or keywords to its approved/moderated photos, does (could) this distinguish who added it (uploader or staff)?

    [4] keyword Is this referring to the search terms used to find the photo? [4.1] assuming it is the searching keywords, can we add a field for the position it was displayed on the website (i.e. if it was the first row first column, or it was an image that was 30 photos down that a searcher scrolled down to find and pick)

    Thanks for releasing the dataset, its a great contribution to the research community!

    enhancement 
    opened by cadop 1
  • Image cannot be displayed contains error

    Image cannot be displayed contains error

    Describe the bug Photo with id sEDzxW4NhL4 has errors. While it can be accessed via photo_url https://unsplash.com/photos/sEDzxW4NhL4 it cannot get accessed in photo_img_url https://images.unsplash.com/photo-1586019496196-bdbea65add07

    To Reproduce https://images.unsplash.com/photo-1586019496196-bdbea65add07

    Expected behavior

    Additional context

    bug 
    opened by sniafas 1
  • Dataset of all Unsplash contributors

    Dataset of all Unsplash contributors

    Is your feature request related to a problem? Please describe. Since the Unsplash Slack server as the only connection point of the community isn't as populated as it could be, I think there could be another way of bringing the people of Unsplash together.

    Describe the solution you'd like With a dataset of all Unsplash contributors it would be possible to create a map, giving them a chance to find other motivated photographers nearby. The dataset should contain the name, the location, the number of photos, the URL and maybe the linked website.

    Describe alternatives you've considered Before, there have been local Slack channels on the server to connect with other people from the same country, but afaik they have been shutdown.

    Additional context Nothing to add

    enhancement 
    opened by juppcamper 3
  • Publishing on Kaggle

    Publishing on Kaggle

    Hello,

    Thanks for sharing / publishing this dataset. Are there any plans for mirroring this dataset on Kaggle? If not, can I publish the Lite Dataset as a public dataset on Kaggle linking it back to Unsplash as the source and using the same licensing terms as here?

    It would be great to make this data available on Kaggle where a lot of ML research and models can be built.

    question 
    opened by vopani 3
Releases(1.2.0)
  • 1.2.0(Jul 30, 2021)

    New:

    Fix:

    • #39: Fixed AI confidence for AI keywords

    Data:

    • The collection_id field in the collections dataset field can now be a string
    • Added more search conversions
    • Historical data for search conversions will now be limited to 1 year before each version's release date.
    • Added ~1M photos to the Full Dataset.
    • Replaced 794 dead photos (removed from Unsplash) in the Lite Dataset with approved photos

    Lite dataset link:

    Integrity checks (SHA-256):

    • Lite: 461fa4a1796b7966fc3aa904ce2e7f18890323243ed0e95f47c7042b335fcd98
    • Full: daa99dab8ba7a47d530356311ffa73f17eb403898a75399c54812e9dd582f8af
    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Sep 24, 2020)

    New:

    • #10: Included user descriptions of the photos
    • #21: Included width, height and aspect ratio of photos
    • #22: Included colors data from photos, coming from a 3rd party AI

    Fix:

    • #13: Trimmed some fields.
    • #29: Replaced newlines in keywords file by spaces to avoid CSV importation issues

    Data:

    • Replaced 307 deleted photos in the Lite dataset with new approved photos
    • Removed about 17k deleted photos in the Full dataset
    • Updated conversions data with latest conversions. Full dataset now weighs ~25GB (vs. 16GB)

    Datasets:

    Integrity checks (SHA-256):

    • Lite: 266e45a8658ab2456779b3376b109e435e595646126846603f2efee5b47ee526
    • Full: 19abc3494bda06e36e61ccabf4dd2ca8e046ac50a5e4e3570cc8aa89ed6a9713
    Source code(tar.gz)
    Source code(zip)
  • 1.0.1(Aug 12, 2020)

    Fix:

    • #13: AI landmarks were always empty.
    • Some landmark names were blank instead of NULL
    • Removed duplicated photo zV2-QjJqkI in the Full Dataset

    Datasets:

    • Lite: Version 1.0.1
    • Lite dataset link now follows the pattern: https://unsplash.com/data/lite/{version}

    Integrity checks (SHA-256):

    • Lite: aa199951dd8756563f7ffef4abbc2d20c845bcff62241ae677af523728819d60
    • Full: ee47f7542e5ef260e6b904046b4837532f420412a0e2c299dcecab55acd28d1f
    Source code(tar.gz)
    Source code(zip)
  • 1.0.0(Aug 6, 2020)

Owner
Unsplash
Building the internet's open library of freely usable visuals. Join us 👫
Unsplash
This implements one of result networks from Large-scale evolution of image classifiers

Exotic structured image classifier This implements one of result networks from Large-scale evolution of image classifiers by Esteban Real, et. al. Req

54 Nov 25, 2022
[ICRA2021] Reconstructing Interactive 3D Scene by Panoptic Mapping and CAD Model Alignment

Interactive Scene Reconstruction Project Page | Paper This repository contains the implementation of our ICRA2021 paper Reconstructing Interactive 3D

97 Dec 28, 2022
The deployment framework aims to provide a simple, lightweight, fast integrated, pipelined deployment framework that ensures reliability, high concurrency and scalability of services.

savior是一个能够进行快速集成算法模块并支持高性能部署的轻量开发框架。能够帮助将团队进行快速想法验证(PoC),避免重复的去github上找模型然后复现模型;能够帮助团队将功能进行流程拆解,很方便的提高分布式执行效率;能够有效减少代码冗余,减少不必要负担。

Tao Luo 125 Dec 22, 2022
Pythonic particle-based (super-droplet) warm-rain/aqueous-chemistry cloud microphysics package with box, parcel & 1D/2D prescribed-flow examples in Python, Julia and Matlab

PySDM PySDM is a package for simulating the dynamics of population of particles. It is intended to serve as a building block for simulation systems mo

Atmospheric Cloud Simulation Group @ Jagiellonian University 32 Oct 18, 2022
The official repository for paper ''Domain Generalization for Vision-based Driving Trajectory Generation'' submitted to ICRA 2022

DG-TrajGen The official repository for paper ''Domain Generalization for Vision-based Driving Trajectory Generation'' submitted to ICRA 2022. Our Meth

Wang 25 Sep 26, 2022
Resilience from Diversity: Population-based approach to harden models against adversarial attacks

Resilience from Diversity: Population-based approach to harden models against adversarial attacks Requirements To install requirements: pip install -r

0 Nov 23, 2021
An Active Automata Learning Library Written in Python

AALpy An Active Automata Learning Library AALpy is a light-weight active automata learning library written in pure Python. You can start learning auto

TU Graz - SAL Dependable Embedded Systems Lab (DES Lab) 78 Dec 30, 2022
imbalanced-DL: Deep Imbalanced Learning in Python

imbalanced-DL: Deep Imbalanced Learning in Python Overview imbalanced-DL (imported as imbalanceddl) is a Python package designed to make deep imbalanc

NTUCSIE CLLab 19 Dec 28, 2022
RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues

RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues FGBG (foreground-background) pytorch package for defining and training model

Klaas Kelchtermans 1 Jun 02, 2022
A simple software for capturing human body movements using the Kinect camera.

KinectMotionCapture A simple software for capturing human body movements using the Kinect camera. The software can seamlessly save joints and bones po

Aleksander Palkowski 5 Aug 13, 2022
这是一个yolox-keras的源码,可以用于训练自己的模型。

YOLOX:You Only Look Once目标检测模型在Keras当中的实现 目录 性能情况 Performance 实现的内容 Achievement 所需环境 Environment 小技巧的设置 TricksSet 文件下载 Download 训练步骤 How2train 预测步骤 Ho

Bubbliiiing 64 Nov 10, 2022
Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

Rowel Atienza 152 Dec 28, 2022
A Quick and Dirty Progressive Neural Network written in TensorFlow.

prog_nn .▄▄ · ▄· ▄▌ ▐ ▄ ▄▄▄· ▐ ▄ ▐█ ▀. ▐█▪██▌•█▌▐█▐█ ▄█▪ •█▌▐█ ▄▀▀▀█▄▐█▌▐█▪▐█▐▐▌ ██▀

SynPon 53 Dec 12, 2022
Awesome Monocular 3D detection

Awesome Monocular 3D detection Paper list of 3D detetction, keep updating! Contents Paper List 2022 2021 2020 2019 2018 2017 2016 KITTI Results Paper

Zhikang Zou 184 Jan 04, 2023
Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer.

DocEnTR Description Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer. This model is implemented on to

Mohamed Ali Souibgui 74 Jan 07, 2023
Towards Implicit Text-Guided 3D Shape Generation (CVPR2022)

Towards Implicit Text-Guided 3D Shape Generation Towards Implicit Text-Guided 3D Shape Generation (CVPR2022) Code for the paper [Towards Implicit Text

55 Dec 16, 2022
A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

Biomedical Computer Vision @ Uniandes 52 Dec 19, 2022
The Instructed Glacier Model (IGM)

The Instructed Glacier Model (IGM) Overview The Instructed Glacier Model (IGM) simulates the ice dynamics, surface mass balance, and its coupling thro

27 Dec 16, 2022
Monk is a low code Deep Learning tool and a unified wrapper for Computer Vision.

Monk - A computer vision toolkit for everyone Why use Monk Issue: Want to begin learning computer vision Solution: Start with Monk's hands-on study ro

Tessellate Imaging 507 Dec 04, 2022
Koç University deep learning framework.

Knet Knet (pronounced "kay-net") is the Koç University deep learning framework implemented in Julia by Deniz Yuret and collaborators. It supports GPU

1.4k Dec 31, 2022