福音!用好AWS公开数据集

aws 提供了一个s3存储空间,给那些开放公共下载的资源,并且存储费用是免费的!比如我们生信常用的NCBI SRA,ENCODE, 1000g,NIH的人类微生物组计划数据,nanopore reference, 水稻基因组3000 Rice Genomics, TCGA,ICGC数据等,有了这些有什么好处呢?没错,如果我是一名aws用户,我就不需要再去下载数据啦,大家都知道网速那个慢呐~ 我只需要知道数据所在的aws的位置,通过aws内网把想要的数据直接拉到我要做分析的机器上就可以,速度飞起!

--- D.C

AWS public dataset

什么是公开数据集

场景:在做数据分析时,一个困难是海量的数据本地存储困难,而且下载耗费的时间极长。例如1T数据,如果下载网速是3MBps(目前中国的平均宽带速度),那要4天才能下载完。有些数据集有几十T,那光下载就要几个月。

AWS云服务平台上为了解决这个困难提供了很多常用的大规模数据集 Public Data Sets https://aws.amazon.com/datasets ,无需下载即可在亚马逊AWS EC2上使用。

目的是:

dataset_list

挑其中两个数据介绍下:

如何使用数据集

dataset_1000g

$ aws s3 ls s3://1000genomes
                           PRE 1000G_2504_high_coverage/
                           PRE alignment_indices/
                           PRE changelog_details/
                           PRE complete_genomics_indices/
                           PRE data/
                           PRE hgsv_sv_discovery/
                           PRE phase1/
                           PRE phase3/
                           PRE pilot_data/
                           PRE release/
                           PRE sequence_indices/
                           PRE technical/
2015-09-08 21:16:09       1663 20131219.populations.tsv
2015-09-08 21:17:01         97 20131219.superpopulations.tsv
2015-09-08 15:01:44     257098 CHANGELOG
2014-09-02 15:39:53      15977 README.alignment_data
2014-01-30 11:13:29       5289 README.analysis_history
2014-01-31 03:44:08       5967 README.complete_genomics_data
2014-08-29 00:22:38        563 README.crams
2013-08-06 16:11:58        935 README.ebi_aspera_info
2013-08-06 16:11:58       8408 README.ftp_structure
2014-09-02 21:19:43       2082 README.pilot_data
2014-09-03 12:33:15       1938 README.populations
2013-08-06 16:11:58       7857 README.sequence_data
2015-06-18 18:28:31        672 README_missing_files_20150612
2015-06-03 19:43:32        136 README_phase3_alignments_sequence_20150526
2015-06-18 16:34:45        273 README_phase3_data_move_20150612
2014-09-03 12:34:30    3579471 alignment.index
2014-09-03 12:32:59   54743580 analysis.sequence.index
2014-09-03 12:34:57    3549051 exome.alignment.index
2014-09-03 12:35:15   67069489 sequence.index

$ aws s3 sync s3://1000genomes/1000G_2504_high_coverage ./
Completed 320.2 MiB/~76.9 GiB (152.0 MiB/s) with ~5 file(s) remaining (calculating...)

Usage examples for all datasets

如何公布我的数据集

dataset_apply

需要注意的

王权没有永恒。

一些有意思的数据集

Data from the 1980 US Census

Data from the 1990 US Census

Data from the 2000 US Census

US Economic Data for years 2003 to 2006

Census 2000 and Current United States shapefiles

3D Version of the PubChem Library

Anthropometric data on children from two studies in 1975 and 1977

A collection of all publicly available Apache Software Foundation mail archives as of July 11, 2011

US Business and Industry Summary Data

C57BL/6J by C3H/HeJ mouse cross from the Jake Lusis lab at UCLA

A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.

A collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web

The high-coverage genome sequence of a Denisovan individual sequenced to ~30x coverage on the Illumina platform. Together with their sister group the Neandertals, Denisovans are the most closely related extinct relatives of currently living humans.

Enron email data publicly released as part of FERC’s Western Energy Markets investigation converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The email is provided in Microsoft PST, IETF MIME, and EDRM XML formats.

Ensembl sequence databases of transcript and translation models

The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.

The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.

A data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov.

Database of 20,059 U.S. economic time series.

Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories

A data dump of all the current facts and assertions in Freebase

A data dump of the basic identifying facts about every topic in Freebase

An annotated collection of all publicly available DNA sequences including more than 85.7B bases and 82.8M sequence records.

A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/.

Human Liver Cohort characterizing gene expression in liver samples

Human Microbiome Project Data Set

Jay Flatley (CEO of Illumina) human genome data set.

NCBI Influenza Resource Center Data.

Multiple data sets including: (1) Population Census of Japan (1995, 2000, 2005, 2010), (2) Establishment and Enterprise Census of Japan (1999, 2001, 2004, 2006), and (3) Economic Census of Japan (2009).

Various Labor Statistics

NDT test results created through Measurement Lab (M-Lab) between February 2009 and September 2009

NPAD test results created through Measurement Lab (M-Lab) between February 2009 and September 2009

This dataset is an example of a social collaboration network based on the characters in The Marvel Universe, that is, the artificial world that takes place in the universe of the Marvel comic books.

230,000 Material Safety Data Sheets.

The Million Songs Collection is a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.

This is a 10,000 song subset of audio features and metadata from the Million Songs collection – a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.

A collection of data from the modENCODE project ( http://www.modencode.org )

Three NASA NEX datasets are now available, including climate projections and satellite images of Earth.

A PostGIS 8.3 data cluster of all OpenStreetMap data for the planet.

Public-domain data for the oil & gas industry, assembled from the contributions of participating agencies in the United States, Canada and around the world. This data provides industry stakeholders with an opportunity to focus their efforts on the analysis and interpretation of this data without concern for the trivial and time-consuming tasks of locating, downloading, reformatting and integrating the data prior to value-added work being performed.

A data set of information on the biological activities of small molecules.

The Sloan Digital Sky Survey is the most ambitious astronomical survey ever undertaken.

Whole Genome Shotgun Sequencing of the Cannabis Sativa Cultivar “Chemdawg”

The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010.

Various transportation statistics

Twilio/Wigle.net database of mapped US street names and address ranges.

UniGene: An Organized View of the Transcriptome.

The University of Florida Sparse Matrix Collection is a large, widely available, and actively growing set of sparse matrices that arise in real applications.

A processed dump of the English language Wikipedia

This dataset contains a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).

Contains 7 months of hourly pageview statistics for all articles in Wikipedia

Contains 16 months of hourly pageview statistics for all articles in Wikipedia

A complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML.

Complete genome sequence data for three Yoruba individuals from Ibadan, Nigeria