高阶11 使用S3数据同步神器,数据尽在我手

在高阶9中提到了S3神器-Amazon S3多线程断点续传迁移工具,今天终于有空给大家介绍,对我等生信汪来说简直是莫大的福音,动辄上G的国外数据库,几KB的下载速度一度令我们抓狂,有了这个工具,我们从此不必再烦恼啦。 -- D.C

例行介绍,该工具的全称是:Amazon S3 MultiThread Resume Migration Solution (Amazon S3多线程断点续传迁移) 点我 ,官方没有给出缩写,为了方便记忆,就叫做SMRMS吧,非常好记,两边是SM,中间一个人(Ren)。

适用于

有三个版本

此外还支持S3的版本控制,即时触发或定时扫描。

好用的点

在典型测试中,迁移1.2TB数据从 美东区域us-east-1 S3 到 国内宁夏区域 cn-northwest-1 S3 只用1小时。

我该选择什么版本?

问题 : 先思考一下自己的应用场景是什么?是要搬迁大量数据?还只是偶尔需要去海外扒数据库?

这里简单归纳了以下:

版本 场景
单机版 一次性的搬迁工作,包含三个模式:LOCAL_TO_S3:本地上传;S3_TO_S3:轻中量级,一次性运行的;ALIOSS_TO_S3:阿里云OSS到S3
集群版 大量文件,单文件从0到TB级别。定时任务扫描或即时数据同步(S3触发SQS)。支持S3新增文件触发传输,或Jobsender定时扫描现有S3文件。
无服务器版 轻中量(建议单文件< 50GB),不定期传输,或即时数据同步。利用断点续传和SQS重驱动,Lambda不用担心15分钟超时。支持S3新增文件触发传输,或Jobsender定时扫描现有S3文件。

更多信息请参考本文开篇的链接,aws架构师写的readme已经详细到令人发指了!

准备工作,安装程序

因为我的需求很简单,就是下载一个海外数据库,所以我选择了单机版的S3_To_S3功能。

下图是单机版完整介绍。

singlenode

单机版的一些要求: Python 3.6 及其以上,并且要装有针对AWS的SDK: boto3, 后面会讲到。如果客官还要从阿里云拉数据,则还要装阿里云的命令行oss2.

PS: 下载的软件包里有个requirment.txt文件,不放心就跑一下。

$ pip install -r requirements.txt --user
Requirement already satisfied: boto3 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (1.10.34)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (0.9.4)
Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (0.2.1)
Requirement already satisfied: botocore<1.14.0,>=1.13.34 in /usr/local/lib/python3.7/site-packages (from boto3->-r requirements.txt (line 3)) (1.13.34)
Requirement already satisfied: python-dateutil<2.8.1,>=2.1; python_version >= "2.7" in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (2.8.0)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (0.15.2)
Requirement already satisfied: urllib3<1.26,>=1.20; python_version >= "3.4" in /usr/local/lib/python3.7/site-packages (from botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (1.25.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<2.8.1,>=2.1; python_version >= "2.7"->botocore<1.14.0,>=1.13.34->boto3->-r requirements.txt (line 3)) (1.13.0)

针对我的这个场景,基本思路是:

当然这是比较粗糙的,群里大神完全可以利用aws的sdk啥的完全可以做的一键自动化,今天我们把基本流程走一遍。

起海外虚拟机

python3 --version
Python 3.7.8
[ec2-user@ip-172-xx-xx-xxx ~]$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************6BXX         iam-role
secret_key     ****************6qXX         iam-role
    region                us-west-2      config-file    ~/.aws/config

安装SMRMS工具

#!/bin/bash -v
yum update -y
yum install git -y
yum install python3 -y
pip3 install boto3

# Setup BBR
echo "Setup BBR"
cat <<EOF>> /etc/sysconfig/modules/tcpcong.modules
#!/bin/bash
exec /sbin/modprobe tcp_bbr >/dev/null 2>&1
exec /sbin/modprobe sch_fq >/dev/null 2>&1
EOF
chmod 755 /etc/sysconfig/modules/tcpcong.modules
echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/00-tcpcong.conf
modprobe tcp_bbr
modprobe sch_fq
sysctl -w net.ipv4.tcp_congestion_control=bbr


echo "Download application amazon-s3-resumable-upload.git"
cd /home/ec2-user/  || exit
git clone https://github.com/aws-samples/amazon-s3-resumable-upload
chmod 755 ec2_init.sh
sudo bash ec2_init.sh
$ tree
.
├── cluster
│   ├── cdk-cluster
│   │   ├── app.py
│   │   ├── cdk
│   │   │   ├── cdk_ec2_stack.py
│   │   │   ├── cdk_resource_stack.py
│   │   │   ├── cdk_vpc_stack.py
│   │   │   ├── cw_agent_config.json
│   │   │   ├── __init__.py
│   │   │   ├── user_data_jobsender.sh
│   │   │   ├── user_data_part1.sh
│   │   │   ├── user_data_part2.sh
│   │   │   └── user_data_worker.sh
│   │   ├── cdk.context.json
│   │   ├── cdk.json
│   │   ├── code
│   │   │   ├── requirements.txt
│   │   │   ├── s3_migration_cluster_config.ini
│   │   │   ├── s3_migration_cluster_jobsender.py
│   │   │   ├── s3_migration_cluster_worker.py
│   │   │   ├── s3_migration_ignore_list.txt
│   │   │   └── s3_migration_lib.py
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── setup.py
│   │   └── source.bat
│   ├── img
│   │   ├── 02-jobsender.png
│   │   ├── 02-new.png
│   │   ├── 03.png
│   │   ├── 04.png
│   │   ├── 05.png
│   │   ├── 07.png
│   │   ├── 08.png
│   │   ├── 09.png
│   │   └── 0a.png
│   ├── old-cdk-cluster-0.96
│   │   ├── app.py
│   │   ├── cdk
│   │   │   ├── cdk_ec2_stack.py
│   │   │   ├── cdk_resource_stack.py
│   │   │   ├── cdk_vpc_stack.py
│   │   │   ├── cw_agent_config.json
│   │   │   ├── user_data_jobsender.sh
│   │   │   ├── user_data_part1.sh
│   │   │   ├── user_data_part2.sh
│   │   │   └── user_data_worker.sh
│   │   ├── cdk.context.json
│   │   ├── cdk.json
│   │   ├── code
│   │   │   ├── requirements.txt
│   │   │   ├── s3_migration_cluster_config.ini
│   │   │   ├── s3_migration_cluster_jobsender.py
│   │   │   ├── s3_migration_cluster_worker.py
│   │   │   ├── s3_migration_ignore_list.txt
│   │   │   └── s3_migration_lib.py
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── setup.py
│   │   └── source.bat
│   ├── README-English.md
│   └── README.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── img
│   ├── 01.png
│   └── 02.png
├── LICENSE
├── README.md
├── serverless
│   ├── cdk-serverless
│   │   ├── app.py
│   │   ├── cdk.context.json
│   │   ├── cdk.json
│   │   ├── lambda
│   │   │   ├── lambda_function_jobsender.py
│   │   │   ├── lambda_function_worker.py
│   │   │   └── s3_migration_lib.py
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── s3_migration_ignore_list.txt
│   │   ├── setup.py
│   │   └── source.bat
│   ├── img
│   │   ├── 01.png
│   │   ├── 02-jobsender.png
│   │   ├── 02-new.png
│   │   ├── 05.png
│   │   ├── 06.png
│   │   ├── 07b.png
│   │   └── 09.png
│   ├── old-cdk-serverless-0.96
│   │   ├── app.py
│   │   ├── cdk.context.json
│   │   ├── cdk.json
│   │   ├── lambda
│   │   │   ├── lambda_function_jobsender.py
│   │   │   ├── lambda_function_worker.py
│   │   │   └── s3_migration_lib.py
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── s3_migration_ignore_list.txt
│   │   ├── setup.py
│   │   └── source.bat
│   ├── README-English.md
│   └── README.md
├── single_node
│   ├── ec2_init.sh
│   ├── img
│   │   ├── img01.png
│   │   ├── img02.png
│   │   ├── img03.png
│   │   ├── img04.png
│   │   ├── img05.png
│   │   └── img06.png
│   ├── os_x
│   ├── README.md
│   ├── requestPayer-exampleCodeFrom-\344\270\201\345\217\257_s3_download.py
│   ├── requirements.txt
│   ├── s3_download_config.ini
│   ├── s3_download_config.ini.default
│   ├── s3_download.py
│   ├── s3_upload_config.ini
│   ├── s3_upload_config.ini.default
│   ├── s3_upload.py
│   └── win64
│       ├── s3_download.zip
│       └── s3_upload.zip
└── tools
    ├── analystic_dynamodb_table.py
    ├── clean_unfinished_multipart_upload.py
    └── README.md

20 directories, 112 files

重点!配置SMRMS

由于我的场景是海外aws和国内aws互传,所以可以启用BBR来提高传输效率,什么是BBR? TCP Bottleneck Bandwidth and RTT, 这是Amazon Linux AMI内核支持的功能,默认不开启,如下方式开启:

$ sudo modprobe tcp_bbr

$ sudo modprobe sch_fq

$ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

如果要永久开启:

$ sudo su -

$ cat <<EOF>> /etc/sysconfig/modules/tcpcong.modules
>#!/bin/bash
> exec /sbin/modprobe tcp_bbr >/dev/null 2>&1
> exec /sbin/modprobe sch_fq >/dev/null 2>&1
> EOF

$ chmod 755 /etc/sysconfig/modules/tcpcong.modules

$ echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/00-tcpcong.conf

由于我是从海外S3传到国内S3,涉及两个账号,所以需要配置两个AKSK信息。

$ vi ~/.aws/credentials
[ningxia]
region=cn-northwest-1
aws_access_key_id=xxxxxxxxxxxxxxx
aws_secret_access_key=xxxxxxxxxxxxxxxxxxxx


[oregon]
region=us-west-2
aws_access_key_id = xxxxxxxxxxxxxxxxxxx
aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxx

进到/home/ec2-user/amazon-s3-resumable-upload/single_node 文件夹下,找到文件 s3_upload_config.ini, vim 打开编辑,主要修改以下参数:

[Basic]
JobType = S3_TO_S3  # 改
# 'LOCAL_TO_S3' | 'S3_TO_S3' | 'ALIOSS_TO_S3'

DesBucket = mybucket  # 改
# Destination S3 bucket name
# 目标文件bucket, type = str

S3Prefix = test  # 改,如果是同步桶内某个文件夹,就写这个文件夹名
# S3_TO_S3 mode Src. S3 Prefix, and same as Des. S3 Prefix; LOCAL_TO_S3 mode, this is Des. S3 Prefix.
# S3_TO_S3 源S3的Prefix(与目标S3一致),LOCAL_TO_S3 则为目标S3的Prefix, type = str

SrcFileIndex = *
# Specify the file name to upload. Wildcard "*" to upload all.
# 指定要上传的文件的文件名, type = str,Upload全部文件则用 "*"

DesProfileName = ningxia  # 改,中国区保密 profile 名称,本文中是 ningxia
# Profile name config in ~/.aws credentials. It is the destination account profile.
# 在~/.aws 中配置的能访问目标S3的 profile name

[LOCAL_TO_S3]
SrcDir = d:\mydir
# Source file directory. It is useless in S3_TO_S3 mode
# 原文件本地存放目录,S3_TO_S3 则该字段无效 type = str

[S3_TO_S3]
SrcBucket = mybucket  # 改
# Source bucket name. It is useless in LOCAL_TO_S3 mode.
# 源Bucket,LOCAL_TO_S3 则本字段无效

SrcProfileName = oregon  # 改,海外区保密 profile 名称,本文中是oregon
# Profile name config in ~/.aws credentials. It is the source account profile. Useless for LOCAL_TO_S3 mode.
# 在~/.aws 中配置的能访问源S3的 profile name,LOCAL_TO_S3 则本字段无效

[ALIOSS_TO_S3] # 如果是阿里云到国内aws,那么需要把阿里的aksk也要设置在这里
ali_SrcBucket = img-process
ali_access_key_id = xxxx
ali_access_key_secret = xxx
ali_endpoint = oss-cn-beijing.aliyuncs.com

[Advanced]
ChunkSize = 5
# File chunksize, unit MBytes, not less than 5MB. Single file parts number < 10,000, limited by S3 mulitpart upload API. The application will auto change it adapting to file size, you don't need to change it.
# 文件分片大小,单位为MB,不小于5M,单文件分片总数不能超过10000, 所以程序会根据文件大小自动调整该值,你一般无需调整。type = int

MaxRetry = 20
# Max retry times while S3 API call fail.
# S3 API call 失败,最大重试次数, type = int

MaxThread = 5
# Max threads for ONE file.
# 单文件同时上传的进程数量, type = int

MaxParallelFile = 5
# Max paralle running file, i.e. concurrency threads = MaxParallelFile * MaxThread
# 并行操作文件数量, type = int, 即同时并发的进程数 = MaxParallelFile * MaxThread

StorageClass = STANDARD  # 看要同步的文件是什么类型,本文是下数据库,所以保持默认
# 'STANDARD'|'REDUCED_REDUNDANCY'|'STANDARD_IA'|'ONEZONE_IA'|'INTELLIGENT_TIERING'|'GLACIER'|'DEEP_ARCHIVE'

ifVerifyMD5 = False
# Practice for twice MD5 for whole file.
# If True, then after merge file, will do the second time of Etag MD5 for the whole file.
# In S3_TO_S3 mode, this True will force to re-download all parts while break-point resume for calculating MD5, but not reupload the parts which already uploaded.
# In LOCAL_TO_S3 mode, this True will re-read the file and calculate MD5 to compare with S3 ETag after finish one file upload.
# This switch will not affect the MD5 verification of every part upload, even False, it still verify very part's MD5.
# 是否做这个文件的二次的MD5校验
# 为True则一个文件完成上传合并分片之后再次进行整个文件的ETag校验MD5。
# 对于 S3_TO_S3,该开关True会在断点续传的时候重新下载所有已传过的分片来计算MD5。
# 对于LOCAL_TO_S3,该开关True会在文件上传完毕之后重新读取整个文件并计算本地的MD5。
# 该开关不影响每个分片上传时候的校验,即使为False也会校验每个分片MD5。

DontAskMeToClean = True
# If True: While there is unfinished upload, it will not ask you to clean the unfinished parts on Des. S3 or not. It will move on and resume break-point upload.
# If True: 遇到存在现有的未完成upload时,不再询问是否Clean,默认不Clean,自动续传

LoggingLevel = INFO
# 'WARNING' | 'INFO' | 'DEBUG'

保存,退出。

运行同步程序SMRMS

$ python3 /home/ec2-user/amazon-s3-resumable-upload/single_node/s3_upload.py --nogui
Reading config file: s3_upload_config.ini
Logging to file: /home/ec2-user/amazon-s3-resumable-upload/single_node/log/s3_upload-2020-08-23T13-06-09.log
Logging level: INFO
2020-08-23 13:06:09,201 INFO - Found credentials in shared credentials file: ~/.aws/credentials
2020-08-23 13:06:09,242 INFO - Checking write permission for: davischen
2020-08-23 13:06:10,668 INFO - Get source file list
2020-08-23 13:06:10,686 INFO - Found credentials in shared credentials file: ~/.aws/credentials
2020-08-23 13:06:10,708 INFO - Get s3 file list davischen
2020-08-23 13:06:10,764 INFO - Bucket list length:4
2020-08-23 13:06:10,765 INFO - Get s3 file list davischen
2020-08-23 13:06:11,028 INFO - Bucket list length:9
2020-08-23 13:06:11,028 INFO - Get unfinished multipart upload
2020-08-23 13:06:11,823 INFO - Start file: test/
2020-08-23 13:06:11,823 INFO - Start file: test/AtomSetup-x64.exe
2020-08-23 13:06:11,823 INFO - Duplicated. test/ same size, goto next file.
2020-08-23 13:06:11,823 INFO - Duplicated. test/AtomSetup-x64.exe same size, goto next file.
2020-08-23 13:06:11,824 INFO - Start file: test/test.1a.zip
2020-08-23 13:06:11,824 INFO - Start file: test/nt.26.tar.gz
2020-08-23 13:06:11,825 INFO - New upload: test/test.1a.zip
2020-08-23 13:06:11,825 INFO - Duplicated. test/nt.26.tar.gz same size, goto next file.
--->Downloading test/test.1a.zip - 1/5786
--->Downloading test/test.1a.zip - 2/5786
--->Downloading test/test.1a.zip - 3/5786
--->Downloading test/test.1a.zip - 4/5786
--->Downloading test/test.1a.zip - 5/5786
    --->Uploading test/test.1a.zip - 1/5786
    --->Uploading test/test.1a.zip - 4/5786
    --->Uploading test/test.1a.zip - 3/5786
    --->Uploading test/test.1a.zip - 2/5786
    --->Uploading test/test.1a.zip - 5/5786
        --->Complete test/test.1a.zip - 5/5786 0.02% - 1.7 MB/s
--->Downloading test/test.1a.zip - 6/5786
        --->Complete test/test.1a.zip - 2/5786 0.03% - 1.6 MB/s
--->Downloading test/test.1a.zip - 7/5786
    --->Uploading test/test.1a.zip - 6/5786
    --->Uploading test/test.1a.zip - 7/5786
        --->Complete test/test.1a.zip - 6/5786 0.05% - 5.4 MB/s
--->Downloading test/test.1a.zip - 8/5786
    --->Uploading test/test.1a.zip - 8/5786
        --->Complete test/test.1a.zip - 7/5786 0.07% - 4.5 MB/s
--->Downloading test/test.1a.zip - 9/5786
    --->Uploading test/test.1a.zip - 9/5786
        --->Complete test/test.1a.zip - 8/5786 0.09% - 5.6 MB/s
--->Downloading test/test.1a.zip - 10/5786
        --->Complete test/test.1a.zip - 1/5786 0.10% - 1.0 MB/s
--->Downloading test/test.1a.zip - 11/5786
        --->Complete test/test.1a.zip - 9/5786 0.12% - 6.2 MB/s
--->Downloading test/test.1a.zip - 12/5786
    --->Uploading test/test.1a.zip - 10/5786
    --->Uploading test/test.1a.zip - 11/5786
    --->Uploading test/test.1a.zip - 12/5786
        --->Complete test/test.1a.zip - 10/5786 0.14% - 5.2 MB/s
--->Downloading test/test.1a.zip - 13/5786
        --->Complete test/test.1a.zip - 12/5786 0.16% - 5.5 MB/s
...

MISSION ACCOMPLISHED - Time: 0:31:39.284875  - FROM: mybucket/test TO mybucket/test

结语

人最大的弱点,在于渴望被认同。