- 1. Format Versioning表格式版本
- 2. Overview
- 2.1 Sequence Numbers序列号
- 2.2 Row-level Deletes
- 3. Specification规范
- 3.1 v1和v2版本的Writer和reader兼容性要求
表格式有版本1和版本2,可以通过format-version
参数进行指定,默认值是1
版本1: 分析型数据表
使用的是不可变的文件格式:parquet、avro、orc
版本2:行级更新和删除
当进行更新和删除,会添加不可变的delete files,来表明行被更新或删除。同时该版本对writer有更严格的要求,详情参考部分
版本1可以升级到版本2,但要注意如下:
- 版本1的reader读取版本2的新特性会产生错误
- 版本1的writer可以向版本2的表write数据
- 版本2的reader可以读取版本1的data和metadata,但是新特性会用默认字段值表示,部分说明了默认字段值的情况
我们先看hadoop catalog的一个元数据示例,vN.metadata.json部分内容如下:
......省略部分......
"current-snapshot-id" : 4792595715782867813,
"snapshots" : [ {
"snapshot-id" : 3105878303282846379,
"timestamp-ms" : 1645089884916,
"summary" : {
"operation" : "append",
"flink.job-id" : "5f0ae342d300cfb6f2239425144aaa10",
"flink.max-committed-checkpoint-id" : "9223372036854775807",
"added-data-files" : "2",
"added-records" : "2",
"added-files-size" : "2487",
"changed-partition-count" : "2",
"total-records" : "2",
"total-files-size" : "2487",
"total-data-files" : "2",
"total-delete-files" : "0",
"total-position-deletes" : "0",
"total-equality-deletes" : "0"
},
"manifest-list" : "hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-3105878303282846379-1-4203a078-6b68-403f-967f-a24c9195c009.avro",
"schema-id" : 0
}, {
"snapshot-id" : 781839495286765092,
"parent-snapshot-id" : 3105878303282846379,
"timestamp-ms" : 1645089898911,
"summary" : {
"operation" : "append",
"flink.job-id" : "51088171a7dc3266e29826c656d9e998",
"flink.max-committed-checkpoint-id" : "9223372036854775807",
"added-data-files" : "2",
"added-records" : "2",
"added-files-size" : "2487",
"changed-partition-count" : "2",
"total-records" : "4",
"total-files-size" : "4974",
"total-data-files" : "4",
"total-delete-files" : "0",
"total-position-deletes" : "0",
"total-equality-deletes" : "0"
},
"manifest-list" : "hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-781839495286765092-1-6f23a7fd-e467-4153-a182-7e96cda61074.avro",
"schema-id" : 0
}, {
"snapshot-id" : 242516093407225541,
"parent-snapshot-id" : 781839495286765092,
"timestamp-ms" : 1645089917661,
"summary" : {
"operation" : "overwrite",
"replace-partitions" : "true",
"flink.job-id" : "4b805620d9f0b7065e9f70b757ddeafc",
"flink.max-committed-checkpoint-id" : "9223372036854775807",
"added-data-files" : "1",
"deleted-data-files" : "2",
"added-records" : "1",
"deleted-records" : "2",
"added-files-size" : "1244",
"removed-files-size" : "2459",
"changed-partition-count" : "1",
"total-records" : "3",
"total-files-size" : "3759",
"total-data-files" : "3",
"total-delete-files" : "0",
"total-position-deletes" : "0",
"total-equality-deletes" : "0"
},
"manifest-list" : "hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-242516093407225541-1-4bbf0565-406b-428a-aad7-e32993df0fef.avro",
"schema-id" : 0
}, {
"snapshot-id" : 6992061996353419515,
"parent-snapshot-id" : 242516093407225541,
"timestamp-ms" : 1645089922843,
"summary" : {
"operation" : "overwrite",
"replace-partitions" : "true",
"flink.job-id" : "d02231da8d60b803e63006193384da23",
"flink.max-committed-checkpoint-id" : "9223372036854775807",
"added-data-files" : "1",
"deleted-data-files" : "1",
"added-records" : "1",
"deleted-records" : "1",
"added-files-size" : "1251",
"removed-files-size" : "1244",
"changed-partition-count" : "1",
"total-records" : "3",
"total-files-size" : "3766",
"total-data-files" : "3",
"total-delete-files" : "0",
"total-position-deletes" : "0",
"total-equality-deletes" : "0"
},
"manifest-list" : "hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-6992061996353419515-1-9c8e870b-7e7b-49f0-96b4-be695105de9f.avro",
"schema-id" : 0
}, {
"snapshot-id" : 4375332003125072899,
"parent-snapshot-id" : 6992061996353419515,
"timestamp-ms" : 1645090105742,
"summary" : {
"operation" : "replace",
"added-data-files" : "1",
"deleted-data-files" : "2",
"added-records" : "2",
"deleted-records" : "2",
"added-files-size" : "1422",
"removed-files-size" : "2515",
"changed-partition-count" : "1",
"total-records" : "3",
"total-files-size" : "2673",
"total-data-files" : "2",
"total-delete-files" : "0",
"total-position-deletes" : "0",
"total-equality-deletes" : "0"
},
"manifest-list" : "hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-4375332003125072899-1-7bab7463-2cd5-4bae-b59e-cd57d19da4f0.avro",
"schema-id" : 0
}, {
"snapshot-id" : 4792595715782867813,
"parent-snapshot-id" : 4375332003125072899,
"timestamp-ms" : 1645090219342,
"summary" : {
"operation" : "append",
"flink.job-id" : "42464deec75c72c5fa376544e2369f20",
"flink.max-committed-checkpoint-id" : "9223372036854775807",
"added-data-files" : "1",
"added-records" : "1",
"added-files-size" : "1258",
"changed-partition-count" : "1",
"total-records" : "4",
"total-files-size" : "3931",
"total-data-files" : "3",
"total-delete-files" : "0",
"total-position-deletes" : "0",
"total-equality-deletes" : "0"
},
"manifest-list" : "hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-4792595715782867813-1-d62cb29a-4a8d-4cce-a58f-151606761101.avro",
"schema-id" : 0
} ],
"snapshot-log" : [ {
"timestamp-ms" : 1645089884916,
"snapshot-id" : 3105878303282846379
}, {
"timestamp-ms" : 1645089898911,
"snapshot-id" : 781839495286765092
}, {
"timestamp-ms" : 1645089917661,
"snapshot-id" : 242516093407225541
}, {
"timestamp-ms" : 1645089922843,
"snapshot-id" : 6992061996353419515
}, {
"timestamp-ms" : 1645090105742,
"snapshot-id" : 4375332003125072899
}, {
"timestamp-ms" : 1645090219342,
"snapshot-id" : 4792595715782867813
} ]
......省略部分......
metadata目录下文件信息如下:
[root@hive1 ~]# hadoop fs -ls /user/iceberg/warehouse/iceberg_db/my_user/metadata
Found 25 items
-rw-r--r-- 1 root supergroup 6475 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/4203a078-6b68-403f-967f-a24c9195c009-m0.avro
-rw-r--r-- 1 root supergroup 6491 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/4bbf0565-406b-428a-aad7-e32993df0fef-m0.avro
-rw-r--r-- 1 root supergroup 6490 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/4bbf0565-406b-428a-aad7-e32993df0fef-m1.avro
-rw-r--r-- 1 root supergroup 6422 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/4bbf0565-406b-428a-aad7-e32993df0fef-m2.avro
-rw-r--r-- 1 root supergroup 6474 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/6f23a7fd-e467-4153-a182-7e96cda61074-m0.avro
-rw-r--r-- 1 root supergroup 6423 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/7bab7463-2cd5-4bae-b59e-cd57d19da4f0-m0.avro
-rw-r--r-- 1 root supergroup 6422 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/7bab7463-2cd5-4bae-b59e-cd57d19da4f0-m1.avro
-rw-r--r-- 1 root supergroup 6432 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/7bab7463-2cd5-4bae-b59e-cd57d19da4f0-m2.avro
-rw-r--r-- 1 root supergroup 6423 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/9c8e870b-7e7b-49f0-96b4-be695105de9f-m0.avro
-rw-r--r-- 1 root supergroup 6423 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/9c8e870b-7e7b-49f0-96b4-be695105de9f-m1.avro
-rw-r--r-- 1 root supergroup 6424 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/d62cb29a-4a8d-4cce-a58f-151606761101-m0.avro
-rw-r--r-- 1 root supergroup 3836 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-242516093407225541-1-4bbf0565-406b-428a-aad7-e32993df0fef.avro
-rw-r--r-- 1 root supergroup 3792 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-3105878303282846379-1-4203a078-6b68-403f-967f-a24c9195c009.avro
-rw-r--r-- 1 root supergroup 3902 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-4375332003125072899-1-7bab7463-2cd5-4bae-b59e-cd57d19da4f0.avro
-rw-r--r-- 1 root supergroup 3928 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-4792595715782867813-1-d62cb29a-4a8d-4cce-a58f-151606761101.avro
-rw-r--r-- 1 root supergroup 3892 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-6992061996353419515-1-9c8e870b-7e7b-49f0-96b4-be695105de9f.avro
-rw-r--r-- 1 root supergroup 3864 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-781839495286765092-1-6f23a7fd-e467-4153-a182-7e96cda61074.avro
-rw-r--r-- 1 root supergroup 2115 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/v1.metadata.json
-rw-r--r-- 1 root supergroup 3141 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/v2.metadata.json
-rw-r--r-- 1 root supergroup 4197 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/v3.metadata.json
-rw-r--r-- 1 root supergroup 5395 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/v4.metadata.json
-rw-r--r-- 1 root supergroup 6597 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/v5.metadata.json
-rw-r--r-- 1 root supergroup 7634 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/v6.metadata.json
-rw-r--r-- 1 root supergroup 8694 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/v7.metadata.json
-rw-r--r-- 1 root supergroup 1 2022-02-17 17:34 /user/iceberg/warehouse/iceberg_db/my_user/metadata/version-hint.text
[root@hive1 ~]#
我们可以看到以下几点
- snapshot-id并不是新的比旧的大,但是新的snapshot-id的timestamp-ms肯定比旧的大
- manifest list的文件命名方式为:snap-{snapshot-id}-{一个数字}-{manifest list自己的序列号}.avro
- manifest file的文件命名方式为:{manifest list的序列号}-mN.avro
其中avro格式的文件可以先下载到本地,然后通过java -jar avro-tools-1.11.0.jar tojson xxx.avro
转换成json格式进行查看
有两种删除方式:
- 位置删除:删除一行数据时,根据data file的path和该行在data file的position进行删除
- 等式删除:删除一行或多行数据时,根据一个或多个列的值进行删除,比如colA = some_value
delete files也是按partition进行分区存放的。因为manifest file包含每个data file对应数据的每列统计信息(如一列的值数量、空值数量、数值上界和下界) ,所以能推导出delete file覆盖哪个data file的数据
3. Specification规范 3.1 v1和v2版本的Writer和reader兼容性要求对于writer写入metadata files,元数据的字段是否需要写入数据,遵循下表
RequirementWrite behavior空白不向该字段写入数据optional可以向该字段写入数据,也可以不向该字段写入数据required必须向该字段写入数据对于reader,版本2的reader可以读取版本1的manifest files和manifest list,版本2的reader读取版本1的manifest files和manifest list表现如下:
v1v2版本2的reader表现空白optional可以不读取字段,或者读取字段值为NULL空白required读取字段值为NULLoptional空白不读取字段optionaloptional1: 不读取字段,2: 读取字段的值(非NULL),3. 读取字段值为NULLoptionalrequired读取字段的值(非NULL),或读取字段值为NULLrequired空白不读取字段requiredoptional不读取字段,或读取字段的值(非NULL)requiredrequired读取字段的值(非NULL)对于版本2的reader,读取版本1的vN.metadata.json的要求更严格些,如果v2对该字段要求为required,而v1空白或optional但没有,则版本2读取的时候会抛异常。所以vN.metadata.json文件在版本1和版本2之间不能通用