【数据湖Hudi的概念】Key Generation和Concurrency Control

Bulut0907 发布时间：2022-05-30 09:13:40 ，浏览量：5

1. Key Generation
- 1.1 SimpleKeyGenerator
- 1.2 ComplexKeyGenerator
- 1.3 NonPartitionedKeyGenerator
- 1.4 CustomKeyGenerator
- 1.5 TimestampBasedKeyGenerator
2. Concurrency Control

1. Key Generation

Hudi提供了几种key generators，key generators的通用配置如下：

Config含义/目的hoodie.datasource.write.recordkey.field数据的key字段，必须包含hoodie.datasource.write.partitionpath.field数据的partition字段，必须包含hoodie.datasource.write.keygenerator.classfull path的Key generator class，必须包含hoodie.datasource.write.partitionpath.urlencode默认为false，如果为true，partition path将按url进行编码hoodie.datasource.write.hive_style_partitioning默认为false，分区字段名称只有partition_field_value，如果为true，分区字段名称为：partition_field_name=partition_field_value 1.1 SimpleKeyGenerator

将一个列转换成string类型，作为分区字段名称

1.2 ComplexKeyGenerator

recordkey和partitionpath都将一个或多个字段作为key，多个字段逗号分隔。比如"Hoodie.datasource.write.recordkey.field" : "col1,col3"

1.3 NonPartitionedKeyGenerator

如果表不是分区表，使用NonPartitionedKeyGenerator，生成一个empty “” partiiton

1.4 CustomKeyGenerator

可以同时使用SimpleKeyGenerator、ComplexKeyGenerator、TimestampBasedKeyGenerator

指定keygenerator.class

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator

指定recordkey，可以是SimpleKeyGenerator或ComplexKeyGenerator

hoodie.datasource.write.recordkey.field=col1,col3

创建的record key格式为：col1:value1,col3:value3

指定partitionpath，格式为：“field1:PartitionKeyType1,field2:PartitionKeyType2,…”，PartitionKeyType的可选值为simple、timestamp

hoodie.datasource.write.partitionpath.field=col2:simple,col4:timestamp

HDFS上创建的分区路径为：value2/value4

1.5 TimestampBasedKeyGenerator

这个key generator用于partition字段，需要设置的配置如下：

Config含义/目录hoodie.deltastreamer.keygen.timebased.timestamp.typeUNIX_TIMESTAMP、DATE_STRING、MIXED、EPOCHMILLISECONDS、SCALARhoodie.deltastreamer.keygen.timebased.output.dateformat输出的date formathoodie.deltastreamer.keygen.timebased.timezonedata format的Timezoneoodie.deltastreamer.keygen.timebased.input.dateformat输入的date format

下面是使用的一些例子

Timestamp is GMT

Config字段值hoodie.deltastreamer.keygen.timebased.timestamp.type“EPOCHMILLISECONDS”hoodie.deltastreamer.keygen.timebased.output.dateformat“yyyy-MM-dd hh”hoodie.deltastreamer.keygen.timebased.timezone“GMT+8:00”

输入字段值: “1578283932000L”，生成的Partition path: “2020-01-06 12”

如果输入字段值为null，生成的Partition path: “1970-01-01 08”

Timestamp is DATE_STRING

Config字段值hoodie.deltastreamer.keygen.timebased.timestamp.type“DATE_STRING”hoodie.deltastreamer.keygen.timebased.output.dateformat“yyyy-MM-dd hh”hoodie.deltastreamer.keygen.timebased.timezone“GMT+8:00”hoodie.deltastreamer.keygen.timebased.input.dateformat“yyyy-MM-dd hh:mm:ss”

输入字段值: “2020-01-06 12:12:12”，生成的Partition path: “2020-01-06 12”

如果输入字段值为null，生成的Partition path: “1970-01-01 12:00:00”

Scalar examples

Config字段值hoodie.deltastreamer.keygen.timebased.timestamp.type“SCALAR”hoodie.deltastreamer.keygen.timebased.output.dateformat“yyyy-MM-dd hh”hoodie.deltastreamer.keygen.timebased.timezone“GMT”hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit“days”输入字段值: “20000L”，生成的Partition path: “2024-10-04 12”

如果输入字段值为null，生成的Partition path: “1970-01-02 12”

2. Concurrency Control

支持的方式：

MVCC：Hudi的table service，如compaction、clean，利用MVCC在写入和读取之间提供snapshot isolation。可以实现单一写入和并发读
OPTIMISTIC CONCURRENCY(experimental)：实现并发写入，需要Zookeeper或HiveMetastore获取locks的支持。如write_A写入file1和file2，write_B写入file3和file4，则两个write写入成功；如write_A写入file1和file2，write_B写入file2和file3，则只能有一个write成功，另一个write失败

Multi Writer Guarantees

upsert: 表不会有重复数据
insert: 即使开启dedup，表也可能有重复数据
bulk_insert: 即使开启dedup，表也可能有重复数据
incremental pull: Data consumption和checkpoints可能会乱序

关注

打赏

1664501120

查看更多评论

【数据湖Hudi的概念】Key Generation和Concurrency Control

最近更新

热门博客

[ 申请 ]友情链接：