- 1. 加载数据的原理
- 2. 使用Data Loader来加载数据
- 3. 使用spec加载数据(通过Web控制台)
- 4. 使用spec加载数据(通过命令行)
- 5. 使用CRUL来加载数据
数据的加载需要向Overlord提交ingestion spec。ingestion spec其实就是一个json格式的元数据,生成方式有两种:
- 可以手动编写
- 通过Druid Web控制台内置的data loader加载少量数据,并配置参数,来帮助我们生成
操作步骤如下:
点击Apply,进行文件数据预览。再点击Next: Parse data。自动检测出Input format为json。再点击Next: Parse time
Druid需要一个时间列_time。如果没有时间字段,可以选择Constant Value
点击Next: Transform,再点击Next: Filter,再点击Next: Configure schema,Configure schema可以选择需要的字段,再点击Next:Partition
再点击Next:Publish
可以编辑Spec,返回前面,会看到前面的选型被修改
将/opt/apache-druid-0.22.1/quickstart/tutorial/wikipedia-index.json的内容粘贴到文本框,然后点击Submit提交任务
该json文件定义了task会自动读取/opt/apache-druid-0.22.1/quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz,创建名称为wikipedia的数据源
4. 使用spec加载数据(通过命令行)首先对集群进行重新初始化,将数据进行清空,清空步骤如下:
- stop集群
- HDFS将目录/druid删除
- Zookeeper将目录/druid删除
- Mysql将数据库druid删除,再创建
- Druid集群所有服务器下的目录var_master/*、var_data/*、var_query/*、var/druid、var/tmp/*进行删除
- start集群
查看wikipedia-index.json的内容如下:
[root@bigdata001 apache-druid-0.22.1]#
[root@bigdata001 apache-druid-0.22.1]# cat quickstart/tutorial/wikipedia-index.json
{
"type" : "index_parallel",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"timestampSpec": {
"column": "time",
"format": "iso"
},
"dimensionsSpec" : {
"dimensions" : [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }
]
},
"metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2015-09-12/2015-09-13"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "index_parallel",
"inputSource" : {
"type" : "local",
"baseDir" : "quickstart/tutorial/",
"filter" : "wikiticker-2015-09-12-sampled.json.gz"
},
"inputFormat" : {
"type" : "json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index_parallel",
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000
}
}
}
[root@bigdata001 apache-druid-0.22.1]#
通过命令行执行task
[root@bigdata001 apache-druid-0.22.1]#
[root@bigdata001 apache-druid-0.22.1]# pwd
/opt/apache-druid-0.22.1
[root@bigdata001 apache-druid-0.22.1]#
[root@bigdata001 apache-druid-0.22.1]# bin/post-index-task --file quickstart/tutorial/wikipedia-index.json --url http://bigdata003:9081
Beginning indexing data for wikipedia
Redirect response received, setting url to [http://bigdata002:9081]
Task started: index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z
Task log: http://bigdata002:9081/druid/indexer/v1/task/index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z/log
Task status: http://bigdata002:9081/druid/indexer/v1/task/index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z/status
Task index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z still running...
Task index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z still running...
Task index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z still running...
Task finished with status: SUCCESS
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
Traceback (most recent call last):
File "/opt/apache-druid-0.22.1/bin/post-index-task-main", line 174, in
main()
File "/opt/apache-druid-0.22.1/bin/post-index-task-main", line 171, in main
await_load_completion(args, datasource, load_timeout_at)
File "/opt/apache-druid-0.22.1/bin/post-index-task-main", line 119, in await_load_completion
response = urllib2.urlopen(req, None, response_timeout)
File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
[root@bigdata001 apache-druid-0.22.1]#
- port 9081:是Coordinator的plain port
- 这个urllib2.HTTPError可以忽略,因为上面已经显示数据导入成功
切换到目录apache-druid-0.22.1
[root@bigdata001 apache-druid-0.22.1]#
[root@bigdata001 apache-druid-0.22.1]# pwd
/opt/apache-druid-0.22.1
[root@bigdata001 apache-druid-0.22.1]#
[root@bigdata001 apache-druid-0.22.1]# curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://bigdata002:9081/druid/indexer/v1/task
{"task":"index_parallel_wikipedia_ffpapmhp_2022-03-28T15:02:43.737Z"}[root@bigdata001 apache-druid-0.22.1]#
[root@bigdata001 apache-druid-0.22.1]#
- 连接的Coordinator必须是leader Coordinator
- 任务成功submit,才会返回json字符串