准备hadoop环境
master 192.168.0.200
slave1 192.168.0.201
slave1 192.168.0.202
master start-all.sh
5728 SecondaryNameNode
7828 Jps
5893 ResourceManager
5531 NameNode
slave1
3895 NodeManager
3772 DataNode
5646 Jps
slave2
3745 DataNode
5650 Jps
3868 NodeManager
cat hdfs-core.xml
<configuration>
<!--指定namenode的地址-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<!--用来指定使用hadoop时产生文件的存放目录-->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoopdata</value>
</property>
</configuration>
[root@master hadoop]# cat hdfs-site.xml
<configuration>
<!--指定hdfs保存数据的副本数量 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.http.address</name>
<value>slave1:50070</value>
</property>
</configuration>
准备datax.txt
1 张一 21
2 张二 22
3 张三 23
4 张四 24
5 张五 25
datax2.txt
1 李一 21
2 李二 22
3 李三 23
4 李四 24
5 李五 25
以空格分隔,主要是和以下json文件中的
"fieldDelimiter": " "相对应即可
hdfs2stream.json
{
"job": {
"setting": {
"speed": {
"channel": 3
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/user/*",
"defaultFS": "hdfs://192.168.0.200:9000",
"column": [{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "long"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": " "
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print": true
}
}
}]
}
}
注意:
"path": "/user/*",可以通过http://192.168.0.200:50070/explorer.html#/查看文件存放的位置
"defaultFS": "hdfs://192.168.0.200:9000",
对应与hdfs-core.xml中的namenode的地址
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
结论
python datax.py hdfs2stream.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2020-10-30 16:19:59.688 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2020-10-30 16:19:59.698 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.261-b12
jvmInfo: Windows 10 amd64 10.0
cpu num: 4
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2020-10-30 16:19:59.723 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"hdfsreader",
"parameter":{
"column":[
{
"index":0,
"type":"long"
},
{
"index":1,
"type":"string"
},
{
"index":2,
"type":"long"
}
],
"defaultFS":"hdfs://192.168.0.200:9000",
"encoding":"UTF-8",
"fieldDelimiter":" ",
"fileType":"text",
"path":"/user/*"
}
},
"writer":{
"name":"streamwriter",
"parameter":{
"print":true
}
}
}
],
"setting":{
"speed":{
"channel":3
}
}
}
2020-10-30 16:19:59.748 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-10-30 16:19:59.750 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-10-30 16:19:59.752 [main] INFO JobContainer - DataX jobContainer starts job.
2020-10-30 16:19:59.757 [main] INFO JobContainer - Set jobId = 0
2020-10-30 16:19:59.782 [job-0] INFO HdfsReader$Job - init() begin...
2020-10-30 16:20:00.205 [job-0] INFO HdfsReader$Job - hadoopConfig details:{"finalParameters":[]}
2020-10-30 16:20:00.206 [job-0] INFO HdfsReader$Job - init() ok and end...
2020-10-30 16:20:00.222 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2020-10-30 16:20:00.222 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] do prepare work .
2020-10-30 16:20:00.223 [job-0] INFO HdfsReader$Job - prepare(), start to getAllFiles...
2020-10-30 16:20:00.229 [job-0] INFO HdfsReader$Job - get HDFS all files in path = [/user/*]
2020-10-30 16:20:07.291 [job-0] INFO HdfsReader$Job - [hdfs://192.168.0.200:9000/user/datax.txt]是[text]类型的文件, 将 该文件加入source files列表
2020-10-30 16:20:07.740 [job-0] INFO HdfsReader$Job - [hdfs://192.168.0.200:9000/user/datax2.txt]是[text]类型的文件, 将该文件加入source files列表
2020-10-30 16:20:07.744 [job-0] INFO HdfsReader$Job - 您即将读取的文件数为: [2], 列表为: [hdfs://192.168.0.200:9000/user/datax2.txt,hdfs://192.168.0.200:9000/user/datax.txt]
2020-10-30 16:20:07.746 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2020-10-30 16:20:07.748 [job-0] INFO JobContainer - jobContainer starts to do split ...
2020-10-30 16:20:07.749 [job-0] INFO JobContainer - Job set Channel-Number to 3 channels.
2020-10-30 16:20:07.751 [job-0] INFO HdfsReader$Job - split() begin...
2020-10-30 16:20:07.754 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] splits to [2] tasks.
2020-10-30 16:20:07.754 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [2] tasks.
2020-10-30 16:20:07.774 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2020-10-30 16:20:07.789 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2020-10-30 16:20:07.794 [job-0] INFO JobContainer - Running by standalone Mode.
2020-10-30 16:20:07.804 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [2] channels for [2] tasks.
2020-10-30 16:20:07.821 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-10-30 16:20:07.821 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2020-10-30 16:20:07.838 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-10-30 16:20:07.851 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
2020-10-30 16:20:07.910 [0-0-1-reader] INFO HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]}
2020-10-30 16:20:07.917 [0-0-1-reader] INFO Reader$Task - read start
2020-10-30 16:20:07.944 [0-0-0-reader] INFO HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]}
2020-10-30 16:20:07.957 [0-0-0-reader] INFO Reader$Task - read start
2020-10-30 16:20:07.962 [0-0-0-reader] INFO Reader$Task - reading file : [hdfs://192.168.0.200:9000/user/datax2.txt]
2020-10-30 16:20:07.959 [0-0-1-reader] INFO Reader$Task - reading file : [hdfs://192.168.0.200:9000/user/datax.txt]
2020-10-30 16:20:08.019 [0-0-0-reader] INFO UnstructuredStorageReaderUtil - CsvReader使用默认值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":" ","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值为[null]
2020-10-30 16:20:08.019 [0-0-1-reader] INFO UnstructuredStorageReaderUtil - CsvReader使用默认值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":" ","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值为[null]
2020-10-30 16:20:08.038 [0-0-0-reader] INFO Reader$Task - end read source files...
2020-10-30 16:20:08.038 [0-0-1-reader] INFO Reader$Task - end read source files...
1 李一 21
2 李二 22
3 李三 23
4 李四 24
5 李五 25
1 张一 21
2 张二 22
3 张三 23
4 张四 24
5 张五 25
2020-10-30 16:20:08.063 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[226]ms
2020-10-30 16:20:08.064 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[214]ms
2020-10-30 16:20:08.066 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2020-10-30 16:20:17.822 [job-0] INFO StandAloneJobContainerCommunicator - Total 10 records, 50 bytes | Speed 5B/s, 1 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.001s | All Task WaitReaderTime 0.379s | Percentage 100.00%
2020-10-30 16:20:17.823 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2020-10-30 16:20:17.827 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.
2020-10-30 16:20:17.831 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] do post work.
2020-10-30 16:20:17.837 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2020-10-30 16:20:17.839 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: d:\java\datax\hook
2020-10-30 16:20:17.849 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 1 | 1 | 1 | 0.032s | 0.032s | 0.032s
PS Scavenge | 1 | 1 | 1 | 0.016s | 0.016s | 0.016s
2020-10-30 16:20:17.851 [job-0] INFO JobContainer - PerfTrace not enable!
2020-10-30 16:20:17.853 [job-0] INFO StandAloneJobContainerCommunicator - Total 10 records, 50 bytes | Speed 5B/s, 1 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.001s | All Task WaitReaderTime 0.379s | Percentage 100.00%
2020-10-30 16:20:17.900 [job-0] INFO JobContainer -
任务启动时刻 : 2020-10-30 16:19:59
任务结束时刻 : 2020-10-30 16:20:17
任务总计耗时 : 18s
任务平均流量 : 5B/s
记录写入速度 : 1rec/s
读出记录总数 : 10
读写失败总数 : 0
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/140800.html