我有一个大表(数百万行乘数百列,tab-delimited),前三列如下所示:

GT:DS:GP    0|0:0.181:0.827,0.165,0.008 0|0:0.181:0.827,0.165,0.008 0|0:0.181:0.827,0.165,0.008
GT:DS:GP    0|0:0.109:0.894,0.103,0.003 0|0:0.109:0.894,0.103,0.003 0|0:0.109:0.894,0.103,0.003
GT:DS:GP    0|0:0.004:0.996,0.004,0.000 0|0:0.004:0.996,0.004,0.000 0|0:0.004:0.996,0.004,0.000
GT:DS:GP    0|0:0.117:0.886,0.110,0.003 0|0:0.117:0.886,0.110,0.003 0|0:0.117:0.886,0.110,0.003

所有其余的列看起来都像第2列和第3列。我需要一个基于第一个文件的新文件,而第二个冒号(:)之后没有文字。输出应如下所示:

GT:DS   0|0:0.181   0|0:0.181   0|0:0.181
GT:DS   0|0:0.109   0|0:0.109   0|0:0.109
GT:DS   0|0:0.004   0|0:0.004   0|0:0.004
GT:DS   0|0:0.117   0|0:0.117   0|0:0.117

我觉得这可能与我在这个帖子中找到的东西差不多,但是显然,退出命令告诉它在第一次出现之后停止,因此它不适用于多次出现(在多个lines/columns中)...

awk -v RS=':' -v ORS=':' 'NR==1{print} NR==2{print; printf"\n";exit}' input > output

此失败尝试的输出是:

GT:DS:

感谢您的帮助!

分析解答
$ sed 's/\([^:]*:[^:]*\):[^:\t]*/\1/g' file
GT:DS   0|0:0.181       0|0:0.181       0|0:0.181
GT:DS   0|0:0.109       0|0:0.109       0|0:0.109
GT:DS   0|0:0.004       0|0:0.004       0|0:0.004
GT:DS   0|0:0.117       0|0:0.117       0|0:0.117