我编写了一个脚本来更改一个文本文件(fasta格式)中的特定行,我想并行化,因为有很多行(〜800k)。
>CTC14_37541|M00842:336:000000000-C7WWK:1:2101:20913:9309:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=75;p=71|CO
我想将其转换为:
>Sample-CTC14_Read37541
我有两个问题。
我尝试在有功能和无功能的情况下运行脚本:
-
没有功能,它将起作用:我要更改的所有行均已修改。
当我使用一个函数时,只有一行被修改。我的功能
header()
有问题吗?
-
第二个问题是并行化。我用"&"尝试过一些东西,但是我不确定这是最好的解决方案。任何想法?
我的代码没有功能和并行:
#!/bin/bash
TMP_PATH="/path/where/is/my/fasta"
cd $TMP_PATH
for fasta in *.fasta
do
echo $fasta
lines=$(grep ">" $fasta)
for line in $lines
do
if [[ $line = *">"* ]]; then
read_nb="_Read"$(echo $line | cut -d'|' -f1 | cut -d'_' -f2)
sample=$(echo $line | cut -d'_' -f1 | cut -d'>' -f2)
newheader=$(echo ">Sample-$sample$read_nb")
sed -i -e "s/$line/$newheader/g" $fasta
sed -i -e "s/ /\n/g" $fasta
fi
done
done
echo "END"
我的函数和并行代码:
#!/bin/bash
TMP_PATH="/path/where/is/my/fasta"
cd $TMP_PATH
n=0
maxjobs=500
header(){
if [[ $line = *">"* ]]; then
read_nb="_Read"$(echo $line | cut -d'|' -f1 | cut -d'_' -f2)
sample=$(echo $line | cut -d'_' -f1 | cut -d'>' -f2)
newheader=$(echo ">Sample-$sample$read_nb")
sed -i -e "s/$line/$newheader/g" $fasta
sed -i -e "s/ /\n/g" $fasta
fi
}
for fasta in *.fasta
do
lines=$(grep ">" $fasta)
for line in $lines
do
header $line &
#limit jobs
if (( $(($((++n)) % $maxjobs)) == 0 )) ; then
wait
echo $n wait
fi
done
done
我有一个fasta文件作为输入,其中包含几个标题和序列。我想转换标题,以便在特定的工作流程中使用我的fasta文件。我需要从那开始:
>CTC14_18758|M00842:336:000000000-C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGCGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCTTGGGGAGCAAACAGG
>CTC14_20535|M00842:336:000000000-C7WWK:1:1108:28568:20175:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=64|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACCCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
>CTC14_24700|M00842:336:000000000-C7WWK:1:1110:7911:9824:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=71|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
对此:
>Sample-CTC14_Read18758
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGCGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCTTGGGGAGCAAACAGG
>Sample-CTC14_Read20535
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACCCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
>Sample-CTC14_Read24700
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
我想进行并行处理,因为我要更改很多行(〜700-800k),并且如果逐行运行脚本需要花费很长时间。
使用我的没有功能的脚本,工作正常,但是时间太长。
使用具有功能和并行功能的脚本,作业无法正常工作,因为在我的Fasta中仅更改了一个header,而不是所有header,并且我不明白为什么。我尝试了不同的方式编写和调用函数,但结果始终相同。
此外,我尝试使用gnu-parallel,但方法相同。我认为我的功能或通话有问题,但是我不知道在哪里。
我认为按照您的建议使用awk是个好主意,但我对此不太满意。你能帮我吗?
我的fasta文件的正确格式为:
>CTC14_1600|M00842:336:000000000-C7WWK:1:1101:26089:18004:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=71|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGACGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
>CTC14_11169|M00842:336:000000000-C7WWK:1:1105:11636:11876:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=76;p=65|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
>CTC14_16471|M00842:336:000000000-C7WWK:1:1107:6941:10486:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=70|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGGCGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAGCTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTGAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
假设>CTC14_18758|M00842:336:000000000-
在单独的行上,则此代码会将输入转换为输出。
#!/bin/sed -f
#skip blank lines
/^[[:space:]]*$/n
#change >CTC14_18758|M00842:336:000000000-
# to >Sample-CTC14_Read18758
s/^>/>Sample-/
s/_/_Read/
/^>/s/|.*$//
# remove 2ndary header
# C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0| TGGGGAATATTGGAC...
# to
# TGGGGAATATTGGAC...
s/^[^>].*| //
将其另存为file/script。 然后将其标记为可执行文件
chmod +x mySed
然后像这样运行
./mySed -i fileIn
或者,如果您收到有关-i
的warning/error消息,请运行
./mySed fileIn> fileOut && mv fileOut fileIn
现在,您可以消除函数header()
和代码中的2ndary循环。
只是
for file in *.fasta ; do
echo "processing file=$file"
/path/to/mySed -i "$file"
# run other processing if needed
# don't think you need wait any more
#uncomment? wait
done
--------------版本2 sed ---------------
#!/bin/sed -f
#skip blank lines
/^[[:space:]]*$/n
#>CTC14_18758|M00842:336:000000000-C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0| TGGGGA...
#change >CTC14_18758|M00842:336:000000000-
# to >Sample-CTC14_Read18758
s/^>/>Sample-/
s/_/_Read/
s/|.*| / /
# /^>/s/-.*| / /
# s/-.*| / /
使用类似的数据
>CTC14_16471|M00842:336:000000000-C7WWK:1:1107:6941:10486:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=70|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGGCGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAGCTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTGAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
高温超导