我编写了一个脚本来更改一个文本文件(fasta格式)中的特定行,我想并行化,因为有很多行(〜800k)。

>CTC14_37541|M00842:336:000000000-C7WWK:1:2101:20913:9309:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=75;p=71|CO

我想将其转换为:

>Sample-CTC14_Read37541

我有两个问题。

  1. 我尝试在有功能和无功能的情况下运行脚本:

    • 没有功能,它将起作用:我要更改的所有行均已修改。

    • 当我使用一个函数时,只有一行被修改。我的功能header()有问题吗?

  2. 第二个问题是并行化。我用"&"尝试过一些东西,但是我不确定这是最好的解决方案。任何想法?

我的代码没有功能和并行:

#!/bin/bash

TMP_PATH="/path/where/is/my/fasta"
cd $TMP_PATH

for fasta in *.fasta
do
  echo $fasta
    lines=$(grep ">" $fasta)
    for line in $lines
  do
    if [[ $line = *">"* ]]; then
      read_nb="_Read"$(echo $line | cut -d'|' -f1 | cut -d'_' -f2)
      sample=$(echo $line | cut -d'_' -f1 | cut -d'>' -f2)
      newheader=$(echo ">Sample-$sample$read_nb")
      sed -i -e "s/$line/$newheader/g" $fasta
      sed -i -e "s/ /\n/g" $fasta
    fi
    done
done

echo "END"

我的函数和并行代码:

#!/bin/bash

TMP_PATH="/path/where/is/my/fasta"
cd $TMP_PATH

n=0
maxjobs=500


header(){
    if [[ $line = *">"* ]]; then
      read_nb="_Read"$(echo $line | cut -d'|' -f1 | cut -d'_' -f2)
      sample=$(echo $line | cut -d'_' -f1 | cut -d'>' -f2)
      newheader=$(echo ">Sample-$sample$read_nb")
      sed -i -e "s/$line/$newheader/g" $fasta
      sed -i -e "s/ /\n/g" $fasta
    fi
}

for fasta in *.fasta
do
    lines=$(grep ">" $fasta)
    for line in $lines
    do
       header $line &
       #limit jobs
       if (( $(($((++n)) % $maxjobs)) == 0 )) ; then
         wait
         echo $n wait
       fi
    done
done

我有一个fasta文件作为输入,其中包含几个标题和序列。我想转换标题,以便在特定的工作流程中使用我的fasta文件。我需要从那开始:

>CTC14_18758|M00842:336:000000000-C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGCGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCTTGGGGAGCAAACAGG
>CTC14_20535|M00842:336:000000000-C7WWK:1:1108:28568:20175:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=64|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACCCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
>CTC14_24700|M00842:336:000000000-C7WWK:1:1110:7911:9824:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=71|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG

对此:

>Sample-CTC14_Read18758
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGCGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCTTGGGGAGCAAACAGG
>Sample-CTC14_Read20535
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACCCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
>Sample-CTC14_Read24700
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG

我想进行并行处理,因为我要更改很多行(〜700-800k),并且如果逐行运行脚本需要花费很长时间。

使用我的没有功能的脚本,工作正常,但是时间太长。

使用具有功能和并行功能的脚本,作业无法正常工作,因为在我的Fasta中仅更改了一个header,而不是所有header,并且我不明白为什么。我尝试了不同的方式编写和调用函数,但结果始终相同。

此外,我尝试使用gnu-parallel,但方法相同。我认为我的功能或通话有问题,但是我不知道在哪里。

我认为按照您的建议使用awk是个好主意,但我对此不太满意。你能帮我吗?


我的fasta文件的正确格式为:

>CTC14_1600|M00842:336:000000000-C7WWK:1:1101:26089:18004:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=71|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGACGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
>CTC14_11169|M00842:336:000000000-C7WWK:1:1105:11636:11876:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=76;p=65|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
>CTC14_16471|M00842:336:000000000-C7WWK:1:1107:6941:10486:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=70|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGGCGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAGCTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTGAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
分析解答

假设>CTC14_18758|M00842:336:000000000-在单独的行上,则此代码会将输入转换为输出。

#!/bin/sed -f

#skip blank lines
/^[[:space:]]*$/n

#change >CTC14_18758|M00842:336:000000000-
#  to >Sample-CTC14_Read18758
s/^>/>Sample-/
s/_/_Read/
/^>/s/|.*$//

# remove 2ndary header
#  C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0| TGGGGAATATTGGAC...
# to
#  TGGGGAATATTGGAC...

s/^[^>].*| //

将其另存为file/script。 然后将其标记为可执行文件

chmod +x mySed

然后像这样运行

./mySed -i fileIn

或者,如果您收到有关-i的warning/error消息,请运行

./mySed fileIn> fileOut && mv fileOut fileIn

现在,您可以消除函数header()和代码中的2ndary循环。 只是

for file in *.fasta ; do
   echo "processing file=$file"
   /path/to/mySed -i "$file"
   # run other processing if needed
   # don't think you need wait any more
   #uncomment? wait
done

--------------版本2 sed ---------------

#!/bin/sed -f

#skip blank lines
/^[[:space:]]*$/n

#>CTC14_18758|M00842:336:000000000-C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0| TGGGGA...

#change >CTC14_18758|M00842:336:000000000-
#  to >Sample-CTC14_Read18758
s/^>/>Sample-/
s/_/_Read/
s/|.*| / /
# /^>/s/-.*| / /
# s/-.*| / /

使用类似的数据

>CTC14_16471|M00842:336:000000000-C7WWK:1:1107:6941:10486:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=70|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGGCGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAGCTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTGAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG

高温超导