this snippet is intended to label protein sequences based on their file name. for example:
the file name is: ClassA_Amine_Serotonin.txt
and it contains sequence patterns as below:
>gi|73954222|ref|XP_546316.2| PREDICTED: similar to 5-hydroxytryptamine (serotonin) receptor 4 isoform b [Canis familiaris]
MDELDANVSSKEGFGSVEKVVLLTFLSAVILMAILGNLLVMVAVCRDRQLRKIKTNYFIVSLAFVDLLVSVLVMPFGAIELVQDIWIYGEMFCLVRTSLDVLLTTASIFHLCCISLDRYYAICCQPLVYRNKMTPLRIALMLGGCWIIPMFISFLPIMQGWNNIGIIDLIEKRKFNQNSNSTYCIFMVNKPYAITCSVVAFYIPFLLMVLAYYRIYVTAKEHAHQIQMLQRAGAPSEGRPQPADQHSTHRMRTETKAAKTLCIIMGCFCLCWAPFFVTNIVDPFIDYTVPGQVWTAFLWLGYINSGLNPFLYAFLNKSFRRAFLIILCCDDERYRRPSILGQTVPCSTTTINGSTHVLRDAVECGGQWESHCHPPATSSLVAAHPSDP
so it's output would be a CSV file containing this line:
ClassA,Amine,Serotonin,MDELDANVSSKEGFGSVEKVVLLTFLSAVILMAILGNLLVMVAVCRDRQLRKIKTNYFIVSLAFVDLLVSVLVMPFGAIELVQDIWIYGEMFCLVRTSLDVLLTTASIFHLCCISLDRYYAICCQPLVYRNKMTPLRIALMLGGCWIIPMFISFLPIMQGWNNIGIIDLIEKRKFNQNSNSTYCIFMVNKPYAITCSVVAFYIPFLLMVLAYYRIYVTAKEHAHQIQMLQRAGAPSEGRPQPADQHSTHRMRTETKAAKTLCIIMGCFCLCWAPFFVTNIVDPFIDYTVPGQVWTAFLWLGYINSGLNPFLYAFLNKSFRRAFLIILCCDDERYRRPSILGQTVPCSTTTINGSTHVLRDAVECGGQWESHCHPPATSSLVAAHPSDP
the raw data is available in /data/
directory and final output is in sequences.txt