Skip to content

Latest commit

 

History

History
15 lines (13 loc) · 1.28 KB

README.md

File metadata and controls

15 lines (13 loc) · 1.28 KB

Protein-sequence-labeler


this snippet is intended to label protein sequences based on their file name. for example: the file name is: ClassA_Amine_Serotonin.txt and it contains sequence patterns as below:

>gi|73954222|ref|XP_546316.2| PREDICTED: similar to 5-hydroxytryptamine (serotonin) receptor 4 isoform b [Canis familiaris]
MDELDANVSSKEGFGSVEKVVLLTFLSAVILMAILGNLLVMVAVCRDRQLRKIKTNYFIVSLAFVDLLVSVLVMPFGAIELVQDIWIYGEMFCLVRTSLDVLLTTASIFHLCCISLDRYYAICCQPLVYRNKMTPLRIALMLGGCWIIPMFISFLPIMQGWNNIGIIDLIEKRKFNQNSNSTYCIFMVNKPYAITCSVVAFYIPFLLMVLAYYRIYVTAKEHAHQIQMLQRAGAPSEGRPQPADQHSTHRMRTETKAAKTLCIIMGCFCLCWAPFFVTNIVDPFIDYTVPGQVWTAFLWLGYINSGLNPFLYAFLNKSFRRAFLIILCCDDERYRRPSILGQTVPCSTTTINGSTHVLRDAVECGGQWESHCHPPATSSLVAAHPSDP

so it's output would be a CSV file containing this line:

ClassA,Amine,Serotonin,MDELDANVSSKEGFGSVEKVVLLTFLSAVILMAILGNLLVMVAVCRDRQLRKIKTNYFIVSLAFVDLLVSVLVMPFGAIELVQDIWIYGEMFCLVRTSLDVLLTTASIFHLCCISLDRYYAICCQPLVYRNKMTPLRIALMLGGCWIIPMFISFLPIMQGWNNIGIIDLIEKRKFNQNSNSTYCIFMVNKPYAITCSVVAFYIPFLLMVLAYYRIYVTAKEHAHQIQMLQRAGAPSEGRPQPADQHSTHRMRTETKAAKTLCIIMGCFCLCWAPFFVTNIVDPFIDYTVPGQVWTAFLWLGYINSGLNPFLYAFLNKSFRRAFLIILCCDDERYRRPSILGQTVPCSTTTINGSTHVLRDAVECGGQWESHCHPPATSSLVAAHPSDP

the raw data is available in /data/ directory and final output is in sequences.txt