i’m working multiple vcf files in directory (linux server) , tab delimited key file contains sample names , corresponding barcodes.
here how files named:
ra_4090_v1_ra_4090_rna_v1.vcf ra_4090_dup_v1_ra_4090_dup_rna_v1.vcf ra_565_v1.vcf ra_565_dup_v1.vcf ra_hcc-78-2.vcf
here contents of key file:
barcode id sample name ionselect-2 ra_4090 ionselect-4 ra_565 ionselect-6 ra_hcc-78-2 ionselect-10 ra_4090_dup ionselect-12 ra_565_dup
i need correlate correct sample names each .vcf file , rename each .vcf file.
there 1 vcf file each sample. however, samples names begin same substring , it’s impossible match them correctly, since sample names not standardized.
the following code works when sample names different fails if multiple sample names begin same substring. have no idea how account multiple sample names begging same substring.
please suggest work. here current code:
#!/usr/bin/perl use warnings; use strict; use file::copy qw(move); $home="/data/"; $bam_directory = $home."test_all_runs/".$argv[0]; $matrix_key = $home."test_all_runs/".$argv[0]."/key.txt"; @matrix_key = (); open(txt2, "$matrix_key") or die "can't open '$matrix_key': $!"; while (<txt2>){ push (@matrix_key, $_); } close(txt2); @ant_vcf = glob "$bam_directory/*.vcf"; $tsv_file (@ant_vcf){ $matrix_barcode_vcf = ""; $matrix_sample_vcf = ""; foreach (@matrix_key){ chomp($_); @matrix_key = split ("\t", $_);## if (index ($tsv_file,$matrix_key[1]) != -1) { $matrix_barcode_vcf = $matrix_key[0]; print $matrix_key[0]; $matrix_sample_vcf = $matrix_key[1]; chomp $matrix_barcode_vcf; chomp $matrix_sample_vcf; #print $bam_directory."/".$matrix_sample_id."_".$matrix_barcode.".bam"; move $tsv_file, $bam_directory."/".$matrix_sample_vcf."_".$matrix_sample_vcf.".vcf"; } } }
the following code works when sample names different fails if multiple sample names begin same substring. have no idea how account multiple sample names begging same substring.
the key solving problem sorting 'sample name' names length - longest first.
for example, matches ra_4090_dup
should before matches ra_4090
in @matrix_key
array attempt match longer string first. then, after match, stop searching (i used first
list::util
module part of core perl since version 5.08).
#!/usr/bin/perl use strict; use warnings; use list::util 'first'; @files = qw( ra_4090_v1_ra_4090_rna_v1.vcf ra_4090_dup_v1_ra_4090_dup_rna_v1.vcf ra_565_v1.vcf ra_565_dup_v1.vcf ra_hcc-78-2.vcf ); open $key, '<', 'junk.txt' or die $!; # key file <$key>; # throw away header line in key file (first line) @matrix_key = sort {length($b->[1]) <=> length($a->[1])} map [ split ], <$key>; close $key or die $!; $tsv_file (@files) { if ( $aref = first { index($tsv_file, $_->[1]) != -1 } @matrix_key ) { print "$tsv_file \t matches $aref->[1]\n"; print "\t$aref->[1]_$aref->[0]\n\n"; } }
this produced output:
ra_4090_v1_ra_4090_rna_v1.vcf matches ra_4090 ra_4090_ionselect-2 ra_4090_dup_v1_ra_4090_dup_rna_v1.vcf matches ra_4090_dup ra_4090_dup_ionselect-10 ra_565_v1.vcf matches ra_565 ra_565_ionselect-4 ra_565_dup_v1.vcf matches ra_565_dup ra_565_dup_ionselect-12 ra_hcc-78-2.vcf matches ra_hcc-78-2 ra_hcc-78-2_ionselect-6
Comments
Post a Comment