perl - Matching string with substrings -


i’m working multiple vcf files in directory (linux server) , tab delimited key file contains sample names , corresponding barcodes.

here how files named:

ra_4090_v1_ra_4090_rna_v1.vcf ra_4090_dup_v1_ra_4090_dup_rna_v1.vcf ra_565_v1.vcf ra_565_dup_v1.vcf ra_hcc-78-2.vcf 

here contents of key file:

barcode id      sample name ionselect-2     ra_4090 ionselect-4     ra_565 ionselect-6     ra_hcc-78-2 ionselect-10    ra_4090_dup ionselect-12    ra_565_dup 

i need correlate correct sample names each .vcf file , rename each .vcf file.

there 1 vcf file each sample. however, samples names begin same substring , it’s impossible match them correctly, since sample names not standardized.

the following code works when sample names different fails if multiple sample names begin same substring. have no idea how account multiple sample names begging same substring.

please suggest work. here current code:

#!/usr/bin/perl use warnings; use strict; use file::copy qw(move);  $home="/data/";                                                      $bam_directory = $home."test_all_runs/".$argv[0];  $matrix_key = $home."test_all_runs/".$argv[0]."/key.txt";  @matrix_key = ();  open(txt2, "$matrix_key") or die "can't open '$matrix_key': $!";         while (<txt2>){                   push (@matrix_key, $_);                    } close(txt2);  @ant_vcf = glob "$bam_directory/*.vcf";  $tsv_file (@ant_vcf){          $matrix_barcode_vcf = "";         $matrix_sample_vcf = "";          foreach (@matrix_key){                 chomp($_);                 @matrix_key = split ("\t", $_);##                   if (index ($tsv_file,$matrix_key[1]) != -1) {                   $matrix_barcode_vcf = $matrix_key[0]; print $matrix_key[0];                   $matrix_sample_vcf = $matrix_key[1];                   chomp $matrix_barcode_vcf;                   chomp $matrix_sample_vcf;                   #print $bam_directory."/".$matrix_sample_id."_".$matrix_barcode.".bam";                   move $tsv_file, $bam_directory."/".$matrix_sample_vcf."_".$matrix_sample_vcf.".vcf";                 }                }  } 

the following code works when sample names different fails if multiple sample names begin same substring. have no idea how account multiple sample names begging same substring.

the key solving problem sorting 'sample name' names length - longest first.

for example, matches ra_4090_dup should before matches ra_4090 in @matrix_key array attempt match longer string first. then, after match, stop searching (i used first list::util module part of core perl since version 5.08).

#!/usr/bin/perl use strict; use warnings; use list::util 'first';  @files = qw( ra_4090_v1_ra_4090_rna_v1.vcf ra_4090_dup_v1_ra_4090_dup_rna_v1.vcf ra_565_v1.vcf ra_565_dup_v1.vcf ra_hcc-78-2.vcf );  open $key, '<', 'junk.txt' or die $!; # key file  <$key>; # throw away header line in key file (first line)  @matrix_key = sort {length($b->[1]) <=> length($a->[1])} map [ split ],  <$key>; close $key or die $!;  $tsv_file (@files) {     if ( $aref = first { index($tsv_file, $_->[1]) != -1 } @matrix_key ) {         print "$tsv_file \t matches $aref->[1]\n";         print "\t$aref->[1]_$aref->[0]\n\n";         } } 

this produced output:

ra_4090_v1_ra_4090_rna_v1.vcf    matches ra_4090         ra_4090_ionselect-2  ra_4090_dup_v1_ra_4090_dup_rna_v1.vcf    matches ra_4090_dup         ra_4090_dup_ionselect-10  ra_565_v1.vcf    matches ra_565         ra_565_ionselect-4  ra_565_dup_v1.vcf        matches ra_565_dup         ra_565_dup_ionselect-12  ra_hcc-78-2.vcf          matches ra_hcc-78-2         ra_hcc-78-2_ionselect-6 

Comments