SGM+ DNA Profile Test Data Generation

Some years ago I needed to create a few hundred million SGM+ DNA profiles as dummy/test data. Although the DNA profile file/input format was constrained, it also had to tolerate additional whitespace here and there, trailing commas, and so on.

An example SGM+ DNA profile adhering to this predefined format is shown below:

personID0: TH01 (8 , 8.3 ) ; FGA ( 51 , 49 ) ;; ; D21 ( 27.2 ,24.2) ; D19 ( F, 12 ); ;;;; D16 ( 12.2 , 7 ) ; ; D8 ( 14 , 14) ; ;; D2 (26 ,27 ) ; ; ;; D3 ( 17.2 , 20); ;; Amel (F , F ) ;; ;;

Basically the format is text – a person ID, a colon, a predefined marker/locus, two or three values from a quantised list separated by commas surrounded by brackets, and this is repeated for each locus separated by a semicolon.

I needed to provide some demo code to the client to generate test data, for system throughput and performance testing. This is the primary objective. Other objectives were to avoid having to support an application or script that generates this dummy data for perpetuity, writing a formal application installer (for all MacOS/Unix, Windows x86/x64, Linux, …), or providing binary files with all the hidden dependency and security concerns. Although an obvious technology candidate to overcome some of these issues is Java, it is just too buggy for me, and then on top of that there is all the routine pain of classpaths, jar hunting, and more that makes the technology very unappealing.

I decided that Perl was the right technology as it had been around for a while, it runs robustly on any old platform, from a distance it appears as just another curly brace language that already has a feel of familiarity to me, and given this task is nothing more than a little ‘scripting job’, it is a technical choice that seems very matched to the task at hand.

Unfortunately I am not a natural Perl programmer, so a supposedly 10 minute job bloated out to several hours. I disclose a sanitised version of the code I produced here:

#!/usr/bin/perl -w

#shuffle a 1D array of numbers
 sub shuffle1D
 {
   my $aOfN=shift;
   for (my $i= @$aOfN;--$i;)
   {
    my $j = int rand($i+1);
    @$aOfN[$i,$j] = @$aOfN[$j,$i];
   }
 }

#returns random number of characters
 sub rnsc
 {
  return join('', $_[0]x int rand($_[1]));
 }

#returns random number of spaces between 0 and 5
 sub rns
 {
  return rnsc(' ',5);
 }

#returns random array element from 1D array
 sub rnElement
 {
  my $aOfN=shift;
  return @$aOfN[int rand(scalar(@$aOfN))];
 }

my @sgmPlus=(
 # marker  low                                                     high
 # ------- ------------------------------------------------------- ---------------------------------------------------------------
  [ "D2", [14..29,15.2,16.2,17.2,18.2,19.2,20.2,21.2,22.2,23.2,
          24.2,25.2,26.2,27.2,F],                                  [14..29,15.2,16.2,17.2,18.2,19.2,20.2,21.2,22.2,23.2,24.2, 25.2,26.2,27.2]],

  [ "D3", [9..21,12.2,13.2,14.2,15.2,16.2,17.2,18.2,F],            [9..21,12.2,13.2,14.2,15.2,16.2,17.2,18.2,F]],

  [ "D8", [6..20,8.2,9.2,10.2,11.2,12.2,13.2,14.2,15.2, 16.2,17.2,
          18.2,F],                                                 [6..20,8.2,9.2,10.2,11.2,12.2,13.2,14.2,15.2,16.2,17.2, 18.2,F]],

  [ "D16", [4..16,5.2,6.2,7.2,8.2,9.2,10.2,11.2,12.2,13.2,14.2,F], [4..16,5.2,6.2,7.2,8.2,9.2,10.2,11.2,12.2,13.2,14.2,F]],

  [ "D18", [6..28,9.2,10.2,11.2,12.2,13.2,14.2,15.2,16.2,17.2,
           18.2,19.2,20.2,21.2,22.2,23.2,24.2,25.2,26.2,F],        [6..28,9.2,10.2,11.2,12.2,13.2,14.2,15.2,16.2,17.2,18.2, 19.2,20.2,21.2,22.2,23.2,24.2,25..2,26.2,F]],

  [ "D19", [7..17,9.2,10.2,11.2,12.1,12.2,13.2,14.2,15.2,16.2,
           17.2,18.2,F],                                           [7..17,9.2,10.2,11.2,12.1,12.2,13.2,14.2,15.2,16.2,17.2,18.2,F]],

  [ "D21", [23..39,24.2,24.3,25.2,26.2,27.2,28.1,28.2,28.3,29.1, 
           29.2,29.3,30.1,30.2,30.3,31.1,31.2,31.3,32.1,32.2,
           32.3,33.1,33.2,33.3,34.1,34.2,34.3,35.1,35.2,35.3,
           36.2,37.2,F],                                           [23..39,24.2,24.3,25.2,26.2,27.2,28.1,28.2,28.3,29.1,29.2, 29.3,30.1,30.2,30.3,31.1,31.2,31.3,32.1,32.2,32.3,33.1, 33.2,33.3,34.1,34.2,34.3,35.1,35.2,35.3,36.2,37.2,F]],

  [ "Amel", [X,F],                                                 [X,Y,F]],

  [ "FGA", [16..33,42..51,17.2,18.2,19.2,20.2,21.2,22.2,23.2,24.2,
           25.2,26.2,27.2,28.2,29.2,30.2,31.2,32.2,33.2,42.2,43.2,
           44.2,45.2,46.2,47.2,48.2,50.2,51.2,F],                  [16..33,42..51,17.2,18.2,19.2,20.2,21.2,22.2,23.2,24.2, 25.2,26.2,27.2,28.2,29.2,30.2,31.2,32.2,33.2,42.2, 43.2,44.2,45.2,46.2,47.2,48.2,50.2,51.2,F]],

  [ "TH01",[3..13,5.1,5.3,6.1,6.3,7.1,7.3,8.1,8.3,9.1,9.3,13.3,F], [3..13,5.1,5.3,6.1,6.3,7.1,7.3,8.1,8.3,9.1,9.3,13.3,F]],

  [ "vWA", [9..25,11.2,12.2,13.2,14.2,15.2,16.2,17.2,18.2,19.2,
           20.2,21.2,22.2,23.2,24.2,F],                            [9..25,11.2,12.2,13.2,14.2,15.2,16.2,17.2,18.2, 19.2,20.2,21.2,22.2,23.2,24.2,F]]
 );

my @locusOrder=(0..$#sgmPlus);
 for (my $i=0;$i<10;$i++)
 {
  shuffle1D(\@locusOrder);
  my $markersMissed = rand(3); #no of markers short of a full SGM+ profile

 print rns().'personID'.$i.rns().':'.rns();
 for (my $j=0;$i<scalar(@locusOrder)-$markersMissed;$j++)
 {
   my $sgmIndex = $locusOrder[$j]; 
   my $one = rnElement($sgmPlus[$sgmIndex][1]);
   my $two = rnElement($sgmPlus[$sgmIndex][2]);
   my $desigList = "(".rns().$one.rns().",".rns().$two.rns();
   $desigList = $desigList.((rand(100)>98.5)?",".rnElement($sgmPlus[$sgmIndex][2]):""); #trisomy, infrequent but certainly possible
   $desigList = $desigList.")".rns().';'.rns().rnsc(';',2).rns().rnsc(';',2);
   print rns().$sgmPlus[$sgmIndex][0].rns().$desigList.rnsc(';',3);
  }
  print "\n";
 }

If anyone reading this blog can offer an insight into how I can achieve the result in a more efficient way, feel free to drop me a line, or comment below. I won’t be offended. I would actually welcome the feedback as I am in the process of updating C$WILDNA1 for Oracle 12c. The COTS product that consumes the million plus file of sample DNA profiles is C$WILDNA1.

Example output from executing this script is shown below. Change the constant at line 70 from 10 to something larger to produce more than 10 random SGM+ DNA profiles.

chemistry:~/sgmplus#

chemistry:~/sgmplus#

chemistry:~/sgmplus# ./sgmPlusTestData.pl

personID0 :     TH01  (    8.3   ,  5.1    )   ;    ;  D19    (  17.2,   12.2 ,9);    ; ;;   D2   (   20  ,27.2);       ;  Amel    ( X   ,    X   )   ; ; ; D16    ( 7  , 15 )   ;    ;   FGA(28  , 48) ; ;;D8 (16 ,15.2   )    ;      ;;;  D18 (   22.2 ,14 ) ; ;   D3    (12, 17.2   );     ;;    vWA  (17,  23   );      ;;;    D21 ( 34    ,  28)  ;      ;;;

personID1    :    D8(  6,  F   ) ;   ; ;  Amel (X  ,    X)  ;;  ;    D18  (    9.2    ,9.2 )  ;;  ; FGA   (    27 ,   31.2   )   ;;   ;;;  TH01    (13.3, 9.1 )   ;    ;; vWA   (  21.2    ,F)    ;;    D2 ( 23 ,20.2)    ;  ;;D19 ( 16.2 ,15 ) ;  ;;; D21 (32 ,28.2  )  ;    ;;   D16 ( 6    ,   12   ) ;  ;  ;;

personID2  :  D19( F,  17   )   ;     ; D18(    16.2   ,  17  )    ;    ;  ;;;  D8(15    ,   12    )    ;   ;  ;;  D16   (  6    ,    9.2  )    ;;   ; TH01    (  10   , 8    )    ;;;;;    D3(    17  ,   14.2)    ;   ;   ;; vWA  (  15 ,  F  )    ;    ;;;  FGA   (    51 ,   31 )   ;    ;;    D2    ( 27.2, 26.2  ) ;    ; ;;

personID3 :     TH01  (  5  ,   6.1  )  ;     ;;   D16   (7,8    ) ; ;;;    D19  (  9    ,    12.2)   ; ;  ;  D2  (  26.2    ,16  )   ;    ;;;    D18  (    18.2    ,   18 );  ; ;;  D21  (   32.1  ,   32.3)  ;    ;    Amel(    F  ,    Y    )   ;   ;; D3  (   17.2   ,  18) ;    ;   ;D8   (  11.2,   9   )  ;    ;  ;

personID4 :      vWA    (   13.2    , 17    );        D3 (15 , 18 ) ;     D2  (    25.2,   23  )    ;  ; ;;;Amel  (  F   ,  Y); ;  D21 (27, 26.2    )  ;  ;;;  FGA  (  20    ,   28.2)  ;  ;;;TH01   ( 11 ,   9.3    )  ; ;;;  D16    (  11   ,  14.2  ) ;    ;;    D19   (12.2,   9)  ;; ;;;  D18(    19 ,  13.2  ,23.2)   ;    ;;;    D8(   F    ,   18.2 ) ;; ;

personID5    :     D2    (    15.2  ,  22   )   ;  ;     D18   ( 21,    15   );   ;  ;; D19    (    13 ,  7)    ;  ;   ;;  TH01( 3, 5.3    ) ;    ;;;    vWA(   12.2    ,    16 )   ;   ;  ;; D8    ( 10  ,10.2 ) ;  ;  D21 (34.2    ,33.1    )   ;    ;       FGA    (   32.2  ,45.2    )  ; ;;  D16  (    5  ,    8.2  )   ;   ;    ;;;   Amel    (    X,  Y);

personID6   :    D3   (   10   , 9   )  ;       ;;  vWA    ( 10, 13   ) ;        ;   D18    (  14.2 ,   18.2)    ;    ;  ;;TH01   (   8   ,9  ) ;  D16 (   6.2   ,   6.2    )   ;  ;   ;;;   Amel(  X   ,    Y  );  ;     FGA   (    48   ,    48  );    ;  ;;    D8(    8,16.2  ) ;;;    D21   (   32.2 ,    35.1    );   ;  D2  (    29    , 27.2    );;

personID7 :  D21( 25    ,    30    )   ;        ;; D16  (    13.2 ,  12    )    ; ;Amel  ( F,    Y  );    ;TH01    (   7.3    ,    11  )    ;   ;; D2(  16.2  , 19.2   ) ;  ;;    D8  (  11.2, 7   )    ;    ;   ; D3 (   18.2   ,14.2  )    ;    ; ;;   D19   (15,    15  )    ;    FGA   ( 29  , 44.2   ); ;;   vWA   (  16  ,  16.2 )   ;;;;   D18    (    28    , 23 )    ;;  ;

personID8   : D19    (    17,   17)    ;     ;; TH01  (    12,  7.3   )  ;    ;       D8 ( 17.2   ,  8.2   )   ;     ;FGA(33 ,    18  )  ;   ;    D21(  31.2 ,  34.3 )    ;    ;    ;; Amel  (X,   X    );    ;   ;vWA(   19.2,  19.2    )    ; D18    (    8 ,    25) ;   ;   D16( 8.2 ,14.2   )  ;    ;;    D3   (    10   ,    15.2 ) ;    ;

personID9   :    Amel   (F  ,   F )  ;    ;   D3( 10  ,    13  );     ;;   D8   ( 6  ,15    )   ;   ;;;  D16   (   12   ,    14.2) ;    ; ;;D2    (29  ,26.2  ) ;    D21   (  31.2 ,  34.2  );  ;  ;;;  vWA (   9    ,    23 ,23.2)   ;  ;  ;    D18   (27  ,    14.2  )  ; ; FGA  (    28.2 ,  47    )  ;

chemistry:~/sgmplus#

chemistry:~/sgmplus#

chemistry:~/sgmplus#

— Published by Mike, 21:35:20 07 August 2016

Connect with:

SGM+ DNA Profile Test Data Generation

Leave a Reply

Archive

Categories