We first cleaned the raw sequencing files (fastq sequencing files in Illumina 1.3+ format) to remove any low quality reads and potential contaminating vector sequences. To ensure high quality of data, we discarded reads where one or more position had a phred quality score below 3 (in either paired end). We then aligned the reads against the reference transcriptome using bwa (aln and sampe command, default parameters except for read trimming parameter (-q) set at 20 to ensure high quality alignments ). Following alignments, we used the Indel Realigner from the genome analysis toolkit (gatk ) to correct alignment errors near indel regions. We used samtools to extract the base-pair information at each site for each individual (pileup command, 17 million sites). Note that one individual (PI586932) was not included in the analysis given low sequencing yield (see results section). We parsed this output (base-pair information at each site for each individual) based on several criteria. First, sites with a read depth of less than three were called as missing. Then, if that site passed this first filter, the genotype was called as heterozygous only if the minor allele was represented by more than two reads and the minor allele frequency was at least 10%. At this point, we concatenated information from all individuals for all sites. From this dataset, we filtered on a per site basis, in order to keep only sites for which no more than 20% of all individuals had missing data, expected heterozygosity (He) was greater than 0.2 and observed heterozygosity (Ho) was smaller than 0.6. Polymorphic sites with values above the latter threshold likely represent paralogous sequence variants instead of true SNPs. Conversely, polymorphic sites with values below the former threshold either represent rare alleles, which possess little information for the sake of differentiating individuals, or sequencing errors (unless very high coverage is attained). Nevertheless, we are aware that our analysis likely contains a small fraction of false positives due to alignment and/or sequencing errors. However, given the large amount of data, high overall coverage, strict quality thresholds cut-offs and visual inspection of random subsets of alignments (several tens of kilobases), we expect the data to be more than sufficient for the analysis we report here. This is evident when the same trends are observed when conducting replicate analysis using small random subsamples (for example in the structure runs, see results).