Protein-coding introns in mitochondrial genomes

Protein-coding introns in mitochondrial genomes

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am studying the mitochondrial genome and have read that some contain introns. However, these introns code for proteins. I cannot really understand this. Could someone tell me what intron in which mitochondria genome this refers to?

Mitochondrial genomes differ greatly in size, coding potential and even whether they are circular or linear. Mammalian mitochondrial DNA is small (11-28 kbp) and intronless. However the mitochondria of certain other organisms range up to 1000 kbp in size.

Certain sponges (demosponges) with large mitochondrial genomes contain type I introns and type II introns. Although introns were initially thought to have no other function than to separate exons, the introns of certain nuclear genes have been found themselves to contain genes. This turns out also to be true of some of the mitochondrial introns. To quote from then introduction to a paper I found through an internet search:

Most Group I introns encode homing endonuclease genes (HEG) and/or maturase of the LAGIDADG † family, while most Group II introns encode a reverse transcriptase (RT).

These introns are found in genes such as that encoding cytochrome oxidase subunit 1 (COI).

† This appears to be a misprint for the LAGLIDAD mutases, which are also endonucleases, and are involved in splicing the introns in which they reside.

Proposal of a new nomenclature for introns in protein-coding genes in fungal mitogenomes

Fungal mitochondrial genes are often invaded by group I or II introns, which represent an ideal marker for understanding fungal evolution. A standard nomenclature of mitochondrial introns is needed to avoid confusion when comparing different fungal mitogenomes. Currently, there has been a standard nomenclature for introns present in rRNA genes, but there is a lack of a standard nomenclature for introns present in protein-coding genes. In this study, we propose a new nomenclature system for introns in fungal mitochondrial protein-coding genes based on (1) three-letter abbreviation of host scientific name, (2) host gene name, (3), one capital letter P (for group I introns), S (for group II introns), or U (for introns with unknown types), and (4) intron insertion site in the host gene according to the cyclosporin-producing fungus Tolypocladium inflatum. The suggested nomenclature was proved feasible by naming introns present in mitogenomes of 16 fungi of different phyla, including both basal and higher fungal lineages although minor adjustment of the nomenclature is needed to fit certain special conditions. The nomenclature also had the potential to name plant/protist/animal mitochondrial introns. We hope future studies follow the proposed nomenclature to ensure direct comparison across different studies.


One of the defining and essential features of life is genetic material. An organism’s genome is the complete set of all genes and genetic material that is present in that organism or individual cell. Often we think of genes in terms of protein-coding genes, or genes that are transcribed into mRNAs and then translated into protein however, genomes consist of a lot more than just protein coding genes. In addition, the features of prokaryotic and eukaryotic genomes differ in terms of both size and content.
The image below shows the different ranges of genome sizes in different taxonomic groups of life. Note that, in general, prokaryotic genomes are smaller than eukaryotic genomes. However, eukaryotic genome sizes are vary wildly and are not linked to organimsal “complexity.” Refer to this diagram as you read on about the differences and similarities between prokayotic and eukaryotic genomes.

Genome sizes, from Wikipedia

Prokaryotic Genomes

  • The genomes of Bacteria and Archaea are compact essentially all of their DNA is “functional” (contains genes or gene regulatory elements).
  • The sizes of prokaryotic genomes ranges from about 1 million to 10 million base pairs of DNA, usually in a single, circular chromosome
  • Genes in a biochemical pathway or signaling pathway are often clustered together and arranged into operons, where they are transcribed as a single mRNA that is translated to make all the proteins in the operon.
  • The size of prokaryotic genomes is directly related to their metabolic capabilities – the more genes, the more proteins and enzymes they make.

Eukaryotic Genomes

  • The genome sizes of eukaryotes are tremendously variable, even within a taxonomic group (so-called C-value paradox).
  • Eukaryotic genomes are divided into multiple linear chromosomes each chromosome contains a single linear duplex DNA molecule.
  • Eukaryotic genes in a biochemical or signaling pathway are not organized into operons one mRNA makes one protein.
  • Many eukaryotic genes (most human genes) are split non-coding introns must be removed and the exons spliced together to make a mature mRNA. Introns are “intervening” sequences in genes that do not code for proteins. The image below shows a zoomed-in region of a gene highlighting the alternating exons and introns.

One gene is transcribed, then spliced in different ways to produce mRNAs that encode related proteins from different exon combinations.

What accounts for the variation in genome size?
There is no good correlation between the body size or complexity of an organism and the size of its genome. Eukaryotic genomes sequenced thus far have between

30,000 protein-coding genes, or less than 10-fold variation in the number of genes. The human genome has about 21,000 protein-coding genes (recently revised to as few as

19,000 genes). Therefore, the 10,000-fold variation in eukaryotic genome size is due mostly to varying amounts of non-coding DNA.
Here is a quick comparison of the genome size and predicted gene number for a sampling of eukaryotes:

It’s very interesting to note that humans have about the same number of genes as the microscopic nematode worm, C. elegans , and fewer genes than rice.

What’s in the human genome?

The content of the human genome, from Wikipedia

  • Protein-coding (exon) DNA sequences comprise less than 2% of the human genome.
  • Introns make up just over 1/4 of the human genome.
  • Transposable elements and DNA derived from them make up about 1/2 of the human genome. Transposable elements are essentially “parasitic” DNA that resides in a host genome, taking up space in the genome but not contributing useful or functional sequences to the genome. They are the DNA transposons, LTR retrotransposons, LINEs and SINEs.
  • Because they are parasitic DNA elements, transposable elements are extremely valuable for studying evolutionary relationships. If a transposable element “invades” an organism’s genome, then it is likely to remain in that genome as the population evolves and when speciation occurs. If the same transposable element is present in the same location in the genomes of two different species, this is strong evidence that those two species share a recent common ancestor who also had the transposable element in its genome.
  • One family of SINEs, called the Alu element, is a 300-nucleotide sequence that is present in over 1 million copies in human and chimpanzee genomes.
  • Segmental duplications are relatively long (> 1 kb kb = 1,000 bp) segments of DNA that have become duplicated. These duplications create copies of genes that can mutate and acquire new functions. Gene families (e.g., alpha- and beta-hemoglobin, myoglobin) arose this way.

Is the human genome 80% “junk” or 80% functional?
Recent publication of data and papers from the ENCODE project, a systematic survey of the human genome variation and activity from chromatin modifications to transcription, has claimed that, contrary to previous belief, fully 80% of the human genome has at least some biochemical activity, such as transcription (The ENCODE Project Consortium, 2012). Indeed, many small RNAs, called microRNAs (miRNAs) with important regulatory roles are transcribed from intergenic regions. However, these miRNAs and other regulatory RNAs comprise less than 1% of the human genome, and other studies have indicated that only 10% of the genome appears to be subject to some evolutionary constraint (review by Palazzo and Gregory, 2014).

DNA sequencing
The human genome project was accomplished by large banks of automated sequencers that used the Sanger dideoxy sequencing technology. In recent years, however, massively parallel sequencing technologies have brought down the cost and throughput of DNA sequencing much faster than computing speed and power has increased (Moore’s Law).

The implications for being able to obtain huge amounts of DNA sequence quickly and cheaply has startling implications for biological research in all fields, and for human health. The TedTalk below by Richard Resnick discusses some of the applications:

Results and discussion

Mitochondrial genetic diversity across S. cerevisiae population

We explored 1011 S. cerevisiae sequenced isolates [47] to investigate the intraspecific mitochondrial genome diversity and evolution. Since mitochondrial genomes include variable long AT-rich intergenic regions that are difficult to compare, we first focused on the eight mitochondrial coding DNA sequences (CDSs). From 698 de novo genome assemblies, we collected the eight complete CDSs. Out of these, 553 isolates also had a complete or nearly complete mitochondrial sequence. A subset of 353 genome sequences did not have any ambiguous base across the CDSs (Additional file 1: Figure S1, Additional file 2: Table S1). We estimated the global genetic diversity by the average pairwise divergence π. Overall, we observed lower diversity in the coding nuclear (π

0.003) [47] compared to the mitochondrial sequences (π

0.0085, Additional file 2: Table S2), which contrasts to what was previously observed for other yeast species (Additional file 1: Figure S2). This opposing trend, more similar to the pattern observed in animal rather than in fungi [19, 48], is consistent with S. cerevisiae experienced rapid evolution of mitochondrial genes after the whole genome duplication [49].

We observed sharp genetic divergence differences of the nuclear and mitochondrial genomes among wild and domesticated isolates. In wild clades, despite higher nuclear divergence (up to 1.1% at CDS level), the mitochondrial CDS genetic distance reaches its maximum of

0.4% of nuclear divergence and plateau afterwards. In contrast, mitochondrial sequence divergence between domesticated clades have a larger increase, reaching its maximum at lower nuclear divergences (Additional file 1: Figure S3). This difference in variation is observed across all the mitochondrial CDSs whose values of π are systematically higher in domesticated compared to wild isolates.

The shortest CDSs, ATP8 and ATP9, have the lowest proportion of polymorphic sites (

2%) and lowest values of π (0.003 or less) and lack non-synonymous mutations. In contrast, COX1 and COX2 are highly polymorphic. Although COX1 has the highest polymorphic sites (8%), COX2 has the highest π value (0.0163, Table 1). We used the discriminant analyses of principal component (DAPC) [50] to evaluate the contribution of specific genes to classify mitochondrial ‘haplotypes’ and population clustering. We quantified that ATP6 and COX2 respectively account for 38% and 28% of population clustering. This observation supports the widespread usage of COX2 in mitochondrial phylogeny (Fig. 1a) [37, 38, 51, 52].

Allele distribution across mitochondrial CDSs. a Distribution of major (blue) and minor (red) alleles for the 259 polymorphic positions in the 234 S. cerevisiae unique complete profiles (which include 353 isolates). Profiles are ordered according to their phylogenetic relationship using the phylogenetic neighbour-joining tree (left side). b The numbers of unique alleles for each mitochondrial CDS show a dramatic difference between genes

Next, we generated a non-redundant allele database. We observed a variable number of distinct CDSs alleles (Fig. 1b), resulting in high proportion of unique allelic profiles (234 out of 353 isolates, Additional file 1: Figure S1). We used these non-redundant allelic profiles as a proxy to investigate the mitochondrial genome distribution in the population. Globally, we observed a poor overlap between the mitochondrial and nuclear genome phylogenetic lineages [47] with few exceptions that include near-to-clonal nuclear genome lineages, having specific mitochondrial profiles. These exceptions include a Sake subclade, the two clinical Wine/European subclades (Y′ amplification and S. boulardii), the North American and the reproductively isolated Malaysian clades [53]. In contrast, the mixed origin clade [47], which has highly diverse ecological (e.g. bakeries, beer, plants, animal, water, clinical sample) and geographical origins (e.g. Europe, Asia, Middle East, America), shows low mitochondrial intra-clade difference despite substantial nuclear genome variation (Additional file 1: Figure S4). Indeed, across the mixed origin clade, only very similar profiles of mitochondrial genes segregate, with variants limited to COX1 and VAR1, resulting in very low π (0.00008) compared to other clades (

0.001, Additional file 2: Table S2).

The VAR1 gene is a particularly variable, highly AT rich and prone to non-synonymous mutations and indels. These indels mostly represent GC-rich byp-like elements able to cause jumps in protein translation in other yeast species [33]. Two positions were described, one named ‘common’ and another downstream with GC cluster in inverted orientation [54]. We identified 35 allelic variants of VAR1 gene harbouring these two clusters in 117 isolates, mainly belonging to the mosaic groups (N = 52) (Additional file 2: Table S1). While most of the reported cases harboured the GC cluster either in the common (N = 91, accounting for 18 different VAR1 alleles), or in both positions (N = 6, across 4 VAR1 alleles) [54], a large fraction of the observed allelic variants here only harboured the GC cluster in the second site (N = 19, across 13 VAR1 alleles). We also discovered two novel variants, one with GC cluster at the common position but in inverted orientation (2 isolates) and the second with the GC cluster in tandem duplication at the second position (3 isolates).

In addition to the canonical ORFs, we characterized the four non-canonical ORFs F-SceIV (OMEGA intron), F-SceI (RF3), RF2 and F-SceIII (RF1) [31, 32]. F-SceIV is relatively uncommon in the population (198 isolates), while F-SceI, RF2 and F-SceIII are more spread (447, 542 and 477 isolates, respectively). These three ORFs are known to contain GC clusters, which often introduce frame shifts in the sequences. In F-SceIII, we identified three GC clusters positions. The first position is particularly rare (43 isolates), and in two cases, the GC cluster is truncated. The other two GC clusters are much more abundant (in 277 and 206 isolates, respectively). We identified 6 distinct GC cluster positions in both RF2 and F-SceI (see Additional file 2: Table S1). Altogether, these results uncovered high variability of mitochondrial sequence across the S. cerevisiae natural population.

Extensive admixture of mitochondrial genomes

We investigated the mitochondrial genome population structure using the eight concatenated CDS to calculate the phylogenetic network using SPLITSTREE [55]. The dataset comprises 239 non-redundant CDS profiles, with 234 S. cerevisiae isolates with complete CDS sequences and five S. paradoxus representatives [56] as outgroups. The resulting intertwined network shows a strong interconnectivity of the sequences, underlying frequent historical recombination (Fig. 2a). In contrast, classical phylogenetic trees are unable to consistently group the isolates (Fig. 2b). Using ADMIXTURE [57], we observed that the opposite edges of the trees fall in the same population for low K values (K = 2–3), further underlying a poor grouping.

Complex mitochondrial genome phylogeny. Only S. cerevisiae isolates with complete CDS data have been used (N = 353) in addition to five S. paradoxus isolates used as an outgroup. a Phylogenetic network of non-redundant concatenated CDS sequences (N = 237 profiles) produced a highly intertwined network driven by recombination with few groups of closely related strains. b The rooted tree (left) shows a weak topology with few nodes (red) with bootstrap values over 75. ADMIXTURE analysis of genomic components (right) with K ranging from 2 to 15 confirms the high degree of mosaicism. The highly divergent Taiwanese lineage (green dot) is not divergent to the other lineages in contrast to the nuclear genome phylogeny

Mitochondrial population structure appears to poorly reflect the clustering obtained from the nuclear genome. For example, the early divergent Taiwanese lineage based on nuclear genome does not show higher sequence distance. However, isolates belonging to the mosaic groups of the S. cerevisiae population show the highest degrees of admixture, indicating that outbreeding has impacted both mitochondrial and nuclear genomes.

We then calculated the coefficient of concordance ‘W’ using the congruence among distance matrix (CADM) metric [58] to be 0.79, with 0 indicating complete disagreement and 1 complete agreement between distance matrixes. This value indicates a relatively good concordance between the phylogenic networks of mitochondrial and nuclear genomes. This is likely driven by isolates with very close mitochondrial sequence often also having similar nuclear genome sequence, while the main branches of the mitochondrial tree are discordant. We further compared phylogenetic trees and networks based on concatenated sequences derived from the 8 mitochondrial CDSs and 8 nuclear genes previously used for phylogenetic studies [59, 60] in a selection of 14 isolates (Additional file 1: Figure S5). Consistently, mitochondrial sequences resulted in a wider network, implying a less defined phylogenetic structure and more pronounced admixture, with early branching lineages falling within the worldwide non-Chinese lineages. Overall, our results highlight a pronounced separation in evolutionary histories of the two coexisting genomes, and the extensive mitochondrial genome admixture provides additional support to its mitochondrial inheritance requiring recombination-driven replication [34, 61, 62].

Interspecies introgressions of mtDNA are rare

We recently described four clades (namely Alpechin, Mexican Agave, French Guiana and Brazilian bioethanol) with abundant S. paradoxus interspecies introgressions in the nuclear genome [47]. We analysed the mitochondrial CDSs to search for introgressed alleles. The four clades with abundant nuclear genome introgressions did not show any S. paradoxus mitochondrial alleles. Nevertheless, two isolates from America (CQS, YCL) and one from Africa (ADE), all genetically related to the French Guiana and the Mexican Agave clades, harbour two distinct patterns of S. paradoxus mitochondrial introgressions. We retrieved the complete CDS set for two of them (CQS and YCL), while the third (ADE) is incomplete but very close to YCL. The mitochondria introgression in YCL (YJM1399) strain was already reported, but no further analyses were presented [28]. We generated a set of polymorphic markers (methods), to accurately identify the introgression boundaries. The S. cerevisiae major alleles were identified from the 1011 isolates, whereas for S. paradoxus, they were derived from 23 North American isolates for which full chromosome sequence were available [21, 56]. Eurasian S. paradoxus isolates were not included because of their similarity with S. cerevisiae sequences, likely due to an ancient introgression event from S. cerevisiae to S. paradoxus [21, 56, 63]. We generated a catalogue of 110 polymorphic positions and derived different alleles between the two species. Several genes in these two isolates were catalogued either as partially or fully introgressed (Fig. 3a). Since the frequency of some alleles is close to 50% and often the less common allele of one species is the more common allele of the second one, there is a chance of calling false-positive introgressions. Nevertheless, long consecutive series of S. paradoxus marker in the COB, ATP9, COX1, COX2 and COX3 genes in YCL, as well as those in the COB, COX1, COX2 and COX3 genes in CQS, are likely to be genuine. The absence of traces of introgression in S. cerevisiae isolates from Europe could be explained by the higher sequence similarity with European S. paradoxus, which prevent the detection. However, introgressions between S. cerevisiae and European S. paradoxus isolates could also be prevented by the non-collinearity in the structure of their mitochondrial genomes that likely impair the recombination [56].

Rare S. paradoxus introgressions. a Polymorphic markers between S. cerevisiae and S. paradoxus across the mitochondrial CDSs were used to identify introgression events. Introgression boundaries are set as the midpoint between markers. The two bottom rows indicate the frequency in the population of the major or consensus allele (AF), in the specific position and species. b Number of introgressed ORFs in the nuclear genome does not correlate with the percentage of genetic markers of S. paradoxus in mitochondrial CDS. Only isolates with complete unambiguous CDS data were included (N = 353). Position of isolate reported in a is circled in red

We further extend the analysis of the 110 polymorphic sites to 353 isolates with fully assembled CDS. We observed additional potential cases of mitochondrial introgressions. The isolate YCL mitochondrial sequence harbours over 50% of S. paradoxus markers, possibly indicating a recombinant genome derived from a recent transfer event. In addition, a small number of S. paradoxus markers are found in each S. cerevisiae isolate, perhaps due to incomplete lineage sorting. Overall, the number of S. paradoxus markers in the mitochondrial genomes does not correlate with the number of introgressed ORFs in the nuclear genomes (Fig. 3b), suggesting that the interspecies gene flows were independent due to distinct origin and/or fate.

Introns gain and loss during evolution and dispersal

Two mitochondrial protein coding genes, COB and COX1, harbour introns at multiple sites, and we explored their presence-absence patterns in the whole 1011 isolate collection. COX1 introns are found at varying frequencies (median 0.48) with highly variable presence-absence profiles (Fig. 4a). Intron patterns further support low variability within North American, Malaysian and mixed origin lineages (Additional file 1: Figure S6). In contrast, the groups of loosely related mosaics (M1, M2 and M3 clusters) show the lowest level of intron conservation, consistent with their admixed genetic backgrounds.

Intron phylogeny underlies both loss and gain events. a The distribution of intron presence and absence is not consistent with the mitochondrial tree phylogeny. The rare introns bi1α and ai3β are highlighted (bold) intron ai4γ was not found in the sequenced collection and not shown. b Only 4 non-redundant sequences have been found for cox1 ai3β intron. Their sequences are unrelated to other Saccharomyces species, which could not be used for rooting. The peculiarity of the distribution of this intron could suggest a lineage-specific gain event. c Rooted tree of the COB intron bi1α using S. paradoxus and S. eubayanus sequences as an outgroup. Nodes with bootstrap values below 0.5 have been collapsed. Its presence in multiple highly divergent Asian lineages and in other Saccharomyces species is consistent with intron loss following the out-of-China dispersal. The isolate CQS, which harbours introgression in both nuclear and mitochondrial genome, also derive from S. paradoxus origin. This is compatible with the downstream exonic sequence, which also is introgressed

The COX1 intron frequencies in the population are consistent with previous report [28], ranging from 26 to 86%. We identified a total of 103 different COX1 intron combinations with two introns, ai4β and ai5α, that are never found together (ai4β is in 89 while ai5α in 85, out of 408 alleles). Given the linkage between these closely spaced intronic positions, they either raised in two ancestral populations or are unlikely to be brought together by recombination or the double presence is functionally incompatible. Two additional COX1 introns, ai3β and ai4γ, are very rare in S. cerevisiae population. While aI4γ is also absent in most related Saccharomyces species, ai3β intron is present in all of them. The only occurrence of ai3β previously reported in S. cerevisiae was in the YCL isolate, which also contains S. paradoxus introgression around the intron position in COX1. However, although the ai3β intron is present in S. paradoxus, the ai3β intron sequence of YCL is closer to the one found in Lachancea meyersii [28]. In addition to the YCL allele, we found other three variants of ai3β, all related to the Lachancea sequence. Two variants are present in YCL and ADE isolates with abundant S. paradoxus introgressions, while the CQS strain has related version. Additional ai3β intron is present in two Asian isolates and in 19 French Guiana isolates, whose clade is highly introgressed from S. paradoxus (Fig. 4b). The presence of the ai3β intron among these highly introgressed lineages suggests separate lateral transfer events from Lachancea, although it cannot be ruled out that these introns where initially transferred from Lachancea, or a related genus, to S. paradoxus before the introgression occurred.

In contrast, the six COB introns are more uniformly present (frequencies ranging from 88 to 99%, Fig. 4a) with the only exception of the recently described bi1α [28] occurring at low frequency (

5%). Surprisingly, bi1α is common among the early-branching Asian clades [47]. Other isolates harbour it, mainly mosaic isolates, but segregate at low frequency in non-Asian clades. Its presence in several Saccharomyces outgroup species and in the S. cerevisiae early divergent lineages suggests a loss preceding or during the out-of-Asia dispersal. The intron could have been introduced again, from secondary contacts with bi1α-positive Asian lines. To test these hypotheses, we constructed a phylogenetic tree using all the bI1α intron sequences and outgroups (Fig. 4c). The bi1α phylogenetic tree shows more variants of Asian sequences compared to non-Asian ones, which mainly cluster in two groups stemming from separated branches of Asian introns, consistent with multiple separate regain events in the worldwide population.

Self-splicing introns have been associated to increased mutation frequencies at the boundary intron/exon [29]. We scanned the exonic sequences in a window of 70 nucleotides both upstream and downstream each intron in COX1 and COB. Consistently, the highly mobile COX1 introns are associated with higher frequency of alternative alleles in a 20-nucleotide window adjacent to insertion boundaries (Additional file 1: Figure S7).

Structural rearrangements are rare in mitochondrial genomes

Next, we investigate the size and the presence of structural variation across the mitochondrial genomes. Considering the 250 circularized assemblies, the mitochondrial genome sizes range from 73,450 to 95,658 bp (Additional file 2: Table S3). As the gene content is entirely conserved between these isolates, this high size plasticity is driven by variability of the intergenic region (ranging from 45,254 to 69,807 bp) and the intron content (ranging from 7748 to 20,024 bp in size) (Additional file 1: Figure S8). Both factors are highly correlated to the total mitochondrial genome length (r 2 0.769 and 0.756, respectively correlation-associated p values < 2.0E−04) (Additional file 1: Figure S9). Mitochondrial genome size is variable among isolates of the same lineage.

Synteny analysis across the 553 isolates with genome on single scaffold highlights four distinct genomic inversions (Fig. 5, Additional file 1: Figure S10). Two strains from the Wine/European and Ale beer lineages, BKI and AQT, share an inversion of the region that ranges from trnW to the COX2 gene, while three closely related Wine/European strains (AIM, BNG and CFB) share a larger inversion that also encompasses the 15S rRNA gene (Fig. 5b, c). Inversions were also found in BDN (African beer) and CDN (Ecuadorean) and are related to regions ranging from the 15S rRNA or COX1 genes, respectively, to the ATP6 gene (Fig. 5d, e). All inversion boundaries map to highly repetitive AT-rich intergenic regions, which prevent their precise delimitation. Interestingly, all these inversions lead to the loss of a feature shared by most ascomycetous yeast, namely that all mitochondrial protein-coding genes are transcribed from the same DNA strand [64]. However, mitochondrial functions seem not to be impaired, as these isolates maintain their respiration capabilities.

Structural variants in the mitochondrial genomes. Schematic of the mitochondrial genome organization annotated for protein-coding genes and rRNA and tRNA genes. The approximate breakpoint locations of the inversions are indicated by dotted lines. These mitochondrial genome organizations are related to different isolates. a S288C (shared by the vast majority of isolates). b AQI and BKI. c AIM, BNG and CFB. d CDN. e BDN

Recent report suggested that the alteration of the gene order within yeast genera could be related to the mitochondrial genome size [65]. While the Lachancea and Yarrowia clades, with mitochondrial genome less than 50 kb, show high synteny across species [66, 67], the Saccharomyces clade (mitochondrial genome size > 65 kb) is more prone to rearrangements [65]. Indeed structural rearrangements were also detected in the mitochondrial genome of S. paradoxus [56]. Our results suggest that mtDNA structural variation can be tolerated, perhaps restricted to balanced events that do not alter the CDS copy number.

Variation in mtDNA copy number reveal natural petite isolates

Mitochondrial copy number can dramatically affect phenotypes but is hard to measure with high-throughput methods. We estimated mtDNA copy number using the relative coverage of ATP6, COX2 and COX3, which provide robust mapping. The number of mitochondrial genomes is generally constant across clades (Additional file 1: Figure S11), with no significant differences between domesticated and wild lineages, with a median of 18 mitochondrial genomes for each haploid nuclear genome. The variation is however particularly high across the population, reaching over 80 copies. As previously reported [68], the mtDNA copy number scales up with ploidy in a linear way, with diploid strains having around double number of mitochondrial genomes and triploid having three times the number, compared to haploid cells (Fig. 6a).

Natural variation in mitochondrial genome copy number. a Mitochondrial genome copy number linearly increase with the nuclear genome content. Fifteen natural petite isolates were detected. The number of isolates is indicated above the corresponding plot. b Spotting assay on non-fermentable carbon source (YPEG) confirms natural petite isolates (a subset of tested isolates is shown). c Mitochondrial activity (as membrane potential) is strongly altered by the lack of mitochondrial genome (black versus grey symbols), while the volume remains unaltered. d Growth curve variation of isogenic strains with normal mitochondria (rho + , red), petite (rho 0 , green) and petite harbouring the ATP2G1099T suppressing mutation (rho 0 ATP2 sup, blue). Among natural petites, we can identify both isolates with high doubling time (DT, black solid line) and isolates with recovered growth rate, comparable to petite with suppressor mutations (black dashed line). e Generation times for isolates with different mitochondrial CN show at least two natural petite isolates that seem to have recovered normal growth rate on rich media. Growth curves for the circled isolates are shown in d

The presence of mitochondrial genomes is assumed to be the natural state of S. cerevisiae cells, which is defined as rho + . However, strains can lose mitochondrial functionality under different conditions either by accumulating mutations (rho − ) or by complete loss (rho 0 ) of the mitochondrial genome. These mutants are defined as ‘cytoplasmic petite’ (i.e. ‘small’) because they form small colonies in rich fermentable media due to their slow growth. Since the respiration-mediated ATP production is impaired in petite strains, they are unable to grow in non-fermentable carbon sources. We identified 15 potential natural petites isolates (Additional file 2: Tables S1 and S4) from coverage analysis and confirmed that they were unable to grow on non-fermentable carbon sources based media (Fig. 6b). All petite isolates appear to be rho 0 , with the exception of two rho − isolates: ABM which retained COB gene and AHV which kept ATP9 and VAR1 genes (Additional file 2: Table S4). These rho − isolates, with other five rho 0 petite, were laboratory-derived haploid (HO deleted), and since strain manipulation could have caused their mitochondrial condition, they were excluded from further analyses. We examined a selection of four strains for mitochondrial activity (measured as membrane potential) and volume. We included as controls a wild-type rho + strain and two derived rho 0 variants with one wearing an additional mutation (ATP2 G1099T), which partially restore growth in rich media (Michael Breitenbach, unpublished data). As expected, activity data show inability to grow on non-fermentable carbon sources (YPEG). There was, however, no significant variation between mitochondrial volume of wild-type and petites isolates, consistent with the essentiality of maintaining mitochondria also in petite strains (Fig. 6c, Additional file 2: Table S4). We investigated if these natural petites have doubling time defect by measuring growth curves in rich media (YPD). The petite strains showed different growth rates, with two of them having close to normal doubling time (Fig. 6d, e Additional file 2: Table S4). These strains do not have neither ATP2 G1099T nor ATP3 G348T polymorphism that partially restore growth in rich media (Michael Breitenbach, unpublished data) hence, other compensatory mutations might have partially restored growth in these strains. We cannot rule out that the petite phenotype might have raised during laboratory manipulations however, restoring near-to-normal growth in some of these isolates has likely required extensive propagation with large population sizes suggesting a more distant mtDNA loss event. Sporulation is known to be impaired in petite isolates [69], and consistently, all natural petite isolates do not sporulate.


We sequenced the mitochondrial genome of Liriodendron tulipifera, the first from the large (>10,000 species) magnoliid lineage, to fill an important phylogenetic gap and provide an outgroup for comparison to the previously studied monocot and eudicot lineages. The phylogenetic position of Liriodendron allowed us to polarize changes in monocots and eudicots, leading to a more detailed understanding of the patterns of loss of RNA editing, gains of plastid tRNAs, and gene cluster conservation across flowering plants. These efforts were bolstered by the fact that the Liriodendron mitochondrial genome evolves exceptionally slowly in terms of gene sequence, content and order, allowing an unprecedented look into the early evolution of plant mitochondrial genomes. Thus, in many striking ways, Liriodendron has a “fossilized” mitochondrial genome, having undergone remarkably little change over the last ∽ 100 million years.

Insights into the acquisition of plastid-derived tRNAs

The evidence presented here points to a different evolutionary history of mitochondrial plastid-derived tRNAs in angiosperms than previously postulated [16, 54], generally pushing back their origins earlier in flowering plant evolution (Figure 2). Whereas Wang et al.[54] posited a recent origin of trnP(TGG)-cp on the branch leading to Nicotiana, its presence in monocots, eudicots and now magnoliids (Figure 2) suggests that its acquisition likely predated the common ancestor of these three lineages. Similarly, the presence of trnD(GTC)-cp in Liriodendron likely pushes the origin of that tRNA back from the common ancestor of eudicots to sometime after the gymnosperm/angiosperm divergence. It should be noted, however, that parallel gains in the magnoliids and eudicots is possible in this case as well. The small size and conserved nature of tRNA genes is such that these competing hypotheses are difficult, if not impossible, to test with phylogenetic analysis.

We know from other angiosperm mitochondrial genomes that sequence transfer from the plastid genome is frequent on an evolutionary timescale [14, 55] and that on occasion these transfer events led to the gain of functional tRNAs, based on their widespread conservation across angiosperms [56]. However, the timing of functional transfers has been unclear. Due to its slow rates of gene loss, sequence change and gene-cluster fragmentation, Liriodendron may have retained one or more regions of plastid DNA that date back to the original sequence transfers that permanently seeded some of the plastid tRNAs found across angiosperm mitochondrial genomes (Figures 2 and 3). Other interpretations are possible, however. For example, that Liriodendron and most eudicots have trnD(GTC)-cp (Figures 2 and 3A) could be due to independent parallel gains, once in a magnoliid ancestor and once early in eudicot evolution.

Part of our reasoning that the plastid-derived sequences in Figure 3A, B may be remnants of early functional plastid tRNA transfers is that the tRNA appears to be more strongly conserved than the flanking regions that were simultaneously transferred, suggesting that purifying selection has preserved the tRNA while the surrounding noncoding sequence deteriorated. The fragment in Figure 3A appears to be the oldest, having accumulated 15% pairwise sequence divergence. Given the inferred low rates of sequence evolution in both the mitochondrial and plastid genomes of Liriodendron, its transfer may well date to early in angiosperm evolution. We hesitate, however, to estimate the actual timing of the transfer event for several reasons. The current low substitution rates in the magnoliid lineage are possibly lower than rates were earlier in angiosperm evolution, precluding the use of a strict molecular clock. The transferred regions contain plastid sequence with intergenic DNA, as well as synonymous and nonsynonymous sites, which are under different constraints in the plastid relative to the mitochondrial genome, further complicating fragment-wide divergence time estimates. The plastid-derived fragment containing trnP(TTG)-cp (Figure 3B) appears to be more recently transferred than the fragment in Figure 3A, given the lower overall divergence from its cognate plastid sequence. In this case, however, more of the fragment consists of protein-coding genes, which would likely decrease the overall rate of pairwise sequence divergence following the transfer event.

Our interpretation of the time since transfer may also be complicated by the possibility that concerted evolution homogenizes homologous plastid and mitochondrial sequences [57]. For example, it is possible that a divergent, plastid-derived sequence fragment containing trnN(GTT)-cp (Figure 3C) was already present in the Liriodendron mitochondrial genome from an earlier transfer, and the short stretch containing the tRNA was “updated” via gene conversion between it and a reintroduced copy of the same stretch of plastid DNA, restoring the sequence identity between the plastid and mitochondrial copies. This concerted-evolution mechanism was postulated to explain patterns of sequence divergence in a stretch of plastid-derived sequence in the mitochondrial genomes of Oryza and Zea, where within-species plastid/mitochondrial divergence is less than between species in the mitochondrial region, despite the putatively shared origin of the transferred fragment [57]. If mitochondrial and plastid copies are evolving in concert, the nearly identical plastid-derived fragment in Figure 3C could be much older than suggested by the high sequence similarity.

Low mitochondrial and plastid substitution rates in magnoliids

The mitochondrial genes in Liriodendron evolve at an exceptionally low rate, accumulating just 0.035 nucleotide substitutions per silent site per billion years. As a point of reference, using the same computational approach as employed for the plant mitochondrial rate analysis, we aligned all 13 protein coding genes from the full mitochondrial genomes of a human [58], a Neanderthal [59], a more distantly related Denisova hominin [60], and a chimpanzee outgroup [61]. We calculated an absolute silent substitution rate of 69.5 ssb in humans, using the relevant divergence dates from Krause et al.[60]. The human mitochondrial substitution rate is more than 5,000 times faster than Magnolia and 2,000 times faster than Liriodendron. Stated differently, the average amount of silent site mitochondrial divergence accrued over the course of a single generation (25 years) in humans would take roughly 50,000 years in Liriodendron and 130,000 years in Magnolia.

Mower et al.[10] characterized mitochondrial silent substitution rates across approximately 600 plant species with datasets of one to five genes and also found that Silene noctiflora is the fastest [10]. The slowest evolving mitochondrial genome reported by Mower et al.[10] was Cycas at 0.02 +/− 0.1 ssb, similar to the Liriodendron rate reported here, and greater than our estimate for Magnolia using an 18-gene concatenated alignment. To our knowledge, the estimated rate of 0.013 ssb in Magnolia is the lowest reported genome-wide substitution rate in any organism, but this conclusion is tempered by the associated error in our estimates. For Magnolia and Liriodendron, the 95% likelihood confidence interval about the ssb estimate due to errors in branch specific synonymous substitution estimation was 0.003 to 0.034 and 0.015 to 0.065, respectively (Additional file 1: Table S3). In addition, our estimates rely heavily on fossil-calibrated divergence times, which add an additional source of error (for example, see [30, 31, 62, 63]). We used two widely accepted fossils within magnoliids [64, 65], which together should provide a relatively accurate divergence time estimate for the relevant LiriodendronMagnolia split. The 95% highest probability density interval for this split was 94.9 to 102.2 mya, and the median value we used for our estimate was 97.4 mya (see Methods). Therefore, in our study, errors in absolute rate estimation for Liriodendron and Magnolia are less influenced by divergence time uncertainty than by error in the likelihood estimate of the branch-specific synonymous substitution rates.

We found that mitochondrial and chloroplast substitution rates were roughly correlated in the taxa examined here (Figure 4), an observation deserving of more detailed follow-up study. Although it is too early to extrapolate too much, growth habit (annual vs. perennial, shrub vs. tree) might underlie this pattern [66]. Generation time and rates of synonymous substitution are generally inversely correlated in plants (for review, see [67]). The driving forces behind this relationship are unclear, however, as plants do not have a dedicated germ line, so generation time and number of reproductive cell divisions per year are not as closely linked as they are in animals. Differences between annuals and perennials, in terms of speciation rate and/or metabolism, could underlie the generation time substitution rate relationship [67], and might be expected to similarly influence each of the plant’s three genomes. As nuclear genomic data become available for a broader diversity of plants, it will be interesting to determine whether this correlation extends across all three genetic compartments.

Our data also recovered a greater ratio of plastid to mitochondrial silent substitution rates than was found previously [9, 13, 51, 52]. Our estimate benefited from considerably more sequence data and much broader taxon sampling than previous studies, which might account for the discrepancy. In addition, given the 5,000-fold and 40-fold range in mitochondrial and plastid substitution rates, respectively, that we found, it appears that taxon sampling can have a large effect on average inferred ratios. “High-rate” mitochondrial and plastid lineages do not always have proportionally elevated rates in both organelle genomes [48], leading to extreme plastid–mitochondrial rate relationships (for example, 0.08 in Silene conica) (Figure 4). Gene-to-gene variation in mitochondrial [10] and plastid [48, 53] silent substitution rates are common as well, underscoring the need to consider many mitochondrial and plastid genes for an accurate determination of relative rates.

Retention of RNA editing sites lost in many lineages

The overall high level of C-to-U RNA editing in Liriodendron, along with its large number of unique edit sites, add further support for a model of relatively high levels of RNA editing in the ancestral angiosperm mitochondrial genome (approximately 700 sites in protein-coding genes), followed by various degrees of subsequent loss in different lineages (Figure 5) [26, 27]. RNA editing data from an angiosperm from an “early diverging” lineage, such as Amborella or Nymphaea, would help polarize the degree of editing loss in Liriodendron, which looks to be exceptionally low based on these data. There is no clear adaptive explanation for the emergence and maintenance of RNA editing in plants [25, 68, 69], but it may have emerged through neutral processes, only to become essential following substitutions at functionally important cytosines that required post-transcriptional editing to produce the conserved amino acid [70] – a hypothesis falling under the category of ‘constructive neutral evolution’ [71, 72]. Consistent with this model, most edit sites change the translated amino acid sequence [21, 73], a pattern underscored in Liriodendron, in which 82% of the edits were at nonsynonymous sites. While the emergence of RNA editing may be due to neutral processes, comparative work has found support for selection favoring loss of editing over time [26, 27], and it is likely that such selection would be stronger at nonsynonymous sites, where unreliable editing would be most deleterious. Consistent with this hypothesis, we found the ratio of loss to gain was 14:1 at nonsynonymous sites compared to 2:1 at silent sites across angiosperms (Figure 5).

Conservation of ancient gene clusters

Although overall gene order is highly variable among angiosperm mitochondrial genomes [13], even between closely related taxa [15], the results here underscore countervailing constraints on short clusters of gene linkage operating across angiosperm evolution. While some of the conserved clusters (for example, rrnS–rrn5 and rpl2–rps19–rps3–rpl16) date back to the original bacterial ancestor of mitochondria [19], others are unique to angiosperms, such as the atp8–cox3–sdh4 and rps13–nad1.x2.x3 clusters. The five clusters shared by Liriodendron and Cycas most likely were present early in seed plant evolution, and we can look outside of seed plants to infer which of these were also present early in vascular plant evolution as well. A comparative gene order analysis showed Huperzia to have experienced fewer rearrangements relative to bryophytes than any other vascular plant mitochondrial genome [74], making it a meaningful comparison for vascular plant gene order conservation. Of the five clusters shared by Cycas and Liriodendron, three are shared with Huperzia and two are not. All of the gene clusters found in Liriodendron to the exclusion of Cycas are also lacking in Huperzia, suggesting such clusters are indeed angiosperm-specific.

Transcription is likely an important constraint, whereby adjacent genes share a single promoter and are co-transcribed, as was shown for three conserved gene clusters in Nicotiana[16]. This could explain why all of the clusters conserved across angiosperms involve genes encoded on the same strand. Interestingly, three of the clusters inferred to be present in the ancestral angiosperm involve internal fragments of trans-spliced genes (Figure 6), which may, upon further examination, provide clues as to the regulation and reconstruction of full-length transcripts from trans-spliced genes.

The Liriodendron mitochondrial genome appears to have been subject to both low silent-substitution rates and infrequent gene-cluster fragmentation relative to sequenced eudicot and monocot mitochondrial genomes (Figures 4 and 6). However, levels of silent substitution and gene cluster fragmentation do not necessarily covary across all angiosperms in our study. For example, one of the taxa with a relatively high silent substitution rate (>30 × faster than Liriodendron), Cucurbita, has 11 conserved gene clusters compared to 12 in Liriodendron, whereas Zea, with a relatively slower rate (10 × faster than Liriodendron), has only five. In angiosperm plastid genomes, there is support for a positive relationship between rates of structural and sequence evolution [75], but this relationship is not universal [48, 53]. In Silene, for example, although rates of plastid gene order rearrangement are higher in species with higher substitution rates, many of these substitutions occur at nonsynonymous sites and so are not easily explained by a simple, mutationally-driven model [48].


Introns were first discovered in protein-coding genes of adenovirus, [8] [9] and were subsequently identified in genes encoding transfer RNA and ribosomal RNA genes. Introns are now known to occur within a wide variety of genes throughout organisms and viruses within all of the biological kingdoms.

The fact that genes were split or interrupted by introns was discovered independently in 1977 by Phillip Allen Sharp and Richard J. Roberts, for which they shared the Nobel Prize in Physiology or Medicine in 1993. [10] The term intron was introduced by American biochemist Walter Gilbert: [5]

"The notion of the cistron [i.e., gene] . must be replaced by that of a transcription unit containing regions which will be lost from the mature messenger – which I suggest we call introns (for intragenic regions) – alternating with regions which will be expressed – exons." (Gilbert 1978)

The term intron also refers to intracistron, i.e., an additional piece of DNA that arises within a cistron. [11]

Although introns are sometimes called intervening sequences, [12] the term "intervening sequence" can refer to any of several families of internal nucleic acid sequences that are not present in the final gene product, including inteins, untranslated regions (UTR), and nucleotides removed by RNA editing, in addition to introns.

The frequency of introns within different genomes is observed to vary widely across the spectrum of biological organisms. For example, introns are extremely common within the nuclear genome of jawed vertebrates (e.g. humans and mice), where protein-coding genes almost always contain multiple introns, while introns are rare within the nuclear genes of some eukaryotic microorganisms, [13] for example baker's/brewer's yeast (Saccharomyces cerevisiae). In contrast, the mitochondrial genomes of vertebrates are entirely devoid of introns, while those of eukaryotic microorganisms may contain many introns. [14]

A particularly extreme case is the Drosophila dhc7 gene containing a ≥3.6 megabase (Mb) intron, which takes roughly three days to transcribe. [15] [16] On the other extreme, a recent study suggests that the shortest known eukaryotic intron length is 30 base pairs (bp) belonging to the human MST1L gene. [17]

Splicing of all intron-containing RNA molecules is superficially similar, as described above. However, different types of introns were identified through the examination of intron structure by DNA sequence analysis, together with genetic and biochemical analysis of RNA splicing reactions.

At least four distinct classes of introns have been identified: [1]

    that are removed by spliceosomes (spliceosomal introns)
  • Introns in nuclear and archaeal transfer RNA genes that are removed by proteins (tRNA introns)
  • Self-splicing group I introns that are removed by RNA catalysis
  • Self-splicing group II introns that are removed by RNA catalysis

Group III introns are proposed to be a fifth family, but little is known about the biochemical apparatus that mediates their splicing. They appear to be related to group II introns, and possibly to spliceosomal introns. [18]

Spliceosomal introns Edit

Nuclear pre-mRNA introns (spliceosomal introns) are characterized by specific intron sequences located at the boundaries between introns and exons. [19] These sequences are recognized by spliceosomal RNA molecules when the splicing reactions are initiated. [20] In addition, they contain a branch point, a particular nucleotide sequence near the 3' end of the intron that becomes covalently linked to the 5' end of the intron during the splicing process, generating a branched (lariat) intron. Apart from these three short conserved elements, nuclear pre-mRNA intron sequences are highly variable. Nuclear pre-mRNA introns are often much longer than their surrounding exons.

TRNA introns Edit

Transfer RNA introns that depend upon proteins for removal occur at a specific location within the anticodon loop of unspliced tRNA precursors, and are removed by a tRNA splicing endonuclease. The exons are then linked together by a second protein, the tRNA splicing ligase. [21] Note that self-splicing introns are also sometimes found within tRNA genes. [22]

Group I and group II introns Edit

Group I and group II introns are found in genes encoding proteins (messenger RNA), transfer RNA and ribosomal RNA in a very wide range of living organisms., [23] [24] Following transcription into RNA, group I and group II introns also make extensive internal interactions that allow them to fold into a specific, complex three-dimensional architecture. These complex architectures allow some group I and group II introns to be self-splicing, that is, the intron-containing RNA molecule can rearrange its own covalent structure so as to precisely remove the intron and link the exons together in the correct order. In some cases, particular intron-binding proteins are involved in splicing, acting in such a way that they assist the intron in folding into the three-dimensional structure that is necessary for self-splicing activity. Group I and group II introns are distinguished by different sets of internal conserved sequences and folded structures, and by the fact that splicing of RNA molecules containing group II introns generates branched introns (like those of spliceosomal RNAs), while group I introns use a non-encoded guanosine nucleotide (typically GTP) to initiate splicing, adding it on to the 5'-end of the excised intron.

While introns do not encode protein products, they are integral to gene expression regulation. Some introns themselves encode functional RNAs through further processing after splicing to generate noncoding RNA molecules. [25] Alternative splicing is widely used to generate multiple proteins from a single gene. Furthermore, some introns play essential roles in a wide range of gene expression regulatory functions such as nonsense-mediated decay [26] and mRNA export. [27]

The biological origins of introns are obscure. After the initial discovery of introns in protein-coding genes of the eukaryotic nucleus, there was significant debate as to whether introns in modern-day organisms were inherited from a common ancient ancestor (termed the introns-early hypothesis), or whether they appeared in genes rather recently in the evolutionary process (termed the introns-late hypothesis). Another theory is that the spliceosome and the intron-exon structure of genes is a relic of the RNA world (the introns-first hypothesis). [28] There is still considerable debate about the extent to which of these hypotheses is most correct. The popular consensus at the moment is that introns arose within the eukaryote lineage as selfish elements. [29]

Early studies of genomic DNA sequences from a wide range of organisms show that the intron-exon structure of homologous genes in different organisms can vary widely. [30] More recent studies of entire eukaryotic genomes have now shown that the lengths and density (introns/gene) of introns varies considerably between related species. For example, while the human genome contains an average of 8.4 introns/gene (139,418 in the genome), the unicellular fungus Encephalitozoon cuniculi contains only 0.0075 introns/gene (15 introns in the genome). [31] Since eukaryotes arose from a common ancestor (common descent), there must have been extensive gain or loss of introns during evolutionary time. [32] [33] This process is thought to be subject to selection, with a tendency towards intron gain in larger species due to their smaller population sizes, and the converse in smaller (particularly unicellular) species. [34] Biological factors also influence which genes in a genome lose or accumulate introns. [35] [36] [37]

Alternative splicing of exons within a gene after intron excision acts to introduce greater variability of protein sequences translated from a single gene, allowing multiple related proteins to be generated from a single gene and a single precursor mRNA transcript. The control of alternative RNA splicing is performed by a complex network of signaling molecules that respond to a wide range of intracellular and extracellular signals.

Introns contain several short sequences that are important for efficient splicing, such as acceptor and donor sites at either end of the intron as well as a branch point site, which are required for proper splicing by the spliceosome. Some introns are known to enhance the expression of the gene that they are contained in by a process known as intron-mediated enhancement (IME).

Actively transcribed regions of DNA frequently form R-loops that are vulnerable to DNA damage. In highly expressed yeast genes, introns inhibit R-loop formation and the occurrence of DNA damage. [38] Genome-wide analysis in both yeast and humans revealed that intron-containing genes have decreased R-loop levels and decreased DNA damage compared to intronless genes of similar expression. [38] Insertion of an intron within an R-loop prone gene can also suppress R-loop formation and recombination. Bonnet et al. (2017) [38] speculated that the function of introns in maintaining genetic stability may explain their evolutionary maintenance at certain locations, particularly in highly expressed genes.

Starvation adaptation Edit

The physical presence of introns promotes cellular resistance to starvation via intron enhanced repression of ribosomal protein genes of nutrient-sensing pathways. [39]

Introns may be lost or gained over evolutionary time, as shown by many comparative studies of orthologous genes. Subsequent analyses have identified thousands of examples of intron loss and gain events, and it has been proposed that the emergence of eukaryotes, or the initial stages of eukaryotic evolution, involved an intron invasion. [40] Two definitive mechanisms of intron loss, reverse transcriptase-mediated intron loss (RTMIL) and genomic deletions, have been identified, and are known to occur. [41] The definitive mechanisms of intron gain, however, remain elusive and controversial. At least seven mechanisms of intron gain have been reported thus far: intron transposition, transposon insertion, tandem genomic duplication, intron transfer, intron gain during double-strand break repair (DSBR), insertion of a group II intron, and intronization. In theory it should be easiest to deduce the origin of recently gained introns due to the lack of host-induced mutations, yet even introns gained recently did not arise from any of the aforementioned mechanisms. These findings thus raise the question of whether or not the proposed mechanisms of intron gain fail to describe the mechanistic origin of many novel introns because they are not accurate mechanisms of intron gain, or if there are other, yet to be discovered, processes generating novel introns. [42]

In intron transposition, the most commonly purported intron gain mechanism, a spliced intron is thought to reverse splice into either its own mRNA or another mRNA at a previously intron-less position. This intron-containing mRNA is then reverse transcribed and the resulting intron-containing cDNA may then cause intron gain via complete or partial recombination with its original genomic locus. Transposon insertions can also result in intron creation. Such an insertion could intronize the transposon without disrupting the coding sequence when a transposon inserts into the sequence AGGT, resulting in the duplication of this sequence on each side of the transposon. It is not yet understood why these elements are spliced, whether by chance, or by some preferential action by the transposon. In tandem genomic duplication, due to the similarity between consensus donor and acceptor splice sites, which both closely resemble AGGT, the tandem genomic duplication of an exonic segment harboring an AGGT sequence generates two potential splice sites. When recognized by the spliceosome, the sequence between the original and duplicated AGGT will be spliced, resulting in the creation of an intron without alteration of the coding sequence of the gene. Double-stranded break repair via non-homologous end joining was recently identified as a source of intron gain when researchers identified short direct repeats flanking 43% of gained introns in Daphnia. [42] These numbers must be compared to the number of conserved introns flanked by repeats in other organisms, though, for statistical relevance. For group II intron insertion, the retrohoming of a group II intron into a nuclear gene was proposed to cause recent spliceosomal intron gain.

Intron transfer has been hypothesized to result in intron gain when a paralog or pseudogene gains an intron and then transfers this intron via recombination to an intron-absent location in its sister paralog. Intronization is the process by which mutations create novel introns from formerly exonic sequence. Thus, unlike other proposed mechanisms of intron gain, this mechanism does not require the insertion or generation of DNA to create a novel intron. [42]

The only hypothesized mechanism of recent intron gain lacking any direct evidence is that of group II intron insertion, which when demonstrated in vivo, abolishes gene expression. [43] Group II introns are therefore likely the presumed ancestors of spliceosomal introns, acting as site-specific retroelements, and are no longer responsible for intron gain. [44] [45] Tandem genomic duplication is the only proposed mechanism with supporting in vivo experimental evidence: a short intragenic tandem duplication can insert a novel intron into a protein-coding gene, leaving the corresponding peptide sequence unchanged. [46] This mechanism also has extensive indirect evidence lending support to the idea that tandem genomic duplication is a prevalent mechanism for intron gain. The testing of other proposed mechanisms in vivo, particularly intron gain during DSBR, intron transfer, and intronization, is possible, although these mechanisms must be demonstrated in vivo to solidify them as actual mechanisms of intron gain. Further genomic analyses, especially when executed at the population level, may then quantify the relative contribution of each mechanism, possibly identifying species-specific biases that may shed light on varied rates of intron gain amongst different species. [42]


In order to establish a standard nomenclature for introns in protein-coding genes across the kingdom Fungi, it is necessary to find an appropriate reference mitogenome. By looking at fungal species with available mitogenomes, we choose the mitogenome of the cyclosporin-producing fungus Tolypocladium inflatum ARSEF 3280 (accession number NC_036382) as the reference mitogenome. The 25,328-bp mitogenome of T. inflatum contains all the 15 protein-coding genes typically found in fungal mitogenomes, and there is no intron in any of these protein-coding genes (Zhang et al. 2017d). We did not choose the best-understood model fungi: ‘baker’s yeast’ Saccharomyces cerevisiae, the fission yeast Schizosaccharomyces pombe, the opportunistic fungal pathogen Candida albicans, the filamentous euascomycete Neurospora crassa, etc. This is because the yeasts Sa. cerevisiae and Sc. pombe both lack genes coding for NADH dehydrogenases in their mitogenomes (Foury et al. 1998), and C. albicans and N. crassa contain introns in many different protein-coding genes (Borkovich et al. 2004 Bartelli et al. 2013). We also did not choose the human mitochondrial genome, which was selected as the reference to name introns found in nad5 and cox1 in certain metazoans (Emblem et al. 2011). This is because the human mitogenome contains only 13 standard protein-coding genes without atp9 and rps3. The latter two genes are known to harbor introns in fungal mitogenomes.

Both basal and higher fungi may contain introns in their mitogenomes. We randomly selected representative species in each fungal phylum to locate and name possible introns (Table 1). Determination of the insertion position of an intron relies on alignment between sequences of its host gene and corresponding gene sequences of T. inflatum (Additional file 1). Although there are many sequence alignment programs available, we recommend using MAFFT (, which is fast when aligning long sequences containing many introns and can always generate satisfactory alignment according to our experience. The default setting of MAFFT works well in most cases. If exon-intron boundaries are not correctly identified (probably due to the interference of intron sequences or presence of short exons) under the default settings, one may consider adjusting the alignment parameters (e.g., try ‘Unalignlevel > 0’ and possibly ‘Leave gappy regions’ by selecting the G-INS-1 or G-INS-i alignment strategy) and/or importing additional sequences to align from a species closely related the test species. In addition, it is always advisable to refer to known annotation results and/or characteristic nucleotides at splice sites of group I/II introns (Cech 1988) to ensure correct alignment and identification of exon-intron boundaries.

Author information


Universidade Federal do Espírito Santo, Grupo de Ecologia Bêntica, Departamento de Oceanografia, Av. Fernando Ferrari, 514, Vitória, ES, 29075-910, Brazil

Auburn University, Department of Biological Sciences, 101 Life Sciences Building, Auburn, AL, 36849, USA

Yuanning Li & Kenneth M. Halanych

Department of Oceanography, SOEST, University of Hawaii at Manoa, 1000 Pope Road, Honolulu, HI, 96822, USA

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar


Conceived and designed the experiments: C.R.S., K.M.H. Performed the experiments: A.F.B., Y.L., C.R.S., K.M.H. Analyzed the data: A.F.B., Y.L., K.M.H. Contributed reagents/materials/analysis tools: K.M.H., C.R.S. Wrote the paper: A.F.B., Y.L., C.R.S., K.M.H.

Corresponding authors


Fungal isolates and DNA extraction

Sixteen T. fuciformis isolates (TF01-TF16) were obtained by the Edible Fungal Germplasm Resources Management Center of Fujian province, Fuzhou, China. The origin of the isolates is listed in Supplementary Table 3. Among them, TF15 was isolated from Wuyishan National Parks, Fujian, China, in 2014, TF11 and TF14 were obtained from Wuyishan National Nature Reserve in 2015, and TF01 was another wild isolate from Huboliao National Nature Reserve of Fujian.

After being grown on potato dextrose broth at 25 °C for 48 h, single yeast-like cells of T. fuciformis were washed and harvested by centrifugation at 10,000 g for 5 min, and stored at − 20 °C after freeze-drying. For Illumina sequencing, total genomic DNA of 16 T. fuciformis isolates was extracted using the Omega HP Plant DNA Kit according to the manufacturer’s instructions at least 500 ng DNA (> 18 ng/ul) was required for each sample. For PacBio sequencing, single molecule real-time (SMRT) sequencing, long DNA fragments of TF02 and TF15 were isolated using the cetyl trimethylammonium bromide (CTAB) method as described in DNA-extraction-chlamy-CTAB-JGI.pdf at least 20 μg DNA (OD260/280 between 1.8 and 2.0, OD260/230 between 2.0 and 2.2, intact gDNA > 20 kb) was required for each sample.

Genome sequencing, assembly, and gene annotation

Whole genome shotgun sequencing of 16 T. fuciformis isolates was performed at Beijing Novogene Bioinformatics Technology Co., Ltd. using the Illumina HiSeq 2500 platform with paired-end libraries, targeting 3–6 Gb data per isolate. The raw Illumina sequencing data of T. mesenterica ATCC28783 (accession SRX8046622) was downloaded from the SRA database of NCBI. Raw reads were assembled using Velvet 1.2.03 [52].

Mitochondrial contigs were identified by BLAST against published mitochondrial genome of Cryptococcus neoformans var. grubii H99 (accession NC_004336). Mitochondrial contigs were extended step by step according to the pair-end relationship of reads: if one read mapped on end of a contig, the other end may extend the sequence. Ambiguous extensions or gaps were confirmed or closed by PCR sequencing. Contigs were concatenated into single circular DNA sequences based on 100% overlap.

PacBio sequencing technology was used to verify the assembly accuracy of two of the Illumina-sequenced isolates, TF13 and TF15. These were sequenced using PacBio RS II, targeting approximately 2.5 Gb raw data per isolate. Genome assembly for PacBio sequencing data was done using the Canu 1.3 program [53]. Single contigs for each mitogenome were identified by comparison with mitochondrial genomes of the corresponding isolates obtained from Illumina sequencing data, to obtain complete circular DNAs after trimming 3′ ends.

Both gene prediction and gene annotation were initially done using the online tool MFannot ( tRNAs were annotated by combining the results of MFannot, tRNAscan-SE [54], and RNAweasel [55]. Conserved gene boundaries and exon-intron junction points were confirmed by comparison with corresponding intron-free genes of other tested isolates using Clustal X [56].

Phylogenetic analysis of T. fuciformis isolates

To determine the evolutionary relationships among the 16 T. fuciformis isolates, concatenated amino acid sequences of 14 conserved genes (atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad4L, nad5, and nad6) totalling 4252 characters, were used for phylogenetic analysis, using T. mesenterica as an outgroup. Amino acid alignments were done using Clustal W in the MEGA 6 program [57] with gap opening penalty and gap extensive penalty values of 10 and 3, respectively (same as pairwise and multiple alignments). A phylogenetic tree was constructed using Maximum Likelihood in MEGA 6, and tested by Booststrap analysis with 500 replications. Gaps and missing data within alignments were treated as deletions.

PCR analysis to confirm special predicted introns

PCR analyses were used to confirm predicted introns. Primers (Supplementary Table 4) were designed using online tool primer-blast from NCBI website. These primers targeted regions of cDNA from upstream exon to N-terminal sequence, and in special cases, regions from upstream exon to N-terminal duplication. Representative isolates were selected for the PCR work mtDNAs of these isolates had to include all of the introns, and corresponding intron-free sequences.

Yeast cells were collected at logarithmic phase, and RNA was extracted using the Omega HP Plant RNA Kit. cDNA was reverse transcribed using the PrimeScript™ RT-PCR Kit (Takara, Dalian), and used as PCR templates. PCR products were sequenced at Sangon Biotech (Shanghai).

Materials and Methods

Sampling and DNA extraction

The symptomatic mycelium of the pathogen of slippery scar from A. polytricha was collected from Jintang, Sichuan Province, China. The isolation of the causative pathogen was conducted according to Peng et al. 1 . Suspected fungi were first cultured on PDA medium for 3 days, and then inoculated into cultivation bags with healthy A. polytricha mycelia. The inoculated cultivation bags were cultured at 25 °C for 20 days. Then the pathogenic fungi were re-isolated from the cultivation bags with infected A. polytricha, which showed the symptoms of slippery scar. The strain was identified as S. auriculariicola based on the Koch’s postulates, morphology, and ITS sequences. The mycelium of S. auriculariicola was cultured in liquid potato dextrose medium for 4 days and then collected for DNA extraction. Total DNA was extracted from the mycelia using the fungal DNA Kit D3390-00 (Omega Bio-Tek, Norcross, GA, USA) according to the manufacturer’s instructions. The quality of extracted DNA was checked by electrophoresis, and DNA was stored at −20 °C until sequencing. The S. auriculariicola strain was stored in Sichuan Academy of Agricultural Sciences (No. SAAS_Sau), and is available from Cheng Chen and Daihua Lu of the Sichuan Academy of Agricultural Sciences, China.

Sequencing, assembly, and annotation of the mitochondrial genome

Purified DNA was used to construct sequencing libraries following the instructions of the NEBNext Ultra II DNA Library Prep Kit (NEB, Beijing, China). Whole genome shotgun sequencing was performed using an Illumina HiSeq 2500 Platform (Illumina, San Diego, CA, USA). We performed quality control and de novo assembly of the mitogenome according to Bi 52 . SPAdes 3.9.0 software 53 was used for de novo assembly of the mitogenome, and the MITObim V1.9 program 54 was used to fill in the gaps between contigs.

MFannot ( and MITOS 55 tools were used for mitogenome annotation of S. auriculariicola, both of which are based on Genetic Code 4. Uncertain results were adjusted manually by sequence alignments with orthologous genes without intron from the closely related species. The initially annotated protein-coding genes, rRNA, or tRNA genes of S. auriculariicola were also modified by alignment with previously published Leotiomycetes mitogenomes. ORFs were functionally annotated by InterProScan software 56 . The tRNAscan-SE 2.0 program was used to predict tRNA genes 57 . Finally, we used the OrganellarGenomeDraw (OGDRAW) tool 58 to draw a map of the S. auriculariicola complete mitogenome.

Analysis of the mitogenomic organization

We used the Lasergene v7.1 (DNASTAR tool with default settings to analyze the base composition of the mitogenome of S. auriculariicola. Strand asymmetry of the mitogenome was assessed using the following formulas: AT skew = [A − T]/[A + T], and GC skew = [G − C]/[G + C] 59 . We calculated the codon usage using Sequence Manipulation Suite software 60 based on genetic code 4. We compared the arrangement of genes in S. auriculariicola with those of other published Leotiomycetes species. Genomic synteny analysis of mitogenomes from six representative species within the Leotiomycetes was conducted with Mauve v2.4.0 61 .

Repetitive elements analysis

We searched the entire mitogenome of S. auriculariicola by BLASTn searches against itself using Circoletto 62 ( with an E-value of <10 −10 , aiming to identify large intragenomic replications of sequences and interspersed repeats. The Tandem Repeats Finder 63 ( with default settings was used to analyze the tandem repeats. We searched for repeated sequences including forward, reverse, complementary, and reverse complementary sequences in S. auriculariicola using the REPuter 64 tool with E-values <10 −5 .

Phylogenetic analysis

For the phylogenetic analysis, we constructed a phylogenetic tree based on 15 common mitochondrial genes from S. auriculariicola and other 15 species in Leotiomycetes, 8 species in Dothideomycetes, 11 species in Eurotiomycetes, and 3 species in Sordariomycetes (outgroup). The MAFFT algorithm within the TranslatorX online platform 65 was used to align the 15 conserved protein-coding genes. The Sequence Matrix 1.7.8 program 66 was used to combine the individual genes into a combined matrix. We used the Modelgenerator v851 67 tool to determine the best-fit evolutionary model for the phylogenetic analysis.

The Bayesian inference (BI) method was used for phylogenetic analysis based on the combined gene dataset with the MrBayes 3.2.6 68 program. Two independent runs were performed for 2 × 10 6 generations sampling per 100 generations. Each run was sampled every 100 generations. Stationarity was assumed to have been reached when the estimated sample size (ESS) was >100, and the potential scale reduction factor (PSRF) approached 1.0. After the analysis was stable, the first 25% of the yielded trees were discarded as burn-in, and a 50% majority-rule consensus tree with posterior probability (PP) values was generated from the remaining trees. In order to compare mitochodrial phylogeny with nuclear multi-locus phylogeny, we downloaded internal transcribed spacer (ITS), RNA polymerase II second largest subunit (RPB2), translation elongation factor-1 alpha (EF1-α) and beta-tubulin (β-TUB) genes of 38 species from the NCBI database. Phylogenetic trees were constructed using the same method as mitochondrial genes. We also used the BI method to analyze the phylogenetic relationships of S. auriculariicola and related species using individual mitochondrial genes (15 core protein-coding genes) the purpose of which is to test whether these genes were useful as molecular markers for the phylogenetic analysis of Leotiomycetes species.

Watch the video: Όλη Η Αλήθεια Για Την Κρεατίνη (January 2023).