We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

To calculate likelihoods for state-dependent diversification models we use a pruning algorithm with calculations that progress back through the tree from the tips to the root. We will follow the description of this algorithm in Maddison et al. (2007). We have already used this approach to derive likelihoods for constant rate birth-death models on trees (Chapter 12), and this derivation is similar.

We consider a phylogenetic tree with data on the character states at the tips. For the purposes of this example, we will assume that the tree is complete and correct – we are not missing any species, and there is no phylogenetic uncertainty. We will come back to these two assumptions later in the chapter.

We need to obtain the probability of obtaining the data given the model (the likelihood). As we have seen before, we will calculate that likelihood going backwards in time using a pruning algorithm (Maddison et al. 2007). The key principle, again, is that if we know the probabilities at some point in time on the tree, we can calculate those probabilities at some time point immediately before. By applying this method successively, we can move back towards the root of the tree. We move backwards along each branch in the tree, merging these calculations at nodes. When we get to the root, we have the probability of the data given the model and the entire tree – that is, we have the likelihood.

The other essential piece is that we have a starting point. When we start at the tips of the tree, we assume that our character states are fixed and known. We use the fact that we know all of the species and their character states at the present day as our starting point, and move backwards from there (Maddison et al. For example, for a species with character state 0, the likelihood for state 0 is 1, and for state 1 is zero. In other words, at the tips of the tree we can start our calculations with a probability of 1 for the state that matches the tip state, and 0 otherwise.

This discussion also highlights the fact that incorporating uncertainty and/or variation in tip states for these algorithms is not computationally difficult – we just need to start from a different point at the tips. For example, if we are completely unsure about the tip state for a certain taxa, we can begin with likelihoods of 0.5 for starting in state 0 and 0.5 for starting in state 1. However, such calculations are not commonly implemented in comparative methods software.

We now need to consider the change in the likelihood as we step backwards through time in the tree (Maddison et al. We will consider some very small time interval *Δ**t*, and later use differential equations to find out what happens in the limit as this interval goes to zero (Figure 13.2). Since we will eventually take the limit as *Δ**t* → 0, we can assume that the time interval is so small that, at most, one event (speciation, extinction, or character change) has happened in that interval, but never more than one. We will calculate the probability of the observed data given that the character is in each state at time *t*, again measuring time backwards from the present day. In other words, we are considering the probability of the observed data if, at time t, the character state were in state 0 [*p*_{0}(*t*)] or state 1 [*p*_{1}(*t*)]. For now, we can assume we know these probabilities, and try to calculate updated probabilities at some earlier time *t* + *Δ**t*: *p*_{0}(*t* + *Δ**t*) and *p*_{1}(*t* + *Δ**t*).

To calculate *p*_{0}(*t* + *Δ**t*) and *p*_{1}(*t* + *Δ**t*), we consider all of the possible things that could happen in a time interval *Δ**t* along a branch in a phylogenetic tree that are compatible with our dataset (Figure 13.2; Maddison et al. First, nothing at all could have happened; second, our character state could have changed; and third, there could have been a speciation event. This last event might seem incorrect, as we are only considering changes along branches in the tree and not at nodes. If we did not reconstruct any speciation events at some point along a branch, then how could one have taken place? The answer is that a speciation event could have occurred but all taxa descended from that branch have since gone extinct. We must also consider the possibility that either the right or the left lineage went extinct following the speciation event; that is why the speciation event probabilities appear twice in Figure 13.2 (Maddison et al. 2007).

We can write an equation for these updated probabilities. We will consider the probability that the character is in state 0 at time *t* + *Δ**t*; the equation for state 1 is similar (Maddison et al. 2007).

$$ egin{aligned} p_0 (t+Delta t)=(1-mu_0 )Delta t cdot [(1-q_{01} Delta t)(1-lambda_0 Delta t) p_0 (t)+q_{01} Delta t(1-lambda_0 Delta t) p_1 (t)+2 cdot (1-q_{01} Delta t) lambda_0 Delta t cdot E_0 (t) p_0 (t)] end{aligned} label{13.4}$$

We can multiply through and simplify. We will also drop any terms that include [*Δ**t*]^{2}, which become vanishingly small as *Δ**t* decreases. Doing that, we obtain (Maddison et al. 2007):

[p_0(t + Δt)=[1 − (λ_0 + μ_0 + q_{01})Δt]p_0(t)+(q_{01}Δt)p_1(t)+2(λ_0Δt)E_0(t)p_0(t) label{13.5}]

Similarly,

[p_1(t + Δt)=[1 − (λ_1 + μ_1 + q_{10})Δt]p_1(t)+(q_{10}Δt)p_0(t)+2(λ_1Δt)E_1(t)p_1(t) label{13.6}]

We can then find the instantaneous rate of change for these two equations by solving for *p*_{1}(*t* + *Δ**t*)/[*Δ**t*], then taking the limit as *Δ**t* → 0. This gives (Maddison et al. 2007):

$$ frac{dp_0}{dt} = -(lambda_0 + mu_0 + q_{01}) p_0(t) + q{01}p_1(t) + 2 lambda_0 E_0(t) p_0(t) label{13.7}$$

and:

$$ frac{dp_1}{dt} = -(lambda_1 + mu_1 + q_{10}) p_1(t) + q{10}p_1(t) + 2 lambda_1 E_1(t) p_1(t) label{13.8}$$

We also need to consider *E*_{0}(*t*) and *E*_{1}(*t*). These represent the probability that a lineage with state 0 or 1, respectively, and alive at time *t* will go extinct before the present day. Neglecting the derivation of these formulas, which can be found in Maddison et al. (2007) and is closely related to similar terms in Chapters 11 and 12, we have:

$$ frac{dE_0}{dt} = mu_0-(lambda_0+mu_0+q_{01} ) E_0 (t)+q_{01} E_1 (t)+lambda_0 [E_0 (t)]^2 label{13.9} $$

and:

$$ frac{dE_1}{dt} = mu_1-(lambda_1+mu_1+q_{10} ) E_1 (t)+q_{10} E_0 (t)+lambda_1 [E_1 (t)]^2 label{13.10}$$

Along a single branch in a tree, we can sum together many such small time intervals. But what happens when we get to a node? Well, if we consider the time interval that contains the node, then we already know what happened – a speciation event. We also know that the two daughters immediately after the speciation event were identical in their traits (this is an assumption of the model). So we can calculate the likelihood for their ancestor for each state as the product of the likelihoods of the two daughter branches coming into that node and the speciation rate (Maddison et al. In this way, we merge our likelihood calculations along each branch when we get to nodes in the tree.

When we get to the root of the tree, we are almost done – but not quite! We have partial likelihood calculations for each character state – so we know, for example, the likelihood of the data if we had started with a root state of 0, and also if we had started at 1. To merge these we need to use probabilities of each character state at the root of the tree (Maddison et al. For example, if we do not know the root state from any outside information, we might consider root probabilities for each state to be equal, 0.5 for state 0 and 0.5 for state 1. We then multiply the likelihood associated with each state with the root probability for that state. Finally, we add these likelihoods together to obtain the full likelihood of the data given the model.

The question of which root probabilities to use for this calculation has been discussed in the literature, and does matter in some applications. Aside from equal probabilities of each state, other options include using outside information to inform prior probabilities on each state (e.g. Hagey et al. 2017), finding the calculated equilibrium frequency of each state under the model (Maddison et al. 2007), or weighting each root state by its likelihood of generating the data, effectively treating the root as a nuisance parameter (FitzJohn et al. 2009).

I have described the situation where we have two character states, but this method generalizes well to multi-state characters (the MuSSE method; FitzJohn 2012). We can describe the evolution of the character in the same way as described for multi-state discrete characters in chapter 9. We then can assign unique diversification rate parameters to each of the *k* character states: *λ*_{0}, *λ*_{1}, …, *λ*_{k} and *μ*_{0}, *μ*_{1}, …, *μ*_{k} (FitzJohn 2012). It is worth keeping in mind, though, that it is not too hard to construct a model where parameters are not identifiable and model fitting and estimation become very difficult.

## Fit_musse : Fit a discrete-state-dependent diversification model via.

The Binary State Speciation and Extinction (BiSSE) model (Maddison et al. 2007) and its extension to Multiple State Speciation Extinction (MuSSE) models (FitzJohn et al. 2009, 2012), Hidden State Speciation Extinction (HiSSE) models (Beaulieu and O'meara, 2016) or Several Examined and Concealed States-dependent Speciation and Extinction (SecSSE) models (van Els et al. 2018), describe a Poissonian cladogenic process whose birth/death (speciation/extinction) rates depend on the states of an evolving discrete trait. Specifically, extant tips either go extinct or split continuously in time at Poissonian rates, and birth/death rates at each extant tip depend on the current state of the tip lineages tansition stochastically between states acccording to a continuous-time Markov process with fixed transition rates. At the end of the simulation (i.e., at "present-day"), extant lineages are sampled according to some state-dependent probability ("sampling_fraction"), which may depend on proxy state. Optionally, tips may also be sampled continuously over time according to some Poissonian rate (which may depend on proxy state), in which case the resulting tree may not be ultrametric.

This function takes as main input a phylogenetic tree (ultrametric unless Poissonian sampling is included) and a list of tip proxy states, and fits the parameters of a BiSSE/MuSSE/HiSSE/SecSSE model to the data via maximum-likelihood. Tips can have missing (unknown) proxy states, and the function can account for biases in species sampling and biases in the identification of proxy states. The likelihood is calculated using a mathematically equivalent, but computationally more efficient variant, of the classical postorder-traversal BiSSE/MuSSE/HiSSE/SecSSE algorithm, as described by Louca (2019). This function has been optimized for large phylogenetic trees, with a relatively small number of states (i.e. Nstates<<Ntips) its time complexity scales roughly linearly with Ntips.

## Materials and Methods

### PHYLOGENY ESTIMATION

Sequences were obtained for three mitochondrial (16S, 12S, COI) and three nuclear (18S, 28S, Histone H3) gene regions for 466 described crayfish species and subspecies, representing 70% of the described diversity (Table S1) (Crandall and De Grave 2017 ). These genes have proven useful for resolving relationships among crayfish and other decapods (Toon et al. 2009 Bracken-Grissom et al. 2014 ). Of the 1565 total sequences from 466 taxa used in the phylogeny estimate, 411 from 119 taxa were new to this study. The full molecular dataset contained 450 16S sequences, 289 12S sequences, 352 COI sequences, 86 18S sequences, 247 28S sequences, and 141 Histone H3 sequences. Protein coding sequences (COI, H3) were aligned by translating to amino acid sequences and back translating to nucleotides in Translator X (Abascal et al. 2010 ). Protein coding sequences that contained stop codons were removed to eliminate potential nuclear mitochondrial genes (Song et al. 2008 Buhay 2009 ). The rRNA genes were aligned with PASTA version 1.6 (Mirarab et al. 2015 ) using default alignment, merging, and tree searching algorithms suggested by the program. We used the greedy algorithm in PartitionFinder version 1.1.1 (Lanfear et al. 2012 ) to select an optimal partitioning scheme and best fitting models of molecular evolution for each partition using the Bayesian information criterion (BIC). Codon positions for protein coding genes and full rRNA genes were used as a priori data subsets.

Bootstrap phylogenies (*N* = 1000) were estimated from the concatenated alignment using RAxML version 8.2.13 (Stamatakis 2014 ) with the optimal partitioning scheme suggested by PartitionFinder. Option “-f d” was used to search for the maximum likelihood phylogeny from 200 randomized stepwise addition order parsimony trees under the GTRCAT model, with the best scoring tree optimized under the GTRGAMMA model with four rate categories.

The maximum likelihood phylogeny and all bootstrap phylogenies were calibrated to absolute time using penalized likelihood in the program treePL (Smith and O'Meara 2012 ) with an optimal smoothing parameter chosen using a cross-validation procedure on the maximum likelihood phylogeny. Eight fossil calibration points were used following the placements and justifications found in Bracken-Grissom et al. ( 2014 ). The maximum age of the root of the tree was set to 262 Ma to reflect an estimated split of the two crayfish superfamilies (Astacoidea and Parastacoidea) (Bracken-Grissom et al. 2014 ). The lower bounds of fossil age estimates were used as minimum estimates for their assigned clades.

Although our molecular dataset sought to maximize coverage and compatibility with existing sampling, we used a phylogenetic synthesis approach, rather than either molecular sampling or taxonomy alone, to take advantage of all recent phylogenetic studies and taxonomic updates to the freshwater crayfishes. We also used our synthesis tree (our newly estimated phylogeny + phylogenies from the literature + taxonomy) to assess data coverage and sampling (Hinchliff et al. 2015 ). The synthesis approach allows us to construct a more complete picture of phylogeny and taxonomy, using studies with different sampling schemes (in terms of taxa and genes), which often result in different topologies. This approach takes taxonomy as a backbone phylogeny and introduces bifurcations to the tree using existing phylogenetic studies, resolving conflicts among these inputs with user-defined rankings. We combined 19 published crayfish phylogenies (as referenced in Owen et al. 2015 ) and the maximum likelihood estimate with Open Tree taxonomy ott2.10 (Hinchliff et al. 2015 ). These trees were merged and assembled into a synthesis tree using the *propinquity* pipeline (Redelings and Holder 2017 ). Taxa not represented in the published phylogenies or in the maximum likelihood estimate are represented by taxonomy in the synthetic tree, giving us a phylogeny-informed understanding of data distribution and a complete phylogenetic hypothesis for the freshwater crayfishes.

### GEOGRAPHIC RANGE SIZE, HABITAT CORRELATION

We analyzed species’ current geographic range sizes to test if the contemporary distributions of cave lineages are smaller than those of other specialized or generalized surface lineages, a predicted consequence of the dead-end model. Dispersal ability is generally assumed to have a relationship with geographic range size, despite the many exceptions and other contributing factors to the range size of a species (Lester et al. 2007 ). Nevertheless, both dispersal ability and geographic range size influence speciation and extinction dynamics (Rosenzweig 1995 Birand et al. 2012 ). Extinction probability increases to 1 as range size tends to 0 (Jones et al. 2003 ) and range size is one of the most commonly used predictors of extinction risk (Foote et al. 2008 Harnik et al. 2012 ). Conversely, speciation probability should increase as geographic range size increases due to the increased probability of vicariance or isolation by distance yielding population fragmentation (Birand et al. 2012 ).

Range maps in the form of ESRI Shapefiles for 540 crayfish species were obtained from the IUCN Red List database (IUCN 2015 ) and converted into spatial polygons in the R packages *letsR* (Vilela and Villalobos 2015 ). Native geographic range sizes in square meters were calculated for each species from the spatial polygons. If multiple polygons were present for a given species, range sizes for each polygon were summed to obtain a total range size estimate for each species. Range sizes were converted to square kilometers and log_{10}-transformed before all analyses.

We used Felsenstein's threshold model to test for a correlation between rates of change in habitat-preference and geographic range size, thereby testing if cave-adapted species and other specialized species have significantly smaller geographic range sizes than nonspecialized species (Felsenstein 2012 ). The threshold model assumes that a binary trait has an underlying continuous trait called “liability” that controls the state of the binary trait. This facilitates estimation of the correlation between a discrete binary and a continuous trait. The maximum likelihood phylogeny was trimmed to the 386 tips with range maps in the IUCN Red List database (Table S2). Habitat preference was coded as either exclusively one state (i.e., cave = 0) or any other state or combination thereof (i.e., surface = 1). Habitat assignments were made using data from the IUCN Red List (IUCN 2015 ) and crayfish taxonomy web browser (Fetzner 2005 ). Species were assigned to one or more of the following categories: lentic (still) water, lotic (flowing) water, caves, or burrows. Primary (obligate) burrowers were coded as inhabiting burrows exclusively, whereas secondary (seasonal) burrowers were coded as inhabiting burrows and another habitat state (lentic, lotic, or both) (Table S2). Five tests were performed for each of the four “specialist” habitat types and one “generalist” type using the R package *phytools* (Revell 2012 ). A Markov chain Monti Carlo (MCMC) chain was run under default priors for 30 million generations for each test, sampling every 3000 generations. The first 25% of samples were discarded as burn-in prior to summarizing parameter estimates and effective sample size (ESS) values for the correlation coefficient, *r*, were checked (>200) to assess convergence. Correlations were assumed to be significant if the 95% highest posterior density (HPD) interval of the correlation coefficient, *r*, did not contain 0, because the power of the analysis can reliably reveal only the sign of the correlation (Felsenstein 2012 ).

### DIVERSIFICATION AND TRANSITION RATES

#### Parameter estimation and hypothesis testing

The evolutionary dead-end model predicts that cave adaptation is an irreversible state leading to increased extinction rates and decreased speciation rates. We used the binary-state speciation and extinction (BiSSE) and geographic-state speciation and extinction (GeoSSE) models to estimate diversification and transition rates, and test this prediction of the evolutionary dead-end hypothesis. GeoSSE was originally intended for use with discrete geographical areas, but is applicable to analyses of habitat preference due to the similar processes involved in range and habitat evolution (Goldberg et al. 2011 ). For GeoSSE, taxa were coded as obligate cave-dwellers (A), surface-dwellers (B), or both (AB). Taxa were only coded as AB if they are found both in surface and cave habitats, and can survive and reproduce in caves (troglophiles). For BiSSE, taxa were coded as obligate cave-dwellers (0) or not obligate cave-dwellers (1) to capture the dynamics of specialist species. Incomplete sampling was accounted for by specifying the percentage sampled in each state, estimated from the synthetic tree, and subspecies were trimmed to one per species to avoid conflating species- and population-level processes.

To avoid making process-based inferences based on point estimates of parameters and model fits, which have been shown to be problematic and potentially misleading (Goldberg et al. 2011 Rabosky and Goldberg 2015 Beaulieu and O'Meara 2016 ), we took a Bayesian approach to testing these predictions based on MCMC samples of the full (unconstrained) models. For each of 1000 bootstrap phylogenies, an MCMC chain with slice sampling (Neal 2003 ) was run for 5000 generations using a broad exponential prior distribution with a mean of 0.5 on all parameters of the BiSSE and GeoSSE models in the R package *diversitree* (FitzJohn 2012 ). The first 10% of samples from each run were discarded as burn-in. Posterior probabilities of different predictions made by the evolutionary dead-end hypothesis were assessed by calculating the percentage of MCMC samples that met each condition (e.g., the extinction rate is greater than the speciation rate). We assessed support for an irreversible model by constraining the transition rate out of caves to 0 and enforcing the root state to be in the “surface” state (Goldberg and Igic 2008 ). Reversible and irreversible models were compared using the Bayes factor (Kass and Raftery 1995 ) with marginal likelihoods estimated by taking the harmonic mean of the likelihood of the MCMC chains. All analyses were run on two sets of 1000 bootstrap trees: one set containing the full crayfish phylogeny and the second containing the largest pruned subtrees that maximized the percentage of cave species to reduce potential biases associated with an uneven tip-state distribution (Gamisch 2016 ). This subtree was identified using the synthetic phylogeny and happened to correspond to the *Cambaridae* family, which contains all described cave-adapted species.

To test if diversification rates in cave lineages are significantly different from other specialized habitat affinities, we repeated the above procedure for estimating BiSSE and GeoSSE parameters, categorizing taxa into three additional crayfish habitat types, as above. Taxa occurring in more than one habitat were coded as “AB” in GeoSSE and as “state 1” in BiSSE. To facilitate comparisons across separate analyses for each habitat type, we analyzed relative net diversification rates ([*s _{A}* −

*x*]/[

_{A}*s*−

_{B}*x*]).

_{B}#### SSE model adequacy

To test the adequacy (objective fit) of a state-dependent diversification model, we took two approaches. The first was a posterior-predictive approach in which we simulated phylogenies using parameter estimates from the model and compared the simulated data to the empirical trees. If the model adequately describes the data, phylogenies simulated under the model should be compatible with the observed data (Pennell et al. 2015 ). First, we tested whether data simulated under estimated model parameters produced the same tip state frequency as the empirical phylogenies. For each sampling scheme and model, we used *diversitree* (FitzJohn 2012 ) to simulate 1000 phylogenies of the same size as the total number of taxa with coded habitat data. Each simulation was run using parameters from a random sample from the MCMC chain. For each set of trees, we calculated the number of taxa surviving to the present in each state and compared this to the observed state distribution with a chi-squared test. We also tested if phylogenies simulated using estimated BiSSE and GeoSSE parameters would produce the same phylogenetic signal of habitat states as the empirical data. To keep the prevalence of each state constant, we simulated 1000 trees for each sampling scheme and model using backward simulation in *phylometrics* (Hua and Bromham 2016 ), using the estimated parameters from 1000 MCMC samples for each model. We compared the phylogenetic signal of habitat states in the simulated data and bootstrap phylogenies using the sum of sister clade differences (SSCD) (Fritz and Purvis 2010 ) calculated in the R package *phylometrics* (Hua and Bromham 2016 ). We tested if the mean of the simulated distribution of SSCD was significantly different from the observed distribution using a *t*-test and calculated the proportion of SSCD values in the bootstrap phylogenies that fell within the 95th and 50th percentiles of simulated SSCD values.

The second approach in testing the adequacy of state-dependent models in general was to fit a set of hidden-state speciation and extinction (HiSSE) models to the data to test if an unobserved character in cave lineages might influence diversification dynamics and if character-independent models fit the data better than the state-dependent models (i.e., the diversification process varies through the tree, but not according to any particular modeled trait) (Beaulieu and O'Meara 2016 ). HiSSE allows each state of a binary trait to contain an unobserved “hidden” state with diversification rate parameters that may be different from the observed states. This also facilitates the construction of “character-independent” diversification models with equal complexity to “character-dependent” models. When used in a model selection context, this has been shown to reduce the propensity to select state-dependent models (Beaulieu and O'Meara 2016 ). As the implementation of these models in the R package *hisse* does not currently support MCMC sampling, we do not extensively draw conclusions based on point parameter estimates given the complex likelihood surfaces of SSE models (Goldberg et al. 2011 Beaulieu and O'Meara 2016 ). We fit six HiSSE models using the R package *hisse* (Beaulieu and O'Meara 2016 ) on the 1000 bootstrap phylogenies trimmed to the *Cambaridae* subtree, including two character independent models, a full HiSSE model, and models in which only the cave or surface states had a hidden state. Taxa were coded as they were for the BiSSE analysis. The parameters of the “best-fit” HiSSE model were used to reconstruct marginal ancestral states on the maximum-likelihood phylogeny. All code used in our analyses is available in the Supporting Information.

## 1) Test hypothesis that American marsupials exhibit less variable body sizes than Australian mammals

Read in the tree and the trait data.

/Desktop/Bodega_2014/Datasets/marsupialmass_continent.txt”, header=T) # first column contains species names, second the discrete states and third the continuous trait

It is always a good idea to look at the data first and do some basic statistics to check everything is read in correctly and so you know what to expect from the more complex models. So we are going to plot histograms of the body mass data using different colours for the Australian (state ’) and American species (state ’)

**hist(marsupialc[,3][marsupialc[,2]==1], col=”blue”, add=T)** # Australia **hist(marsupialc[,3][marsupialc[,2]==2], col=”red”, add=T)** # America

We can also estimate the variance

**austvar<-var(marsupialc[,3][marsupialc[,2]==1])** **amvar<-var(marsupialc[,3][marsupialc[,2]==2])**

The distribution of the American species is much narrower than the Australian species, which would support our hypothesis that the American species have less variable body masses but what do the models suggest? Because our dataset is missing the South American microbiotheres the Australasian and South American species form monophyletic lineages, the clade we are identifying clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”) is the South American group. OUwie will automatically assign internal node labels and update the data matrix according to this clade designation.

Fitting models using clades

The simplest model is one where there is no difference between the continents and body mass just evolves according to a Brownian motion process (model=”BM1″ in OUwie).

**bm1sizecontinent<-OUwie(mtr, marsupialc, model=”BM1″, simmap.tree=FALSE, root.station=TRUE, clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”))** # for completeness I have added the clade, root state and simmap tree but they will be ignored as this is just fitting a Brownian motion model across the whole tree.

The next most simple model is one where there is no difference between the continents and body mass is evolving to the same optimum across marsupials (model=”OU1″ in OUwie)

**ou1sizecontinent<-OUwie(mtr, marsupialc, model=”OU1″, simmap.tree=FALSE, root.station=TRUE, clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”))** # for completeness I have added the clade and simmap tree but they will be ignored as this is just fitting a Brownian motion model across the whole tree.

What are we doing when we set root.station=TRUE? We are stating whether the starting state, theta_0, should be estimated and the default is TRUE, which means theta_0 is dropped from the model so the starting value is distributed according to the stationary distribution of the OU process. Dropping theta_0 from the model can stabilize estimates of the primary optima, especially if you find the estimates of theta in the full model do not make sense biologically. However, this assumption does not fit a biological scenario involving moving away from an ancestral state. A good idea is to run preliminary analyses using both TRUE and FALSE to see which works for your dataset, for this one TRUE is best.

Now we are going to start fitting models where continent influences body mass evolution. The first allows body size to evolve at different rates on the different continents (model=”BMS” in OUwie)

**bmssizecontinent<-OUwie(mtr, marsupialc, model=”BMS”, simmap.tree=FALSE, root.station=TRUE, clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”))**

The second allows body mass to evolve towards different peaks but the rate and pull towards the peak are the same (model=”OUM” in OUwie)

**oumsizecontinent<-OUwie(mtr, marsupialc, model=”OUM”, simmap.tree=FALSE, root.station=TRUE, clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”))**

The third allows body mass to evolve towards different peaks with different strengths of pull towards the peak but the rate is the same (model=”OUMA” in OUwie)

**oumasizecontinent<-OUwie(mtr, marsupialc, model=”OUMA”, simmap.tree=FALSE, root.station=TRUE, clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”))**

The fourth allows body mass to evolve towards different peaks with rates but the pull towards the peak is the same (model=”OUMV” in OUwie)

**oumvsizecontinent<-OUwie(mtr, marsupialc, model=”OUMV”, simmap.tree=FALSE, root.station=TRUE, clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”))**

The final model is the most complex and allows continent to influence all parameters (model=”OUMVA”)

**oumvasizecontinent<-OUwie(mtr, marsupialc, model=”OUMVA”, simmap.tree=FALSE, root.station=TRUE, clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”))**

First thing to do is to look at how OUwie outputs the data. To get a summary of the parameters and model-fit just type the name of the object you stored the results in e.g.

**bm1sizecontinent**

To extract information to calculate the delta AICc etc. we can look at the structure of the object

**str(bm1sizecontinent)**

We can see it is a list of 18 separate things, it includes the tree and the datset as well as the results. If we want to see how well each model fits the data we can calculate the delta AICc value by extracting the AICc. (We could of course use a likelihood ratio test to also test this etc.).

**bestfit<-c(bm1sizecontinent$AICc, ou1sizecontinent$AICc, bmssizecontinent$AICc, oumsizecontinent$AICc, oumasizecontinent$AICc, oumvsizecontinent$AICc, oumvasizecontinent$AICc)** **names(bestfit)<-c(“bm1”, “ou1”, “bms”, “oum”, “ouma”, “oumv”,”oumva”)**

**bestfit-min(bestfit)**

Now look at the results for the best fitting model – are the rates faster in Australia (1) or America (2)?

**bmssizecontinent**

Checking the model has run properly

These models we are fitting are very complex and sometimes the information contained within the dataset are not sufficient and as a result one or more parameters may be estimated incorrectly. If you include the diagnostics in the OUwie run (diagn=T) it will include the eigendecomposition of the Hessian matrix, which can indicate whether the maximum likelihood estimate has been found. If the eigenvalues of the Hessian are positive then parameter estimates are considered reliable and the ML estimate has been found (i.e. the Hessian is positive definite).

Run the best-fitting bms model but this time include diagnostics and look at the eigenvalues

**bmssizecontinentdiag<-OUwie(mtr, marsupialc, model=”BMS”, simmap.tree=FALSE, root.station=TRUE, clade=c(“Caluromys_derbianus”, “Monodelphis_domestica”), diagn=T)**

**bmssizecontinentdiag$eigval**

The eigenvalues are positive. Now lets look at the standard errors of the rate estimate, if they are large it indicates that if you change the parameter value in any direction it will have little effect on the log-likelihood and the parameter estimate is considered unstable.

**bmssizecontinentdiag$solution.se**

## Syllabus

**Important!** The syllabus below is currently mostly as it was when the course finished in Spring 2018. Modifications will occur as needed as the semester proceeds.

Dirichlet process priors, credible vs. confidence intervals. [CI applet] [Stick-breaking applet]

Bayes factors, steppingstone estimation of marginal likelihood.

An alternative to Pagel’s (1994) test for assessing whether correlation among characters goes beyond what is expected from inheritance alone. *[one 22 min lecture video and PDFs of slides have been posted in the HuskyCT course]*

Independent Contrasts and Phylogenetic Generalized Least Squares (PGLS). *[a 13.5 min lecture on PIC, a 21 min lecture on PGLS, and PDFs of slides have been posted in the HuskyCT course]*

State-dependent diversification models (BiSSE and its descendants) BAMM: estimating the number of shifts in diversification regime and where these occur on the tree.

Open book, due 8pm Wednesday of finals week (i.e. the end time of our scheduled final exam).

**Optional lecture: Estimating phylogenetic information**

A talk I gave at the American Museum of Natural History in January.

## Conclusions

In summary, the approach presented here shows that temporal dynamics of species diversification resulting from biologically relevant events such as key innovations or the impact of environmental change should best be studied in a Bayesian framework. The use of MCMC sampling provides an elegant way to estimate speciation and extinction rates while taking into account the often considerable uncertainty on divergence times. Furthermore, the model with taxon sampling represents an important step towards a more realistic estimation of the diversification parameters, where a non-random distribution of missing taxa can be incorporated with clade-specific sampling proportions. In addition to the models implemented in this study, recently developed modifications of the birth-death process [13, 61] could also be integrated in the algorithm. With the possibility to run customized analyses specifically designed for hypothesis testing, this method provides a useful and flexible statistical framework to investigate diversification processes. A promising future development would be to relax the *F*-model to incorporate more than two rates, by assigning a specific multiplier to each time frame or group of clades.

## Impact Summary

What an animal eats is a fundamental part of its biology. Surprisingly, the evolution of animal diets has not been studied across all animals. Here, we analyzed diet data across an evolutionary tree of animals to address three major questions. First, are diets evolutionary conserved over time, or are they highly labile and variable among species? Whether ecological traits are evolutionarily conserved has become a major debate in evolutionary biology and ecology. Most studies have examined traits over shorter timescales, but here we test a major ecological trait over an extraordinarily deep timescale (>800 million years). Second, does diet influence rates of species proliferation, and thereby determine patterns of diversity among animal phyla? Animal phyla vary from less than five species to more than 1.2 million (i.e., arthropods). Previous studies suggested that diet (especially a herbivorous, plant-eating diet) drives rates of species proliferation and diversity patterns in major groups of animals (e.g., mammals, insects, and crustaceans). However, this has not been tested across animal phyla. Third, what was the diet of the ancestor of all living animals, and of the major animal clades? We find three surprising results. First, we show that diet is highly conserved across animals, such that related species tend to share similar diets. Thus, we show that ecological traits can be evolutionarily conserved over incredibly deep timescales. Second, diet does not significantly influence large-scale patterns of animal diversity, despite previous studies showing that herbivorous diet increases rates of species proliferation. Finally, we find that the ancestor of all animals was most likely carnivorous (eating other heterotrophs), as were the ancestors of many of the largest animal groups (like arthropods, chordates, and molluscs). Our results suggest that many carnivorous species living today may have inherited this trait through a series of carnivorous ancestors dating back more than 800 million years.

One of the most fundamental aspects of an animal's biology is its diet. Animals have a remarkable diversity of diets and associated lifestyles, including mammalian carnivores that pursue large and dangerous prey, insect herbivores that specialize on a few plant species, and marine invertebrates that passively filter feed on tiny organisms (Hickman et al. 2012 ). Yet, the evolution of animal diets remains poorly understood at the largest phylogenetic scales (e.g., among phyla). Previous large-scale studies have suggested that food webs in natural systems are shaped (in part) by phylogenetic constraints on diet (Cattin et al. 2014 ) and that ecological interactions among species (e.g., predator–prey) are broadly conserved across the tree of life (Gomez et al. 2010 ). However, these important studies did not directly address the evolution and conservatism of trophic strategies at deep phylogenetic scales.

Here, we address three major unresolved questions about the evolution of diet across animals. First, are diets evolutionary conserved across the animal tree of life? There has been considerable debate about whether ecological niches are conserved or not, including which aspects of the niche are conserved and over what timescales (e.g., Peterson et al. 1999 Losos et al. 2003 Losos 2008 Crisp et al. 2009 Gomez et al. 2010 Wiens et al. 2010 Peterson 2011 Cattin et al. 2014 Anderson and Wiens 2017 ). Yet, as noted by Olalla-Tárraga et al. ( 2017 ), this literature typically focuses on the Grinnellian niche (e.g., large-scale climate) and not the Eltonian niche (e.g., local-scale species interactions terminology following Soberón 2007 ). Here, we provide the broadest test (so far) of conservatism in the Eltonian niche, with an analysis spanning >800 million years of evolutionary history (Fig. 1).

Second, does diet influence rates of species diversification at broad phylogenetic scales across animal phylogeny? Previous studies have shown evidence that diet (e.g., herbivory) influences diversification within some important groups (e.g., mammals: Price et al. 2012 hexapods: Wiens et al. 2015 birds: Burin et al. 2016 and crustaceans: Poore et al. 2017 ). However, it remains unclear whether diet influences diversification patterns among phyla. There is striking variation in richness among animal phyla (from less than five species to more than 1.2 million) that is strongly associated with variation in diversification rates (Wiens 2015 ). Recent analyses suggest that most variation in diversification rates and richness among animal phyla is explained by whether phyla are predominantly nonmarine, have skeletons, and are parasites on other animals (Jezkova and Wiens 2017 ). However, diet itself was not included. Here, we test whether diet significantly influences large-scale diversification patterns across animals.

Third, what was the ancestral diet of animals and what were the major shifts in diet across the animal tree of life? For example, were animals originally carnivores or herbivores? What about major phyla, such as arthropods, molluscs, and chordates? Few previous studies have explicitly addressed this topic. Vermeij and Lindberg ( 2000 ) suggested that “nonherbivory” was ancestral for animals, but focused on marine taxa, used a restricted definition of herbivory, and did not present explicit ancestral-state reconstructions. More recent studies have commented on the possible ancestral feeding ecology of animals (using phylogenies and fossils Sperling and Vinther 2010 Erwin et al. 2011 Sperling et al. 2013 ), and concluded that this ancestor was not carnivorous. However, they did not directly test whether this ancestor was more likely to be herbivorous (defined here as feeding on autotrophs) or carnivorous (i.e., feeding on heterotrophs). Although some might argue that such deep-scale patterns can only be estimated with fossils, many relevant animal phyla do not preserve well (e.g., small, soft-bodied taxa Sperling 2013 ) and diet may be difficult to infer for many fossil taxa. Furthermore, analyzing data from extant taxa allows use of new methods that can estimate ancestral states while accounting for the possible impact of those states on diversification rates and the impact of diversification rates on ancestral-state reconstructions (HiSSE Beaulieu and O'Meara 2016a ).

We address these three questions using a phylogenetic approach. We first assemble a dataset of 1087 carefully selected taxa with diet data from the literature (Dataset S1 all datasets and supplementary materials are available as Supporting Information and on Dryad, https://doi.org/10.5061/dryad.q2d60q3). The selected taxa have published diet data available, are represented in the time-calibrated phylogeny assembled here (Dataset S2), and are sampled in proportion to the richness of the phyla they belong to (Table S1). We then test for phylogenetic signal in diet across the tree, as a test of niche conservatism. We next test whether diet influences diversification rates using state-dependent speciation and extinction (HiSSE) models, and an alternative approach based on estimated net diversification rates for phyla. We then use the best-fitting HiSSE models to estimate ancestral states and major changes in diet across animal phylogeny. We also conduct these analyses on two alternative trees, based on different assumptions about animal phylogeny and divergence times (Dunn et al. 2014 Wiens 2015 ). Note that HiSSE analyses can account for the potential impact of other traits on diversification besides diet, and we also perform analyses to address the potential confounding effects of marine habitat.

## Introduction

In statistical phylogenetics, we are interested in learning the parameters of models in which evolutionary trees—phylogenies—play an important part. Such analyses have a surprisingly wide range of applications across the life sciences 1,2,3 . In fact, the research front in many disciplines is partly defined today by our ability to learn the parameters of realistic phylogenetic models.

Statistical problems are often analyzed using generic modeling and inference tools. Not so in phylogenetics, where empiricists are largely dependent on dedicated software developed by small teams of computational biologists 3 . Even though these software packages have become increasingly flexible in recent years, empiricists are still limited to a large extent by predefined model spaces and inference strategies. Venturing outside these boundaries typically requires the help of skilled programmers and inference experts.

If it were possible to specify arbitrary phylogenetic models in an easy and intuitive way, and then automatically learn the latent variables (the unknown parameters) in them, the full creativity of the research community could be unleashed, significantly accelerating progress. There are two major hurdles standing in the way of such a vision. First, we must find a formalism (a language) that can express phylogenetic models in all their complexity, while still being easy to learn for empiricists (the modeling language expressivity problem). Second, we need to be able to generate computationally efficient inference algorithms from such model descriptions, drawing from the full range of techniques available today (the automated inference problem).

In recent years, there has been significant progress toward solving the expressivity problem by adopting the framework of probabilistic graphical models (PGMs) 4,5 . PGMs can express many components of phylogenetic models in a structured way, so that efficient Markov chain Monte Carlo (MCMC) samplers—the current workhorse of Bayesian statistical phylogenetics—can be automatically generated for them 5 . Other, more novel inference strategies are also readily applied to PGM descriptions of phylogenetic model components, as exemplified by recent work using STAN 6 or the new Blang framework 7 .

Unfortunately, PGMs cannot express the core of phylogenetic models: the stochastic processes that generate the tree, and anything dependent on those processes. This is because the evolutionary tree has variable topology, while a PGM expresses a fixed topology. The problem even occurs on a fixed tree, if we need to express the possible existence of unobserved side branches that have gone extinct or have not been sampled. There could be any number of those for a given observed tree, each corresponding to a separate PGM instance.

Similar problems occur when describing evolutionary processes occurring on the branches of the tree. Many of the standard models considered today for trait evolution, such as continuous-time discrete-state Markov chains, are associated with an infinite number of possible change histories on a given branch. It is not always possible to represent this as a single distribution with an analytical likelihood that integrates out all change histories. Thus, it is sometimes necessary to describe the model as an unbounded stochastic loop or recursion over potential PGMs (individual change histories).

PGM-based systems may address these shortcomings by providing model components that hide underlying complexity. For instance, a tree may be represented as a single stochastic variable in a PGM-based model description 5 . An important disadvantage of this approach is that it removes information about complex model components from the model description. This forces users to choose among predefined alternatives instead of enabling them to describe how these model components are structured. Furthermore, computers can no longer “understand” these components from the model description, making it impossible to automatically apply generic inference algorithms to them. Instead, special-purpose code has to be developed manually for each of the components. Finally, hiding a complex model component, such as a phylogenetic tree, also makes dependent variables unavailable for automated inference. In phylogenetics, for instance, a single stochastic tree node makes it impossible to describe branch-wise relations between the processes that generated the tree and other model components, such as the rate of evolution, the evolution of organism traits, or the dispersal of lineages across space.

Here, we show that the expressivity problem can be solved using universal probabilistic programming languages (PPLs). A “universal PPL” is an extension of a Turing-complete general-purpose language, which can express models with an unbounded number of random variables. This means that random variables are not fixed statically in the model (as they are in a finite PGM) but can be created dynamically during execution.

PPLs have a long history in computer science 8 , but until recently they have been largely of academic interest because of the difficulty of generating efficient inference machinery from model descriptions using such expressive languages. This is now changing rapidly thanks to improved methods of automated inference for PPLs 9,10,11,12,13,14 , and the increased interest in more flexible approaches to statistical modeling and analysis. Current PPL inference algorithms provide state-of-the-art performance for many models but they are still quite inefficient for others. Improving PPL algorithms so that they can compete with manually engineered solutions for more problem domains is currently a very active research area.

To demonstrate the potential of PPLs in statistical phylogenetics, we tackle a tough problem domain: models that accommodate variation across lineages in diversification rate. These include the recent cladogenetic diversification rate shift (ClaDS) (ClaDS0–ClaDS2) 15 , lineage-specific birth–death-shift (LSBDS) 16 , and Bayesian analysis of macroevolutionary mixtures (BAMM) 17 models, attracting considerable interest among evolutionary biologists despite the difficulties in developing good inference algorithms for some of them 18 .

Using WebPPL—an easy-to-learn PPL 9 —and Birch—a language with a more computationally efficient inference machinery 14 —we develop techniques that allow us to automatically generate efficient sequential Monte Carlo (SMC) algorithms from short descriptions of these models (

100 lines of code each). Although we found it convenient to work with WebPPL and Birch for this paper, we emphasize that similar work could have been done using other universal PPLs. Adopting the PPL approach allows us generate the first efficient SMC algorithms for these models, and the first asymptotically exact inference machinery for the full BAMM model. Among other benefits, SMC inference allows us to directly estimate the marginal likelihoods of the models, so that we can assess their performance in explaining empirical data using rigorous Bayesian model comparison. Taking advantage of this, we show that models with slowing diversification, constant turnover and many small shifts (all combined in ClaDS2) generally explain the data from 40 bird phylogenies better than alternative models. We end the paper by discussing a few problems, all seemingly tractable, which remain to be solved before PPLs can be used to address the full range of phylogenetic models. Solving them would facilitate the adoption of a wide range of novel inference strategies that have seen little or no use in phylogenetics before.

## Abstract

Understanding how and why rates of evolutionary diversification vary is a key issue in evolutionary biology, ecology, and biogeography. Evolutionary rates are the net result of interacting processes summarized under concepts such as adaptive radiation and evolutionary stasis. Here, we review the central concepts in the evolutionary diversification literature and synthesize these into a simple, general framework for studying rates of diversification and quantifying their underlying dynamics, which can be applied across clades and regions, and across spatial and temporal scales. Our framework describes the diversification rate (*d*) as a function of the abiotic environment (*a*), the biotic environment (*b*), and clade-specific phenotypes or traits (*c*) thus, *d*

* *

*a,b,c*. We refer to the four components (*a*–*d*) and their interactions collectively as the “Evolutionary Arena.” We outline analytical approaches to this framework and present a case study on conifers, for which we parameterize the general model. We also discuss three conceptual examples: the *Lupinus* radiation in the Andes in the context of emerging ecological opportunity and fluctuating connectivity due to climatic oscillations oceanic island radiations in the context of island formation and erosion and biotically driven radiations of the Mediterranean orchid genus *Ophrys*. The results of the conifer case study are consistent with the long-standing scenario that low competition and high rates of niche evolution promote diversification. The conceptual examples illustrate how using the synthetic Evolutionary Arena framework helps to identify and structure future directions for research on evolutionary radiations. In this way, the Evolutionary Arena framework promotes a more general understanding of variation in evolutionary rates by making quantitative results comparable between case studies, thereby allowing new syntheses of evolutionary and ecological processes to emerge.

## Acknowledgments

We thank those people who helped in sample collection and analysis, especially SCL Knowles, J Martínez de la Puente, E Lobato, G Hegyi and M Herényi. We also thank H Winkler, R-T Klumpp and A Fojt for logistical support for the Vienna study. The study was supported by a Synthesys grant from the European Union and a grant from the Kungliga Fysiografiska Sällskapet i Lund to ES, grants from the Hungarian Scientific Research Fund (OTKA 75618, F68295, PD75481) to JT and BR, the Swedish Research Council (individual grants to JÅN and DH) and CAnMove (a Linnaeus Research Excellence Centre funded by the Swedish Research Council and Lund University) to JÅN and DH, the Max Planck Society to BK, the Thule Institute in University of Oulu to SR, the University of Antwerp to LZG and ME, the Spanish National Research Council (CSIC, Spain) to LZG, the Spanish Ministry of Science and Innovation (project CGL2009-09439) to SM and LZG, DS/WBiNoZ/INOS/767/10 to MC. MJW was supported by a Natural Environment Research Council standard grant to Prof Ben Sheldon. The European Union and the European Social Fund have provided financial support to the project under the grant agreement no. TÁMOP 4.2.1./B-09/1/KMR-2010-0003. We thank J Donés, Director of Centro Montes de Valsaín, the Dirección General del Medio Natural (Junta de Castilla y León) and the Pilis Park Forestry to authorize the work in the San Ildefonso and Pilisszentlászló study areas.