**Statistical Analysis Recreation of “Maternal Size and Age Shape Offspring Size in a Live-Bearing Fish, Xiphophorus birchmanni”**

Author: Benison P. Zerrudo

BIO 531 – Biological Data Analysis I with Dr. William B. Kristan (CSUSM)

**Synopsis**

The size of offspring has always been studied as an effect of maternal investment associated with ecological factors like competition or predation. Typical theory estimates that an optimal offspring size is selected by these ecological factors. However, there were evidence that maternal size or age could influence the size of offspring. General understanding of this pattern is still unclear, and some studies propose that it may constitute non-adaptive variation or unnatural pattern of spatial or temporal differences in the environment surrounding the mother. To investigate this pattern, I recreated the statistical analysis from the paper “Maternal Size and Age Shape Offspring Size in a Live-Bearing Fish, *Xiphophorus birchmanni*” authored by Kindsvater et al. I determined the relationship between maternal size and age in three populations of swordtail *Xiphophorus birchmanni* using linear model and ANCOVA in RStudio. I also determined the correlation between maternal size and offspring size using linear regression analysis for each population. I found that population and age are good predictors of maternal size. I also found that maternal size is positively correlated with offspring size in some of the population. Larger and older females tend to have larger offspring size in the Coacuilco and San Pedro populations. The results support a previous theory predicting adaptive age and size dependence in maternal investment. Furthermore, size and age of females could have an important role in population growth and evolution of these swordtail fishes.

**Statistical Analysis**

First, I modified the original excel data table by removing all data points with “NA” leaving only the numeric data for analysis. I also removed data with low counts such as male fishes, juvenile fishes and the ones that had not matured.

One goal of this study was to determine the correlation between maternal size, maternal age, and offspring size in different sites (populations). Using the function hist() in R studio, I plotted histograms that show the distribution of the standard length data (maternal size) for each site. Overall, the Cocalaco population was shorter compared to the San Pedro population while the Coacuilco population had an intermediate length.

To determine if these standard-length differences among populations were statistically significant, I created a linear model with standard-length as the response variable and with both maternal age and population as predictors and run the linear model in Anova. The population variable was converted into factor before adding it into the linear model. The Type I SS Anova result suggested that one of the standard-length means was different from the other population group (ANCOVA: F2 = 74.816, P<2.2e-16).

I performed Tukey test to see which pair of population has significant difference. All population comparisons showed significant difference with all the p-values less than 0.05 which suggest that each population had different standard-length mean than the other populations supporting the histograms for each site. The Cocalaco population has the lowest estimated marginal means, the San Pedro population has the highest estimated marginal mean, and the Coacuilco population has the intermediate estimated marginal means.

I also generated the summary statistics for the linear model with standard-length as response variable and with both age and population as the predictors to show the relationship between age and standard-length. The slope of the age coefficient was 0.09291 which suggest a positive relationship. The positive relationship between the standard-length and age was significant (T = 12.971, P<2e-16). The result suggests that age is a good predictor of maternal size and that older fishes are generally longer than younger ones.

To see the interaction between site and age, I created a linear model with the standard-length as response variable and with age, population, and age-population interaction as the predictors. The result from the Type I SS analysis of variance showed that age and population were good predictors of maternal size; however, there is no interaction between age and population (ANCOVA: F2 = 0.5357, P=0.5857). Thus, the age-population interaction was not included to any of the linear model created earlier.

The main objective was to determine the correlation between maternal size and offspring size. But first, I needed to know if I can combine data with different years and population. Environmental condition may influence the outcome of our dependent variables and it was important to know if the environmental conditions were different among the years and population. Using the lipid data worksheet, I created a linear model where the condition is the response variable, and both the years and population are the predictors. Both years and predictors were converted into factor. I ran the linear model in Anova and performed Tukey test using the emmeans() function. The result suggest that most years and populations were different from the others as seen in Figure 6. To account the variation in environmental condition, I created a dataset for each population with single year and performed linear regression analysis for each dataset.

**Coacuilco 2010 Dataset:**

I performed Shapiro-Wilk and Breusch-Pagan test and determined that this dataset passed the GLM assumptions of normality and homogeneity of variance with both p-values greater than 0.05. The results from these tests were not mentioned in the original paper and should have been reported.

Using the ggplot function, scatter plot of standard-length versus embryo weight was created and showed a positive relationship.

Linear model was created in R Studio with embryo weight as the response variable and standard-length as the predictor. Summary statistics showed that the positive relationship between embryo weight and standard-length is statistically significant (F1, 102 = 21.05, P=1.276e-05).

**Cocalaco 2008 and 2010 Dataset:**

I performed Shapiro-Wilk and Breusch-Pagan test and determined that this dataset passed GLM assumptions of normality and homogeneity of variance with all the p-values greater than 0.05. These results should have been reported in the original paper.

Using the ggplot function, scatter plot of standard-length versus embryo weight was created and showed an increasing trendline in 2008 dataset but decreasing in the 2010 dataset; however, the slopes seem to be minimal for both years.

Linear model was created in R Studio with embryo weight as the response variable and standard-length as the predictor. Summary statistics showed that the relationship between embryo weight and standard-length in 2008 dataset was not significant (F1,40=0.4263, P=0.5175) as well as in the 2010 dataset (F1,52=0.02113, P=0.885). The results are consistent with the regression lines.

**San Pedro 2008 and 2010 Dataset:**

I performed Shapiro-Wilk and Breusch-Pagan test for the 2008 dataset and determined that it passed GLM assumptions of normality and homogeneity of variance with both the p-values greater than 0.05. The 2010 dataset passed the Shapiro-Wilk test however, the dataset failed the Breusch-Pagan test. The results from these tests should have been reported in the original paper.

Using the ggplot function, scatter plot of standard-length versus embryo weight was created and showed an increasing trendline in both 2008 and 2010 dataset.

Linear model was created in R Studio with embryo weight as the response variable and standard-length as the predictor. Summary statistics showed that the relationship between embryo weight and standard-length in 2008 dataset was significant (F1,155=40.03, P=2.56e-09) as well as in the 2010 dataset (F1,59=30.74, P=7.26e-07).

The original paper failed to report the results of the assumption tests. Additionally, the authors seem to not utilize all the available data points as depicted by the lower number of degrees of freedom in their regression analysis compared to my analysis. For example, the degree of freedom (denominator) in 2010 San Pedro dataset was only 17 in their analysis; however, the degree of freedom (denominator) I got with my analysis was 155. Precision of results could have been reported in the original paper including the values for the R-squared. Overall, the results of the regression analysis between embryo weight and maternal size were consistent with the original paper. Both years in the San Pedro population resulted significant correlation, both years in the Cocalaco population did not have significant correlation, and the Coacuilco population resulted significant correlation.

**Reference**

Kindsvater, H. K., Rosenthal, G. G., & Alonzo, S. H. (2012). Maternal size and age shape offspring size in a live-bearing fish, Xiphophorus birchmanni. PloS one, 7(11), e48473. https://doi.org/10.1371/journal.pone.0048473