Functional GLM: population growth example

Data preparation

The data

data("popgrowthmillion")

contains population growth, i.e. population at the end of the year divided by population at the beginning of the year, in 134 countries in years from 1950 to 2015. The dataset includes only countries over million inhabitants in 1950. The data were extracted from the supplement of Nagy et al. (2017) distributed under the GPL-2 license.

The GDP data are publicly available at https://data.worldbank.org/indicator/NY.GDP.PCAP.CD. The excel file that we downloaded was called API_NY.GDP.PCAP.CD_DS2_en_excel_v2_3358980.xls. The inflation rates are publicly available at https://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG. The excel file that we downloaded was called API_NY.GDP.DEFL.KD.ZG_DS2_en_excel_v2_3469555.xls, from there we took only the inflation rates for United States. Both are distributed under the CC-BY 4.0 license (see https://datacatalog.worldbank.org/public-licenses#cc-by).

GDP <- read_excel("API_NY.GDP.PCAP.CD_DS2_en_excel_v2_3358980.xls",
                  sheet=1, skip=3) #, detectDates=TRUE)
Inflation <- read_excel("API_NY.GDP.DEFL.KD.ZG_DS2_en_excel_v2_3469555.xls",
                        sheet=1, skip=3)
Inflation <- data.frame(Year=names(Inflation)[-(1:5)],
      Infl=as.numeric(Inflation[Inflation$`Country Code` == "USA", -(1:5)]))
Inflation$Infl <- Inflation$Infl/100 + 1
range(Inflation$Infl)

## [1] 1.007623 1.094621

For the population data available in GET, the continents/groups of the countries are as follows and we make a curve_set object of the data for nicer treatment of it. The Oceania group with three countries was left out from the analysis.

cset1 <- create_curve_set(list(r=1950:2014,
              obs=popgrowthmillion[, -(132:134)])) # Left out Oceania
groups <- c(rep("Africa", times=37),
            rep("Asia", times=37),
            rep("Europe and North America", times=37),
            rep("Latin America", times=20))

Then we discounted the GDP of every country in the study to the 1960 USD, and we extrapolated the missing values of the GDP of a country using the closest known ratio of the GDP of the country and the median GDP in that year. Further, the missing values of GDP were interpolated using linear interpolation of the two closest ratios.

Thereafter, we chose those countries for whom we had both population growth and GDP data, created sets of curves from both data sources and reduced them to years 1960-2014 available for both. Thus, we had the group (Continent), population growth curves and GDP curves:

groups <- as.factor(groups)
table(groups)

## groups
##                   Africa                     Asia Europe and North America 
##                       34                       28                       34 
##            Latin America 
##                       17

GDPcset

## A curve_set(1d) object with 113 curves observed at 55 argument values.
## Contains: 
## $ r     :  int [1:55] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
## $ funcs :  num [1:55, 1:113] 70.1 70.2 71.6 75.7 81.8 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:55] "1960" "1961" "1962" "1963" ...
##   ..$ : chr [1:113] "Burundi" "Eritrea" "Ethiopia" "Kenya" ...

POPcset

## A curve_set(1d) object with 113 curves observed at 55 argument values.
## Contains: 
## $ r     :  int [1:55] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
## $ funcs :  num [1:55, 1:113] 1.02 1.02 1.02 1.02 1.02 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:55] "1960" "1961" "1962" "1963" ...
##   ..$ : chr [1:113] "Burundi" "Eritrea" "Ethiopia" "Kenya" ...

Plotting the data simply

plot(GDPcset)

plot(POPcset)

Plotting the data using ggplot and colors for each continent:

POPfuncs <- POPcset[['funcs']]
GDPfuncs <- GDPcset[['funcs']]
nfunc <- ncol(POPcset[['funcs']])
narg <- nrow(POPcset[['funcs']])
POPdf <- data.frame(r = rep(POPcset$r, times=nfunc),
                 funcs = c(POPfuncs),
                 id =  factor(rep(1:nfunc, each=nrow(POPfuncs)), levels = 1:nfunc),
                 Continent = rep(groups, each=narg))
GDPdf <- data.frame(r = rep(POPcset$r, times=nfunc),
                    funcs = c(GDPfuncs),
                    id =  factor(rep(1:nfunc, each=nrow(GDPfuncs)), levels = 1:nfunc),
                    Continent = rep(groups, each=narg))
p1 <- ( ggplot() + geom_line(data=POPdf, aes_(x = ~r, y = ~funcs, group = ~id, col = ~Continent))
       + labs(title="", y="Population growth", x = "Year")
       + scale_color_manual(values=viridisLite::viridis(4)))#c("#21908CFF", "#440154FF", "#5DC863FF", "grey70")))
p2 <- ( ggplot() + geom_line(data=GDPdf, aes_(x = ~r, y = ~funcs, group = ~id, col = ~Continent))
        + labs(title="", y="GDP (in 1960 USD)", x = "Year")
        + scale_color_manual(values=viridisLite::viridis(4))) #c("#21908CFF", "#440154FF", "#5DC863FF", "grey70")))
p1 + p2 + plot_layout(guides="collect") & theme(legend.position="bottom")

Functional general linear model (GLM)

Mrkvička and Myllymäki (2022) explored the functional general linear model (GLM) for the population growth across years 1960-2014 with categorical covariate (continent), continuous covariate (GDP) and interactions (GDP with respect to the continent). For this purpose, they applied three tests, and on each of them, they used their proposed FDR control (IATSE) to obtain all years which are significant for the studied covariate. We refer the reader to Mrkvička and Myllymäki (2022) about the details and show below how these three tests can be run using GET.

First the effect of continent in the main effects model (formula.full) was tested. The function graph.flm was used to obtain simulations under the null and to test the effect under the control of FDR. The result can be directly plotted.

nsim <- 5000
resC <- graph.flm(nsim = nsim,
                  formula.full = Y ~ GDP + Continent,
                  formula.reduced = Y ~ GDP,
                  typeone = "fdr", # Choose the control
                  curve_sets = list(Y=POPcset, GDP=GDPcset),
                  factors = data.frame(Continent = groups),
                  contrasts = FALSE)

## You have chosen to control the false discovery rate (typeone = "fdr").

plot(resC)

Mrkvička and Myllymäki (2022) further added the fitted curve for every continent, i.e. they added the intercept \(\beta_{i}^0\) and mean effect of GDP, \(\beta_{i}^\text{GDP}\cdot \text{mean}(\text{GDP}_i)\), to every \(\beta_{j,i}^{\text{Cont}}\).

Second, the effect of GDP was tested:

resG <- graph.flm(nsim = nsim,
                  formula.full = Y ~ GDP + Continent,
                  formula.reduced = Y ~ Continent,
                  typeone = "fdr", # Choose the control
                  curve_sets = list(Y=POPcset, GDP=GDPcset),
                  factors = data.frame(Continent = groups),
                  contrasts = FALSE)

## You have chosen to control the false discovery rate (typeone = "fdr").

# Basic plot
plot(resG)

# Tuned labels:
labTextx <- 'Year'
plot(resG) +
  labs(x=bquote(.(labTextx[1]) ~ i),
       y=expression(hat(beta)[i]^GDP),
       title="95% FDR envelope for the coeff. of GDP")

Third, the effect of the interaction of continent and GDP was tested:

resCG <- graph.flm(nsim = nsim,
                  formula.full = Y ~ GDP + Continent + GDP:Continent,
                  formula.reduced = Y ~ GDP + Continent,
                  typeone = "fdr",
                  curve_sets = list(Y=POPcset, GDP=GDPcset),
                  factors = data.frame(Continent = groups),
                  contrasts = FALSE, savefuns = TRUE)

## You have chosen to control the false discovery rate (typeone = "fdr").

# Basic plot
plot(resCG)

# Tuned labels
plot(resCG) +
  labs(x=bquote(.(labTextx[1]) ~ i),
       y=expression(hat(beta)[list(j,i)]^I),
       title="95% FDR envelope for the interaction coeff.")

Functional GLM: population growth example

Mari Myllymäki

8/19/2022

Data preparation

Functional general linear model (GLM)