Contents

Making the most of Exponential Random Graph Modeling (ERGM)


Introduction

Exponential Random Graph Models (ERGM) are statistical models that are concerned with the structure and formation of networks. As part of a network science toolkit, ERGMs provide insight about the social system that creates the network. More specifically, they help us understand the kinds of network structures that cause the formation of edges (Robins & Lusher, 2013).

In this article I conduct two studies. study0 aims to demonstrate the importance of precision, iteration, and reproducibility when building an ERGM. study1 aims to show a common pitfall in model selection and comparison. I encourage you to repeat these studies on your laptop and grow your expertise in network analysis. I am assuming that you are proficient in R scripting, you are already familiar with ERGM, and you know your way around the ERGM package and reference manual (Handcock et al., 2023; Hunter et al., 2008; Krivitsky et al., 2023).

study0: An ERGM that is robust and reproducible

Reproducibility is a major principle in the scientific method and in scholarly literature. For computer simulations, it is common practice to publish the code and a simulated dataset that allows anyone to reproduce the study. However, most computer simulations involve the generation of random numbers. For example, ERGMs involve the generation of random networks. But if this randomness cannot be controlled or traced, then it would be impossible to recreate the simulated dataset or reproduce the exact results of the published study.

Computers normally use pseudo-random numbers, which statistically look a lot like truly random numbers, but are actually traceable and reproducible. A simulated dataset can be exactly recreated if we know the random seed, a number that is used as the base for the generation of random numbers. For the purposes of reproducible research, it is common to publish the random seed values as part of the code that generates data for the study. Any interested reader must be able to use the published code and seeds to reproduce the published dataset, statistics and conclusions that the researcher has reached, all with exact numerical values. Next, the reader must be able to run the same code with different random seeds. The new dataset will be numerically different, but it must lead to substantially the same statistics, results, and conclusions than those in the published study.

An ERGM study that is reproducible is not necessarily a good study. Like other forms of statistical inference, ERGM is concerned with choosing values for a set of parameters that maximize the likelihood that the statistical distribution represents the observed data. Unlike other forms of inference though, ERGM must conduct a numerical, rather than analytical, exploration of the space of parameters. Usually these parameters are real numbers, and there is only a finite amount of computational resources to explore them. We must be careful that the computer explores a reasonable region in the space of parameters; if the computer is too limited in its selection of parameter values, then it will come to biased, erroneous conclusions. The ERGM library comes with a wide range of control parameters that can be tuned so that we can reach an appropriate compromise among computational resources, time, and precision in our results.

What we want to do is use the resources wisely to produce robust as well as reproducible results. There is no one-size-fits-all definition of robustness when it comes to ERGM. Additionally, the ERGM library comes bundled with algorithms to diagnose multiple aspects of robustness on each step of the modeling process. One definition I like to use for ERGM as well as other forms of computer modeling is that, after making sure the model has converged and has produced no warning or error messages, I want to make sure that the results do not gain much from my repeating the code with increased precision.

All of this is vague of course, because the absence of warnings or errors from running ergm() does not mean an absence of problems. There’s only so many diagnostic algorithms that the authors can put into the library. And my deciding that I have run my code with sufficiently high precision is different from deciding that I have explored sufficient points in the parameter space.

However, in adopting a Popperian approach to computer simulation, we use as many computational resources as we can to produce results that support our hypothesis, and after we have published our results we encourage and welcome any work that contributes evidence in support or in opposition to the conclusions of our study.

study0 aims to demonstrate the principles of reproducibility and robustness as described above. The study starts by creating a network of our choosing, then fitting several ERGMs to it. The ERGMs will differ by their precision as well as the random seed. We will be comparing the statistical significance across models as well as the importance of the random seed to the results.

Housekeeping and loading libraries

The code listings below come with an abridged output and some details are omitted:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Note 1: Packages lattice and latticeExtra are used by mcmc.diagnostics(which = "plots")
# Note 2: Package stargazer was installed with:
# devtools::install_github("facorread/stargazer-ergm", ref = "fix-ergm", build = FALSE)
for (package_name in c("ergm", "ggplot2", "ggraph", "lattice", "latticeExtra", "Rglpk", "stargazer", "data.table")) {
  require(package_name, character.only = TRUE) # mask.ok has no useful effect.
  message("Package ", package_name, " version ", packageVersion(package_name),
  " released ", packageDate(package_name))
}
print_data_table_impl <- function(x) {
  data_table_char <- as.data.table(lapply(x, function(column) {
    if (is.numeric(column)) {
      column <- format(column)
      repeat {
        column1 <- sub("(\\.[[:digit:]]+)0( *)$", "\\1 \\2", column)
        if (study0$identical(column1, column)) {
          break
        }
        column <- column1
      }
    }
    column
  }))
  setnames(data_table_char, names(x))
  data_table_char
}

print_data_table <- function(x, ...) {
  print(print_data_table_impl(x), quote = FALSE, ...)
}

print_data_table_t <- function(x, new_col_names = NULL, ...) {
  data_frame_x <- t(print_data_table_impl(x))
  if (!is.null(new_col_names)) {
    colnames(data_frame_x) <- new_col_names
  }
  print(data_frame_x, quote = FALSE, ...)
}
1
2
3
4
5
6
7
8
Package ergm version 4.4.0 released 2023-01-26
Package ggplot2 version 3.4.0 released 2022-10-31
Package ggraph version 2.1.0 released 2022-10-05
Package lattice version 0.20.45 released 2021-09-18
Package latticeExtra version 0.6.30 released 2022-06-30
Package Rglpk version 0.6.4 released 2019-02-08
Package stargazer version 5.2.3 released 2022-03-03
Package data.table version 1.14.6 released 2022-11-16
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
set.seed(1024321215)
tmp <- list()
study0 <- list()
study0$cat <- function(...) cat(..., file = stderr(), sep = "")
study0$stopifna <- function(...) {
    stopifnot(!is.null(...))
    stopifnot(!anyNA(..., recursive = TRUE))
}
study0$identical <- function(x, y) {
    study0$stopifna(x)
    study0$stopifna(y)
    identical(x, y)
}
study0$n_cores <- 8
study0$verbose <- 3
study0$effectiveSize.interval_drop <- 8 # Integer, power of 2. Default 2 Recommended 4
study0$n_vertices <- 50
study0$vertex <- data.table(id = 1:(study0$n_vertices))
study0$cat("Building a directed network with ", study0$n_vertices,
" vertices. The maximum possible number of edges is ", study0$n_vertices * (study0$n_vertices - 1), ".\n")
# Number 0 is used for a trivial network, which acts as a starting point for generating networks
study0$edgelist0 <- data.table(ego_id = as.integer(1), alter_id = as.integer(2))
study0$net0 <- network(study0$edgelist0, vertices = study0$vertex)

Choosing our starting exponential random graph distribution (ERGM distribution)

In the code below, we arbitrarily choose an ERGM distribution and coefficients that we label with the number 1, in objects such as coef1 and formula1. Then we conduct a simulation of 1000 networks so that we can take a look at the properties of our distribution (nets1, means1).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
study0$coef1 <- c(asymmetric = -0.85)
study0$coef_table1 <- as.data.table(t(study0$coef1))
study0$formula1 <- (study0$net0 ~ asymmetric)
study0$control_simulate <- control.simulate.formula(parallel = study0$n_cores)
tmp$nets1 <- simulate(study0$formula1, nsim = 1000, coef = study0$coef1, control = study0$control_simulate)
study0$per_net_stats1 <- as.data.table(summary(tmp$nets1 ~ asymmetric))
# Number 2 is used for our proposed network, a network that most closely matches stats1.
study0$net2 <- san(study0$net0 ~ asymmetric, target.stats = study0$means1)
study0$formula2 <- (study0$net2 ~ asymmetric)
study0$stats2 <- as.data.table(t(summary(study0$formula2)))
study0$stats_table2 <- rbind(
    study0$coef_table1   [, c(Stat = "coef1", .(.SD))],
    study0$per_net_stats1[, c(Stat = "Means1", lapply(.SD, mean))],
    study0$per_net_stats1[, c(Stat = "Sd1",   lapply(.SD, sd))],
    study0$stats2        [, c(Stat = "net2", .(study0$stats2))]
)
print_data_table(study0$stats_table2, row.names = FALSE)
ggraph(study0$net2) + geom_edge_fan()
1
2
3
4
5
Stat     asymmetric
  coef1    -0.85   
 Means1   366.884  
    Sd1    16.19558
   net2   367.0

We have arbitrarily chosen to build an exponential random graph distribution (ERGM distribution) of networks of 50 nodes with the asymmetric statistic and a parameter -0.85. This distribution generates networks that have asymmetric –non-reciprocated– edges in amounts that are smaller than those found on purely random networks. A random sample of 1000 networks generated from this distribution turned out to have an average of Means1 = 366.884 asymmetric edges, and a standard deviation of Sd1 = 16.19558 asymmetric edges; a rather narrow distribution.

From this ERGM distribution we created one representative network, study0$net2, consisting of 367 asymmetric edges and illustrated in the figure above. We use this as the input for the ERGM fitting process as shown below.

Fitting the models

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
study0$control_fast <- control.ergm(
    MCMC.effectiveSize = 100, # Default 64 Strict 1000
    MCMC.effectiveSize.burnin.pval = 0.2, # Default 0.2 Strict 0.4
    MCMLE.MCMC.precision = 0.1, # Default 0.1 Strict 0.02
    MCMLE.confidence = 0.9999, # Default 0.99 Strict 0.9999
    # MCMC.samplesize: MCMLE will increase it until it reaches the requested effectiveSize.
    # Do not make samplesize too small though, because the resulting burnin p-value will be 0.
    MCMC.samplesize = study0$n_cores * 100, # Minimum 100
    MCMLE.maxit = 100, # Default 60. Smaller values might cut ergm short and cause incorrect results.
    MCMC.effectiveSize.maxruns = 30, # Default 16. Smaller values might cause incorrect results.
    # parallel = study0$n_cores,
    MCMLE.effectiveSize.interval_drop = study0$effectiveSize.interval_drop
)
study0$formula3 <- (study0$net2 ~ asymmetric + cyclicalties + mutual)
study0$ergm_comparison <- function(random_seed) {
    set.seed(random_seed)
    study0$cat("\n\n\n\n\nRunning ergm with default precision for random seed ", random_seed, "\n")
    show(study0$stats_table2)
    ergm_default_precision <- ergm(study0$formula3,
        # control = study0$control_default,
        verbose = study0$verbose)
    study0$cat("\n\n\n\n\nRunning ergm with fast precision for random seed ", random_seed, "\n")
    show(study0$stats_table2)
    ergm_fast <- ergm(study0$formula3, control = study0$control_fast, verbose = study0$verbose)
    control_strict <- control.ergm(
        MCMC.effectiveSize = 1000, # Default 64 Strict 1000
        MCMC.effectiveSize.burnin.pval = 0.4, # Default 0.2 Strict 0.4
        MCMLE.MCMC.precision = 0.02, # Default 0.1 Strict 0.02
        MCMLE.confidence = 0.9999, # Default 0.99 Strict 0.9999
        init = ergm_fast$coefficients, # Set the initial parameter values explicitly.
        # MCMC.samplesize: MCMLE will increase it until it reaches the requested effectiveSize.
        # Do not make samplesize too small though, because the resulting burnin p-value will be 0.
        MCMC.samplesize = study0$n_cores * 100, # Minimum 100
        MCMLE.maxit = 100, # Default 60. Smaller values might cut ergm short and cause incorrect results.
        MCMC.effectiveSize.maxruns = 30, # Default 16. Smaller values might cause incorrect results.
        parallel = study0$n_cores,
        MCMLE.effectiveSize.interval_drop = study0$effectiveSize.interval_drop
    )
    study0$cat("\n\n\n\n\nRunning ergm with strict precision for random seed ", random_seed, "\n")
    show(study0$stats_table2)
    ergm_strict <- ergm(study0$formula3, control = control_strict, verbose = study0$verbose)
    data.table(`Random seed` = random_seed, Precision = c("Default", "Fast", "Strict"),
        Ergm = list(ergm_default_precision, ergm_fast, ergm_strict))
}
study0$random_seeds <- c(1092953241, 1025768792)
study0$ergm_table <- rbindlist(lapply(study0$random_seeds, study0$ergm_comparison))
study0$extract_ergm_statistics <- function(ergm_object) {
    summary1 <- summary(ergm_object)
    coef1 <- as.data.table(summary1$coefficients[, c(1, 2, 5)], keep.rownames = TRUE)
    setnames(coef1, c("rn", "Estimate", "Std. Error", "Pr(>|z|)"), c("row_name", "_est", "_se", "_p"))
    melt1 <- melt(coef1, id.vars = "row_name", variable.name = "suffix")
    melt1[, variable := paste0(row_name, suffix)]
    new_columns <- as.data.table(t(melt1$value))
    setnames(new_columns, melt1$variable)
    new_columns[, AIC := summary1$aic]
    new_columns[, BIC := summary1$bic]
    new_columns
}
# study0$extract_ergm_statistics(study0$ergm_table[1, Ergm[[1]]])
study0$add_ergm_table_columns <- function() {
    new_columns <- study0$ergm_table[, rbindlist(lapply(Ergm, study0$extract_ergm_statistics))]
    stopifnot(!(names(new_columns) %in% names(study0$ergm_table)))
    set(study0$ergm_table, j = names(new_columns), value = new_columns)
}
study0$add_ergm_table_columns()
stargazer(study0$ergm_table[["Ergm"]], type = "text", style = "all", column.labels = paste0("Model ", 1:7),
  model.numbers = FALSE,
  add.lines = list(`Random seed` = c("Random seed", study0$ergm_table[["Random seed"]]),
  Precision = c("Precision", study0$ergm_table[["Precision"]])))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
=================================================================================================
                                                    Dependent variable:                          
                          -----------------------------------------------------------------------
                                                          study0                                 
                            Model 1     Model 2     Model 3     Model 4     Model 5     Model 6  
-------------------------------------------------------------------------------------------------
asymmetric                 -1.741***   -1.740***   -1.735***   -1.739***   -1.730***   -1.734*** 
                            (0.149)     (0.145)     (0.144)     (0.142)     (0.144)     (0.143)  
                          t = -11.647 t = -11.981 t = -12.046 t = -12.216 t = -12.026 t = -12.137
                           p = 0.000   p = 0.000   p = 0.000   p = 0.000   p = 0.000   p = 0.000 
cyclicalties                 0.157      0.154*      0.156*      0.155*       0.151      0.153*   
                            (0.098)     (0.093)     (0.093)     (0.092)     (0.093)     (0.093)  
                           t = 1.606   t = 1.647   t = 1.680   t = 1.691   t = 1.631   t = 1.653 
                           p = 0.109   p = 0.100   p = 0.093   p = 0.091   p = 0.103   p = 0.099 
mutual                     -4.132***   -4.113***   -4.118***   -4.124***   -4.106***   -4.109*** 
                            (0.340)     (0.340)     (0.338)     (0.317)     (0.340)     (0.351)  
                          t = -12.161 t = -12.092 t = -12.197 t = -13.030 t = -12.076 t = -11.710
                           p = 0.000   p = 0.000   p = 0.000   p = 0.000   p = 0.000   p = 0.000 
-------------------------------------------------------------------------------------------------
Random seed               1092953241  1092953241  1092953241  1025768792  1025768792  1025768792 
Precision                   Default      Fast       Strict      Default      Fast       Strict   
Akaike Inf. Crit.          2,204.353   2,204.633   2,204.659   2,203.297   2,205.034   2,204.646 
Bayesian Inf. Crit.        2,221.764   2,222.044   2,222.070   2,220.708   2,222.445   2,222.058 
Null Deviance (df = 2450)  3,396.421   3,396.421   3,396.421   3,396.421   3,396.421   3,396.421 
=================================================================================================
Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

We have fitted 6 ERGMs that differ in the two fundamentals discussed above: random seed and precision. For the purposes of the discussion, I call random seed 1 to the seed that ends in number 1, and random seed 2 to the seed that ends in number 2. ERGMs with fast precision were those that used study0$control_fast as control parameter, and ERGMs with strict precision were those that used control_strict as control parameter. I hope the comments above are useful as guidelines to decide the kind of precision you want to get from your ERGM fits.

For the purposes of our example we, have chosen ERGMs based on the formula study0$net2 ~ asymmetric + cyclicalties + mutual. This is obviously different to the formula we used to build study0$net2 itself, which is based on asymmetric only. This responds to the reality of ERGM research, in which we capture data from a real-world network and try to figure out the terms that are appropriate to model the network. Those terms should be informed by our theoretical framework, built on existing scholarly literature. Let us take a look at the statistical significance that is found for those terms in our 6 models in the table above.

The statistical significance of cyclicalties depends not only on the precision of the model but also on the random seed that we have chosen. Models 3 and 6, the strict ones, consistently show statistical significance in cyclicalties. In fact, statistical significance can be shown for strict precision on any choice of random seed (again, in a Popperian sense). In contrast, models 2 and 5, the fast ones, do not agree on the statistical significance of cyclicalties; significance depends on the choice of random seed. The same happens with the models that use the default precision settings, models 1 and 4.

The 6 models in this study have very similar values for the model selection criteria AIC and BIC. We can trust their findings of statistical significance for asymmetric and mutual, and we can say that our conclusions for these two statistics are robust: they are independent of the choice of random seed or precision; we would be able to economize resources by using the default precision if it was not for the problem we described for cyclicalties in the previous paragraph.

To summarize, a study of the ERGM described by the formula study0$net2 ~ asymmetric + cyclicalties + mutual must be conducted with strict precision, and only then we can confidently conclude that the three terms have statistically significant contributions to network formation.

As an exercise, you can create a seventh ERGM described by the formula study0$net2 ~ asymmetric and fit it with appropriate precision settings; you should be able to reach a confident conclusion as to which of the seven models best describes net2. Also, you can check these results against different random seeds.

As an appendix to study0, the graphs of mcmc.diagnostics() are shown below. We should adopt a habit of checking these figures to make sure that the exploration of the parameter space is not biased, autocorrelated, or clearly incomplete. A rule of thumb is to check that the figures to the left look like white noise and the smooth line must look horizontal. The figures on the right should resemble normal distributions. That seems to be the case for the 6 models.

1
2
3
4
5
6
study0$ergm_table[, {
  print(paste0("Here is the MCMC diagnostic plot for random seed ", `Random seed`, " and ", Precision,
  " precision (Model ", .I, "):"))
  print(mcmc.diagnostics(Ergm[[1]], which = "plots"))
  NULL
}, by = each_row]

Here is the MCMC diagnostic plot for random seed 1092953241 and Default precision (Model 1):

Click here to see the results of mcmc.diagnostics() for the other models (study0)

Here is the MCMC diagnostic plot for random seed 1092953241 and Fast precision (Model 2):

Here is the MCMC diagnostic plot for random seed 1092953241 and Strict precision (Model 3):

Here is the MCMC diagnostic plot for random seed 1025768792 and Default precision (Model 4):

Here is the MCMC diagnostic plot for random seed 1025768792 and Fast precision (Model 5):

Here is the MCMC diagnostic plot for random seed 1025768792 and Strict precision (Model 6):

In another appendix to study0, the goodness-of-fit plots are shown below. We should adopt a habit of checking them for every model. The rule of thumb is to visually check that the black lines fall as close to the blue points as possible.

1
2
3
4
5
6
7
study0$ergm_table[, {
  gof1 <- gof(Ergm[[1]])
  print(paste0("Here is the goodness of fit plot for random seed ", `Random seed`, " and ", Precision,
  " precision (Model ", .I, "):"))
  plot(gof1)
  NULL
}, by = each_row]

Here is the goodness of fit plot for random seed 1092953241 and Default precision (Model 1):

Click here to see the results of plot(gof) for the other models (Study0)

Here is the goodness of fit plot for random seed 1092953241 and Fast precision (Model 2):

Here is the goodness of fit plot for random seed 1092953241 and Strict precision (Model 3):

Here is the goodness of fit plot for random seed 1025768792 and Default precision (Model 4):

Here is the goodness of fit plot for random seed 1025768792 and Fast precision (Model 5):