Introduction
Exponential Random Graph Models (ERGM) are statistical models that are concerned with the structure and formation of networks. As part of a network science toolkit, ERGMs provide insight about the social system that creates the network. More specifically, they help us understand the kinds of network structures that cause the formation of edges (Robins & Lusher, 2013).
In this article I conduct two studies. study0
aims to demonstrate the importance of precision, iteration, and reproducibility when building an ERGM. study1
aims to show a common pitfall in model selection and comparison. I encourage you to repeat these studies on your laptop and grow your expertise in network analysis. I am assuming that you are proficient in R scripting, you are already familiar with ERGM, and you know your way around the ERGM package and reference manual (Handcock et al., 2023; Hunter et al., 2008; Krivitsky et al., 2023).
study0
: An ERGM that is robust and reproducible
Reproducibility is a major principle in the scientific method and in scholarly literature. For computer simulations, it is common practice to publish the code and a simulated dataset that allows anyone to reproduce the study. However, most computer simulations involve the generation of random numbers. For example, ERGMs involve the generation of random networks. But if this randomness cannot be controlled or traced, then it would be impossible to recreate the simulated dataset or reproduce the exact results of the published study.
Computers normally use pseudo-random numbers, which statistically look a lot like truly random numbers, but are actually traceable and reproducible. A simulated dataset can be exactly recreated if we know the random seed, a number that is used as the base for the generation of random numbers. For the purposes of reproducible research, it is common to publish the random seed values as part of the code that generates data for the study. Any interested reader must be able to use the published code and seeds to reproduce the published dataset, statistics and conclusions that the researcher has reached, all with exact numerical values. Next, the reader must be able to run the same code with different random seeds. The new dataset will be numerically different, but it must lead to substantially the same statistics, results, and conclusions than those in the published study.
An ERGM study that is reproducible is not necessarily a good study. Like other forms of statistical inference, ERGM is concerned with choosing values for a set of parameters that maximize the likelihood that the statistical distribution represents the observed data. Unlike other forms of inference though, ERGM must conduct a numerical, rather than analytical, exploration of the space of parameters. Usually these parameters are real numbers, and there is only a finite amount of computational resources to explore them. We must be careful that the computer explores a reasonable region in the space of parameters; if the computer is too limited in its selection of parameter values, then it will come to biased, erroneous conclusions. The ERGM library comes with a wide range of control parameters that can be tuned so that we can reach an appropriate compromise among computational resources, time, and precision in our results.
What we want to do is use the resources wisely to produce robust as well as reproducible results. There is no one-size-fits-all definition of robustness when it comes to ERGM. Additionally, the ERGM library comes bundled with algorithms to diagnose multiple aspects of robustness on each step of the modeling process. One definition I like to use for ERGM as well as other forms of computer modeling is that, after making sure the model has converged and has produced no warning or error messages, I want to make sure that the results do not gain much from my repeating the code with increased precision.
All of this is vague of course, because the absence of warnings or errors from running ergm()
does not mean an absence of problems. There’s only so many diagnostic algorithms that the authors can put into the library. And my deciding that I have run my code with sufficiently high precision is different from deciding that I have explored sufficient points in the parameter space.
However, in adopting a Popperian approach to computer simulation, we use as many computational resources as we can to produce results that support our hypothesis, and after we have published our results we encourage and welcome any work that contributes evidence in support or in opposition to the conclusions of our study.
study0
aims to demonstrate the principles of reproducibility and robustness as described above. The study starts by creating a network of our choosing, then fitting several ERGMs to it. The ERGMs will differ by their precision as well as the random seed. We will be comparing the statistical significance across models as well as the importance of the random seed to the results.
Housekeeping and loading libraries
The code listings below come with an abridged output and some details are omitted:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
# Note 1: Packages lattice and latticeExtra are used by mcmc.diagnostics(which = "plots")
# Note 2: Package stargazer was installed with:
# devtools::install_github("facorread/stargazer-ergm", ref = "fix-ergm", build = FALSE)
for (package_name in c("ergm", "ggplot2", "ggraph", "lattice", "latticeExtra", "Rglpk", "stargazer", "data.table")) {
require(package_name, character.only = TRUE) # mask.ok has no useful effect.
message("Package ", package_name, " version ", packageVersion(package_name),
" released ", packageDate(package_name))
}
print_data_table_impl <- function(x) {
data_table_char <- as.data.table(lapply(x, function(column) {
if (is.numeric(column)) {
column <- format(column)
repeat {
column1 <- sub("(\\.[[:digit:]]+)0( *)$", "\\1 \\2", column)
if (study0$identical(column1, column)) {
break
}
column <- column1
}
}
column
}))
setnames(data_table_char, names(x))
data_table_char
}
print_data_table <- function(x, ...) {
print(print_data_table_impl(x), quote = FALSE, ...)
}
print_data_table_t <- function(x, new_col_names = NULL, ...) {
data_frame_x <- t(print_data_table_impl(x))
if (!is.null(new_col_names)) {
colnames(data_frame_x) <- new_col_names
}
print(data_frame_x, quote = FALSE, ...)
}
|
1
2
3
4
5
6
7
8
|
Package ergm version 4.4.0 released 2023-01-26
Package ggplot2 version 3.4.0 released 2022-10-31
Package ggraph version 2.1.0 released 2022-10-05
Package lattice version 0.20.45 released 2021-09-18
Package latticeExtra version 0.6.30 released 2022-06-30
Package Rglpk version 0.6.4 released 2019-02-08
Package stargazer version 5.2.3 released 2022-03-03
Package data.table version 1.14.6 released 2022-11-16
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
set.seed(1024321215)
tmp <- list()
study0 <- list()
study0$cat <- function(...) cat(..., file = stderr(), sep = "")
study0$stopifna <- function(...) {
stopifnot(!is.null(...))
stopifnot(!anyNA(..., recursive = TRUE))
}
study0$identical <- function(x, y) {
study0$stopifna(x)
study0$stopifna(y)
identical(x, y)
}
study0$n_cores <- 8
study0$verbose <- 3
study0$effectiveSize.interval_drop <- 8 # Integer, power of 2. Default 2 Recommended 4
study0$n_vertices <- 50
study0$vertex <- data.table(id = 1:(study0$n_vertices))
study0$cat("Building a directed network with ", study0$n_vertices,
" vertices. The maximum possible number of edges is ", study0$n_vertices * (study0$n_vertices - 1), ".\n")
# Number 0 is used for a trivial network, which acts as a starting point for generating networks
study0$edgelist0 <- data.table(ego_id = as.integer(1), alter_id = as.integer(2))
study0$net0 <- network(study0$edgelist0, vertices = study0$vertex)
|
Choosing our starting exponential random graph distribution (ERGM distribution)
In the code below, we arbitrarily choose an ERGM distribution and coefficients that we label with the number 1, in objects such as coef1
and formula1
. Then we conduct a simulation of 1000 networks so that we can take a look at the properties of our distribution (nets1
, means1
).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
study0$coef1 <- c(asymmetric = -0.85)
study0$coef_table1 <- as.data.table(t(study0$coef1))
study0$formula1 <- (study0$net0 ~ asymmetric)
study0$control_simulate <- control.simulate.formula(parallel = study0$n_cores)
tmp$nets1 <- simulate(study0$formula1, nsim = 1000, coef = study0$coef1, control = study0$control_simulate)
study0$per_net_stats1 <- as.data.table(summary(tmp$nets1 ~ asymmetric))
# Number 2 is used for our proposed network, a network that most closely matches stats1.
study0$net2 <- san(study0$net0 ~ asymmetric, target.stats = study0$means1)
study0$formula2 <- (study0$net2 ~ asymmetric)
study0$stats2 <- as.data.table(t(summary(study0$formula2)))
study0$stats_table2 <- rbind(
study0$coef_table1 [, c(Stat = "coef1", .(.SD))],
study0$per_net_stats1[, c(Stat = "Means1", lapply(.SD, mean))],
study0$per_net_stats1[, c(Stat = "Sd1", lapply(.SD, sd))],
study0$stats2 [, c(Stat = "net2", .(study0$stats2))]
)
print_data_table(study0$stats_table2, row.names = FALSE)
ggraph(study0$net2) + geom_edge_fan()
|
1
2
3
4
5
|
Stat asymmetric
coef1 -0.85
Means1 366.884
Sd1 16.19558
net2 367.0
|
We have arbitrarily chosen to build an exponential random graph distribution (ERGM distribution) of networks of 50 nodes with the asymmetric
statistic and a parameter -0.85
. This distribution generates networks that have asymmetric –non-reciprocated– edges in amounts that are smaller than those found on purely random networks. A random sample of 1000 networks generated from this distribution turned out to have an average of Means1 = 366.884
asymmetric edges, and a standard deviation of Sd1 = 16.19558
asymmetric edges; a rather narrow distribution.
From this ERGM distribution we created one representative network, study0$net2
, consisting of 367
asymmetric edges and illustrated in the figure above. We use this as the input for the ERGM fitting process as shown below.
Fitting the models
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
|
study0$control_fast <- control.ergm(
MCMC.effectiveSize = 100, # Default 64 Strict 1000
MCMC.effectiveSize.burnin.pval = 0.2, # Default 0.2 Strict 0.4
MCMLE.MCMC.precision = 0.1, # Default 0.1 Strict 0.02
MCMLE.confidence = 0.9999, # Default 0.99 Strict 0.9999
# MCMC.samplesize: MCMLE will increase it until it reaches the requested effectiveSize.
# Do not make samplesize too small though, because the resulting burnin p-value will be 0.
MCMC.samplesize = study0$n_cores * 100, # Minimum 100
MCMLE.maxit = 100, # Default 60. Smaller values might cut ergm short and cause incorrect results.
MCMC.effectiveSize.maxruns = 30, # Default 16. Smaller values might cause incorrect results.
# parallel = study0$n_cores,
MCMLE.effectiveSize.interval_drop = study0$effectiveSize.interval_drop
)
study0$formula3 <- (study0$net2 ~ asymmetric + cyclicalties + mutual)
study0$ergm_comparison <- function(random_seed) {
set.seed(random_seed)
study0$cat("\n\n\n\n\nRunning ergm with default precision for random seed ", random_seed, "\n")
show(study0$stats_table2)
ergm_default_precision <- ergm(study0$formula3,
# control = study0$control_default,
verbose = study0$verbose)
study0$cat("\n\n\n\n\nRunning ergm with fast precision for random seed ", random_seed, "\n")
show(study0$stats_table2)
ergm_fast <- ergm(study0$formula3, control = study0$control_fast, verbose = study0$verbose)
control_strict <- control.ergm(
MCMC.effectiveSize = 1000, # Default 64 Strict 1000
MCMC.effectiveSize.burnin.pval = 0.4, # Default 0.2 Strict 0.4
MCMLE.MCMC.precision = 0.02, # Default 0.1 Strict 0.02
MCMLE.confidence = 0.9999, # Default 0.99 Strict 0.9999
init = ergm_fast$coefficients, # Set the initial parameter values explicitly.
# MCMC.samplesize: MCMLE will increase it until it reaches the requested effectiveSize.
# Do not make samplesize too small though, because the resulting burnin p-value will be 0.
MCMC.samplesize = study0$n_cores * 100, # Minimum 100
MCMLE.maxit = 100, # Default 60. Smaller values might cut ergm short and cause incorrect results.
MCMC.effectiveSize.maxruns = 30, # Default 16. Smaller values might cause incorrect results.
parallel = study0$n_cores,
MCMLE.effectiveSize.interval_drop = study0$effectiveSize.interval_drop
)
study0$cat("\n\n\n\n\nRunning ergm with strict precision for random seed ", random_seed, "\n")
show(study0$stats_table2)
ergm_strict <- ergm(study0$formula3, control = control_strict, verbose = study0$verbose)
data.table(`Random seed` = random_seed, Precision = c("Default", "Fast", "Strict"),
Ergm = list(ergm_default_precision, ergm_fast, ergm_strict))
}
study0$random_seeds <- c(1092953241, 1025768792)
study0$ergm_table <- rbindlist(lapply(study0$random_seeds, study0$ergm_comparison))
study0$extract_ergm_statistics <- function(ergm_object) {
summary1 <- summary(ergm_object)
coef1 <- as.data.table(summary1$coefficients[, c(1, 2, 5)], keep.rownames = TRUE)
setnames(coef1, c("rn", "Estimate", "Std. Error", "Pr(>|z|)"), c("row_name", "_est", "_se", "_p"))
melt1 <- melt(coef1, id.vars = "row_name", variable.name = "suffix")
melt1[, variable := paste0(row_name, suffix)]
new_columns <- as.data.table(t(melt1$value))
setnames(new_columns, melt1$variable)
new_columns[, AIC := summary1$aic]
new_columns[, BIC := summary1$bic]
new_columns
}
# study0$extract_ergm_statistics(study0$ergm_table[1, Ergm[[1]]])
study0$add_ergm_table_columns <- function() {
new_columns <- study0$ergm_table[, rbindlist(lapply(Ergm, study0$extract_ergm_statistics))]
stopifnot(!(names(new_columns) %in% names(study0$ergm_table)))
set(study0$ergm_table, j = names(new_columns), value = new_columns)
}
study0$add_ergm_table_columns()
stargazer(study0$ergm_table[["Ergm"]], type = "text", style = "all", column.labels = paste0("Model ", 1:7),
model.numbers = FALSE,
add.lines = list(`Random seed` = c("Random seed", study0$ergm_table[["Random seed"]]),
Precision = c("Precision", study0$ergm_table[["Precision"]])))
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
=================================================================================================
Dependent variable:
-----------------------------------------------------------------------
study0
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
-------------------------------------------------------------------------------------------------
asymmetric -1.741*** -1.740*** -1.735*** -1.739*** -1.730*** -1.734***
(0.149) (0.145) (0.144) (0.142) (0.144) (0.143)
t = -11.647 t = -11.981 t = -12.046 t = -12.216 t = -12.026 t = -12.137
p = 0.000 p = 0.000 p = 0.000 p = 0.000 p = 0.000 p = 0.000
cyclicalties 0.157 0.154* 0.156* 0.155* 0.151 0.153*
(0.098) (0.093) (0.093) (0.092) (0.093) (0.093)
t = 1.606 t = 1.647 t = 1.680 t = 1.691 t = 1.631 t = 1.653
p = 0.109 p = 0.100 p = 0.093 p = 0.091 p = 0.103 p = 0.099
mutual -4.132*** -4.113*** -4.118*** -4.124*** -4.106*** -4.109***
(0.340) (0.340) (0.338) (0.317) (0.340) (0.351)
t = -12.161 t = -12.092 t = -12.197 t = -13.030 t = -12.076 t = -11.710
p = 0.000 p = 0.000 p = 0.000 p = 0.000 p = 0.000 p = 0.000
-------------------------------------------------------------------------------------------------
Random seed 1092953241 1092953241 1092953241 1025768792 1025768792 1025768792
Precision Default Fast Strict Default Fast Strict
Akaike Inf. Crit. 2,204.353 2,204.633 2,204.659 2,203.297 2,205.034 2,204.646
Bayesian Inf. Crit. 2,221.764 2,222.044 2,222.070 2,220.708 2,222.445 2,222.058
Null Deviance (df = 2450) 3,396.421 3,396.421 3,396.421 3,396.421 3,396.421 3,396.421
=================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
|
We have fitted 6 ERGMs that differ in the two fundamentals discussed above: random seed and precision. For the purposes of the discussion, I call random seed 1 to the seed that ends in number 1, and random seed 2 to the seed that ends in number 2. ERGMs with fast precision were those that used study0$control_fast
as control parameter, and ERGMs with strict precision were those that used control_strict
as control parameter. I hope the comments above are useful as guidelines to decide the kind of precision you want to get from your ERGM fits.
For the purposes of our example we, have chosen ERGMs based on the formula study0$net2 ~ asymmetric + cyclicalties + mutual
. This is obviously different to the formula we used to build study0$net2
itself, which is based on asymmetric
only. This responds to the reality of ERGM research, in which we capture data from a real-world network and try to figure out the terms that are appropriate to model the network. Those terms should be informed by our theoretical framework, built on existing scholarly literature. Let us take a look at the statistical significance that is found for those terms in our 6 models in the table above.
The statistical significance of cyclicalties
depends not only on the precision of the model but also on the random seed that we have chosen. Models 3 and 6, the strict ones, consistently show statistical significance in cyclicalties
. In fact, statistical significance can be shown for strict precision on any choice of random seed (again, in a Popperian sense). In contrast, models 2 and 5, the fast ones, do not agree on the statistical significance of cyclicalties
; significance depends on the choice of random seed. The same happens with the models that use the default precision settings, models 1 and 4.
The 6 models in this study have very similar values for the model selection criteria AIC and BIC. We can trust their findings of statistical significance for asymmetric
and mutual
, and we can say that our conclusions for these two statistics are robust: they are independent of the choice of random seed or precision; we would be able to economize resources by using the default precision if it was not for the problem we described for cyclicalties
in the previous paragraph.
To summarize, a study of the ERGM described by the formula study0$net2 ~ asymmetric + cyclicalties + mutual
must be conducted with strict precision, and only then we can confidently conclude that the three terms have statistically significant contributions to network formation.
As an exercise, you can create a seventh ERGM described by the formula study0$net2 ~ asymmetric
and fit it with appropriate precision settings; you should be able to reach a confident conclusion as to which of the seven models best describes net2
. Also, you can check these results against different random seeds.
As an appendix to study0
, the graphs of mcmc.diagnostics()
are shown below. We should adopt a habit of checking these figures to make sure that the exploration of the parameter space is not biased, autocorrelated, or clearly incomplete. A rule of thumb is to check that the figures to the left look like white noise and the smooth line must look horizontal. The figures on the right should resemble normal distributions. That seems to be the case for the 6 models.
1
2
3
4
5
6
|
study0$ergm_table[, {
print(paste0("Here is the MCMC diagnostic plot for random seed ", `Random seed`, " and ", Precision,
" precision (Model ", .I, "):"))
print(mcmc.diagnostics(Ergm[[1]], which = "plots"))
NULL
}, by = each_row]
|
Here is the MCMC diagnostic plot for random seed 1092953241 and Default precision (Model 1):
Click here to see the results of mcmc.diagnostics() for the other models (study0)
Here is the MCMC diagnostic plot for random seed 1092953241 and Fast precision (Model 2):
Here is the MCMC diagnostic plot for random seed 1092953241 and Strict precision (Model 3):
Here is the MCMC diagnostic plot for random seed 1025768792 and Default precision (Model 4):
Here is the MCMC diagnostic plot for random seed 1025768792 and Fast precision (Model 5):
Here is the MCMC diagnostic plot for random seed 1025768792 and Strict precision (Model 6):
In another appendix to study0
, the goodness-of-fit plots are shown below. We should adopt a habit of checking them for every model. The rule of thumb is to visually check that the black lines fall as close to the blue points as possible.
1
2
3
4
5
6
7
|
study0$ergm_table[, {
gof1 <- gof(Ergm[[1]])
print(paste0("Here is the goodness of fit plot for random seed ", `Random seed`, " and ", Precision,
" precision (Model ", .I, "):"))
plot(gof1)
NULL
}, by = each_row]
|
Here is the goodness of fit plot for random seed 1092953241 and Default precision (Model 1):
Click here to see the results of plot(gof) for the other models (Study0)
Here is the goodness of fit plot for random seed 1092953241 and Fast precision (Model 2):
Here is the goodness of fit plot for random seed 1092953241 and Strict precision (Model 3):
Here is the goodness of fit plot for random seed 1025768792 and Default precision (Model 4):
Here is the goodness of fit plot for random seed 1025768792 and Fast precision (Model 5):
Here is the goodness of fit plot for random seed 1025768792 and Strict precision (Model 6):
To reiterate, we visually check that the black lines fall as close to the blue points as possible. Generally speaking, most black lines fall within the grey boxes, but not all of them. We expect to see discrepancies but make sure they are infrequent. Additionally, all cyclicalties
statistics fall within the grey boxes, even for models where cyclicalties
was found to make a statistically insignificant contribution. This is a key realization as a user of ergm: the computer certainly generates networks with values of cyclicalties
that are similar to net2
, but that fact is different from cyclicalties
making a statistically significant contribution to the model. The gof
plots visualize the first fact but not the second.
study1
: We must explore ERGMs with many combinations of terms
Network statistics are associated to one another in complex ways; they are confounded by the network itself. When we build an ERGM, we put together a linear combination of these statistics. For other types of regression, such as lm()
and glm()
, it is usual to use a sequential algorithm such as stats::step()
or MASS::stepAIC()
to select the best model based on an appropriate combination of predictors. In this section, we will demonstrate how the network confounder impedes a sequential process of model selection in ERGM.
Housekeeping and loading libraries
The code listings below come with an abridged output and some details are omitted:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
set.seed(1024146415)
tmp <- list()
study1 <- list()
study1$cat <- function(...) cat(..., file = stderr(), sep = "")
study1$stopifna <- function(...) {
stopifnot(!is.null(...))
stopifnot(!anyNA(..., recursive = TRUE))
}
study1$identical <- function(x, y) {
study1$stopifna(x)
study1$stopifna(y)
identical(x, y)
}
study1$n_cores <- 8
study1$verbose <- 3
study1$effectiveSize.interval_drop <- 8 # Integer, power of 2. Default 2 Recommended 4
study1$n_vertices <- 50
study1$vertex <- data.table(id = 1:(study1$n_vertices))
study1$cat("Building a directed network with ", study1$n_vertices,
" vertices. The maximum possible number of edges is ", study1$n_vertices * (study1$n_vertices - 1), ".\n")
# Number 0 is used for a trivial network, which acts as a starting point for generating networks
study1$edgelist0 <- data.table(ego_id = as.integer(1), alter_id = as.integer(2))
study1$net0 <- network(study1$edgelist0, vertices = study1$vertex)
|
Choosing our starting ERGM distribution
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
study1$coef1 <- c(cyclicalties = 0.2)
study1$coef_table1 <- as.data.table(t(study1$coef1))
study1$formula1 <- (study1$net0 ~ cyclicalties)
study1$control_simulate <- control.simulate.formula(parallel = study1$n_cores)
tmp$nets1 <- simulate(study1$formula1, nsim = 1000, coef = study1$coef1, control = study1$control_simulate)
study1$per_net_stats1 <- as.data.table(summary(tmp$nets1 ~ cyclicalties))
study1$means1 <- sapply(study1$per_net_stats1, mean)
# Number 2 is used for our proposed network, a network that most closely matches stats1.
study1$net2 <- san(study1$net0 ~ cyclicalties, target.stats = study1$means1)
study1$formula2 <- (study1$net2 ~ cyclicalties)
study1$stats2 <- as.data.table(t(summary(study1$formula2)))
study1$stats_table2 <- rbind(
study1$coef_table1 [, c(Stat = "coef1", .(.SD))],
study1$per_net_stats1[, c(Stat = "Means1", lapply(.SD, mean))],
study1$per_net_stats1[, c(Stat = "Sd1", lapply(.SD, sd))],
study1$stats2 [, c(Stat = "net2", .(study1$stats2))]
)
print_data_table(study1$stats_table2, row.names = FALSE)
ggraph(study1$net2) + geom_edge_fan()
|
1
2
3
4
5
|
Stat cyclicalties
coef1 0.2
Means1 1346.238
Sd1 23.67319
net2 1346.0
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
|
study1$parallel_ergm <- function() {
local_cluster <- makeCluster(study1$n_cores, "PSOCK")
on.exit(stopCluster(local_cluster))
study1$control_fast <- control.ergm(
# MCMC.samplesize: MCMLE will increase it until it reaches the requested effectiveSize.
# Do not make samplesize too small though, because the resulting burnin p-value will be 0.
MCMC.samplesize = study1$n_cores * 100, # Minimum 100
MCMLE.maxit = 100, # Default 60. Smaller values might cut ergm short and cause incorrect results.
MCMC.effectiveSize.maxruns = 30, # Default 16. Smaller values might cause incorrect results.
# parallel = study1$n_cores,
parallel = local_cluster,
MCMLE.effectiveSize.interval_drop = study1$effectiveSize.interval_drop
)
study1$ergm_combination <- function(ergm_formula) {
study1$cat("\n\n\n\n\nRunning ergm with default precision for formula:\n", deparse(ergm_formula), "\n")
show(study1$stats_table2)
ergm_fast <- ergm(ergm_formula, control = study1$control_fast, verbose = study1$verbose)
control_strict <- control.ergm(
MCMC.effectiveSize = 1000, # Default 64 Strict 1000
MCMC.effectiveSize.burnin.pval = 0.4, # Default 0.2 Strict 0.4
MCMLE.MCMC.precision = 0.02, # Default 0.1 Strict 0.02
MCMLE.confidence = 0.9999, # Default 0.99 Strict 0.9999
init = ergm_fast$coefficients, # Set the initial parameter values explicitly.
# MCMC.samplesize: MCMLE will increase it until it reaches the requested effectiveSize.
# Do not make samplesize too small though, because the resulting burnin p-value will be 0.
MCMC.samplesize = study1$n_cores * 100, # Minimum 100
MCMLE.maxit = 100, # Default 60. Smaller values might cut ergm short and cause incorrect results.
MCMC.effectiveSize.maxruns = 30, # Default 16. Smaller values might cause incorrect results.
# parallel = study1$n_cores,
parallel = local_cluster,
MCMLE.effectiveSize.interval_drop = study1$effectiveSize.interval_drop
)
study1$cat("\n\n\n\n\nRunning ergm with strict precision for formula:\n", deparse(ergm_formula), "\n")
show(study1$stats_table2)
ergm(ergm_formula, control = control_strict, verbose = study1$verbose)
}
# asymmetric + balance + cyclicalties + density + dsp(1) + dgwdsp(1) + edges + esp(5)
# + mutual + transitiveties + twopath
study1$ergm3 <- study1$ergm_combination(study1$net2 ~ asymmetric)
study1$gof3 <- gof(study1$ergm3)
study1$ergm4 <- study1$ergm_combination(study1$net2 ~ cyclicalties)
study1$gof4 <- gof(study1$ergm4)
study1$ergm5 <- study1$ergm_combination(study1$net2 ~ twopath)
study1$gof5 <- gof(study1$ergm5)
study1$ergm6 <- study1$ergm_combination(study1$net2 ~ asymmetric + cyclicalties)
study1$gof6 <- gof(study1$ergm6)
study1$ergm7 <- study1$ergm_combination(study1$net2 ~ asymmetric + twopath)
study1$gof7 <- gof(study1$ergm7)
study1$ergm8 <- study1$ergm_combination(study1$net2 ~ cyclicalties + twopath)
study1$gof8 <- gof(study1$ergm8)
study1$ergm9 <- study1$ergm_combination(study1$net2 ~ asymmetric + cyclicalties + twopath)
study1$gof9 <- gof(study1$ergm9)
save(study1, file = "study1.RData") # Data was saved to disk and then organized as shown below.
}
study1$parallel_ergm()
|
1
2
3
|
stargazer(study1$ergm3, study1$ergm4, study1$ergm5, study1$ergm6, study1$ergm7, study1$ergm8,
study1$ergm9, type = "text", style = "all", column.labels = paste0("Model ", 3:9),
model.numbers = FALSE)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
=======================================================================================================
Dependent variable:
-----------------------------------------------------------------------------
net2
Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9
-------------------------------------------------------------------------------------------------------
asymmetric 0.069 0.090 0.090 0.091
(0.057) (0.058) (0.057) (0.058)
t = 1.213 t = 1.556 t = 1.573 t = 1.562
p = 0.226 p = 0.120 p = 0.116 p = 0.119
cyclicalties 0.198*** 0.207*** 0.161 0.175
(0.042) (0.041) (0.628) (0.622)
t = 4.703 t = 5.005 t = 0.257 t = 0.282
p = 0.00001 p = 0.00000 p = 0.798 p = 0.779
twopath 0.004*** 0.004*** 0.001 0.001
(0.001) (0.001) (0.012) (0.012)
t = 4.875 t = 5.182 t = 0.057 t = 0.051
p = 0.00001 p = 0.00000 p = 0.955 p = 0.960
-------------------------------------------------------------------------------------------------------
Akaike Inf. Crit. 3,397.640 3,374.478 3,374.529 3,374.047 3,373.877 3,376.480 3,375.953
Bayesian Inf. Crit. 3,403.444 3,380.282 3,380.333 3,385.655 3,385.485 3,388.088 3,393.365
Null Deviance (df = 2450) 3,396.421 3,396.421 3,396.421 3,396.421 3,396.421 3,396.421 3,396.421
=======================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
|
This table compares the ERGMs we have built, which we numbered 3 through 9. It seems that Model 3 is the least appropriate of all models, as it has the largest values for both information criteria. The corresponding term asymmetric
has no statistical significance for any of the models evaluated. While Akaike information criterion marginally favors Model 7 as the best of all, Model 4 is favored by the Bayesian information criterion. And cyclicalties
, which is the term that we used to build the observed network net2
, shows statistical significants in Model 4 but is absent from Model 7. A sequential model search such as that performed by stats::step()
or MASS::stepAIC()
could have settled for either model depending on the starting point, the random seed, and the choice of information criterion.
This study shows that model selection, informed by information criteria and statistical significance, does not necessarily reach a single, authoritative best model when applied to ERGM. Instead, a selection of terms must be made based on the theoretical framework used to study the particular network observed in the real world, and a comparison of all possible ERGMs that can be constructed must proceed. More often than not, computational and time constraints prevent us from performing that comparison in a way that covers all possible combinations of terms.
As an appendix to this study, here are the MCMC diagnostic plots for each model:
1
|
mcmc.diagnostics(study1$ergm3, which = "plots")
|
Click here to see the results of mcmc.diagnostics() for the other models (study1)
1
|
mcmc.diagnostics(study1$ergm4, which = "plots")
|
1
|
mcmc.diagnostics(study1$ergm5, which = "plots")
|
1
|
mcmc.diagnostics(study1$ergm6, which = "plots")
|
1
|
mcmc.diagnostics(study1$ergm7, which = "plots")
|
1
|
mcmc.diagnostics(study1$ergm8, which = "plots")
|
1
|
mcmc.diagnostics(study1$ergm9, which = "plots")
|
As another appendix to this study, here are the goodness-of-fit plots for all ERGMs. The discrepancies between the models and the observed network (net2
) are evident, especially for those ERGMs for which the information criteria were found to be the highest, namely models 3 and 8.
Click here to see the results of plot(gof) for the other models (study1)
Conclusion
We conducted two studies of Exponential Random Graph Models with the purpose of exploring the technology and the limitations of this network science tool. By using networks we generated, we were able to develop a few best practices for future studies of real-world networks: Make sure the model is reproducible, independent of the choice of random seed, and with enough precision to be trustworthy. Use an established theoretical framework to guide your choice of possible model terms, and try your best to explore all possible combinations of terms to make sure you can choose the best models.
References
Handcock M, Hunter D, Butts C, Goodreau S, Krivitsky P, Morris M (2023). ergm: Fit, Simulate and Diagnose Exponential-Family Models for Networks. The Statnet Project (https://statnet.org). R package version 4.4.0, https://CRAN.R-project.org/package=ergm.
Hunter D, Handcock M, Butts C, Goodreau S, Morris M (2008). “ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks.” Journal of Statistical Software, 24(3), 1–29.
Krivitsky PN, Hunter DR, Morris M, Klumb C (2023). “ergm 4: New Features for Analyzing Exponential-Family Random Graph Models.” Journal of Statistical Software, 105(6), 1–44. doi:10.18637/jss.v105.i06.
Robins, G., & Lusher, D. (2013). What are exponential random graph models? Exponential random graph models for social networks: Theory, methods and applications, 9-15.
Taboga, M (2021). Markov Chain Monte Carlo (MCMC) diagnostics. Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix, available here.
Credits
Featured photo: Switch, cable, network by ubayrak, available here. Licensed under CC0 public domain license, free for personal & commercial use, no attribution required.