Making the most of Exponential Random Graph Modeling (ERGM)
Introduction
Exponential Random Graph Models (ERGM) are statistical models that are concerned with the structure and formation of networks. As part of a network science toolkit, ERGMs provide insight about the social system that creates the network. More specifically, they help us understand the kinds of network structures that cause the formation of edges (Robins & Lusher, 2013).
In this article I conduct two studies. study0
aims to demonstrate the importance of precision, iteration, and reproducibility when building an ERGM. study1
aims to show a common pitfall in model selection and comparison. I encourage you to repeat these studies on your laptop and grow your expertise in network analysis. I am assuming that you are proficient in R scripting, you are already familiar with ERGM, and you know your way around the ERGM package and reference manual (Handcock et al., 2023; Hunter et al., 2008; Krivitsky et al., 2023).
study0
: An ERGM that is robust and reproducible
Reproducibility is a major principle in the scientific method and in scholarly literature. For computer simulations, it is common practice to publish the code and a simulated dataset that allows anyone to reproduce the study. However, most computer simulations involve the generation of random numbers. For example, ERGMs involve the generation of random networks. But if this randomness cannot be controlled or traced, then it would be impossible to recreate the simulated dataset or reproduce the exact results of the published study.
Computers normally use pseudo-random numbers, which statistically look a lot like truly random numbers, but are actually traceable and reproducible. A simulated dataset can be exactly recreated if we know the random seed, a number that is used as the base for the generation of random numbers. For the purposes of reproducible research, it is common to publish the random seed values as part of the code that generates data for the study. Any interested reader must be able to use the published code and seeds to reproduce the published dataset, statistics and conclusions that the researcher has reached, all with exact numerical values. Next, the reader must be able to run the same code with different random seeds. The new dataset will be numerically different, but it must lead to substantially the same statistics, results, and conclusions than those in the published study.
An ERGM study that is reproducible is not necessarily a good study. Like other forms of statistical inference, ERGM is concerned with choosing values for a set of parameters that maximize the likelihood that the statistical distribution represents the observed data. Unlike other forms of inference though, ERGM must conduct a numerical, rather than analytical, exploration of the space of parameters. Usually these parameters are real numbers, and there is only a finite amount of computational resources to explore them. We must be careful that the computer explores a reasonable region in the space of parameters; if the computer is too limited in its selection of parameter values, then it will come to biased, erroneous conclusions. The ERGM library comes with a wide range of control parameters that can be tuned so that we can reach an appropriate compromise among computational resources, time, and precision in our results.
What we want to do is use the resources wisely to produce robust as well as reproducible results. There is no one-size-fits-all definition of robustness when it comes to ERGM. Additionally, the ERGM library comes bundled with algorithms to diagnose multiple aspects of robustness on each step of the modeling process. One definition I like to use for ERGM as well as other forms of computer modeling is that, after making sure the model has converged and has produced no warning or error messages, I want to make sure that the results do not gain much from my repeating the code with increased precision.
All of this is vague of course, because the absence of warnings or errors from running ergm()
does not mean an absence of problems. There’s only so many diagnostic algorithms that the authors can put into the library. And my deciding that I have run my code with sufficiently high precision is different from deciding that I have explored sufficient points in the parameter space.
However, in adopting a Popperian approach to computer simulation, we use as many computational resources as we can to produce results that support our hypothesis, and after we have published our results we encourage and welcome any work that contributes evidence in support or in opposition to the conclusions of our study.
study0
aims to demonstrate the principles of reproducibility and robustness as described above. The study starts by creating a network of our choosing, then fitting several ERGMs to it. The ERGMs will differ by their precision as well as the random seed. We will be comparing the statistical significance across models as well as the importance of the random seed to the results.
Housekeeping and loading libraries
The code listings below come with an abridged output and some details are omitted:
|
|
|
|
|
|
Choosing our starting exponential random graph distribution (ERGM distribution)
In the code below, we arbitrarily choose an ERGM distribution and coefficients that we label with the number 1, in objects such as coef1
and formula1
. Then we conduct a simulation of 1000 networks so that we can take a look at the properties of our distribution (nets1
, means1
).
|
|
|
|
We have arbitrarily chosen to build an exponential random graph distribution (ERGM distribution) of networks of 50 nodes with the asymmetric
statistic and a parameter -0.85
. This distribution generates networks that have asymmetric –non-reciprocated– edges in amounts that are smaller than those found on purely random networks. A random sample of 1000 networks generated from this distribution turned out to have an average of Means1 = 366.884
asymmetric edges, and a standard deviation of Sd1 = 16.19558
asymmetric edges; a rather narrow distribution.
From this ERGM distribution we created one representative network, study0$net2
, consisting of 367
asymmetric edges and illustrated in the figure above. We use this as the input for the ERGM fitting process as shown below.
Fitting the models
|
|
|
|
We have fitted 6 ERGMs that differ in the two fundamentals discussed above: random seed and precision. For the purposes of the discussion, I call random seed 1 to the seed that ends in number 1, and random seed 2 to the seed that ends in number 2. ERGMs with fast precision were those that used study0$control_fast
as control parameter, and ERGMs with strict precision were those that used control_strict
as control parameter. I hope the comments above are useful as guidelines to decide the kind of precision you want to get from your ERGM fits.
For the purposes of our example we, have chosen ERGMs based on the formula study0$net2 ~ asymmetric + cyclicalties + mutual
. This is obviously different to the formula we used to build study0$net2
itself, which is based on asymmetric
only. This responds to the reality of ERGM research, in which we capture data from a real-world network and try to figure out the terms that are appropriate to model the network. Those terms should be informed by our theoretical framework, built on existing scholarly literature. Let us take a look at the statistical significance that is found for those terms in our 6 models in the table above.
The statistical significance of cyclicalties
depends not only on the precision of the model but also on the random seed that we have chosen. Models 3 and 6, the strict ones, consistently show statistical significance in cyclicalties
. In fact, statistical significance can be shown for strict precision on any choice of random seed (again, in a Popperian sense). In contrast, models 2 and 5, the fast ones, do not agree on the statistical significance of cyclicalties
; significance depends on the choice of random seed. The same happens with the models that use the default precision settings, models 1 and 4.
The 6 models in this study have very similar values for the model selection criteria AIC and BIC. We can trust their findings of statistical significance for asymmetric
and mutual
, and we can say that our conclusions for these two statistics are robust: they are independent of the choice of random seed or precision; we would be able to economize resources by using the default precision if it was not for the problem we described for cyclicalties
in the previous paragraph.
To summarize, a study of the ERGM described by the formula study0$net2 ~ asymmetric + cyclicalties + mutual
must be conducted with strict precision, and only then we can confidently conclude that the three terms have statistically significant contributions to network formation.
As an exercise, you can create a seventh ERGM described by the formula study0$net2 ~ asymmetric
and fit it with appropriate precision settings; you should be able to reach a confident conclusion as to which of the seven models best describes net2
. Also, you can check these results against different random seeds.
As an appendix to study0
, the graphs of mcmc.diagnostics()
are shown below. We should adopt a habit of checking these figures to make sure that the exploration of the parameter space is not biased, autocorrelated, or clearly incomplete. A rule of thumb is to check that the figures to the left look like white noise and the smooth line must look horizontal. The figures on the right should resemble normal distributions. That seems to be the case for the 6 models.
|
|
Here is the MCMC diagnostic plot for random seed 1092953241 and Default precision (Model 1):
Click here to see the results of mcmc.diagnostics() for the other models (study0)
Here is the MCMC diagnostic plot for random seed 1092953241 and Fast precision (Model 2):
Here is the MCMC diagnostic plot for random seed 1092953241 and Strict precision (Model 3):
Here is the MCMC diagnostic plot for random seed 1025768792 and Default precision (Model 4):
Here is the MCMC diagnostic plot for random seed 1025768792 and Fast precision (Model 5):
Here is the MCMC diagnostic plot for random seed 1025768792 and Strict precision (Model 6):
In another appendix to study0
, the goodness-of-fit plots are shown below. We should adopt a habit of checking them for every model. The rule of thumb is to visually check that the black lines fall as close to the blue points as possible.
|
|
Here is the goodness of fit plot for random seed 1092953241 and Default precision (Model 1):
Click here to see the results of plot(gof) for the other models (Study0)
Here is the goodness of fit plot for random seed 1092953241 and Fast precision (Model 2):
Here is the goodness of fit plot for random seed 1092953241 and Strict precision (Model 3):
Here is the goodness of fit plot for random seed 1025768792 and Default precision (Model 4):
Here is the goodness of fit plot for random seed 1025768792 and Fast precision (Model 5):