Tuesday, August 25, 2015

Statistical analysis for cross-sectional network data (ERGM) - Part I.

What is"statistical analysis on social networks?"
Everyone in social sciences, economics or anyone doing empirical studies is familiar with linear regression and logistic regression. For instance, logistic regression models are particularly important when studying consumer choice behaviour, i.e Modeling Household Purchase Behavior. However, agent's choice decisions are under many circumstances interdependent and the independence assumption of logistic (linear) regression models is violated.  Classical examples of this interdependence are given in social networks, where agents' relational variables are not independent. Concepts such as reciprocity (e.g. the tendency for students to form mutual friendships) or transitivity (e.g tendency to become a friend) are examples of some dependency structures that might exists in relational data.

Which class of statistical models can I use for social network data?
Exponential random graphs models (ERGMs) is a class of models that were develop to account for the dependency structures observed in social networks. They have been applied in organisational studies, political science, educational setting, etc.

What are ERGMs?
Understanding exponential random graphs requires a little bit of mathematical knowledge, in particular concepts as Markov random field, Gibbs distribution, joint distribution, stationary distribution, Erdogicity and mixing time are a minimum requirement for a full understanding of ERGMs.
It is not my intention to write a formal derivation of ERGM in my first post, and therefore I will give an intuitive idea  and leaving formalism for a future post.
Intuitive idea:
Let us imagine a set of infinite number of schools, and that at time zero a homogenous population of young students are randomly partitioned in different schools, each school having exactly n students at the end of the partition.

Let us assume that at time t greater than zero, students start to add and delete friendship relations between students in the same schools according to certain "social mechanisms", which are represented as local dependencies between the relational variables. We call this process the linking formation process, and we assume that it is the same across schools.
Figure 2.
The linking process is defined by social mechanisms. For example, tendency to become friends with other students, the tendency to reciprocate a friendship and the tendency to become a friend of a friend; and they are represented as local dependencies between relational variables in a subnetwork isomorphic to a link, a pair of reciprocate links and transitive triangle, respectively:
A. link, B. reciprocate pair of links, and C. transitivite triangle.

Now, if two research teams collect  friendship relations between students in two different school at the same time time t1 ( we observe a network), then it is quite likely that the observed networks are quite different. It is also likely that if the same research teams collect the friendship relations in the same school but in a different point in time (t2), their observed networks will be different from the first observed networks. These differences are due to the randomness of the linking process.
Under general conditions, if we collect our observations at time larger than a certain time t, we might assume that the observed network is drawn from a stationary distribution, see Equation 1. Furthermore, if we observe the same school over time, the average time the friendship network spent in a given state is also described by Equation 1.
What does this means?
Regardless of whether you only collect observations of one school at different points in time, or you collect several networks for large t, you will end up with the same distribution. In particular, observing one network over time or several networks provides you with the same information about the model. Here, we assume that data is collected at one point in time (cross-sectional  data).



In empirical studies, researchers postulate the social mechanism underlying the linking process (network statistics are known) but it is unknown the strength and direction these mechanisms have on the process (parameters are unknown). Postulates are given by theories or previous empirical observations; and parameter estimation is often performed using MCMC-MLE.

What are the limitations of ERGMs? 
If you are familiar with ERGM or statistical network analysis, you might have observed that quite often analysis are performed in a single network. Unfortunately,  in a working paper (see below), I showed that ERGM-parameters are not a constant function of network size. This is a consequence that the probability to relate with someone is inversely  proportional to the number of agents in the network (n), which force the parameters of number of links converge to minus infinity; and if reciprocity is a constant function of n, then the parameters for the number of reciprocate pairs of links converge to infinity as n tends to infinity.
These observations make almost impossible to compare estimated parameters across studies, or even to do statistical analysis on multiple networks.
For instance, fitting an ERGM with network statistic number of links for four datasets shows that the estimated parameters are a decreasing linear function of log of n, see Figure 1. This is not a coincidence since a linearity implies that the expected number of links is a linear function of n.
Estimated parameters for links is well approximated by a linear function of log on with negative slope, and  the estimated parameters for reciprocate pairs of links and transitive triangles can be approximated by a linear function of log n with positive slope, see Figure 2.
If we add to the network statistics number of reciprocate pairs of links, number of two-path and number of transitive triangles, we have the following results.
 These observations have a theoretical foundation, since if reciprocity is a constant function on n, then the functional form estimated parameters for links on n is equal to minus the functional form the estimated parameters for reciprocate pairs of links on n.

Figure 1. A  84 networks B. 75 networks, C. 36 networks and D.  19 networks

Figure 2.





Link to the manuscript