An important step in theoretically guided big data analysis is to turn online, electronically stored data into variables needed to develop theories. Unstructured data is not naturally fit for theoretical deduction or testing. Methods like dynamic network modeling need to be refined, effectively organized, and well structured. To have an effective dialogue between big data and theory, we need a process to structuralize data through algorithms. For example, when analyzing stratification of Chinese society, we need to understand every individual’s social economic status (SES). Established variables from previous sociological research include the individual’s education, income, wealth, occupation, and reputation. To take this a step further, family background becomes important, and so the career and education of parents could be included. Only then can we derive a complicated SES measure. None of these, however, can be found in online, electronically stored data. Therefore, we may use survey data of SES as ground truth and look for possible behavioral outcomes in the big data, such as residence location or purchases of products. We can then trace this person’s moving tracks to find their location at night (possible residence location), their location in the day (possible office), property price in that location, web surfing history (profiling their purchasing style), and online purchasing history (profiling their income). This information can be found in electronic data, and we can thus derive an algorithm method to predict this person’s SES. Based on the ground truth obtained from survey, we look for the algorithm to turn big data into structured data, which can thereby be used by theoretical testing and deduction. Currently, the amount of survey data is very limited. When we have an algorithm that can turn big data into theoretically effective measures or, even a step further, deduce a behavioral pattern through the theory, the algorithm and the pattern can be applied to all netizens. The amount of the data becomes enormous.
To summarize, theoretically guided big data analysis has a huge body of data sources, but the amount of data can be big or small after a process of refined and effective organization, which makes the data a good fit for theory development. For example, when we are able to match the SES score of 10,000 surveyed persons to their online behaviors and get an algorism with satisfactory prediction accuracy, we can make inferences regarding all netizens that are similar to these 10,000 individuals. If the 10,000 people are not randomly sampled but belong to a certain social category, inference can only be made to this social category. Millions of online data points could then possibly be turned into structured data good fit for testing and developing theory.
The following sections of the paper take the structure of venture capital firms’ co-investment network (or syndication network) as an example of the triadic dialogue. We later explain the content of this case study in detail. Looking only at the part of data collection first, in the current backdrop of high-level digitalization of the economic and financial system, it is actually very easy for researchers to obtain the investment data of venture capital firms. Abundant and detailed information about investment behaviors can be found in numerous financial reports of listed companies, economic news, and publicized material from venture capital firms. However, it is less easy to construct a syndication network from such information. Even after collecting big data with Web-crawling technologies, the data usually remains a collection of sparse investment events. This is the circumstance that the venture capital firms in this paper confrontFootnote 5. Matching these events and forming a syndication network is in itself a time-consuming job. Especially for investment events that commonly lack some information, the social science researcher needs treatment of missing values for the validity of the theoretical development.
The examples presented above clearly show that big data does not contain the variables needed for our research (syndication network and corresponding VC companies’ indexes) and must thus be handled with a number of methods to structuralize it. This process is far more complicated than typical structured data cleaning. For example, data matching is a seemingly simple step. In the original data of venture capital firms, a firm’s name is typically recorded either in full or in abbreviation in different electronic sources of data and must be recognized and matched. However, there is no certain pattern for such matching. For example, not every firm abbreviates its name as the first two letters of its full name. As a result, even a step as simple as data matching has to make use of multiple techniques, like natural language process and word parsing. In other words, unlike the case of finding individuals’ SES, it is not only a job of algorism design in the process of structuralizing data but requires many steps involving labor-intensive work as well. Because of this issue, this study adopts Zero2IPO Research database as the base, from which we collect more online information to clean the missing values and transform the investment event data into venture capitals’ syndication network.
In the process of structuralizing data, we first need to determine the missing value in Zero2IPO. For example, for an investment event that lacks information about the time of the investment, researchers need to first conduct an online search of other key information of that time, such as the amount of the investment, the currency used, the place of the transaction, or the information of the recipient, so as to match the corresponding investment events and fill in the missing time data. An algorism must then be designed to match all investment events so as to list the co-investment events and syndication ties. Structuralized big data thus provides us with key theoretical variables that can be used in theory testing and causal inference and helps us use dynamic network methods to build predictive models. The data used in the following paragraphs are essentially this kind of structured data that we have integrated from massive electronically stored data.
In showing the abovementioned triadic dialogue, this paper uses dynamic network modeling as an exemplar of predictive models. The reason is twofold. On the one hand, social science theories have guided the process of data mining and provided more directions for big data analysis. On the other hand, not only can big data be used to test theory and shed light on theory development, but it has also extended the direction of theory construction, especially that of dynamic complex system theory.
The evolution of a dynamic complex social system ought to be a co-evolution of individual behaviors and overall social network structure (Padgett and Powell 2012). The previous difficulty in repeated collection of wide-range, long-term data can be remedied by unstructured data of electronic footprints. Previous data collection of an ego-centered network, despite the ability to obtain a wide-range random sample, only obtains individual social network conditions, not enough to determine the whole network structure of the wide range. Whole network survey data can be used to analyze the entire structure of a network within a certain range, but this range is compressed to a very small network when using previous research methods. It has been extremely difficult to collect whole network information for several hundreds of people, let along a huge social system that contains millions of people (Wasserman and Faust 1994). Collecting data on network dynamics is even more difficult since people become cautious when asked repeatedly about their personal interactions. It is very difficult to obtain comparative static information for three to five time points (Burt and Burzynska 2017), let alone dynamic network information.
The emergence of social network websites and APPs such as Facebook, Twitter, QQ, and WeChat has completely changed the picture. For more than a decade, the personal networks of billions of people have been recorded. Data on network structure evolution can now be obtained over tens or even hundreds of time points simply by refining and organizing monthly or quarterly electronic footprints data. To this extent, the emergence of big data turns the construction of network dynamics theory and testing its hypotheses from nearly impossible to achievable.
Why is it so important to construct network dynamics theory and build predictive models on it? In both natural and social sciences, complex theory was developed to correct the reductionist tendency of previous theories (Prigogine 1955). The best-known discussion in social sciences is Granovetter’s questioning of “under-socialization” and “over-socialization” (Granovetter 1985). The former refers to equalizing collective behavior to the linear summation of individual behaviors, i.e., the collective depends on the individual. The latter refers to subjecting individual behavior entirely to the shaping power of the collective, i.e., the individual depends on the collective. In fact, both presume that individuals are atomized. Such reductionist oversimplification ignores the reality that a collectivity is not the simple sum of individuals but collective behaviors are produced by individuals’ binding together and forming large-scale complex social networks. It is the aggregate effect of individual behavior and social network structure that produces collective actions (Granovetter 2017).
Coleman (1990) expresses similar arguments. As shown in Fig. 2, the reductionist view explains collective outcomes with collective elements (Process 4), and individual outcomes with individual elements (Process 2). Process 1, which explains individual outcomes with collective elements, is over-socializing, while Process 3, which explains collective outcomes with individual elements, is under-socializing. Coleman points out that such explanations overlook interpersonal interactions, relations, social networks, and the structure of networks. From the social-network point of view, Process 1 consists of four types of research (Luo et al. 2008), and collective powers can be conceptualized as field forces, including informational and normative field forces (DiMaggio and Powell, 1982). The first line of research argues that collective powers affect individual relations and the formation of personal networks. The second maintains that these relations and egocentric networks influence individual behavioral outcomes through either the interactive effect among friends or social capital that the ego gains from the network (Lin 2001). The third type of research thinks that field forces affect the changes in broader networks surrounding the individual and thereby changes their structural position in the network. The fourth type argues that individual structural positions, such as structural holes (Burt 1992) or centrality in a closed network (Brass and Burkhardt 1993), also influence individual behavioral outcomes.
Process 3 also includes three different lines of research (Luo et al. 2008). The first addresses the change in network structure induced by individuals’ cutting or building relations (Powell et al. 2005), e.g., a topic of network dynamics. The second explores how collective actions emerge from the evolution of network structure and human actions (Padgget and Powell 2012). The third argues that long-term, continuous, large-scale, and influential collective actions will eventually form new field forces and become the powers that shape individual relationships and structural positions in Process 1 (DiMaggio and Powell 1982).
To summarize, we can clearly see that Process 3 is where big data can make the greatest contribution. Research on network dynamics and the emergence of collective action from the co-evolution of structure and action has been filled with theoretical speculations while information to test the theory has been scarce. It has thus been difficult to develop and improve the theory in more depth and details. The use of big data has completed the research loop in Fig. 2, enabled the analysis of nonlinear developments like dynamic changes of large-scale complex networks; the emergence of crucial collective actions such as important innovations, social movements, and revolution breakouts; extranormal evolutions of complex social systems such as financial crises, sudden change of business cycles, and social transformations; and the transition of economic, social, and political institutions. Simply put, as argued above, social science theories can guide the development of big data analysis, discover new topics, collect ground truth, supervise the result of data mining, and conduct broader inferences. The addition of big data also extends the room for development for theories, making it possible to explain and test topics that have been difficult, producing a new frontier for theoretical development.
Why and when does the business cycle of an industry take a sharp turn? When does the turning point come? Is there any pattern or indicator? These are topics of interest that complex theory strives to answer. However, before answering these questions, we need to ask not only how the behaviors of industrial actors change, but also what the industrial network structure looks like and how it evolves. We take the venture capital (VC) industry as an example to consider the structural analysis of the VC syndication network.