From early qualitative studies to the advent of big data
Early qualitative studies
On the qualitative side, we recall the seminal work of Thomas Kuhn. In his book The Structure of Scientific Revolutions (Kuhn, 1962), Kuhn claims that science evolves through shifts from old to new “paradigms”. In his vision, the acceptance or rejection of a particular paradigm is not only a logical process, but a social process too. Following a scientific discovery, widespread collaboration is necessary to establish a new framework.
In the 1970s, Nicholas C. Mullins published his paper The Development of a Scientific Specialty: The Phage Group and the Origins of Molecular Biology (Mullins, 1972). Therein he demonstrates that the birth of a new discipline cannot be explained only by means of the competitive position and relative status of each of the specialties from which it is formed, but intellectual and social activities need to be taken into account too. Specifically, he investigates in detail the development of molecular biology from the American academic group studying the bacteriophage (a virus which infests bacteria) in the mid-20th century. Mullins shows that the development of this new discipline was possible thanks to the successful growth of the network of communications, co-authorship, colleagueship, and apprenticeship.
The “actor-network theory”
In the 1980s, French scholars Michel Callon and Bruno Latour, and collaborators, developed the so-called “actor-network theory” (ANT) (Latour, 1987), in response to the need of a new social theory adjusted to science and technology studies.
Their approach diﬀers from traditional sociology since they claim that there exists no such thing as a ‘social context’ to explain the features that economics, psychology, linguistics, and other sciences cannot account for. Latour defines the ‘social’ as “a trail of associations between heterogeneous elements”, and ‘sociology’ as “the tracing of associations between things that are not themselves social” (Latour, 2005). Therefore in this context the ‘social’ is what emerges from the associations between diﬀerent actors, and not a distinct domain of reality defined a priori.
ANT is based on the assumption that sociologists should track not only human actors, but also all the other non-human elements involved in the process of innovation and creation of knowledge in science and technology. To give an example, Callon explains that the history of the American electrical industry is not reducible to its inventors and their relations. To understand it we need to also take into account intellectual property, patent regulation, and the electric technologies themselves, and build a network that traces the associations between all these human and non-human actors (Callon and Ferrary, 2006).
In this context, to study the evolution of science, we should track not only researchers but also the traces they disseminate, especially their publications. The texts and the ideas therein play a central role and, moreover, in the ANT framework, these are put on the same level as human actors.
ANT received several critics, mainly because of the role given to non-humans, which are not capable of intentionality and should therefore not be put at the same level as human actors, according to (Winner, 1993). However, this methodology is still actively used today and we think that its founding principles are inspirational. Therefore, in this thesis we also consider both humans (researchers) and non-human actors (scientific concepts). Our approach is inspired by ANT, but also presents some diﬀerences that we will detail in Section 1.3.
While the “actor-network theory” was developed by Callon and Latour, quantitative analyses of scientific activity also started to be carried out, giving birth to the field of scientometrics. Pioneers in this field were Eugene Garfield, who created the first scientific citation index (Garfield, 1979), and Derek John de Solla Price, who analyzed the growth of science (Price, 1963), and proposed the first model of growth of networks of citations between scientific papers (Price, 1965). A dedicated academic journal, Scientometrics, was created in 1978.
As interestingly described in (Leydesdorﬀ and Milojevi´c, 2012), whereas in the 1980s sociology of science started to increasingly address micro-level analysis focusing on the behavior of scientists in laboratories (Latour and Woolgar, 1979), scientometrics focused on the quantitative analysis of scientific literature at the macro scale, often considering a whole discipline. Therefore, since then, the field of science and technology studies increasingly bifurcated into two streams of research: on the one hand the qualitative sociology of scientific knowledge, and, on the other hand quantitative studies of scientometrics and science indicators, which soon involved evaluation and policy issues too.
Network oriented studies
During the first decade of this century, the increasing availability of scientific publication archives, and the development of network science, led to large scale studies of co-authorship and citation networks, since the seminal work of Newman (Newman, 2001d)1. The framework of network theory allows new kinds of studies, based on the relationships between authors and papers, such as the investigation of the heterogeneity in the number of collaborators, the transitivity of collaborations, and the emergence of community structure, in which authors and papers are clustered in diﬀerent groups, often corresponding to expertise in diﬀerent subfields of science (Girvan and Newman, 2002). Moreover, new network visualization techniques allow to study science and its diﬀerent disciplines through maps representing the landscape of scientific knowledge (B¨orner et al., 2003). These new interdisciplinary exchanges between scientometrics, computer science and physics has lead to an impressive growth of scientometrics studies, making the discipline a very active area of research (Leydesdorﬀ and Milojevi´c, 2012).
1The idea of studying co-authorship patterns was firstly introduced in (Mullins, 1972), but Newman’s work represents the first detailed reconstruction of an actual large-scale collaboration network.
False ideas about big data: “qualitative vs quantitative”
The debate on qualitative versus quantitative research is very active today. The availability of large datasets is in fact not restricted to scientific archives but, thanks to the exponential growth of the Internet and related technologies, a huge amount of data is now produced online and partly available, like for example interactions between people on social networks such as Twitter, Facebook, etc. This has lead to the so-called “Big Data” science. A few years ago Chris Anderson, at that time editor-in-chief of Wired, wrote:
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. […] The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years. […] But faced with massive data, this approach to science – hypothesize, model, test – is becoming obsolete. […] We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. (Anderson, 2008)
We think that this vision is too extreme and even wrong for certain aspects. The size of the data and the available computational power are of course contributing to revolutionize the way research in social sciences is done, but it is misleading to think that using digital traces is a straightforward process that makes all social theories obsolete. The ideas defended by Anderson are based on a series of wrong assumptions, as extensively discussed in (Boyd and Crawford, 2011).
Firstly, it is not because a corpus of data is large that it is representative. Therefore, the documents to study should be chosen with care. They should contain representative information and be diversified so that sampling does not introduce a bias and does not aﬀect the observed dynamics. Moreover, digital traces are rarely directly usable: they must be cleaned so as to make sense. While it is true that digital traces give a more direct access to phenomena that were hard or even impossible to observe before the digital era, these traces are rarely directly usable. They must be selected, cleaned and organized. One should also keep in mind that digital traces are generally not produced for social science scholars (they are rather produced for observation, surveillance or information sharing purposes). Thus, they reflect a specific point of view or interest towards a given phenomenon and this point of view can be radically diﬀerent from that of the social scientists. It is thus important to keep in mind what are the data, and possibly what is missing among these data. Scientific archives are particularly good datasets with respects to this point. The two datasets that we analyze in this thesis do not contain all the publications in the concerned discipline, but they should be considered representative since they contain all the publications in several journals and conferences of the respective disciplines. Moreover they were produced for scholars for research purposes, as stated by their providers2.
A second and more fundamental issue concerns the definition of the objects under study. Social sciences often operate with complex notions (science fields, sociological categories, etc.) which are hard to define formally and do not have precise boundaries (how to define the frontiers of a research area?). Digital traces provide very little help for defining relevant categories 2 Concerning the APS dataset: “Over the years, APS has made available to researchers data based on our publications for use in research about networks and the social aspects of science. In order to further facilitate the use of our data sets in this type of research, researchers may now request access to this data by filling out a simple web form.” [http: //journals.aps.org/datasets]
Concerning the ACL dataset: “This is the home page of the ACL Anthology Reference Corpus, a corpus of scholarly publications about Computational Linguistics. […] We hope this corpus will be used for benchmarking applications for scholarly and bibliometric data processing.” [http://acl-arc.comp.nus.edu.sg/] and formalize them so that one can track their evolution in a longitudinal corpus for example.
Moreover, automation is not straightforward. By this observation, we mean that even if the data are clean and representative, they must be organized. Numerical methods are not neutral since any computing method involves some choices. What kinds of calculations are applied to the data? Even if the computer can automate some calculations, it does not give any insight on what kind of measure or modeling should be done. This is of course far from neutral: there are diﬀerent ways to compute the similarity between two concepts, or the influence of the context, for instance.
Relationality and temporality
Another fundamental characteristics of big data, underlined by (Boyd and Crawford, 2011), is the following:
Big Data is notable not because of its size, but because of its relationality to other data. Due to eﬀorts to mine and aggregate data, Big Data is fundamentally networked. Its value comes from the patterns that can be derived by making connections between pieces of data, about an individual, about individuals in relation to others, about groups of people, or simply about the structure of information itself.
The characteristic relationality of data constitutes a new challenge that can be addressed in the framework of network theory, namely the science providing the mathematical tools to analyze and model complex systems as graphs illustrating relations between discrete objects. Network science allows us to uncover the properties of complex networked systems at diﬀerent scales: from patterns of centrality and of similarity between objects, to the emergence of aggregates and connections among them.
Moreover, recent studies have focused on the analysis of the temporal aspect of networks (see (Holme and Saram¨aki, 2012) for a review on the subject). Thanks to the availability of data spanning over several decades, we can nowadays empirically investigate the evolution of social systems and scientific activity with the help of new methods, still under development, to model time-evolving objects and relations.
We should underline that the use of network theory is not a straightforward process either. Network analysis provides a suitable framework and useful mathematical and computational tools, but some more fundamental questions in its applications to science studies remain: what objects do we want to study and use as nodes of our network, and how do we extract them from the data? How do we identify and quantify the strength of a relation between two objects? What measures on the resulting network are interesting for our study and would be useful to answer questions on the evolution of scientific fields?
Networks and social science theory
The methodology we follow in this thesis, based on the notion of network, does not rely on traditional social science theories. As very well explained in (Callon and Ferrary, 2006), the network-based approach has a series of advantages.
In particular, this approach let us avoid making use of sociological categories, and of a strict distinction between micro and macro structures. In our methodology, ‘groups’ (also called ‘communities’ in network terminology) are defined as emerging structures in the network, namely as sets of nodes highly connected among each other, and loosely connected with the rest of the network. This is what we call the ‘mesoscopic’ level, and this is what we will use to model groups and communities, instead of accepting the traditional sociological definitions of these notions, which we think are too subjective.
Towards socio-semantic networks representing scientific research
This thesis constitutes a contribution to scientometrics studies. More precisely, we want to apply network theory to the study of the evolution of scientific fields. Many aspects can be studied, such as citations and their impact (Hirsch, 2005), collaborations (Newman, 2004), geographical distribution of laboratories and publications (Frenken et al., 2009), and research funding (Boyack and B¨orner, 2003). In this thesis we focus in particular on the social dimension of scientific production, and on the distribution of the resulting knowledge.
Examining collaborations among researchers can capture the social dimension of science. This kind of information can be extracted directly from publications by tracking co-authorship. From this information we can reconstruct networks of collaborations in diﬀerent disciplines. In the last few decades a number of works reconstructed large-scale co-authorship networks representing scientific communities in diﬀerent fields, such as physics (Newman, 2001b,c), mathematics (Grossman, 2002), neuroscience (Barab´asi et al., 2002), biomedical research, and computer science (Newman, 2001d).
These structures reveal interesting features of scientific communities. It has been shown that all fields seem to have a heterogeneous distribution of the number of collaborators per author, with most researchers having only a few collaborators, and a few having hundreds or in some extreme cases even thousands of them. Moreover, any researcher in the network can be easily reached from any other author in a small number of steps (moving from collaborator to collaborator)3. Relations also tend to be transitive: if two researchers both collaborated with a third researcher, chances are that the former are also co-authors. Lastly, these networks also appear to have a well defined community structure: researchers actually tend to group together so as to form scientific communities working on the same research topic or methodology (Newman, 2004).
Scientific collaboration networks have also been explored from a temporal perspective. (Newman, 2001a) shows that the probability that a researcher has new collaborators increases with the number of her/his past collaborators, and that the likeliness that two researchers initiate a new collaboration increases with the number of collaborators they share. (Barab´asi et al., 2002) then proposed a model for the evolution of co-authorship networks based on preferential attachment, i.e. on the idea that the more collaborators a researcher already has, the higher the probability that she/he will collaborate with even more scholars in the future. Since then, other works have explored the role of preferential attachment in the time-evolution of other empirical co-authorship networks using, for example, the Web of Science database (Wagner and Leydesdorﬀ, 2005; Tomassini and Luthi, 2007).
(Guimer`a et al., 2005b) investigate instead the mechanisms that lead to the formation of teams of creative agents, and how the structure of collaboration networks is determined by these mechanisms. Team organization and functioning has been widely investigated also by (Monge and Contractor, 2003). (Lazega et al., 2008) explore the interdependencies between collaboration networks and inter-organizational networks connecting the scientific laboratories in which researchers
work. Other works have explored topological transitions in the structure of co-authorship networks as the corresponding scientific field develops (Bettencourt et al., 2009), or the emergence of disciplines from splitting and merging of social communities (Sun et al., 2013).
In this thesis we propose a model of growth of co-authorship networks which is based not only on preferential attachment mechanisms and social features, such as the number of common collaborators, but also on researcher similarity, as expressed through knowledge production and investigation.
We elaborate on the idea of Callon and Latour that the process of creation of knowledge can be understood only by tracking human and non-human actor traces, and analyze not only collaboration structures but also the semantic content of scientific publications. We do not directly consider the papers as nodes of our networks: instead we base our analysis on the relations between researchers and concepts extracted from the text and/or the metadata. Therefore the networks we build are composed of both humans (researchers) and non-human actors (scientific concepts), but we make a distinction between the two, and call ‘social’ the associations between researchers (that we trace through co-authorship), and ‘semantic’ the associations between concepts (that we trace through co-occurrences in texts). We thus acknowledge that equal importance should be given to the social and the semantic dimensions, but still assume that the two types of links are not equivalent, as they may support diﬀerent processes.
As already said, the analysis of texts is not straightforward. Firstly, we need to define what kind of information we want to extract from them, then we need to find or develop the methods to do it, which, for large datasets, should be automated tools. Finally, we need to understand how we can connect together the diﬀerent pieces of information. These issues are at the core of nowadays sociology research, as explained by (Venturini et al., 2012):
Quantitative data can have many diﬀerent forms (from a video recording to the very memory of the researcher), but they are often stored in a textual format (i.e. interviews transcriptions, field notes or archive documents…). The question therefore becomes: how can texts be explored quail-quantitatively? Or, more pragmatically, how can texts be turned into networks?
In this thesis, we try to extract, from the titles and abstracts of the papers, the terms that correspond to scientific concepts, in order to reconstruct the landscape of knowledge distribution of scientific fields (Callon et al., 1986). Moreover, we introduce an original method to automatically classify these terms in order to extract the ones corresponding to techniques, so that we can study more fine-grained facts about the evolution of scientific fields. On a methodological level, we therefore combine a network theory approach with computational linguistics methods that make it possible to automatically extract information directly from the publication content4.
Relevant maps of scientific domains can be built using network theory: in this context, nodes of the network correspond to terms extracted from texts and two nodes are connected if the corresponding terms co-occur together in diﬀerent papers or abstracts (Eck, 2011). For example, (Cambrosio et al., 2006) use inter-citation and co-word analysis to map clinical cancer research.
The study of the time-evolution of the structure of diﬀerent scientific fields through co-word network representations is a prolific area of research, and diﬀerent disciplines have been analyzed, such as chemistry (Boyack et al., 2009) physics (Herrera et al., 2010), (Pan et al., 2012), and biology (Chavalarias and Cointet, 2013). In this thesis we mainly focus on the evolution of the field of computational linguistics, and, to some extent, also to the evolution of physics research.
The originality of this thesis relies in the fact that we consider the social and the semantic dimension of science at the same time, and we try to uncover how these two dimensions co-evolve over time. We rely on the work of (Roth, 2005), which was among the first to consider interactions in knowledge communities in both their social and semantic dimensions. Roth analyzes the community of biologists studying the zebrafish, and shows that collaborations are driven both by social distance and by semantic proximity between researchers. However his approach focuses on only one variable at a time, and ignores the simultaneous eﬀect of parameters with respect to each other.
The original contribution of this thesis, with respect to the work of Roth, is to build a more holistic statistical model that takes into account all features at the same time. Moreover, we explore the evolution of collaboration networks, but also the evolution of co-word networks representing scientific knowledge, and build a comprehensive model based on social and semantic features.
Our work is also largely inspired by the previous study of (Anderson et al., 2012). The field of computational linguistics has been the subject of several scientometric studies in 2012, for the 50 years of the Association for Computational Linguistics (ACL). More specifically, a workshop called “Rediscovering 50 Years of Discoveries” was organized to examine 50 years of research in natural language processing (NLP). This workshop was also an opportunity to study a large scientific collection with recent NLP techniques and see how these techniques can be applied to study the dynamics of a scientific domain. The paper “Towards a computational History of the ACL: 1980-2008”, published in the proceedings of this workshop, is very relevant from this point of view. The authors propose a methodology for describing the evolution of the main sub-domains of research within NLP since the 1980s. They demonstrate, for instance, the influence of the American evaluation campaigns on the domain: when a US agency sponsors a sub-domain of NLP, one can observe a rapid concentration eﬀect since a wide number of research groups suddenly concentrate their eﬀorts on the topic; when no evaluation campaign is organized, research is much more widespread across the diﬀerent sub-domains of NLP.
Similarly, we propose to study the evolution of the field of computational linguistics, but we also make a technical contribution to the field itself, as we introduce a new method to automatically categorize keywords according to the information they carry. Among all the terms relevant in the domain, we are especially interested in terms referring to methods and techniques since these terms make it possible to trace the technical evolution of the field.
Table of contents :
I Methodological foundations
1 State of the art
2 Methods and data
II Modeling the socio-semantic space of scientific research
3 Modeling the textual content of scientific publications
4 Modeling scientific research as a socio-semantic network
5 Modeling the time evolution of scientific research
III Investigating the socio-semantic dynamics of scientific research at different scales
6 Investigating socio-semantic dynamics at the micro-level
7 Investigating semantic dynamics at the meso-level
8 Investigating the micro-meso bridge
A ACL Anthology term list
B ACL Anthology semantic clusters