Gender Publication Gap–Data & Methods

Data and Methods

Data Sources

The underlying data stem from three bibliographic repositories managed by scientific organizations that serve a public interes. Data is, at least in part, openly accessible.

The SAO/NASA Astrophysics Data System (ADS) is a digital library for research in Astronomy and Astrophysics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant. The data presented in this website has been imported in March 2018 via the ADS Developer API.
Zentralblatt MATH (zbMATH) is the most comprehensive abstracting and reviewing service in Pure and Applied Mathematics. zbMATH is produced by FIZ Karlsruhe, a member of the Leibniz Association, and as such a non-profit company. The data from zbMATH has been updated on a regular basis during the project and represents now the state from September 2019.
arXiv provides open-access to e-prints in various fields, notably Physics and Mathematics but also in other disciplines such as Computer Science or Quantitative Biology. The arXiv is funded by Cornell University, the Simons Foundation and by member institutions. This pre-print repository is "an indispensable mode of scientific exchange" in particular in Physics, covering the majority of publications in subfields like Astronomy, Astrophysics, and Nuclear and Particle Physics. The use in other disciplines is continuously increasing. The data from arXiv is being updated on a daily basis.

We have further cross-referenced the data from ADS ans arXiv with the crossref database to enrich the information on journals and authors' forenames. In the case of arXiv this was especially useful because the e-prints in arXiv do not include standardized information on the journals in which the articles will appear later.

Gender Inference

Bibliographic metadata do not include the authors' gender, thus this information needs to be inferred. Usually, an author’s name is the only piece of information susceptible of providing an indication of their gender. For the present data we have combined responses from different gender assignment services, maximizing the recall (i.e. the number of names that can be assigned a gender), while keeping the error rate under a certain threshold. Our algorithm is based on a published benchmark where we compared five dedicated web services and software packages. Roughly speaking, in a first stage we use the assignments from Python package gender guesser, which had shown high precision but low recall. For all remaining names we rely on the responses of web service Gender API that features a high probability score. For names leading to probability values between 75 and 90 in Gender API, we combine the responses from Gender API, gender guesser, and genderize.io. For authors without a first name but whose names appear in the Wikipedia list of Soviet surnames, we apply name ending rules to infer the gender.

Plenty of issues arise in connection with automated gender inference (AGR). Names are not always "uniquely" associated to one gender, which leads to a bias towards certain countries. Furthermore, all AGR approaches, building on names or other physiological features, such as facial images or voice, only allow for a binary definition of gender, which fundamentally excludes individuals that do not conform to this societal concept. Despite these (and other) critiques, we have performed a name-based gender inference because academia is notoriously not gender-agnostic and because relevant gender differences can be observed and need to be explained. In this article, we have discussed various issues related to AGR and would welcome ideas towards more inclusive concepts, preferably based on self-identification. Those would allow fairer, sustainable and statistically significant analyses of bibliographic corpora in terms of gender.

Author Profiles

The creation of author profiles from bibliographic records, i.e. the construction of clusters to agglutinate all publications of a given researcher, is essential to perform analyses on scholarly data to the level of individuals. Without this intricate process, also known as "author name disambiguation", research articles are not linked to each other on the basis of common authors and aggregations per scientist, in particular those related to gender, are thus not feasible.

Mathematics database zbMATH does provide author profiles, which we readily use for our analysis, whereas ADS and the arXiv lack this feature. For disambiguation of the ADS records we have trained a Machine Learning (ML) model on manually disambiguated data and we combine it with heuristics to fine tune the resulting author profiles. In the case of the arXiv we have implemented a known initial-based approach which yields appropriate results. This approach might be replaced by our trained ML model in the future, after suitable evaluations have been performed.

Extraction of Geo-Information

We use the affiliation strings to identify the authors' places of work. The data sources have different levels of coverage with affiliations: in ADS about 80% of all authorships have an affiliation, while in arxiv the information is rarely available. In zbMATH, the metadata on publications from the last 10 years are the main source of affiliations, but for articles, not for authorships.

We extract the geo-information (currently mainly the country) via a multi-level algorithmic procedure, based on (1) the extraction of locations using the Stanford Named Entity Recognizer (NER), (2) queries to the database GeoNames and (3) parsing of the affiliation strings with CERMINE, a machine learning based software for extracting meta-information from academic publications.

Publication, authorship

Academic publications are authored by one or various individuals (=authors); formally speaking, we consider each one-to-many pair of publication and author as one instance of authorship. For example, an articles authored by three individuals yields three different authorships that we try to assign to three different authors (see section above on 'author profiles' for creation of such author profiles).

Authorships might be counted in various ways: typically they are weighted equally, regardless of the total number of authors in the paper and with no distinction on the order of appearance. This leads to a counting scheme that does not discriminate between authorships of a single-author article and those of a large collaboration. Alternatively, one can incorporate the importance of publishing solo by computing so-called fractional authorships, where each authorship is assigned a weight of 1/n, with n being the total number of authors (in the above example of three individuals writing together one article, each of the authorships would be weighted by 1/3). Furthermore, analyses can be made that consider only a specific position in the list of authors as relevant, and often it is the first or the last slot that has a particularly important meaning.

The sensible choice of counting schema for authorships is highly field-specific and depends on the peculiarities of each discipline. In mathematics there are very few large collaborations, most articles being written by a handful of authors. In that situation, statistics on publication patterns remain roughly unchanged when using equal or fractional authorship counts. This is not the case in other fields like astronomy or high-energy physics though, where sizeable collaborations abound; hence in those cases it makes sense to compare with fractional authorships. Thus, we provide different counting schemes in our visualizations that can be selected from the corresponding dropdown menus.

Computation of proportions

The drop-down menus for selecting gender usually also contain the 'All' option, which allows the associated chart to display all data grouped by women, men and unknown. The percentages shown in the hover refer to the totality of all authors, including the authors labelled as unknown. For instance, the proportion of women is computed as 'women / (women + men + unknown)'. If instead you select 'Men' or 'Women' in the dropdown menu, the authors labeled as 'unknown' are not considered in the overall group. For example, the proportion of women is then computed as 'women / (women + men)'. This corresponds to the assumption that the statistical distribution of women and men among the 'unknowns' is equal to their distributions in the entire data set.

The proportions of women and men, respectively, within subfields of a discipline are represented as relative deviations from mean, resulting by dividing the absolute deviation from mean by the mean itself. For instance, say the proportion of women in all subfields of Mathematics equals 10%, while in the subfield of Number Theory women account for 8 % of all authors. Then the relative deviation from mean for the subfield Number Theory would be 100*(8-10)/10 = -200/10 = -20, compared to the absolute deviation of -2%. The relative deviation from the mean thus helps to understand the differences in the context of the mean and, for instance, to compare the difference between 8% and an average of 10% to the difference between 92% and a group average of 90%.

Subject classifications

Publications within a scientific field are typically subclassified into subjects according to various hierarchical and discipline-specific schemas. In our visualizations displaying gender proportions per sub-field, we use the MSC2010 for publications indexed in zbMATH and the arXiv categories for publications in arXiv.org. For publications in the astronomy database ADS no suitable classification scheme is available so far, thus ADS in not considered in the respective visualizations.

Datasets for visualizations

Every visualization is based on a dynamically evaluated dataset. You can download the underlying datasets, e.g. to create your own plots offline, by right-clicking on 'Link to raw data' in the navigations on the left hand side and selecting 'save as' from the context menu.