Handling OUCRU publications

Packages needed:

library(readxl)
library(dplyr)
library(rcrossref)

Reading the year 2022 from OUCRU excel file as an example:

pubs22 <- read_excel("2022 OUCRU Publications (Mar20).xlsx")

pubs22 contains the data from the first sheet of the excel file (here showing only the first 16 columns of the first 10 publications as an example):

pubs22

# A tibble: 134 × 34
     No. Author         Title Journal Ref   `Published date`    PMID PMCID DOI  
   <dbl> <chr>          <chr> <chr>   <chr> <chr>              <dbl> <chr> <chr>
 1     1 Nguyen Van Vi… Immu… Am J T… 2022… 44568             3.50e7 <NA>  10.4…
 2     2 Mark S Butler… Anal… Antimi… 2022… 44571             3.50e7 <NA>  10.1…
 3     3 Nothando Ngwe… A ne… Wellco… 2022… 44602            NA      <NA>  10.1…
 4     4 Duc Minh Tran… Digi… J Med … 2022… 44601             3.51e7 <NA>  10.2…
 5     5 Ton That Than… SARS… J Infe… 2022… 44574             3.50e7 PMC8… 10.1…
 6     6 RECOVERY Coll… Casi… Lancet. 2022… 44604             3.52e7 PMC8… 10.1…
 7     7 Antimicrobial… Glob… Lancet. 2022… 44604             3.51e7 <NA>  10.1…
 8     8 Pisani L, Alg… Geoe… Lancet… 2022… 44593             3.49e7 PMC8… 10.1…
 9     9 Damien K. Min… Appl… PLOS D… <NA>  44579            NA      <NA>  10.1…
10    10 Shah JA, Warr… REL … J Immu… 2022… 44617             3.52e7 <NA>  10.4…
# ℹ 124 more rows
# ℹ 25 more variables: `Lead or collaborator` <chr>, Topic <chr>, Group <chr>,
#   `Web link` <chr>, `Qualitative Research findings` <chr>,
#   `Primary research findings` <chr>, Editorial <chr>, Review <chr>,
#   `Book chapter` <lgl>, `Cost-effectiveness Publication` <chr>,
#   `Describing, or modelling, disease burden` <chr>,
#   `Defining or describing disease control strategies` <chr>, …

From pubs22 we want to extract the DOIs:

dois <- pull(pubs22, "DOI")

which looks like this (just showing the first few DOIs as an example):

head(dois)

[1] "10.4269/ajtmh.21-0849"            "10.1128/AAC.01991-21"            
[3] "10.12688/wellcomeopenres.17548.1" "10.2196/32392"                   
[5] "10.1016/j.jinf.2022.01.010"       "10.1016/S0140-6736(22)00163-5"

Then you can use this list of DOIs to query the complete information of each publication through Crossref online:

p22 <- cr_works(dois)

You can see the amazing quantity of information that is available for each publication. Here a full list of the variables (no less than 36!):

names(p22$data)

 [1] "container.title"        "created"                "deposited"             
 [4] "published.print"        "doi"                    "indexed"               
 [7] "issn"                   "issue"                  "issued"                
[10] "member"                 "page"                   "prefix"                
[13] "publisher"              "score"                  "source"                
[16] "reference.count"        "references.count"       "is.referenced.by.count"
[19] "subject"                "title"                  "type"                  
[22] "url"                    "volume"                 "abstract"              
[25] "author"                 "link"                   "license"               
[28] "alternative.id"         "update.policy"          "language"              
[31] "short.container.title"  "assertion"              "reference"             
[34] "published.online"       "funder"                 "archive"               
[37] "update_to"

This is basically what it looks like (again, showing here only the first 8 fields of the first 10 publications):

p22$data

# A tibble: 118 × 37
   container.title   created deposited published.print doi   indexed issn  issue
   <chr>             <chr>   <chr>     <chr>           <chr> <chr>   <chr> <chr>
 1 The American Jou… 2022-0… 2022-02-… 2022-02-02      10.4… 2023-0… 0002… 2    
 2 Antimicrobial Ag… 2022-0… 2022-03-… 2022-03-15      10.1… 2023-0… 0066… 3    
 3 Wellcome Open Re… 2022-0… 2022-03-… <NA>            10.1… 2022-0… 2398… <NA> 
 4 Journal of Medic… 2022-0… 2022-02-… <NA>            10.2… 2023-0… 1438… 2    
 5 Journal of Infec… 2022-0… 2022-07-… 2022-04         10.1… 2022-0… 0163… 4    
 6 The Lancet        2022-0… 2023-03-… 2022-02         10.1… 2023-0… 0140… 10325
 7 The Lancet        2022-0… 2023-03-… 2022-02         10.1… 2023-0… 0140… 10325
 8 The Lancet Globa… 2021-1… 2022-01-… 2022-02         10.1… 2023-0… 2214… 2    
 9 PLOS Digital Hea… 2022-0… 2022-01-… <NA>            10.1… 2023-0… 2767… 1    
10 The Journal of I… 2022-0… 2023-01-… 2022-03-15      10.4… 2023-0… 0022… 6    
# ℹ 108 more rows
# ℹ 29 more variables: issued <chr>, member <chr>, page <chr>, prefix <chr>,
#   publisher <chr>, score <chr>, source <chr>, reference.count <chr>,
#   references.count <chr>, is.referenced.by.count <chr>, subject <chr>,
#   title <chr>, type <chr>, url <chr>, volume <chr>, abstract <chr>,
#   author <list>, link <list>, license <list>, alternative.id <chr>,
#   update.policy <chr>, language <chr>, short.container.title <chr>, …

Some of these variables are very interesting, such as is.referenced.by.count that tells you how often the publication has been cited as of today! This is a key metric. Also, interestingly, there is a variable named subject that gives you the broad topic of the publication. There is also a funder variable that tells you what funded the study. For example, for the 3rd publication in this list we can see that it is Wellcome:

p22$data[3, "funder"][[1]][[1]]$name

[1] "Wellcome Trust"

But probably more interesting to you is what relates to the authors information, where given and family names are (1) full and (2) clearly distinguished, and the affiliations are given too. Here an example for the first publication:

p22$data[1, "author"][[1]]

[[1]]
# A tibble: 24 × 6
   given    family sequence affiliation.name affiliation1.name affiliation2.name
   <chr>    <chr>  <chr>    <chr>            <chr>             <chr>            
 1 Nguyen … Chau   first    1Hospital for T… <NA>              <NA>             
 2 Lam Anh  Nguyet additio… 2Oxford Univers… <NA>              <NA>             
 3 Nguyen … Truong additio… 1Hospital for T… <NA>              <NA>             
 4 Le Mau   Toan   additio… 1Hospital for T… <NA>              <NA>             
 5 Nguyen … Dung   additio… 1Hospital for T… <NA>              <NA>             
 6 Le Manh  Hung   additio… 1Hospital for T… <NA>              <NA>             
 7 Mai Tha… Nhan   additio… 1Hospital for T… <NA>              <NA>             
 8 Dinh Ng… Man    additio… 1Hospital for T… <NA>              <NA>             
 9 Nghiem … Ngoc   additio… 1Hospital for T… <NA>              <NA>             
10 Huynh P… Thao   additio… 1Hospital for T… <NA>              <NA>             
# ℹ 14 more rows

With the full given and family names here I believe you can fairly easily construct the hash table that will ensure consistency. Indeed, the most common inconsistency that happens with Vietnamese authors is that they sometimes switch given and family names. So, you can construct the hash table by trying both options, which takes no time with a bit of computer code. One point to note is that even if on the publication itself initials are usually used for the given names, in the actually data bases such as the one that we queried here (Crossref), you’ll always find the full names. Hence, with the DOIs and a little bit of code you can actually retrieve much more information than what is in the excel file that you sent me, in a tiny fraction of the time that it took to build this excel file in the first place! That’s why I’m basically recommending OUCRU to forget about this excel file and just make sure they have the DOI which is the only thing that really matters.