library("tidyverse")
library("rvest")
library("glue")
library("httr")
library("future")
Recently a friend told me they were doing a systematic review of the Hand Therapy Journal published by SAGE. They wanted a way to scrape the journal for all publications - without going mad in the process. It seemed like a good excuse to get them motivated to learn R and for me to practice web scraping with the rvest
package.
We’ll go through the following steps:
- Generate URLs for all issues of the journal
- Inspect the source code for the page for the DOI
- Scrape all the pages
- Scrape all the pages using parallel processing via the
future
package
Let’s load all the packaes we’re going to use up front:
URLs for each journal issue
Let’s simplify things and only consider issues of the journal up to the date this blogpost was written (2018-11-12). I can’t guarantee that the journal won’t completely change their URL scheme tomorrow, but until they do change things all issues have the following URL structure:
https://journals.sagepub.com/toc/hthb/{volume}/{issue}
There have always been 4 issues a year, and the most recent volume is 23. Let’s setup a tibble()
with this data:
<- tibble(volume = 1:23, issue_1 = 1, issue_2 = 2, issue_3 = 3, issue_4 = 4) issue_urls
I’ll now use gather()
to convert this into a long and tidy dataset that iterates through all issues:
<- issue_urls %>%
issue_urls gather(issue.colname, issue, issue_1:issue_4) %>%
select(-issue.colname) %>%
arrange(volume)
Now we can construct our URLs using glue()
<- issue_urls %>%
issue_urls mutate(issue_url = glue("https://journals.sagepub.com/toc/hthb/{volume}/{issue}"))
head(issue_urls)
# A tibble: 6 × 3
volume issue issue_url
<int> <dbl> <glue>
1 1 1 https://journals.sagepub.com/toc/hthb/1/1
2 1 2 https://journals.sagepub.com/toc/hthb/1/2
3 1 3 https://journals.sagepub.com/toc/hthb/1/3
4 1 4 https://journals.sagepub.com/toc/hthb/1/4
5 2 1 https://journals.sagepub.com/toc/hthb/2/1
6 2 2 https://journals.sagepub.com/toc/hthb/2/2
Inspect the source code for the DOI
Inspecting the source code reveals that the title of each article in the issue has the attribute data-item-name
, with the value click-article-title
.
Let’s use the most recent issue as a toy example:
"https://journals.sagepub.com/toc/hthb/23/4" %>%
read_html() %>%
html_nodes("[data-item-name=click-article-title]") %>%
html_attrs() %>%
1]] .[[
data-item-name class
"click-article-title" "ref nowrap"
href
"/doi/full/10.1177/1758998318784316"
The DOI for the article is almost the href
value, there’s some fluff we’ll get rid of later. But we know enough we can create a function for extracting the href
value
<- function(issue_url) {
get_article_dois_from_issue
<- tryCatch(issue_url %>%
issue_page read_html(),
error = function(c) NA
)
if (is.na(issue_page)) {
return(NA)
}
%>%
issue_url read_html() %>%
html_nodes("[data-item-name=click-article-title]") %>%
html_attr("href")
}
"https://journals.sagepub.com/toc/hthb/23/4" %>%
get_article_dois_from_issue()
[[1]]
[1] "/doi/full/10.1177/1758998318784316"
[[2]]
[1] "/doi/full/10.1177/1758998318796010"
[[3]]
[1] "/doi/full/10.1177/1758998318798668"
[[4]]
[1] "/doi/full/10.1177/1758998318809574"
Scrape all the pages
The wonderful purrr
package allows us to insert these (almost) DOIs into the rows of our tibble()
as follows:
<- issue_urls %>%
example_dois slice(52:54) %>%
mutate(doi = map(issue_url, function(x)get_article_dois_from_issue(x)))
Warning: All elements of `...` must be named.
Did you want `data = doi`?
# A tibble: 3 × 4
volume issue issue_url data
<int> <dbl> <glue> <list>
1 13 4 https://journals.sagepub.com/toc/hthb/13/4 <tibble [3 × 1]>
2 14 1 https://journals.sagepub.com/toc/hthb/14/1 <tibble [5 × 1]>
3 14 2 https://journals.sagepub.com/toc/hthb/14/2 <tibble [6 × 1]>
The unnest()
function from tidyr
allows us to unpack these list columns
%>%
example_dois unnest(doi)
# A tibble: 14 × 4
volume issue issue_url doi
<int> <dbl> <glue> <chr>
1 13 4 https://journals.sagepub.com/toc/hthb/13/4 /doi/pdf/10.1177/175…
2 13 4 https://journals.sagepub.com/toc/hthb/13/4 /doi/pdf/10.1177/175…
3 13 4 https://journals.sagepub.com/toc/hthb/13/4 /doi/pdf/10.1177/175…
4 14 1 https://journals.sagepub.com/toc/hthb/14/1 /doi/full/10.1258/ht…
5 14 1 https://journals.sagepub.com/toc/hthb/14/1 /doi/full/10.1258/ht…
6 14 1 https://journals.sagepub.com/toc/hthb/14/1 /doi/full/10.1258/ht…
7 14 1 https://journals.sagepub.com/toc/hthb/14/1 /doi/full/10.1258/ht…
8 14 1 https://journals.sagepub.com/toc/hthb/14/1 /doi/full/10.1258/ht…
9 14 2 https://journals.sagepub.com/toc/hthb/14/2 /doi/full/10.1258/ht…
10 14 2 https://journals.sagepub.com/toc/hthb/14/2 /doi/full/10.1258/ht…
11 14 2 https://journals.sagepub.com/toc/hthb/14/2 /doi/full/10.1258/ht…
12 14 2 https://journals.sagepub.com/toc/hthb/14/2 /doi/full/10.1258/ht…
13 14 2 https://journals.sagepub.com/toc/hthb/14/2 /doi/full/10.1258/ht…
14 14 2 https://journals.sagepub.com/toc/hthb/14/2 /doi/full/10.1258/ht…
All DOI begin with 10.
1 which we can use to tidy up these almost DOI into real DOI:
%>%
example_dois unnest(doi) %>%
mutate(doi = str_replace(doi, ".*/10.", "http://doi.org/10."))
# A tibble: 14 × 4
volume issue issue_url doi
<int> <dbl> <glue> <chr>
1 13 4 https://journals.sagepub.com/toc/hthb/13/4 http://doi.org/10.11…
2 13 4 https://journals.sagepub.com/toc/hthb/13/4 http://doi.org/10.11…
3 13 4 https://journals.sagepub.com/toc/hthb/13/4 http://doi.org/10.11…
4 14 1 https://journals.sagepub.com/toc/hthb/14/1 http://doi.org/10.12…
5 14 1 https://journals.sagepub.com/toc/hthb/14/1 http://doi.org/10.12…
6 14 1 https://journals.sagepub.com/toc/hthb/14/1 http://doi.org/10.12…
7 14 1 https://journals.sagepub.com/toc/hthb/14/1 http://doi.org/10.12…
8 14 1 https://journals.sagepub.com/toc/hthb/14/1 http://doi.org/10.12…
9 14 2 https://journals.sagepub.com/toc/hthb/14/2 http://doi.org/10.12…
10 14 2 https://journals.sagepub.com/toc/hthb/14/2 http://doi.org/10.12…
11 14 2 https://journals.sagepub.com/toc/hthb/14/2 http://doi.org/10.12…
12 14 2 https://journals.sagepub.com/toc/hthb/14/2 http://doi.org/10.12…
13 14 2 https://journals.sagepub.com/toc/hthb/14/2 http://doi.org/10.12…
14 14 2 https://journals.sagepub.com/toc/hthb/14/2 http://doi.org/10.12…
It’s painfully slow going through all 92 issues in this fashion, thankfully it’s fairly easy to run this in parallel with the future
package.
Scrape all the pages with future
To begin our work with the future
package we must tell it our plan to use multicore evaluation as follows:
plan(multiprocess)
We use future()
to tell the future
package to run according to the plan we just set:
<- Sys.time()
start_scrape <- issue_urls %>%
scraped_dois mutate(dois = map(issue_url, ~future(get_article_dois_from_issue(.x))))
<- Sys.time() end_scrape
This whole process hasn’t taken much time (at the time of writing):
- start_scrape end_scrape
Time difference of 1.755138 mins
But our dataset isn’t ready to work with yet, our list column is full of MulticoreFuture
things:
scraped_dois
We use the value()
function to extract the value of our future calculations and unnest()
as previously:
%>%
scraped_dois mutate(dois = map(dois, ~value(.x))) %>%
unnest(dois) %>%
filter(!is.na(dois)) %>%
rename(doi = dois) %>%
mutate(doi = str_replace(doi, ".*/10.", "http://doi.org/10.")) %>%
select(-issue_url)
# A tibble: 459 × 3
volume issue doi
<int> <dbl> <chr>
1 1 1 http://doi.org/10.1177/175899839600100401
2 1 1 http://doi.org/10.1177/175899839600100402
3 1 1 http://doi.org/10.1177/175899839600100403
4 1 1 http://doi.org/10.1177/175899839600100404
5 1 1 http://doi.org/10.1177/175899839600100405
6 1 1 http://doi.org/10.1177/175899839600100406
7 1 1 http://doi.org/10.1177/175899839600100407
8 1 1 http://doi.org/10.1177/175899839600100408
9 1 1 http://doi.org/10.1177/175899839600100409
10 1 1 http://doi.org/10.1177/175899839600100410
# … with 449 more rows
This was the final output my friend needed - all 459 DOI-issued articles from the journal. It was really easy to put all of this together and finally get a chance to use the future
package properly. If I was asked how to make this more rigorous I’d recommend the following:
- Programmatically discover the most recent volume and issue
- Don’t assume a max of 4 issues per volume, allow the code to iterate through a volume.
Footnotes
See the DOI Handbook DOI: 10.1000/182.↩︎
Reuse
Citation
@online{hadley2018,
author = {Charlotte Hadley},
title = {Crawling {DOI} from a {SAGE}},
date = {2018-11-12},
url = {https://visibledata.co.uk/posts/2018-11-12-crawling-doi-from-a-sage},
langid = {en}
}