library("Rcrawler")
Rcrawler(
Website = "http://researchdata.ox.ac.uk",
no_cores = 4,
no_conn = 4 ,
NetworkData = TRUE,
statslinks = TRUE
)
During a meeting yesterday morning I offered to scrape a website that my colleagues are looking at redesigning. I didn’t really have any idea how to do it in R, but I was confident it wouldn’t be too tricky and there should be some good packages available for doing this kind of thing. I discovered the wonderful Rcrawler
library and also took the opportunity to get more comfortable with tidygraph
as I’ve been putting it off.
This blogpost covers the basics of how I scraped the data and wrangled it with tidygraph
into the graph visualisation you can see here.
The Rcrawler
documentation includes an example 1 of how to build a network of the internal links in a website, using the code below. This takes up to 2 minutes, so I’ve made the data available in a Gist here.
The Rcrawler
function creates two objects in your global workspace:
NetwIndex
: A list of all the web pages (URLs) from the websiteNetwEdges
: The internal links between these pages
It’s much easier to work with these objects if we augment the NetwIndex object with the ids used in the NetwEdges object, and I’ll export the data into a Gist for reproducibility and so I don’t have to recrawl the site each time I generate this post!
library("tidyverse")
library("gistr")
<- tibble(
rdo_oxford_index id = 1:length(NetwIndex),
url = NetwIndex
%>%
) mutate(name = url)
%>%
rdo_oxford_index write_csv("rdo_oxford_index.csv")
%>%
NetwEdges write_csv("rdo_oxford_edges.csv")
<- gist_create(
rdo_oxford_gist files = c(
"rdo_oxford_edges.csv",
"rdo_oxford_index.csv"
),description = "Blogpost: crawling with tidygraph"
)
Now we can import the data directly from the gist as follows:
library("tidyverse")
<- read_csv("https://gist.githubusercontent.com/charliejhadley/ba5a983e4e29cae29d379fc9daf1d873/raw/dea7df7ae9a1542372fe6203362991ec019bb1c3/rdo_oxford_edges.csv")
rdo_oxford_edges <- read_csv("https://gist.githubusercontent.com/charliejhadley/ba5a983e4e29cae29d379fc9daf1d873/raw/c3ead87cd4dc95696432b4f6547e4d7762132721/rdo_oxford_index.csv") rdo_oxford_index
Let’s generate an igraph
object for manipulation with the tidygraph
library, after first removing self-loops via the simplify
function but retain multiple edges between pairs of webpages. The graph is too large to visualise sensibly at present:
library("igraph")
<- graph_from_data_frame(
rdo_oxford_igraph
rdo_oxford_edges,vertices = rdo_oxford_index
%>%
) simplify(remove.multiple = FALSE)
tibble(
edges = ecount(rdo_oxford_igraph),
vertices = vcount(rdo_oxford_igraph)
)
# A tibble: 1 × 2
edges vertices
<dbl> <int>
1 20330 383
It would be useful if we could augment the rdo_oxford_index
object with some features that allow us to selectively visualise certain types of pages, which include:
- pages that are generated due to pagination rules
- auto-generated pages for tags
- auto-generated pages for “portfolio items”
<- rdo_oxford_index %>%
rdo_oxford_index mutate(paginated.page = if_else(str_detect(url, "/page/"),
TRUE,
FALSE)) %>%
mutate(tag.page = if_else(str_detect(url, "/tag/"),
TRUE,
FALSE)) %>%
mutate(portfolio.page = if_else(str_detect(url, "/portfolio/"),
TRUE,
FALSE))
Let’s also markup the pages linked to from the websites navigation menu and footer, as these are linked to on every single page!
library("rvest")
<-
navbar_links read_html(
"https://gist.githubusercontent.com/charliejhadley/ba5a983e4e29cae29d379fc9daf1d873/raw/4294b8180866b70f081c05a1dca9cb3bfd172fe3/navigation.html"
%>%
) html_nodes("a") %>%
html_attr('href') %>%
unique()
<- c("http://researchdata.ox.ac.uk/credits/",
footer_links "http://researchdata.ox.ac.uk/rdm-delivery-group/",
"https://www1.admin.ox.ac.uk/researchsupport/researchcommittees/scworkgroups/rdmopendata/")
<- rdo_oxford_index %>%
rdo_oxford_index mutate(navbar.page = if_else(url %in% navbar_links,
TRUE,
FALSE)) %>%
mutate(footer.page = if_else(url %in% footer_links,
TRUE,
FALSE)) %>%
mutate(node.id = row_number())
Now’s time for me to learn how to use tidygraph
! To filter out the paginated.pages
, portfolio.pages
and tag.pages
and all links to the navbar.pages
and footer.pages
I need to follow these steps:
- Convert an
igraph
object to a"tbl_graph" "igraph"
object - Activate nodes before filtering on them
- Activate edges before filtering on them
library("tidygraph")
<- graph_from_data_frame(
rdo_oxford_igraph
rdo_oxford_edges,vertices = rdo_oxford_index
%>%
) simplify(remove.multiple = FALSE)
<- rdo_oxford_index %>%
navbar_page_new_ids filter(navbar.page == TRUE) %>%
select(node.id) %>%
1]]
.[[
<- rdo_oxford_index %>%
footer_page_new_ids filter(footer.page == TRUE) %>%
select(node.id) %>%
1]]
.[[
<- rdo_oxford_igraph %>%
rdo_oxford_tidygraph as_tbl_graph() %>%
# to_undirected() %>%
activate(nodes) %>%
filter(paginated.page == FALSE & portfolio.page == FALSE & tag.page == FALSE) %>%
activate(edges) %>%
filter(!to %in% navbar_page_new_ids) %>%
filter(!to %in% footer_page_new_ids)
Finally, let’s add a group
column so we can easily colour the different types of pages with the ggraph
library:
library("ggraph")
<- rdo_oxford_tidygraph %>%
rdo_oxford_tidygraph activate(nodes) %>%
mutate(group = if_else(navbar.page == TRUE,
"Navbar Page",
"Other Page")) %>%
mutate(group = if_else(footer.page == TRUE,
"Footer Page",
group))
%>%
rdo_oxford_tidygraph ggraph() +
geom_edge_fan() +
geom_node_point(aes(color = group)) +
theme_graph()
Actually, there are two more things I’d like to do: extract the largest connected component and make an interactive network viz. It was slightly frustrating to figure out from the documentation how to use the group_components
function, but it became clear that it works similarly to group_indices
:
%>%
rdo_oxford_tidygraph activate(nodes) %>%
mutate(component = group_components()) %>%
filter(component == 1) %>%
ggraph() +
geom_edge_fan() +
geom_node_point(aes(color = group)) +
theme_graph()
It’s really easy to create interactive dataviz using htmlwidgets (what are htmlwidgets?), my favourite option for network visualisations is visNetwork
, which happily consumes igraph
objects…
library("visNetwork")
%>%
rdo_oxford_tidygraph activate(nodes) %>%
mutate(component = group_components()) %>%
filter(component == 1) %>%
mutate(title = url) %>%
mutate(label = "") %>% # remove long url labels from underneath nodes
as.igraph() %>%
visIgraph(idToLabel = FALSE) %>% # remove long url labels from underneath nodes
visOptions(highlightNearest = TRUE) %>%
visLegend()
This was a really fun way to learn how to use tidygraph
and ggraph
, which I’ve been putting off for a long time. I’m really grateful to Thomas Lin Pedersen for all his work on these amazing packages!
Footnotes
Actually, there was an error in the example so I made a pull request to fix it 😄↩︎
Reuse
Citation
@online{hadley2018,
author = {Charlie Hadley},
title = {Crawling with Tidygraph},
date = {2018-02-27},
url = {https://visibledata.co.uk/posts/2018-02-27-crawling-with-tidygraph},
langid = {en}
}