-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path09-chunks.Rmd
132 lines (96 loc) · 3.18 KB
/
09-chunks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Extracting text {#chunks}
Functions for extracting parts of texts used to live inside of `fulltext`, but have now
moved to the package [pubchunks][].
The `pubchunks::pub_chunks` function tries to make it easy to extract the parts of
articles you want. This only works with XML format articles though since although
we can get text out of PDFs, there is no machine readable way to say
"I want the abstract".
In addition to only working with XML, this function only has knowledge about a
select set of publishers for which we've encoded knowledge about how to get different
sections of the article. Not all publishers use the same format XML - so each publisher is
slightly different for how to get to each section. That is, to get to the abstract
requires slightly different xpath for publisher A vs. publisher B vs. publisher C.
An alternative to `pubchunks` is to use xpath or css selectors yourself to slice and
dice XML.
## Usage {#chunks-usage}
```{r}
library(fulltext)
library(pubchunks)
```
Get a full text article
```{r chunks_ft_get}
x <- ft_get('10.1371/journal.pone.0086169')
```
Note that unlike previous versions of `fulltext` you now have to collect (`ft_collect()`)
the text from the XML file on disk. Then you can pass to `pub_chunks()`, here to get
authors.
```{r chunks_ft_get_collect}
x %>% ft_collect %>% pub_chunks("authors")
```
In another example, let's search for PLOS articles.
```{r rplos_dois}
library("rplos")
(dois <- searchplos(q="*:*", fl='id',
fq=list('doc_type:full',"article_type:\"research article\""),
limit=5)$data$id)
```
Then get the full text
```{r}
x <- ft_get(dois)
```
Then pull out various sections of each article.
> remember to pull out the full text first
```{r}
x <- ft_collect(x)
```
```{r eval = FALSE}
x %>% pub_chunks("front")
x %>% pub_chunks("body")
x %>% pub_chunks("back")
x %>% pub_chunks("history")
x %>% pub_chunks("authors")
x %>% pub_chunks(c("doi","categories"))
x %>% pub_chunks("all")
x %>% pub_chunks("publisher")
x %>% pub_chunks("acknowledgments")
x %>% pub_chunks("permissions")
x %>% pub_chunks("journal_meta")
x %>% pub_chunks("article_meta")
```
## Tabularize {#chunks-tabularize}
The function `pub_tabularize()` is useful for coercing the output of `pub_chunks()` into a data.frame,
the lingua franca of data work in R.
```{r}
library(data.table)
x <- pub_chunks(x, c("doi", "title"))
x <- pub_tabularize(x)
rbindlist(x$plos, fill = TRUE)
```
## Other inputs {#chunks-other-inputs}
`pub_chunks()` works with other inputs besides the output of `fulltext::ft_get()`.
### Files
```{r}
x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml",
package = "pubchunks")
```
```{r}
pub_chunks(x, "abstract")
pub_chunks(x, "title")
pub_chunks(x, "authors")
pub_chunks(x, c("title", "refs"))
```
The output of `pub_chunks()` is a list with an S3 class `pub_chunks` to make
internal work in the package easier. You can easily see the list structure
by using `unclass()`.
### xml in a string
```{r}
xml <- paste0(readLines(x), collapse = "")
pub_chunks(xml, "title")
```
### xml2 objects
```{r}
xml <- paste0(readLines(x), collapse = "")
xml <- xml2::read_xml(xml)
pub_chunks(xml, "title")
```
[pubchunks]: https://github.com/ropensci/pubchunks/