GitHub - chase-lab/checklist_change: Research project aggregating biodiversity data from checklists. Child project from chase-lab/homogenisation and brother project of chase-lab/metacommunity-surveys

Checklist change

Description

This research compendium regroups scripts used to download, re-structure and aggregate data sets to constitute a large meta-analysis of communities sampled at least twice, 10 years apart or more. The specificity of this data set is that it aggregates data from studies varying greatly in their focus and methods but all sampled an area in the most exhaustive way allowing us to consider it a checklist. In some studies, only past or only present communities were provided and the other species community was built by adding invading species or excluding extinct species.

Data

Raw and aggregated data tables are provided. Raw data are stored for each data set individually in the data/raw data/ folder in compressed .rds files Aggregated data are in data/communities.csv and data/metadata.csv and column definitions are given in data/definitions_communities.txt and data/definitions_metadata.txt, and reproduced at the bottom of this readme.

Here are commands exploring the data set:

dt <- data.table::fread(file = "data/communities.csv",
                        select = c("dataset_id","regional","local","year","species"),
                        stringsAsFactors = TRUE) |> 
   unique()

meta <- data.table::fread(file = "data/metadata.csv",
                          select = c("dataset_id","regional","local","taxon","realm",
                                     "year", "latitude","longitude",
                                     "gamma_bounding_box_km2", "gamma_sum_grains_km2"),
                          stringsAsFactors = TRUE,
                          colClasses = c(latitude = "numeric",
                                         longitude = "numeric")) [
                             j = year := as.integer(as.character(year))
                          ] |> 
   unique()

dt[i = meta,
   j = ":="(
      taxon = i.taxon,
      realm = i.realm
   ),
   on = .(dataset_id, regional, local)
][
   j = data.table::uniqueN(species),
   by = taxon
][
   order(-V1)
]

# How many dataset_ids
base::nlevels(meta$dataset_id)

# How many dataset_ids/regions
data.table::uniqueN(meta[, .(dataset_id, regional)])

# How many dataset_ids/regions/sites?
data.table::uniqueN(meta[, .(dataset_id, regional, local)])

# How many sites per regions on average?
meta[j = data.table::uniqueN(local), keyby = .(dataset_id, regional) ][
   j = mean(V1)]

# How many dataset_ids/regions/sites with unique coordinates?
data.table::uniqueN(meta[i = meta[j = data.table::uniqueN(.SD),
                                  .SDcols = c("latitude", "longitude"),
                                  keyby = .(dataset_id, regional, local)][V1 == 1L],
                         on = .(dataset_id, regional, local)])

# How many samples?
data.table::uniqueN(meta[, .(dataset_id, regional, local, year)])

# What is the mean year range?
unique(meta[, .(dataset_id, regional, local, year)])[j = mean(diff(range(year)))]

# How many localities with 2 samples?
# How many localities with at least 4, 5, 10 samples?
meta[j = data.table::uniqueN(year),
     by = .(dataset_id, regional, local)][j = .(y2 = sum(V1 == 2L),
                                                y4 = sum(V1 >= 4L),
                                                y5 = sum(V1 >= 5L),
                                                y10 = sum(V1 >= 10L))]
# How many regions with 2 samples?
# How many regions with at least 4, 5, 10 samples?
meta[j = data.table::uniqueN(year),
     by = .(dataset_id, regional)][j = .(y2 = sum(V1 == 2L),
                                         y4 = sum(V1 >= 4L),
                                         y5 = sum(V1 >= 5L),
                                         y10 = sum(V1 >= 10L))]
# Beginning year range
meta[j = min(year), by = .(dataset_id, regional)][j = range(V1)]

# End year range
meta[j = max(year), by = .(dataset_id, regional)][j = range(V1)]

# How many regions were first sampled before 1800?
meta[j = min(year) <= 1800L, by = .(dataset_id, regional)][j = sum(V1)]

# What is the maximum number of samples in a site?
meta[j = data.table::uniqueN(year), by = .(dataset_id, regional, local) ][
   j = max(V1)]

# How many samples per taxon groups?
meta[j = data.table::uniqueN(.SD),
     .SDcols = c("dataset_id", "regional", "local", "year"),
     by = "taxon"][order(-V1)]

# How many regions per realm groups?
meta[j = data.table::uniqueN(.SD),
     .SDcols = c("dataset_id", "regional"),
     by = "realm"][order(-V1)]

# How many samples per realm groups?
meta[j = data.table::uniqueN(.SD),
     .SDcols = c("dataset_id", "regional", "local", "year"),
     by = "realm"][order(-V1)]

# Mean richness per sample?
dt[j = data.table::uniqueN(species), 
   by = .(dataset_id, regional, local, year)][j = mean(V1)]

# gamma_extent range
meta[j = .(sum = range(gamma_sum_grains_km2, na.rm = TRUE),
           box = range(gamma_bounding_box_km2, na.rm = TRUE))]

Workflow and reproducibility

Environment

To ensure reproducibility, the working environment (R version and package version) was documented and isolated using the package renv. By running renv::restore(), renv will install all missing packages at once. This function will use the renv.lock file to download the same versions of packages that we used and install them on your system.

Relative paths

Included in the repository is a Rstudio project file: checklist_change.Rproj that should always be used to open the project to ensure that the working directory is set correctly. All paths in the project have the same relative root which is the checklist_change folder where the .Rproj file is located. Using setwd() is discouraged (read more).

Workflow

After downloading or cloning this repository, run the following scripts in order to wrangle raw data and merge all data sets together into one long table.

renv::restore()
# Raw data are stored in the project so that users do not need to download the data
# source("R/1.0_downloading_raw_data.r")
source("R/2.0_wrangling_raw_data.r")
source("R/3.0_merging_long-format_tables")

Additional installations

You might need to install the 64-bit version of Java to run Tabulizer.

Variable definitions

Community data

`/data/definitions_communities.txt`

Variable name	Definition
dataset_id	Unique ID linked to a publication (article or data set). If the data set was split because different taxa group are provided, a letter is added at the end. No missing value.
year	Year of sampling. If sampling was pooled over several years, the last sampling year is used here. No missing value.
regional	Region name, contains at least two localities. Can be a national park, a state or a forest name for example but smaller scales are also included where the region is an experimental sites. A data set can have several regions. No missing value. UTF-8 encoding.
local	Name or code of the sampled locality or experimental sample. For example, it can correspond to the name of an island, a lake or forest. No missing value. UTF-8 encoding.
species	Species names. Whenever possible, complete (Genus + species epithet) names were included rather than codes. No missing value. UTF-8 encoding.

Metadata

`/data/definitions_metadata.txt`

Variable name	Definition
dataset_id	Unique ID linked to a publication (article or data set). If the data set was split because different taxa group are provided, a letter is added at the end. No missing value.
year	Year of sampling. If sampling was pooled over several years, the last sampling year is used here. Where year (i.e., date) for historical lists was not available, they were estimated based on human visitation/colonisation history. No missing value.
regional	Region name, contains at least two localities. Can be a national park, a state or a forest name for example but smaller scales are also included where the region is an experimental sites. A data set can have several regions. No missing value.
local	Name or code of the sampled locality or experimental sample as given by the original data provider. For example, it can correspond to the name of an island, a lake or forest. No missing value.
latitude	Latitude North in decimal degree, WGS84. NA values indicate that information could not be collected.
longitude	Longitude East in decimal degree, WGS84. NA values indicate that information could not be collected.
effort	Sampling effort expressed for example as the number of visits to a plot or the total area sampled in a given year. See Comment column for a description. NA value means that exact effort is unknown but considered extensive and exhaustive.
data_pooled_by_authors	TRUE if the data provided by the authors was already pooled covering several years: several samples made over several years pooled together. No missing value.
data_pooled_by_authors_comment	If there was pooling by the original authors, countains free text describing how the authors pooled their data. NA values when no pooling was done.
sampling_years	If there was pooling by the original authors, contains the years sampled for each period. "1997, 1999" means 1997 and 1999, "1997-1999" means 1997, 1998 and 1999. NA values indicate that information could not be collected.
alpha_grain_m2	Area of the local unit or area of the sampling gear (in which case, alpha_grain_type = sample). NA values indicate that information could not be collected.
alpha_grain_type	Category of alpha_grain specifying what does the alpha_grain measure relate to. Allowed values are: "island", "plot", "administrative" (eg the area of a park or state), "watershed", "sample", "lake_pond", "archipelago", "trap", "transect", "ecosystem" (eg the area of the whole wetland or generally the area of adjacent comparable habitat), "box" (a box covering the sites) or "quadrat"". No missing value.
alpha_grain_comment	Description of how the alpha_grain was measured. NA values indicate that information could not be collected.
gamma_bounding_box_km2	Measure of the extent/regional scale area as the area covering all sites, computed as a convex-hull or a rectangle box or as the area of the administrative unit in which sites are found. NA values indicate that information could not be collected.
gamma_bounding_box_type	Category of gamma_bounding_box specifying what does the gamma_bounding_box relate to. Generally the bounding box is computed based on coordinates of the sites of by using a coarse area such as the area of the park, the state or the whole island. Allowed values are: "administrative" (eg the area of a park or state), "island", "convex-hull", "watershed", "box" (a box covering the sites), "buffer", "functional" (eg the area of the whole wetland or generally the area of adjacent comparable habitat), "shore" or "lake_pond"". NA values indicate that information could not be collected.
gamma_bounding_box_comment	Description or source (eg the paper, Wikipedia or measured on Google Earth) of the gamma_bounding_box value. NA values indicate that information could not be collected.
gamma_sum_grains_km2	Measure of the extent/regional scale area as the sum of the grains sampled each year. NA values indicate that information could not be collected.
gamma_sum_grains_type	Category of gamma_sum_grains specifying what does the gamma_sum_grains relate to. Generally the type is related to the type of the alpha_grain since it is a sum of all alpha_grains in a region in a given year. Allowed values are: "archipelago", "administrative" (eg the area of a park or state), "watershed", "sample", "lake_pond", "plot", "quadrat", "transect", "functional" (eg the area of the whole wetland or generally the area of adjacent comparable habitat) or "box" (a box covering the sites). NA values indicate that information could not be collected.
gamma_sum_grains_comment	Description or source (eg the paper, Wikipedia or measured on Google Earth) of the gamma_sum_grains value. NA values indicate that information could not be collected.
realm	Realm in which the sampling was done, one of: Freshwater or Terrestrial. No missing value.
taxon	Taxon group of the data set, one of: "Plants", "Invertebrates" or "Fish"". No missing value.
comment	A description of the data set origin, goal and sampling method. No missing value.
comment_standardisation	A short description of the modifications we made to the data set to ensure standard effort: excluded sites or years, excluded taxa, etc. No missing value.
doi	One or several DOIs separated by
is_coordinate_local_scale	Logical. TRUE if coordinates re given at the scale of the site/locality. FALSE if the coordinates are at the regional scale or missing. No missing value.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
R		R
data		data
renv		renv
tests		tests
.Rprofile		.Rprofile
.gitattributes		.gitattributes
.gitignore		.gitignore
.renvignore		.renvignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
checklist_change.Rproj		checklist_change.Rproj
renv.lock		renv.lock
rulebook.md		rulebook.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Checklist change

Description

Data

Workflow and reproducibility

Environment

Relative paths

Workflow

Additional installations

Variable definitions

Community data

`/data/definitions_communities.txt`

Metadata

`/data/definitions_metadata.txt`

About

Releases 4

Languages

License

chase-lab/checklist_change

Folders and files

Latest commit

History

Repository files navigation

Checklist change

Description

Data

Workflow and reproducibility

Environment

Relative paths

Workflow

Additional installations

Variable definitions

Community data

/data/definitions_communities.txt

Metadata

/data/definitions_metadata.txt

About

Resources

License

Stars

Watchers

Forks

Releases 4

Languages

`/data/definitions_communities.txt`

`/data/definitions_metadata.txt`