ohvbd is an R package for retrieving (and parsing) data from a network
of disease vector data sources.
This package was developed as part of the One Health Vector-Borne Diseases Hub.
ohvbd allows for searching and the retrieval of data from the
following data sources:
You can install the stable version of ohvbd from CRAN:
install.packages("ohvbd")You can alternatively install the development version of ohvbd from GitHub including any new or experimental features:
# install.packages("devtools")
devtools::install_github("fwimp/ohvbd")The vignettes are all available online, but if you would like to build
them locally, add build_vignettes = TRUE into your install_github()
command. However, we do not recommend doing this due to the number of
extra R packages utilised in the vignettes.
ohvbd has been designed to make finding and retrieving data on disease
vectors simple and straightforward.
Typically it uses a “piped”-style approach to find, get, and filter data from the supported databases, however it aims to provide the data to you “as-is”, leaving further downstream analysis and filtering down to you.
A basic pipeline for finding and retrieving data on Ixodes ricinus from the VecTraits database looks something like this:
library(ohvbd)
df <- search_hub("Ixodes ricinus") |>
filter_db("vt") |>
fetch() |>
glean()Major API change
extract_functions are nowglean_.- This means that if
tidyverseis loaded afterohvbd, there are no direct namespace collisions.
- This means that if
Full list of function name changes:
extract()->glean()extract_ad()->glean_ad()extract_gbif()->glean_gbif()extract_vd()->glean_vd()extract_vt()->glean_vt()fetch_extract_vd_chunked()->fetch_glean_vd_chunked()fetch_extract_vt_chunked()->fetch_glean_vt_chunked()
New functions & arguments:
ohvbdnow interfaces with GBIF for occurrence data.- New
*_gbiffunctions (e.g.fetch_gbif()) allow for retrieving and extracting data from GBIF. - A GBIF account and the
rgbifpackage are required to retrieve data from GBIF. - The account details must also be set up as shown in the rgbif documentation.
- New
- New
tee()command allows one to extract data from the middle of a pipeline and save it to an environment.- This is definitely not only useful for
ohvbdworkflows, and can be used in any base R pipeline (|>). It has not been tested in magrittr pipelines but should work as-is.
- This is definitely not only useful for
- New
filter_db()command allows for filtering out of only one database’s results from hub searches. check_db_status()now returns (invisibly) whether all databases are up or not.- New
fetch_citation()andfetch_citation_*commands provide an interface to attempt to retrieve citations from a vectorbyte dataset.- This will (by default) possibly redownload parts or all of the data if the columns are not currently present.
- New
force_db()function enables one to forceohvbdto consider a particular object as having a particular provenance. - New
simplifyargument tosearch_hub()makes hub searches return anohvbd.idsobject if only one database was searched for. This behaviour is on by default.- To match this,
filter_db()will now transparently returnohvbd.idsobjects if it gets them.
- To match this,
- New
taxonomyargument tosearch_hub()allows for filtering searches by GBIF backbone IDs. - New
match_species()function allows for quick and flexible matching of species names to their GBIF backbone IDs. - New
match_country()function allows for matching of country names to WKT polygons via naturalearth. - New
ohvbd_db(),has_db(), andis_from()functions allow for quick testing of object provenance (according toohvbd). - New
get_default_ohvbd_cache()function allows for custom functions that interface with cachedohvbddata files. - New
list_ohvbd_cache()andclean_ohvbd_cache()functions enable better interactive cache management.- As a result,
clean_ad_cache()has been removed as it is now unnecessary.
- As a result,
search_x_smart()functions can now take"tags"as a search field, enabling support for tagged datasets.
Other:
- Entire code base is now continuously formatted using Air v0.7.1.
- Examples are no longer wrapped in
\dontrun{}so they should be runnable from an installed version of the package. - A good chunk of the functional logic of
ohvbdis now covered with unit tests (using thevcrpackage). fetch_vd()no longer tries to retrieve ids with no pages of data.- Functions that interface with vectorbyte databases no longer recommend
using
set_ohvbd_compat()as unexpected SSL errors should break pipelines by default.- These errors are no longer expected to occur when interfacing with vectorbyte.
- Running
fetch()on anohvbd.hub.searchorglean()on anohvbd.idsobject now provides a hint that you may have forgotten something.- Occasionally users would use forget a
fetch()command and runsearch_hub() |> glean()which didn’t previously give an interpretable error.
- Occasionally users would use forget a
- Vignettes now use
vcrto massively reduce their build time. This should only matter to developers ofohvbd, or users who download from github and build the vignettes themselves. ohvbd.ids()now warns you and fixes the problem if you provide ids with duplicate values.glean_vt()andglean_vd()now force the inclusion of the dataset ID when filtering columns (using thecolsargument).- This is intended to encourage you to preserve at least one means of retrieving citation data later.
- WKT parsing and formatting is now significantly more robust.
- Cached AREAData now includes the cache timestamp as an attribute rather than a separate variable in the cache file.
glean_ad()now correctly returns a matrix even when there is only 1 row or column.- gadm spatial files are now cached as GeoPackage rather than shapefiles, leading to a >50% speedup in loading! (Thanks to @josiah.rs on bluesky for the suggestion!)
fetch_vd_counts()is now significantly faster, more robust, and temporarily caches data.- You will see particular improvements if you are trying to retrieve more than about 10 ids in one go or if you are repeatedly running the same download code in the same day.
- This speedup also applies to
fetch_vd()under the hood, particularly if you are running it multiple times in a day.
- Explicit term checking (such as in
fetch_ad()for metrics andsearch_vt_smart()for operators and fields) is now fuzzy, allowing for a small amount of deviation from the actual term name. assoc_ad()now tries to guess LatLong column names if none (or the wrong ones) are provided.- Errors in internal functions now make it more clear which user-facing functions they originate from.
- Multiple functions now default to
NULLrather thanNAfor default missing values (except date arguments to AD-related functions, where NA is more reasonable in the grand scheme). fetch_ad()now caches and tries to read from cache by default.- Generally speaking unless exceedingly up-to-date data is required, this will be the best for most people.
- If you do require guaranteed new data, it’s worth setting
refresh_cache = TRUEoruse_cache = FALSE(depending on if you want to replace your existing cache or not).
- All downloaders that can potentially cache data also attach the download time if not loading from cache.
See changelog for patch notes for all versions.
