create-EML-from-template.Rmd
This article demonstrates how to create an EML document for a data package containing multiple data entities. To follow along with this example, please download the Banet-Example.
We will create a nested list from our metadata templates and then use
the EML R package’s write_eml
function to convert our list into a valid EML document.
Our EML file will contain the following elements:
- eml
-access
-dataset
- creator
- contact
- associated parties
- title
- abstract
- keyword set
- license
- methods
- maintenance
- project
- coverage
- data table
We have two main sections within our EML document:
For a more in depth description of EML please see the EML Specification.
To follow along with this example please download the files from the Banet-Example. Within the directory, there are two subdirectories (“data” and “metadata”) which contain all the necessary data and metadata to create a valid EML document using EMLaide.
At minimum, four files are needed to use our tools:
Because our data set is composed of multiple data entities, we are creating a dataframe with each row representing a different data entity with the following information:
EMLaide::evaluate_edi_package()
or
EMLaide::upload_edi_package()
to evaluate or upload your
EML document to EDI from R.This dataframe is the input to the add_datatable
function which function generates attribute
metadata from the attribute_info files and physical
information describing the datatable from the filepath and
datatable_url information.
Example Dataframe Structure for
datatable_metadata
datatable_metadata <-
dplyr::tibble(filepath = c("data/enclosure-study-growth-rate-data.csv",
"data/enclosure-study-gut-contents-data.csv",
"data/microhabitat-use-data-2018-2020.csv",
"data/seining-weight-lengths-2018-2020.csv",
"data/snorkel-index-data-2015-2020.csv"),
attribute_info = c("metadata/enclosure-study-growth-rates-metadata.xlsx",
"metadata/enclosure-study-gut-contents-metadata.xlsx",
"metadata/microhabitat-use-metadata.xlsx",
"metadata/seining-weight-length-metadata.xlsx",
"metadata/snorkel-index-metadata.xlsx"),
datatable_description = c("Growth Rates - Enclosure Study",
"Gut Contents - Enclosure Study",
"Microhabitat Data",
"Seining Weight Lengths Data",
"Snorkel Survey Data"),
datatable_url = paste0("https://raw.githubusercontent.com/FlowWest/edi.749.1/main/data/",
c("enclosure-study-growth-rate-data.csv",
"enclosure-study-gut-contents-data.csv",
"microhabitat-use-data-2018-2020.csv",
"seining-weight-lengths-2018-2020.csv",
"snorkel-index-data-2015-2020.csv")))
Each row contains all information needed for a data entity to be added to the dataset element of a data package. If you only have one datatable keep this structure or use a named list with the same information.
filepath | attribute_info | datatable_description | datatable_url |
---|---|---|---|
data/enclosure-study-growth-rate-data.csv | metadata/enclosure-study-growth-rates-metadata.xlsx | Growth Rates - Enclosure Study | https://raw.githubusercontent.com/FlowWest/edi.749.1/main/data/enclosure-study-growth-rate-data.csv |
data/enclosure-study-gut-contents-data.csv | metadata/enclosure-study-gut-contents-metadata.xlsx | Gut Contents - Enclosure Study | https://raw.githubusercontent.com/FlowWest/edi.749.1/main/data/enclosure-study-gut-contents-data.csv |
data/microhabitat-use-data-2018-2020.csv | metadata/microhabitat-use-metadata.xlsx | Microhabitat Data | https://raw.githubusercontent.com/FlowWest/edi.749.1/main/data/microhabitat-use-data-2018-2020.csv |
data/seining-weight-lengths-2018-2020.csv | metadata/seining-weight-length-metadata.xlsx | Seining Weight Lengths Data | https://raw.githubusercontent.com/FlowWest/edi.749.1/main/data/seining-weight-lengths-2018-2020.csv |
data/snorkel-index-data-2015-2020.csv | metadata/snorkel-index-metadata.xlsx | Snorkel Survey Data | https://raw.githubusercontent.com/FlowWest/edi.749.1/main/data/snorkel-index-data-2015-2020.csv |
The following code loads the “data-package-metadata.xlsx”,
“abstract.docx”, and “methods.docx”. Each sheet of the excel workbook
pertains to a different metadata element and will be the input to the
add_[blank]
functions used throughout this example.
excel_path <- "Banet-Example/metadata/data-package-metadata.xlsx"
sheets <- readxl::excel_sheets(excel_path)
metadata <- lapply(sheets, function(x) readxl::read_excel(excel_path, sheet = x))
names(metadata) <- sheets
abstract_docx <- "metadata/abstract.docx"
methods_docx <- "metadata/methods.docx"
In addition to these files, we will need a unique EDI data package
identifier. We use the function reserve_edi_id
to generate a EDI id. You must already have an account associated with
EDI to do this.
edi_number <- reserve_edi_id(user_id = "your user id", password = "your user password ")
You can also reserve this data package identifier on the EDI data repository under tools.
For this example, we will use the following identifier.
edi_number <- "edi.750.1"
We will use magrittr::%>%
with our
add_[blank]
functions to append each EML element to a list.
The %>%
is a pipe like operator which takes the
left-hand side as the first argument of the function appearing on the
right-hand side.
For details on appropriate inputs to the functions see documentation
at ?add_[blank]
.
The add_methods()
and add_abstract()
functions take in the methods_docx
and the
abstract_docx
. The add_datatable()
function
takes in the datatable_metadata
defined and described
above. Every other function takes in one or more sheets from the
metadata
object. For template items with multiple rows, the
add_[blank]
functions map through each row and adds a named
nested list for each row to the dataset element.
The code below adds all dataset elements.
dataset <- list() %>%
add_pub_date() %>%
add_title(metadata$title) %>%
add_personnel(metadata$personnel) %>%
add_keyword_set(metadata$keyword_set) %>%
add_abstract(abstract_docx) %>%
add_license(metadata$license) %>%
add_method(methods_docx) %>%
add_maintenance(metadata$maintenance) %>%
add_project(metadata$funding) %>%
add_coverage(metadata$coverage, metadata$taxonomic_coverage) %>%
add_datatable(datatable_metadata)
When units aren’t standard add_datatable()
will give a
message like the following:
"We identified the following custom unit: fishPerSchool , please make sure to add information on this custom unit in additional metadata information:"
.
We must formally define each of these custom units and add them to the
EML document as an additional metadata section.
The code below defines 4 custom units and uses the
EML::set_unitList()
function to format them into a unitList
that can be added to our EML document.
custom_units <- data.frame(id = c("fishPerEnclosure", "thermal unit", "day", "fishPerSchool"),
unitType = c("density", "temperature", "dimensionless", "density"),
parentSI = c(NA, NA, NA, NA),
multiplierToSI = c(NA, NA, NA, NA),
description = c("Fish density in the enclosure, number of fish in total enclosure space",
"thermal unit of energy given off of fish",
"count of number of days that go by",
"Number of fish counted per school"))
unitList <- EML::set_unitList(custom_units)
The code below adds all of the elements we generated above and an
access element into an eml
list.
add_access
adds an access section at the beginning
of our EML document. The add_access
default is public
principal with a read permission.dataset
list from above contains all elements of
the dataset
section of the EML. This includes the
datatables
, abstract
, methods
,
and all the other metadata sections appended above.addtionalMetadata
contains the
unitList
that we generated to hold our custom units.
eml <- list(packageId = edi_number,
system = "EDI",
access = add_access(),
dataset = dataset,
additionalMetadata = list(metadata = list(unitList = unitList)))
Once all of our information is appended to our eml list we can use
the write_eml
and eml_validate
functions from the EML package to
convert our list to EML and check validity.
EML::write_eml(eml, "edi.750.1.xml")
EML::eml_validate("edi.750.1.xml")
To evaluate your document in R using EDI’s EML Congruence Checker
you can use evaluate_edi_package()
. To use this function
you must have the data entities text files publicly accessible by a URL.
This URL must be added in the datatable_metadata
section
above. If you do not have a URL available then you can upload the EML
document and the dataset on the EDI data
portal.
evaluate_edi_package(user_id = "Your User Id", password = "Your password", eml_file_path = "edi.750.1.xml")