This document walks through the creation of an EML document based on example data and metadata.
The following libraries are needed to create a working EML document.
Our Example
We will use the data provided by John Hannon to show an example of how to compile metadata in a valid EML document that can be uploaded to the EDI data portal website. To follow along with this example please download the Hannon-Example zip folder.
Required Files
Four files must be uploaded to make a complete EML document. To follow along with this example please load the following files from the Hannon-Example
folder into your environment.
- An excel spreadsheet containing the majority of the metadata
- A word document containing the abstract text
- A word document containing the methods text
- A file that contains the data (preferably csv if tabular data) and the name of the file
excel_path <- "Hannon-Example/snorkel-index-example-metadata.xlsx"
sheets <- readxl::excel_sheets(excel_path)
metadata <- lapply(sheets, function(x) readxl::read_excel(excel_path, sheet = x))
names(metadata) <- sheets
abstract_docx <- "Hannon-Example/snorkel-index-example-abstract.docx"
methods_docx <- "Hannon-Example/snorkel-index-hannon-example-methods.docx"
dataset_file <- "Hannon-Example/snorkel-index-example-data.csv"
dataset_file_name <- "snorkel-index-example-data.csv"
EDI Number
In addition to these four files a entity description and an EDI number must be defined. The EDI number should be a unique data package identifier. You can reserve this data package identifier number on the EDI data repository under tools.
edi_number <- "edi.678.1"
It is also possible to use the EMLaide function reserve_edi_id()
to generate a EDI number using R.
edi_number <- reserve_edi_id(user_id = "your user id", password = "your user password ")
EML Sections
We start with the simpler pieces of information such as personnel
and title
and work up to appending the more complex sections like dataTable
. Sections will correspond with different sheets of metadata inputs from the metadata.xlsx
document.
Access Permissions
The add_access
adds an access section at the beginning of our EML document. The add_access
default is public principal with a read permission.
Publication Date
The publication date is simply added with the add_pub_date
function. If no date is provided, it will automatically append the current date. This can be overwritten by providing an input for date
.
Title and Short Name
The add_title
function allows you to append the title and short name of the dataset to the file.
- Title - Descriptive and between 7 and 20 words long
- Short Name - less words than title and should give viewers a more accessible name to the dataset
This information can be added to the “title” tab in the excel file “example-metadata.xlsx”.
title <- add_title(list(), title = metadata$title$title,
short_name = metadata$title$short_name)
Keyword Set
The next item we will append is the keyword set. We can use the add_keyword_set
function to do so. The keyword set should include a list of words that help identify your project or connect it with other similar projects.
Abstract
Next, we will use the add_abstract
function to append the abstract of the dataset to your file. The abstract should include basic information on the dataset that gives a brief summary to the viewers of what they are observing from the data. This information will not be added to the “example-metadata.xlsx” excel sheet, but rather the “hannon_example_abstract.docx”.
License and Intellectual Rights
Following the abstract, we will append the license and intellectual rights information. The add_license
function allows you to append the licensing and usage information to your file. Information can be added to the “license” tab in the excel file “example-metadata.xlsx”.
license <- add_license(list(), default_license = metadata$license$default_license)
Methods
The method section explains the scientific methods that were used in the collection of the dataset. The add_method
function allows you to add a method file to the parent_element. A template methods document is given (“Hannon-Example/hannon_example_methods.docx”). If more methods are needed, you can create separate sections in your word document.
Maintenance
The maintenance of a dataset is simply if the data collection is complete or ongoing. The add_maintenance
function allows you to append the status of the dataset to your file and the inputs of complete
and ongoing
are the only ones allowed. If the dataset is still in progress, the frequency of which it is updated must be provided as well. This information should be added to the “maintenance” tab in the excel file “example-metadata.xlsx”.
maintenance <- add_maintenance(list(), status = metadata$maintenance$status,
update_frequency = metadata$maintenance$update_frequency)
Project: Title, Personnel, and Funding
The project section should be appended next. Project personnel and project funding are nested in this project section. To generate a valid EML doc the project section must contain a project title, a project personnel, and project funding.
Project personnel
The creator of the dataset is appended as the project_personnel
. If a different person is associated with the project please add that information to the “project” tab in the excel file “example-metadata.xlsx”.
project_personnel <- personnel$creator[1:3]
Project funding
The add_funding
function allows you to append both the description of the funding you have received as well as the organization you received the funding from. Multiple funders can be appended and information should be added to the “funding” tab in the excel file “example-metadata.xlsx”.
award_information <- purrr::pmap(metadata$funding, add_funding) %>% flatten()
Combining Project Elements
Once all the components of project have been defined we can use the add_project function to combine the sections in the proper formatting for an EML document.
project <- add_project(list(),
project_title = metadata$title$short_name,
award_information,
project_personnel)
Coverage: Geographic, Temporal, Taxonomic
Next, the coverage information is appended. The add_coverage
function allows you to append full coverage information to your file. Temporal, Geographic, and Taxonomic Coverage are all required elements to generate an EML document.
Taxonomic coverage
Taxonomic coverage can be appended using the add_taxonomic_coverage
function. The taxonomic coverage information is added to the “example-metadata.xlsx” excel file on the tab “taxonomic_coverage”. chinook, delta_smelt, steelhead, white_sturgeon, and green_sturgeon are all default options that can be selected using the drop down menu under the CVPIA_common_species column.
taxonomic_coverage <- purrr::pmap(metadata$taxonomic_coverage, add_taxonomic_coverage)
Combining Coverage Elements
The add_coverage
function will add all the elements of coverage to the parent element.
coverage <- add_coverage(list(),
geographic_description = metadata$coverage$geographic_description,
west_bounding_coordinate = metadata$coverage$west_bounding_coordinate,
east_bounding_coordinate = metadata$coverage$east_bounding_coordinate,
north_bounding_coordinate = metadata$coverage$north_bounding_coordinate,
south_bounding_coordinate = metadata$coverage$south_bounding_coordinate,
begin_date = metadata$coverage$begin_date,
end_date = metadata$coverage$end_date,
taxonomic_coverage = taxonomic_coverage)
Data Table
Next, we need to create the data table. The data table element includesphysical
, attribute_list
, and potentially additional information for Spatial Data. These sections are all lists which must be created first, appended to a dataTable, and then appended to the parent_element
. For more detailed information on the datatable element and instructions on appending multiple datatables please see the dataset specific instructions.
Physical
The physical element can be created first using the add_physical
function. This will append the actual information of the data. The file path of the data must be given as the dataset_file
specified at the top of this document. If you wish to upload your datapackage to EDI directly from R you must include a URL to access the dataset. You can do this by providing an input to the data_url
argument.
physical <- add_physical(file_path = dataset_file, data_url = NULL)
Attribute List
Next, we use the add_attribute
function to append all attributes to an attribute list. Please make sure you review what type of attribute you are providing and what inputs are necessary. These values can then be inputted into the “attribute” tab in the “example-metadata.xlsx” excel file. Every single column in the dataTable must have a described attribute to match EDI congruence checker.
Enumerated vs. Non Enumerated Attributes
- Enumerated - If you are using a “nominal” or “ordinal” attribute which is “enumerated”, (it has a specific code definition), you must also fill out the tab “code_definitions”. The
code_helper()
function below helps adding codeDefinitions
to the “enumerated” attributes.
- Non Enumerated - For Non Enumerated attributes the definition required by EML is simply the
attribute_defintion
given in the spreadsheet.
# Create helper function to add code definitions if domain is "enumerated"
code_helper <- function(code, definitions, attribute_name) {
codeDefinition <- list(code = code, definition = definitions)
}
attribute_list <- list()
# Adds both enumerated and non enumerated attributes
adds_attribute <- function(attribute_name, attribute_definition, storage_type,
measurement_scale, domain, type, units,
unit_precision, number_type, date_time_format,
date_time_precision, minimum, maximum,
attribute_label){
# If statement adds definition for enumerated attribute using code_helper()
if (domain %in% "enumerated") {
definition <- list()
codes <- metadata$code_definitions
current_codes <- codes[codes$attribute_name == attribute_name, ]
definition$codeDefinition <- purrr::pmap(current_codes, code_helper)
# Else statement adds definition for non-enumerated attribute
} else {
definition = attribute_definition
}
new_attribute <- add_attribute(attribute_name = attribute_name,
attribute_definition = attribute_definition,
storage_type = storage_type,
measurement_scale = measurement_scale,
domain = domain, definition = definition,
type = type, units = units,
unit_precision = unit_precision,
number_type = number_type,
date_time_format = date_time_format,
date_time_precision = date_time_precision,
minimum = minimum, maximum = maximum,
attribute_label = attribute_label)
}
# Maps through entire attribute sheet adding to attribute_list
attribute_list$attribute <- purrr::pmap(metadata$attribute, adds_attribute)
If your dataset contains units that are not part of the EML schema you may add them as custom units. To do so please see the vignette on custom units.
Putting the Data Table Together
Now that we have the attribute_list
and physical
information, we can compose the data table, which is the last element we need to create our working EML file. If you wish to append multiple datatables to one datapackage please view the detailed dataset instructions.
dataTable <- list(entityName = dataset_file_name,
entityDescription = metadata$dataset$name,
physical = physical,
attributeList = attribute_list,
numberOfRecords = nrow(read_csv(dataset_file)))
Append all items to the dataset
Here we create a datasest list that contains all of the elements we generated.
dataset <- list(title = title$title,
shortName = title$shortName,
creator = personnel$creator,
contact = personnel$contact,
pubDate = pub_date,
abstract = abstract$abstract,
associatedParty = list(personnel[[3]], personnel[[4]], personnel[[5]]),
keywordSet = keywords$keywordSet,
coverage = coverage$coverage,
intellectualRights = license$intellectualRights,
licensed = license$licensed,
methods = methods,
project = project$project,
maintenance = maintenance$maintenance,
dataTable = dataTable)
Making the EML document
Now that we have all of the items appended to the dataset
we can add the dataset, id, and access to a new list named eml. The unique id number should match one that you reserved on the EDI data portal. Input this at the top of the document where you define the data tables.
eml <- list(packageId = edi_number,
system = "EDI",
access = access,
dataset = dataset)
The final step is to convert our eml
list into the correct format. To do so, we can use the EML library’s write_eml
function.
Evaluation using the EDI Congruency Checker
To evaluate your document in R using the EDI congruence checker you can use evaluate_edi_package()
. To use this function you must have the data table accessible by a URL that can be accessed through the EDI repository. This URL must be added in the add_physical()
section above. If you do not have a URL available then you can upload the EML document and the dataset on the EDI data portal.
evaluate_edi_package(user_id = "Your User Id", password = "Your password", eml_file_path = file_name)