Create EML Document from Template • EMLaide

This document walks through the creation of an EML document based on example data and metadata.

The following libraries are needed to create a working EML document.

library(EMLaide)
library(tidyverse)
library(readxl)
library(EML)

Ecological Metadata Language(EML) Schema Outline

We use the EML package to generate the EML document based on our metadata inputs. The EML package write_eml() function takes in a nested list and formats the list into an EML document. To create valid EML the nested list must:

Contain all required EML fields
nest all fields under the correct sub sections

We have two main sections within our EML document:

Access - The Access Section defines the access for the entire resource (entire EML document)
Dataset - The Dataset Resource contains the descriptive metadata information on a data set. A data set can contain one or more data entities. A data entity is a tabular data table or a vector or raster image.

The Dataset element contains all of the additional metadata information. The Dataset element is composed of nested lists. Some information is at the top level and is applicable to all elements in the Dataset but other information is nested within subsections and is metadata specific to that subsection.

For a more in depth description of EML please see EML Specification.

In this document, we can quickly create an EML file which has the opportunity to append the following elements. For more detailed instructions on a specific section please see section specific articles.

- eml 
  -access
  -dataset
    - creator
    - contact
    - associated parties
    - title  
    - abstract 
    - keyword set  
    - license 
    - methods 
    - maintenance 
    - project
    - coverage 
    - data table

Our Example

We will use the data provided by John Hannon to show an example of how to compile metadata in a valid EML document that can be uploaded to the EDI data portal website. To follow along with this example please download the Hannon-Example zip folder.

Required Files

Four files must be uploaded to make a complete EML document. To follow along with this example please load the following files from the Hannon-Example folder into your environment.

An excel spreadsheet containing the majority of the metadata
A word document containing the abstract text
A word document containing the methods text
A file that contains the data (preferably csv if tabular data) and the name of the file

excel_path <- "Hannon-Example/snorkel-index-example-metadata.xlsx"

sheets <- readxl::excel_sheets(excel_path)
metadata <- lapply(sheets, function(x) readxl::read_excel(excel_path, sheet = x))
names(metadata) <- sheets

abstract_docx <- "Hannon-Example/snorkel-index-example-abstract.docx"

methods_docx <- "Hannon-Example/snorkel-index-hannon-example-methods.docx"

dataset_file <- "Hannon-Example/snorkel-index-example-data.csv"

dataset_file_name <- "snorkel-index-example-data.csv"

EDI Number

In addition to these four files a entity description and an EDI number must be defined. The EDI number should be a unique data package identifier. You can reserve this data package identifier number on the EDI data repository under tools.

edi_number <- "edi.678.1"

It is also possible to use the EMLaide function reserve_edi_id() to generate a EDI number using R.

edi_number <- reserve_edi_id(user_id = "your user id", password = "your user password ")

EML Sections

We start with the simpler pieces of information such as personnel and title and work up to appending the more complex sections like dataTable. Sections will correspond with different sheets of metadata inputs from the metadata.xlsx document.

Access Permissions

The add_access adds an access section at the beginning of our EML document. The add_access default is public principal with a read permission.

access <- add_access()

Publication Date

The publication date is simply added with the add_pub_date function. If no date is provided, it will automatically append the current date. This can be overwritten by providing an input for date.

pub_date <- add_pub_date(list())

Personnel: Creator, Contact, and Associated Parties.

The add_personnel function allows you to append information on personnel involved in this dataset. EDI schema allows for addition of personel taking on the roles of:

Creator - any person or organization who is responsible for the creation of the data
Contact - individual to contact with questions about data (often the creator)
Associated Party - any person or organization which is associated to the dataset

Multiple creators, associated parties, and a mixture of both can be appended to the EML file by providing all inputs in the “personnel” tab in the “example-metadata.xlsx” excel file.

Below we create a personnel list. We append each person using an adds_person() helper function that provides inputs and calls on our add_personnel() function to add each person individually to the personnel list.

personnel <- list()
adds_person <- function(first_name, last_name, email, role, organization, orcid) {
  personnel <- add_personnel(personnel, first_name, last_name, email,
                                  role, organization, orcid)
}
personnel <- purrr::pmap(metadata$personnel, adds_person) %>% flatten()

Title and Short Name

The add_title function allows you to append the title and short name of the dataset to the file.

Title - Descriptive and between 7 and 20 words long
Short Name - less words than title and should give viewers a more accessible name to the dataset

This information can be added to the “title” tab in the excel file “example-metadata.xlsx”.

title <- add_title(list(), title = metadata$title$title,  
                   short_name = metadata$title$short_name)

Keyword Set

The next item we will append is the keyword set. We can use the add_keyword_set function to do so. The keyword set should include a list of words that help identify your project or connect it with other similar projects.

keywords <- add_keyword_set(list(), metadata$keyword_set[,1])

Abstract

Next, we will use the add_abstract function to append the abstract of the dataset to your file. The abstract should include basic information on the dataset that gives a brief summary to the viewers of what they are observing from the data. This information will not be added to the “example-metadata.xlsx” excel sheet, but rather the “hannon_example_abstract.docx”.

abstract <- add_abstract(list(), abstract = abstract_docx)

License and Intellectual Rights

Following the abstract, we will append the license and intellectual rights information. The add_license function allows you to append the licensing and usage information to your file. Information can be added to the “license” tab in the excel file “example-metadata.xlsx”.

license <- add_license(list(), default_license = metadata$license$default_license)

Methods

The method section explains the scientific methods that were used in the collection of the dataset. The add_method function allows you to add a method file to the parent_element. A template methods document is given (“Hannon-Example/hannon_example_methods.docx”). If more methods are needed, you can create separate sections in your word document.

methods <- add_method(list(), methods_file = methods_docx)

Maintenance

The maintenance of a dataset is simply if the data collection is complete or ongoing. The add_maintenance function allows you to append the status of the dataset to your file and the inputs of complete and ongoing are the only ones allowed. If the dataset is still in progress, the frequency of which it is updated must be provided as well. This information should be added to the “maintenance” tab in the excel file “example-metadata.xlsx”.

maintenance <- add_maintenance(list(), status = metadata$maintenance$status,
                               update_frequency = metadata$maintenance$update_frequency)

Project: Title, Personnel, and Funding

The project section should be appended next. Project personnel and project funding are nested in this project section. To generate a valid EML doc the project section must contain a project title, a project personnel, and project funding.

Project personnel

The creator of the dataset is appended as the project_personnel. If a different person is associated with the project please add that information to the “project” tab in the excel file “example-metadata.xlsx”.

project_personnel <- personnel$creator[1:3]

Project funding

The add_funding function allows you to append both the description of the funding you have received as well as the organization you received the funding from. Multiple funders can be appended and information should be added to the “funding” tab in the excel file “example-metadata.xlsx”.

award_information <- purrr::pmap(metadata$funding, add_funding) %>% flatten()

Combining Project Elements

Once all the components of project have been defined we can use the add_project function to combine the sections in the proper formatting for an EML document.

project <- add_project(list(), 
                       project_title = metadata$title$short_name, 
                       award_information,
                       project_personnel)

Coverage: Geographic, Temporal, Taxonomic

Next, the coverage information is appended. The add_coverage function allows you to append full coverage information to your file. Temporal, Geographic, and Taxonomic Coverage are all required elements to generate an EML document.

Taxonomic coverage

Taxonomic coverage can be appended using the add_taxonomic_coverage function. The taxonomic coverage information is added to the “example-metadata.xlsx” excel file on the tab “taxonomic_coverage”. chinook, delta_smelt, steelhead, white_sturgeon, and green_sturgeon are all default options that can be selected using the drop down menu under the CVPIA_common_species column.

taxonomic_coverage <- purrr::pmap(metadata$taxonomic_coverage, add_taxonomic_coverage)

Combining Coverage Elements

The add_coverage function will add all the elements of coverage to the parent element.

coverage <- add_coverage(list(),
                         geographic_description = metadata$coverage$geographic_description,
                         west_bounding_coordinate = metadata$coverage$west_bounding_coordinate,
                         east_bounding_coordinate = metadata$coverage$east_bounding_coordinate,
                         north_bounding_coordinate = metadata$coverage$north_bounding_coordinate,
                         south_bounding_coordinate = metadata$coverage$south_bounding_coordinate,
                         begin_date = metadata$coverage$begin_date,
                         end_date = metadata$coverage$end_date,
                         taxonomic_coverage = taxonomic_coverage)

Data Table

Next, we need to create the data table. The data table element includesphysical, attribute_list, and potentially additional information for Spatial Data. These sections are all lists which must be created first, appended to a dataTable, and then appended to the parent_element. For more detailed information on the datatable element and instructions on appending multiple datatables please see the dataset specific instructions.

Physical

The physical element can be created first using the add_physical function. This will append the actual information of the data. The file path of the data must be given as the dataset_file specified at the top of this document. If you wish to upload your datapackage to EDI directly from R you must include a URL to access the dataset. You can do this by providing an input to the data_url argument.

physical <- add_physical(file_path = dataset_file, data_url = NULL)

Attribute List

Next, we use the add_attribute function to append all attributes to an attribute list. Please make sure you review what type of attribute you are providing and what inputs are necessary. These values can then be inputted into the “attribute” tab in the “example-metadata.xlsx” excel file. Every single column in the dataTable must have a described attribute to match EDI congruence checker.

Enumerated vs. Non Enumerated Attributes

Enumerated - If you are using a “nominal” or “ordinal” attribute which is “enumerated”, (it has a specific code definition), you must also fill out the tab “code_definitions”. The code_helper() function below helps adding codeDefinitions to the “enumerated” attributes.
Non Enumerated - For Non Enumerated attributes the definition required by EML is simply the attribute_defintion given in the spreadsheet.

# Create helper function to add code definitions if domain is "enumerated"
code_helper <- function(code, definitions, attribute_name) {
  codeDefinition <- list(code = code, definition = definitions)
}

attribute_list <- list()
# Adds both enumerated and non enumerated attributes 
adds_attribute <- function(attribute_name, attribute_definition, storage_type, 
                           measurement_scale, domain, type, units,
                           unit_precision, number_type, date_time_format, 
                           date_time_precision, minimum, maximum,
                           attribute_label){
  # If statement adds definition for enumerated attribute using code_helper()
  if (domain %in% "enumerated") { 
    definition <- list()
    codes <- metadata$code_definitions
    current_codes <- codes[codes$attribute_name == attribute_name, ] 
    definition$codeDefinition <- purrr::pmap(current_codes, code_helper) 
  # Else statement adds definition for non-enumerated attribute 
  } else {
    definition = attribute_definition
  }
  new_attribute <- add_attribute(attribute_name = attribute_name, 
                                 attribute_definition = attribute_definition,
                                 storage_type = storage_type,
                                 measurement_scale = measurement_scale, 
                                 domain = domain, definition = definition, 
                                 type = type, units = units, 
                                 unit_precision = unit_precision, 
                                 number_type = number_type, 
                                 date_time_format = date_time_format, 
                                 date_time_precision = date_time_precision, 
                                 minimum = minimum, maximum = maximum, 
                                 attribute_label = attribute_label)
}
# Maps through entire attribute sheet adding to attribute_list
attribute_list$attribute <- purrr::pmap(metadata$attribute, adds_attribute)

If your dataset contains units that are not part of the EML schema you may add them as custom units. To do so please see the vignette on custom units.

Putting the Data Table Together

Now that we have the attribute_list and physical information, we can compose the data table, which is the last element we need to create our working EML file. If you wish to append multiple datatables to one datapackage please view the detailed dataset instructions.

dataTable <- list(entityName = dataset_file_name,
                 entityDescription = metadata$dataset$name,
                 physical = physical,
                 attributeList = attribute_list, 
                 numberOfRecords = nrow(read_csv(dataset_file)))

Append all items to the dataset

Here we create a datasest list that contains all of the elements we generated.

dataset <- list(title = title$title,
                shortName = title$shortName,
                creator = personnel$creator,
                contact = personnel$contact,
                pubDate = pub_date,
                abstract = abstract$abstract,
                associatedParty = list(personnel[[3]], personnel[[4]], personnel[[5]]),
                keywordSet = keywords$keywordSet,
                coverage = coverage$coverage,
                intellectualRights = license$intellectualRights,
                licensed = license$licensed,
                methods = methods,
                project = project$project, 
                maintenance = maintenance$maintenance,
                dataTable = dataTable)

Making the EML document

Now that we have all of the items appended to the dataset we can add the dataset, id, and access to a new list named eml. The unique id number should match one that you reserved on the EDI data portal. Input this at the top of the document where you define the data tables.

eml <- list(packageId = edi_number,
            system = "EDI",
            access = access,
            dataset = dataset)

The final step is to convert our eml list into the correct format. To do so, we can use the EML library’s write_eml function.

file_name <- paste(edi_number, "xml", sep = ".")
EML::write_eml(eml, file_name)
eml_validate(file_name)

Evaluation using the EDI Congruency Checker

To evaluate your document in R using the EDI congruence checker you can use evaluate_edi_package(). To use this function you must have the data table accessible by a URL that can be accessed through the EDI repository. This URL must be added in the add_physical() section above. If you do not have a URL available then you can upload the EML document and the dataset on the EDI data portal.

evaluate_edi_package(user_id = "Your User Id", password = "Your password", eml_file_path = file_name)