NBI Digital Repository Documentation

Instructions to curate and upload research output supported by NBI

The Curation Process

Everything on this page is the responsibility of the NBI Wrangler. The curation process prepares research outputs for upload and creates a record of the files for later use by the NBI Uploader. Information from the FlattenedFauna project informed this protocol.

The three main sections on this page are:

  1. Data Embargoes: Work with the researcher to plan when their data is uploaded.
  2. Curating Data & Reports: Formatting the dataset and report so they meet open science best practices.
  3. Metadata: Recording the information that describes the dataset and report (title, authors, etc). This is essential to make the files searchable and useful once uploaded to the Repository.

Data Embargoes

In an ideal world, a dataset would be ready for upload with the report. However, there are many reasons why a researcher may want to postpone uploading their data (or not submit data at all). Researchers have the option to submit their data with their report to NBI under an agreed upon embargo, postponing upload to a future date. The default time period is two years. The benefit of embargoing, rather than just submitting data at some later point, is that NBI holds a copy of the data in case the researcher either loses the dataset or goes incommunicado. Additionally, timely submission of data means the NBI Wrangler can curate the dataset while the project is fresh on the mind. Work with the researcher to fill out the Data Plan.

The researcher has the following options:

  1. Submit data with report for upload.
  2. Submit data with report and agree to a default two year embargo period for upload, or arrange a different embargo timeline. Report is submitted for upload.
  3. Submit a report only and agree to a custom data agreement.

Universities and institutions may have their own data policies that will complicate this. Ultimately, the data should be published somewhere. If a researcher uses their institutional repository or a third party repository, their report record in the NBI Digital Repository can point to the published data wherever it sits.

Here is an example: A PhD student hopes to publish a paper based on their data as part of their dissertation. They think it will be 3 years before publication. The student agrees to submit data with their report but have it embargoed for 3 years. At the end of their grant year, NBI receives the report and dataset. The report is uploaded to Zenodo immediately. The NBI Wrangler curates the dataset and puts it in a folder marked with the specified date of upload. In 3 years, NBI contacts the researcher to get final confirmation for data upload. At this point the researcher may send a more complete dataset or let NBI know the data has been pubished elsewhere. In the latter case, the NBI uploader can add the DOI of the published data to the existing report record in the NBI Repository.


Curating Data & Reports

Researchers should do most of this work (steps 1-4) since they know their data the best. However, the curator will need to check the dataset and make sure it meets the standards listed below. Here is an overview of the steps involved:

  1. Tidy the data: Make sure the data is in an appropriate open science format.
  2. Clean the data: Check for consistancy and null cells
  3. Create a data dictionary: List of the variables (columns) in a new csv file
  4. Examine each data column and fill in the data dictionary
  5. Create metadata

Here is an example of a cleaned and properly formatted dataset.

An excellent and mandatory read for anyone who works with data is Broman and Woo’s (2018) “Data Organization in Spreadsheets”.

Tidy Data

Tidy data is important because it is easy to read and facilitates analysis and summarization.

It has the following characteristics.


Cleaning Data

General

Georeference

Date and Time


Data Dictionary

Data dictionaries tell users what the column names mean, what the data values mean, and anything else that would help someone use the data wisely.

Create a csv file in your favorite spreadsheet program. The data dictionary will define the variables used in all the data files uploaded as one package, under one DOI. If you have a dataset with more than one data table that each have many of the same variables (e.g. survey data and trap data for the same study subject), make each table its own csv file and list all the variables from each file in the data dictionary file. Add the following columns to the dataDictionary:


File Names and Types

File naming convention is the following:

When adding the year to a filename, use the year of publication rather than the year of the grant. This will make it easier for people to figure out how to cite it.

All file types should be ‘open’ meaning a user does not need propriatary software to open and use the file.

Data

CSV Comma Separated Values

CSV is a standard format for open data. Most spreadsheet programs (like Excel) will allow you to save a file as .csv. All formatting is stripped away. If you have the option, save as CSV UTF-8. This will store unusual characters used in any note or comment fields properly.

TXT Text file

Data can also be saved as a text file as long as there is a way for a spreadsheet program to figure out how the data values are separated from each other.

Reports

PDF

PDF is the preferred report format. Almost any word processor can export a document to the pdf format. There are online tools to combine several pdfs into one.

Code

This section in progress…

Images

Images can be uploaded to the repository. The format matters though and if you need more information check out the Library of Congress digital preservation site. The Center for Digital Architecture also has a very readable summary of the various types. If you can, add your name to the image and upload a high quality version. The use case of this is a future researcher or journalist could request permission to reproduce your image. Rather than trying to find a high quality copy, just put one in the repository.

TIFF

Tiff is the preferred archival format, but only if you can create the original image as Tiff. If your original image is a jpeg, use jpeg or jpeg2000 rather than Tiff.

JPEG and JPEG2000

If your orgigial image is jpeg, use jpeg or jpeg2000. Jpeg2000 is better for long term preservation than jpeg.

Other File Types

Png is also an option but its preservation capabilities may be less than the other formats.


Metadata

Metadata is information about an object, like author, date created, etc. Good metadata is essential to making reports and datasets findable and usable. Metadata gives context to a resource and without it, that resource (the report and dataset) will never be used again or, even worse, never even accessed again. All your hard curation work will be for naught!

“To data users, good metadata is like summer rain after a long drought- it’s refreshing and you don’t know when you’ll see it again.” -the authors of this documentation :)

Filling Out the Metadata Form

The NBI Wrangler submits metadata related to the report and dataset to NBI via the Metadata Form. The fields are set up to match what is available on Zenodo. Once a year, the NBI Uploader (the committee member responsible for uploading to Zenodo) will use the information entered into the Metadata Form to fill in values on Zenodo.

Upload Type

Choose the upload type from the dropdown list. If the upload will include both a report and a dataset, use dataset.

Title

Paste in the title from the researcher report, if available. Otherwise, create a descriptive title. If it is only a dataset, begin the title with “Data for” and then include the title of the report or paper associated with the data. Titles should be descriptive.

Report Date

If there is a report as part of the upload, use the date of the report. If it has no date, use February 15 of the year following the grant year. For example, for a report on work done as part of a 2019 grant, the date would be 02-15-2020. If this is only a data file or code, use today’s date.

Authors

Include all authors in the format: last name1, first name1, affiliation1, ORCID1; last name2, first name2, affiliation2, ORCID2

If the author’s ORCID is unknown, leave blank.

Author contact email

This is especially important if the author chooses to embargo data and is also useful if errors are found in the report or data in the future.

Basic Description

Paste in the abstract of the researcher report and include any other relevant information.

File Description

List the name of each file being uploaded, including the file extension (e.g. wormData.csv). If necessary, include a description that would help a user know what the file contains. This is helpful there are many files and the file names do not fully convey their contents.

This list will be used by the NBI Uploader to confirm they have the right files to upload. It will also be included in the Description field on the Repository.

Subject

Enter the url for the study subject (species studied) if one exists. Use the Global Biodiversity Inforamtion Facility to find the url.

Kingdom

Choose the kingdom that the subject of the study is classified in. If the study focuses on subjects in more than one kingdom, add another kingdom as an additional keyword in the free text box at the end of the page. This will help users find all NBI studies related to plants or animals or fungi.

Group

This list is sourced from the original NBI list of grants. It is a ‘folksonomy’ and is useful for grouping reserach outputs into general subjects of study.

Study Type

Study type is how the collected data is analyzed or what it produces. Examples include: checklist, species survey, genetic analysis, species-area curve, etc.

Locations -Geography and Specific Location

General Geography

Specific Locations

Specific locations may include properties, generally accepted place names, property owner (e.g. Nantucket Conservation Foundation), or geographic feature (e.g. pond, hill, beach).

Methods

Choose two methods used to collect data in the study. If more than two methods were used add additional method keywords in the free text box at the end of the page. Methods are the ways data was acquired, not the way it was analyzed.

Additional Keywords

Examples of this could include ecosystem analysis, symbiotic relationships, diet analysis etc. Include important words that a user might search for. Add any information here that did not fit in the above fields.

Funding Keyword

This helps the NBI uploader fill out the metadata properly on Zenodo. Choose NBI Grant, Other Grant, Both or multiple or None.

License

The recommended license is Creative Commons Attribution 4.0. This allows maximum reusability and requires users to cite the upload. For more information on the world of open data licenses see the specifications and technical details section of this site.

Embargoes

Enter the embargo information in this field. Note why the data is embargoed and what month and year it should be uploaded. Here is an example:

“Embargoed 3 years: upload in 2024. Researcher plans to do a follow up study and publish a paper.”

The dataset or report may be related to a research output that is already published and has a DOI. Related identifiers may point to a research output within the NBI Digital Repository on Zenodo or may point to an external source. For example, a researcher conducts a pilot study on Nantucket and NBI uploads the dataset and report to the repository. These likely have no related identifiers. The following year, the researcher conducts follow up research and collects a much larger dataset, subsequently publishing the results in a peer reviewed journal. The researcher submits the dataset to NBI for upload. That dataset would have two related identifiers: one for the pilot study and one for the peer reviewed publication.

Related Identifier Relationships

Here are the relationships that will likely be useful to NBI, however Zenodo has a much longer list.

Related Identifier isUse this relationshipNotes
Published paper that does not cite the uploaded datais supplemented by this upload
Published paper that cites the uploaded datasetcites this uploadThis is likely added after paper publication
Data already published that this data is a subset ofhas this upload as partAn example is beetle count data from pitfall trap data for a spider study. The published spider data and documentation act as the parent to this upload.
Data that is part of a pilot study or earlier studyis continued by this upload
Published data used/cited in the current uploadis cited by this upload

Grant Funding

Information regarding single or multiple grants that funded this research in this format: Grant funder, grant name or grant number; Grant funder, grant name or grant number; etc. If this is NBI funded, enter the year of the grant. This helps the NBI uploader enter the metadata properly.

Contributors

Contributors are those who helped significantly with data collection/analysis/management or the research project as a whole but are not authors. People in the Acknoledgments should not be automatically added here. However, you, as the NBI Wrangler should be noted here (relationship = Related person) because you likely facilitated the work overall. If you oversaw the data curation of the dataset, include yourself as Data curator!. Use the following format: last name1, first name1, affiliation1, relationship1; last name2, first name2, affiliation2, relationship2

The relationship is important. Use any of the following:

Contact PersonData curatorData collectorData managerDistributor
EditorHosting institutionOtherProducerProject leader
Project managerProject memberRegistration agencyRegistration authorityRelated person
Research groupResearcherRights holderSponsorSupervisor
Work package leader