![]() |
|
|
|
ASP Data Policy
1. Data Policy Objectives and Scope
The objective of the U.S. Department of Energy Atmospheric Science Program (ASP) Data Policy is to maximize the scientific return from multi-investigator field campaigns by requiring the timely submission of data to an archive, setting standards for data formats, and maintaining a Data Catalog. Our goals are to widen the audience of potential end-users and to foster collaborations among campaign participants and with outside users. The Data Policy aims to accomplish these objectives in a way that is not unduly burdensome on individual investigators.
The Data Policy described here applies to measurements made during multi-investigator field campaigns. It is meant to cover investigators who receive support from ASP or who are receiving in-kind support such as the use of platforms or facilities.
2. Site and Experiment Specific Data Protocol
The ASP Data Policy is a generic document covering all field campaigns. An additional level of detail is required to take into account specific features of individual field campaigns or instruments. Decisions that affect all field campaign participants such as whether campaign data will be submitted to an outside archive, will be made by the ASP steering committee. It will be the responsibility of the PIs (hereinafter called Site Coordinators) who are coordinating the activities at an individual site, network of sites, or platform to arrive at a Data Protocol that is site specific. Site Specific Data Protocols will be written in collaboration with the researchers who will be generating data and will specify 1) what data is to be archived, 2) data format, 3) archive location, and 4) modifications to the canonical data delivery schedule.
The Site Specific Data Protocol will also identify supplemental data products and specify how they are to be accessed if they are not part of a data archive. Supplemental data products are those that are not in ASCII format or products that are less processed versions of the data that is archived, as described in Item 9.
Some investigators may find themselves obligated to submit data to more than one archive. Provision can be made in the Site Specific Data Protocol for a link to an outside archive in place of providing data to a dedicated ASP archive.
3. Exploratory Measurements
There is an instrument category identified as "under development", for which there is no a priori expectation that a useful data set will result. Field campaigns are a useful testing ground for new technology. However, if a useful data set can be produced, the investigator is obligated to share it in a timely fashion.
4. Timeline
Prior to a field campaign, each investigator must agree to the basic ASP Data Policy plus site and experiment specific features identified in the Specific Data Protocol.
Data sharing during a field campaign will be dictated by the necessities of coordinating multiple activities and maximizing scientific return e.g. directing aircraft flights on the basis of chemical or meteorological measurements. Such activities are not covered by this Data Policy. There is a general obligation to share information among campaign participants as it becomes available, but there will not be a formal procedure for submitting first-look data to a central archive or web page.
A draft data set should be submitted by individual PIs within 6 months of the end of the campaign. In the case of an instrument malfunction, a determination should be made within this 6 months that the data set is not useful in full or in part. Within 12 months of the end of the campaign a final data set should be submitted. Modifications to the data delivery schedule can be made by prior agreement, but such exceptions are, in general, reserved for cases where data reduction is particularly time consuming.
It is expected that analysis activities that occur over longer time scales will uncover problems with the "final" data set. Better algorithms may also result in changes to the "final" data. It is a continual obligation of the PI to submit revised data sets as long as they are working with the data and uncovering new problems or improvements.
5. Data Access
Data sets will be password protected for 18 months from the date of the experiment. All campaign participants will have access.
6. Tracking
It is the responsibility of the Site Coordinators to track the progress of the individual data producers during the 6 and 12 month data submission periods.
7. Data Catalog
The DOE Chief Scientist for a field program or a designated Data Manager will be responsible for maintaining a Data Catalog on the ASP web site that provides summary information, accessible to the public. This Catalog will include a comprehensive list of measurements, responsible parties for each instrument (with contact information), comments on data acquisition and quality, and links to the archive containing the data set. The Data Catalog will include references to supplemental material which may not be in an archive. The Catalog will also indicate the submission dates for preliminary and final data sets, as well as the dates of any subsequent revisions and a brief summary of what was revised.
8. Intellectual Property
The common good is served by sharing all of the data collected during a field campaign. However, the right to use this data must be balanced by a responsibility to the data providers that respects their efforts and intellectual property. ASP investigators and outsiders (subject to password protection before 18 months) can browse the data archive freely. Substantive use of data from the archive should require the following steps: 1) notification of the data provider and discussion of features of that data set that might not be apparent to an end-user, 2) feedback from the end-user regarding data features (such as consistency with other measurements or model calculations) that are uncovered during analysis, and 3) offer of co-authorship for publications that make use of that data set. We leave undefined, the term "substantive". Whether or not all of these steps are followed should depend on how routine a measurement is and how crucial (or prominent) it is to the end-users study.
Data from an archive should be cited following instructions in the meatadata. Investigators who receive direct or in-kind support from DOE should provide an acknowledgement in presentations and publications. The usual rules and conventions apply to citing published data.
9. Data Archive
Data will not be archived at a central location. The decision as to where to archive a particular data set will be part of the Site Specific Data Protocol. There are advantages to consolidating multiple data streams into a single file. This works well for time series data from co-located instruments. It is part of the overall Data Policy that this level of integration be continued.
Site Specific Data Protocols will distinguish between two types of data. Primary data which is always in ASCII format and supplemental data which may have an alternate format. Supplemental data is what is left over when a primary ASCII data set is produced. This definition explicitly recognizes that choices have to be made in converting instrument readings to information that is useful to a broad community. Examples of supplemental material are 1) high frequency data from which a lower frequency version has been extracted for the archive, and 2) multi-dimensional data such as mass spectra and single-particle results that cannot be easily manipulated outside of their native acquisition or analysis software. Data archives will contain all primary data sets plus supplemental material as agreed on in the Site Specific Data Protocols.
10. Data Format
A common, universally readable file structure is essential. Primary data must be submitted to a data archive as ASCII flat files. Files should be intelligible in a stand-alone fashion without additional software. Information in the file name that is used to identify date, time, location, etc., must be repeated within the body of the file, either as part of the data set or as a header. Data files should be self-documented with metadata fully describing the measurements sufficient for the end user. Allowing for this information at the beginning of each file assures that it is accessed at the same time as the data. In each file, all columns must be labeled and units provided.
Data should be tab delimited, fixed-format to allow files to be read by spreadsheet\graphing\analysis software and to be ingested by Fortran programs. The DOS convention of CR/LF terminated files is strongly encouraged.
10.1 Conventions and Units
Date and Time: Rigid standardization on UTC eliminates uncertainties and inconsistencies between researchers. Date and time will be reported in two columns using IEEE format, namely yyyy-mm-dd and hh:mm:ss(.ss) to specify the beginning of the interval being reported. Fixed width formatting of date/time to include leading zeros is mandatory. The (.ss) is optional. This format is recognizable by most software and is easily and unambiguously read (as opposed to Julian day, seconds from some zero time, hours, or other application-specific formats). This date/time convention allows easy sorting into chronological order.
Time intervals: For continuous, regular, time-series data, time intervals should be monotonically increasing and uniformly spaced at what ever interval is appropriate (1-s, 1-min, 1-h etc.) There should be no gaps for time series data in the date and time columns. Different time bases could be appropriate for different signals. For times series data from discrete (time integrated) samples, start time and stop time of each sampling interval is required. A time mid-point is also needed if the arithmetic mean of the start and stop time is not the mean time for data acquisition. An optional data format that makes it easier to merge continuous and integrated data is to provide the integrated samples on a high frequency continuous time base (i.e. every second) by repeating the integrated result for all time points within the collection interval.
Altitude: Meters. Height above msl preferred.
Aerosol concentration:
Aerosol particle size:
Hydrocarbon mixing ratio: By compound (not by C atom). ppbv preferred.
Trace gas mixing ratio: ppbv preferred for most species.
Position: Decimal degrees. West longitude specified as negative number for correct left to right orientation on graphs. Degrees South specified as negative numbers.
Direction: Angle from 0 to < 360 deg., with 0 being north, increasing clockwise.
10.2 Metadata should include:
Site or platform name(s). Location(s) of stationary sites including altitude. Time period covered. Conversion factors for producing an output variable from a measurement, i.e. Organic reported = 1.4 carbon measured. Relation between UTC and local time (specify standard or daylight saving). For co-located rapid response instruments, specify source and accuracy of time standard. For each instrument: Name, description, quantity measured, instrument performance (LOD, accuracy, precision, time response, and time constants, as appropriate), operational exceptions, contact information, citation instructions, data revision history, and non-standard conventions. Regarding time constants: All instruments have a finite response time. In general, the data must be offset to reflect the state at the reported time rather than the time the data is recorded. This offset must be acknowledged and given in the metadata. Also important is the broadening introduced by the measurement. This cannot easily be corrected. It is recommend that investigators reference the instrument response function to an impulse, if necessary using information from co-located instruments known to have a very rapid response. This information is essential to end users comparing signals with broadly different time responses.
10.3 Data Flags
Not every time interval will have valid data from each signal source. Periods of non-valid or missing data are to be indicated by leaving the field empty. Using a number (-999) to denote missing data is not allowed. If further explanation is needed it can be provided by a flag in another data column. It is recommend that the flag be an integer number whose binary digits can be set to denote any individual or combination of conditions as selected by the investigator. If necessary, the end user can easily parse this number to discriminate only the desired measurements. A flag value of 0 is recommended for valid data.
10.4 Examples
Integrated aircraft data set: Many of the G-1 instruments have response times between 1 and 60 s with data recorded at 1 Hz. Intermittent samples such as hydrocarbon canisters are identified by writing the sample ID in a data column for all of the 1s time periods during which a sample was collected. A ten-second average is archived. This averaging time was arrived at by considering the response time of most of the trace gas instruments and their signal to noise ratio. It represents a compromise between fidelity and ease of manipulation. The 1 Hz data is supplemental and is available by request.
A consideration in creating a merged data set is to limit the number of columns for ease of use. Data sets which are intrinsically multi-dimensional, such as aerosol size spectra and hydrocarbon speciation are archived as separate files.
DMA: Time markers should reflect actual start and stop times, which may yield non-uniform time spacing. dN/dLogD should be interpolated to a size grid which remains fixed for all of the scans in a single file and, unless there is good reason, fixed for the whole campaign.
AMS: Time markers should reflect actual start and stop times, which may yield non-uniform time spacing. ASCII files should contain time traces of the concentrations of major aerosol constituents integrated over all sizes (or a specified size range). Size information is to be provided for each of the major aerosol constituents in the form of a 2-D array of concentration (dM/dLogD) versus size and time. As with the DMA, the size grid should be time-independent. The Site Specific Data Protocol should specify the amount of smoothing (if any) in the time and size bin dimensions. In order to make full use of the AMS, the native Igor experiment should be available as supplemental material.
Radar wind profilers and sodars: ASCII output file containing start and end time of sounding and wind speed and wind direction as a function of height.
Radiosonde: ASCII output file that has start time, pressure, height, temperature, potential temperature, relative humidity, mixing ratio, and winds, if measured.
Single Particle Devices: These devices can generate a huge volume of data that is not easily viewed or manipulated outside of their native software. At a minimum the ASCII representation should indicate the number of particles analyzed as a function of time. Decisions as to what else to make available as ASCII and what to make available as supplemental material should be handled on a case by case basis.
|