![]() |
|
|
|
ASP Data Policy
Adopted, March, 2007
1. Data Policy Objectives and Scope The objective of the U.S. Department of Energy Atmospheric Science Program (ASP) Data Policy is to maximize the scientific return from multi-investigator field campaigns by emphasizing the need for the timely submission of data to an archive, recommending guidelines for data formats, and describing the contents of a Data Catalog. The goals are to provide data and supporting information to ASP investigators in readily readable files, widen the audience of potential end-users, and foster collaborations among campaign participants and with outside users. To a large extent, the success of DOE's Atmospheric Science Program will be measured by our ability to disseminate information that is needed to address questions on global climate change. It is a shared responsibility of all participants to help achieve this goal, and the hope is to do this in a way that is not overly burdensome to individual investigators.
The Data Policy described here applies to measurements made during multi-investigator field campaigns. It applies to investigators who receive support from ASP or who are receiving in-kind support such as the use of platforms or facilities.
2. Site- and Experiment-Specific Data Protocol In a typical ASP field campaign, one researcher will be designated as the Lead Scientist. Within a campaign there may several measurement groups, each of which has a lead investigator or Site Coordinator. Groups may, for example, consist of researchers using the G-1 aircraft, making measurements at a surface site, or operating a network of instruments.
The ASP Data Policy applies to all ASP field campaigns. As such it does not contain the detail that may be desirable for specific field campaigns. Instead it provides guidelines for constructing campaign and site-specific Data Protocols. The details of these are the responsibility of the Lead Scientist and Site Coordinators and should be developed in collaboration with the researchers who will be generating data. Site-specific Data Protocols should cover 1) what data and metadata are to be archived, 2) data format guidelines 3) archive location, and 4) modifications to the standard data delivery schedule. The site-specific Data Protocols should also identify supplemental data products and specify how they are to be accessed if they are not part of a data archive. Supplemental data products include files that are not in ASCII format or products that are less processed versions of the data that are archived, as described in Item 9.
Although details remain to be worked out, it is anticipated for the purposes of this data policy document that ASP will provide disk space for a data archive site that will be capable of accommodating data for ASP field campaigns conducted over approximately a 10-year period. Participants in an ASP field campaign are expected to place the data they collect in that archive.
Some investigators may find themselves obligated to submit data to more than one archive. Having the same data set in multiple places (and formats) complicates maintenance and control. Thus, a link on the outside archive to ASP archive is preferable. These issues are to be addressed in the site-specific Data Protocol before the program.
The ASP Data Policy is written with recognition that it is also desirable for ASP data to be transferred to a long-term archive such as that maintained by NARSTO. In this document, however, we view the actual transfer of data to NARSTO as a separate activity not covered here.
3. Exploratory Measurements There is an instrument category identified as "under development", for which there is no a priori expectation that a useful data set will result. Field campaigns are a useful testing ground for new technology. However, if a useful data set can be produced, the investigator should share it in a timely fashion.
4. Data Submission and Access Timeline Before Campaign It is the responsibility of the DOE/ASP Lead Scientist for each campaign to reach agreement with all Principal Investigators to adhere to the ASP data policy, particularly concerning data submission.
During Campaign Data sharing during a field campaign will be dictated by the necessities of coordinating multiple activities and maximizing scientific return, e.g. directing aircraft flights on the basis of chemical or meteorological measurements. Such activities are not covered by this Data Policy. There is a general expectation of sharing information among campaign participants as it becomes available, but there will not be a formal procedure for submitting first-look data to a central archive or web page.
Data Delivery Schedule A draft data set is due to the ASP archive by individual PIs within 6 months of the end of the campaign. In the case of an instrument malfunction, a determination should be made within these 6 months that the data set is not useful in full or in part. Within 12 months of the end of the campaign a final data set is due. Modifications to the data delivery schedule can be made by prior agreement, but such exceptions should, in general, be reserved for cases where data reduction is particularly time consuming.
Long Term Stewardship It is expected that analysis activities that occur over longer time scales may uncover problems with the "final" data set. Better algorithms may also result in changes to the "final" data. It is a continual obligation of the PIs to submit revised data sets as long as they are working with the data and uncovering new problems or improvements.
It is recognized that the data acquired in most or all ASP campaigns are collected in a research and not an operational mode. As such, rigid quality assurance standards such as those expected for monitoring data and data used in regulatory settings may be inapplicable, inappropriate and/or infeasible. At the same time, it is expected that the individual ASP investigators are professionals and will exercise an appropriate amount of care to ensure the high quality of the data submitted to any archive. Relevant descriptions of data quality and/or references to papers describing data processing procedures or instrument performance characteristics should be included in the headers of the data files (preferred) or in separate Read Me files submitted with the data files.
5. Data Access The Atmospheric Science Program encourages the free distribution of measurement data in the view that the common good is best served by full and open sharing of the data collected during field campaigns and in consonance with the data policy of the U.S. Climate Change Science Program that mandates open access to data. To this end the ASP FTP (File Transfer Protocol) site is available for all programs and investigators associated with the ASP to exchange data internally and for the wider community to share and use ASP data.
It is recognized that the right to use data generated in ASP field studies must be balanced by a responsibility to the data providers to respect their efforts and their rights to use of the data. In certain instances ASP investigators may thus wish not to make their data immediately available to the wider community pending preparation of initial scientific publications. At the request of the data originator such data will be treated as privileged for an initial 12-month period following the field project. Such privileged data will be maintained in password-protected files with access restricted to ASP investigators for the 12-month period following the field project or until authorization is granted by the data originator for the data to be made publicly available, whichever comes first.
6. Tracking It is the responsibility of the Site Coordinators to track the progress of the individual data producers during the 6 and 12-month data submission periods.
7. Data Catalog The DOE Lead Scientist for a field program will be responsible for producing a Data Catalog on the ASP web site that provides summary information that is accessible to the public. This Catalog should include a comprehensive list of measurements and the responsible parties for each instrument (with contact information); it may also contain comments on data acquisition and quality, links to the archive containing the data set, and references to supplemental material that may not be in an archive.
8. Data Use ASP investigators and outside investigators can browse the data archives freely (subject to investigator-requested restricted access for up to 12 months noted in Item 5). Substantive use of data from an archive should normally require the following steps: 1) notification to the data provider and discussion of features of that data set that might not be apparent to an end-user, 2) feedback from the end-user regarding data features (such as consistency with other measurements or model calculations) that are uncovered during analysis, and 3) offer of co-authorship for publications that make substantive use of that data set. Whether or not all of these steps are followed will depend on how routine a measurement is and how crucial (or prominent) it is to the end-user's study. The expectation is that mutually acceptable arrangements will be achieved by collegial discussions among the appropriate parties.
Data from an archive should be cited following instructions in the metadata. Investigators who receive direct or in-kind support from DOE should provide an acknowledgment in presentations and publications. The usual rules and conventions apply to citing published data.
9. Data Archive Site-specific Data Protocols will distinguish between two types of data - basic data, which are always in ASCII format, and supplemental data, which may have an alternate format. Supplemental data are what is left over when a basic ASCII data set is produced. This definition explicitly recognizes that choices have to be made in converting instrument readings to information that is useful to a broad community. Examples of supplemental material are 1) high frequency data from which a lower frequency version has been extracted for an archive, and 2) multi-dimensional data such as mass spectra and single-particle results that cannot be easily manipulated outside of their native acquisition or analysis software. Decisions as to what is primary and what is supplemental are likely to evolve as the community becomes more familiar with manipulating multi-dimensional data. Data archives should contain all primary data sets plus supplemental material as agreed on in any site-specific Data Protocols.
Satisfaction of PIs that their data are reasonable is a necessary and sufficient condition for inclusion in the site-specific data archive. ASP recommends QA/QC checks on submitted data, but recognizes that the special nature of research-grade, as opposed to routine, measurements can be incompatible with formalized QA/QC tests and documentation. Investigators are urged to provide their best information concerning the quality of submitted data. Supporting information including reference standards, minimum detection limits, time resolution, and known interferences is strongly encouraged.
10. Data Format The ASP strongly encourages that all data submissions be intelligible in a stand-alone fashion without additional software. Information in the file name that is used to identify date, time, location, etc., should be repeated within the body of the file, either as part of the data set or as a header. Data files should either be self-documented with metadata fully describing the measurements sufficient for the end user or there should be accompanying Read Me files that provide the appropriate information. The preference is for a self-documenting file where possible because providing this information at the beginning of each file ensures that it is accessed at the same time as the data. In each file, all columns should be labeled and units provided.
Data should be usable by all participants, but it is recognized that the requirements of researchers vary and that different formats may be preferable to different people. Thus, one site-specific Data Protocol might lead to a merged data set that is most easily manipulated in a spreadsheet while another might be most easily read by a Fortran program. Consistency within a Site Data Set is useful in order to ease the burden of merging multiple data streams and to yield a final product in which measurements from multiple instruments can be easily compared.
A delimited, ASCII format can be read by spreadsheet/ graphing/ analysis software and be ingested by Fortran programs. Missing values may be represented by blanks or by a data flag (e.g., -999); be sure to specify what convention is being used. Data records should be terminated with the LF character, optionally preceded with a CR character.
The following guidelines are suggested based on experience of what has worked well for previous ASP campaigns. If other choices are made (e.g., date and time specification) then clear descriptions of those choices should be given in the header or the accompanying Read Me files.
10.1 Time Specifications Date and Time: Standardization on UTC eliminates uncertainties and inconsistencies between researchers. Date and time can be reported in two columns using a uniform format, namely YYYY-MM-DD and hh:mm:ss(.ss), to specify the beginning of the interval being reported. Fixed width formatting of date/time to include leading zeros is encouraged. The (.ss) is optional. This format is recognizable by most software and is easily and unambiguously read (as opposed to Julian day, seconds from some zero time, hours, or other application-specific formats). This date/time convention allows easy sorting into chronological order. Any other time conventions should be explicitly and clearly explained in the file header or the accompanying Read Me files. Data should be reported to reflect the time of sampling (lags due to sampling, instrumental and/or processing delays should be corrected for by the PI). A GPS receiver or NTP server is recommended as a means of determining accurate times.
Time intervals: For continuous, regular, time-series data, time intervals should be monotonically increasing and uniformly spaced at whatever interval is appropriate (1-s, 1-min, 1-h etc.) Ideally there should be no gaps for time series data in the date and time columns. Different time bases could be appropriate for different signals. For times series data from discrete (time integrated) samples, either specification of the start time and stop time of each sampling interval or the start time and sampling duration should be provided. A time mid-point is also needed if the arithmetic mean of the start and stop time is not the mean time for data acquisition.
10.2 Units Specified in Site-Specific Data Protocol Altitude: Meters. Height above mean sea level preferred for aircraft measurements and above ground level preferred for surface based measurements. In the latter case the surface elevation should also be specified.
Aerosol concentration (mass concentration of aerosol particulate matter):
Aerosol particle size:
Trace gas mixing ratio: ppb (nmol/mol) preferred for most species.
Hydrocarbon mixing ratio: By compound (not by C atom); ppb (nmol/mol) preferred.
Position: Decimal degrees. West longitude specified as negative number for correct left to right orientation on graphs. Degrees South specified as negative numbers.
Direction: Angle from 0 to < 360 deg, with 0 being true north. Note that meteorological convention for specifying wind directions is not the same as for specifying geographic orientation.
Meteorological variables: temperature and potential temperature in K; specific humidity or mixing ratio in g/kg; pressure in mbar or hPa.
10.3 Metadata Sufficient metadata should be provided so that a knowledgeable user has all the information needed to interpret the measurement. Information should include:
Site or platform name(s). Location(s) of stationary sites including altitude. Time period covered. Conversion factors for producing an output variable from a measurement, e.g., organic reported = 1.4 carbon measured. Relative humidity at which the measurement was made for quantities that depend on relative humidity, such as size distribution, mass concentration, and light scattering coefficient. Relation between UTC and local time (specify standard or daylight saving). For co-located rapid response instruments, specify source and accuracy of time standard. For each instrument: name, description, quantity measured, instrument performance (level of detection, accuracy, precision, time response, and time constants, as appropriate), operational exceptions, contact information, citation instructions, data revision history, and non-standard conventions.
10.4 Data Flags Not every time interval will have valid data from each signal source. Periods of non-valid or missing data are to be indicated by leaving the delimited field empty or by using an appropriate number (such as -999) to denote missing data. If further explanation is needed it can be provided by a flag in another data column. It is recommended that the flag be an integer number whose binary digits can be set to denote any individual or combination of conditions as selected by the investigator. If necessary, the end user can easily parse this number to discriminate only the desired measurements. A flag value of 0 is recommended for valid data. All flags should be defined and described in the metadata
10.5 Examples Integrated aircraft data set: Many of the G-1 instruments have response times between 1 and 60 s with data recorded at 1 Hz. Intermittent samples such as hydrocarbon canisters are identified by writing the sample ID in a data column for all of the 1-s time periods during which a sample was collected. One and ten-second averaged data are archived. These averaging times were arrived at by considering the response time of most of the trace gas instruments and their signal to noise ratio. It represents a compromise between fidelity and ease of manipulation.
A consideration in creating a merged data set is to limit the number of columns for ease of use. Data sets that are intrinsically multi-dimensional, such as aerosol size spectra and hydrocarbon speciation, are archived as separate files.
DMA: Time markers should reflect actual start and stop times, which may yield non-uniform time spacing. dN/dLogD should be interpolated to a size grid that remains fixed for all of the scans in a single file and, unless there is good reason, fixed for the whole campaign.
AMS: Time markers should reflect actual start and stop times, which may yield non-uniform time spacing. ASCII files should contain time records of the concentrations of major aerosol constituents integrated over all sizes (or a specified size range). Size information should to be provided for each of the major aerosol constituents in the form of a 2-D array of concentration (dM/dLogD) versus size and time. As with the DMA, the size grid should be time-independent. The site-specific Data Protocol should specify (and the meta data report) the amount of smoothing (if any) in the time and size bin dimensions. In order to make full use of the AMS, the native Igor(R) experiment should be available as supplemental material.
Radar wind profilers and sodars: ASCII output file should contain either start and end times of sounding or start time and sampling interval, followed by wind speed and wind direction as a function of height.
Radiosonde: ASCII output should have start time, pressure, height, temperature, potential temperature, relative humidity, mixing ratio, and winds (if measured).
Single Particle Devices: These devices can generate a huge volume of data that are not easily viewed or manipulated outside of their native software. At a minimum the ASCII representation should indicate the number of particles analyzed as a function of time. Decisions as to what else to make available as ASCII and what to make available as supplemental material should be handled on a case-by-case basis. |