The future of the ISCN database

ISCN Forums ISCN Data The future of the ISCN database

This topic contains 1 reply, has 2 voices, and was last updated by  Katherine E O Todd-Brown 6 months ago.

Viewing 2 posts - 1 through 2 (of 2 total)

  • Author

    Posts

  • #3252


    Luke Nave

    Participant

    The ISCN database, now 5 years old and in its 3rd generation, has reached a point in its evolution that critical assessments of its status and future are needed. ISCN members discussed these and other topics at the ISCN All Hands Meeting on 11 December. About 50 participants were involved in the meeting, held the day before the start of the American Geophysical Union Fall Meeting in San Francisco. This forum thread reports on the outcomes of that discussion, and is intended to serve as a point of communication as the database continues to evolve.

    Consensus findings:
    1) The strength of the ISCN database is its ability to adapt to new data contributor and user communities, by changing its input templates (e.g., to accommodate new data types or research approaches), and output mechanisms (e.g., file formats, user interfaces) according to user-defined needs. However, in practice, the refresh time needed to make such changes is slower than desired by researchers and data contributors, leading to the question: “Which is more important- a dynamic database or a stable repository?” Unfortunately, the answer is “neither.” Both are important, but the silver lining seems to be that the limitations to being both are not technical, but rather a matter of funding and personnel time.
    2) To encourage sustained data contributions, data compilation needs to be streamlined. Ways to achieve streamlining include making data templates easier to use, or creating a “pro-rated” template in which certain minimum data are provided in raw form, while the availability of additional data are simply indicated. The latter is a hybrid approach transitional to a third possibility: linking to datasets hosted elsewhere, in essence making ISCN a clearinghouse or metadatabase.
    3) To encourage more frequent and powerful data use, the database needs to be made available by means beyond those currently provided. Alternate file formats, whole-database downloads, and shared code for manipulating data would be helpful. The documentation also needs to be tightened, condensed, centralized, and otherwise improved (especially from a user example point of view).
    4) The ISCN database should fill a niche that separates it from other significant soil databases and data products that fulfill conceptually distinct roles (e.g., the USDA-NRCS SSURGO data products or the ISRIC World Soil Information Service). That niche remains the one that has traditionally been the strength of the ISCN database- its ability to be responsive to new data contributors and new data users. Most notably, these contributors and users come from a pure research perspective, rather than a soil survey, soil management, or data science perspective, and as such the datasets that are ISCN’s fundamental currency are complex. Researchers generate and are interested in datasets that are replicated through space, involve timeseries, experimental treatments, and process measurements. This diversity of data types and structures poses some challenge to standardization, but ISCN has done this thus far with success. Therefore, if there is a grand synthesis to these four consensus findings it is:

    To remain valuable, dynamic, and usable, and to make its best possible impact, the ISCN database needs to evolve into an open-source soil data synthesis resource. It needs to be available, in its complete form, to advanced data users who can harness its hierarchical structure and geo-referenced observations according to specific project needs. At the same time, it needs a repeatable framework for adding new datasets, and new data types, either through an automated mechanism requiring little sustained personnel (programmer) time, or through dedicated efforts by a designated individual whose skills and responsibilities overlap with this valuable information resource.

    #3295

    I would propose that we focus on recovering the long tail of soil data (Nature Neuro example). That is compiling the numerous relatively small data sets that individual PI’s and projects collect relating to soil carbon dynamics. I don’t think we should start with large data types at the moment (FT-ICR or ‘omic) but that might be an interesting place to go in the future. There is still a ton of other data out there. I see this as a community/citizen/student science project modeled after open source software and would propose the following:

    1. Data providers submit original data sets to a central repository with identifying information including a data doi, experimental/sampling design description, and contact information for the data provider. I see there being some minimum data requirements being imposed around making the data machine readable (no data encoded exclusively in the color of the cell for example) but no other requirement.
    2. Alternatively we could mine (meta-)repositories DataOne to find existing datasets to seed the project.
    3. Domain experts define the format, units, and other protocols for target database format. A lot of this work has already been done for the previous databases and efforts around the new template designs.
    4. Community members (students, scientists, citizens) write customized scripts for each data set to translate into the common standard.

    To do this we need the following:

    1. Version controlled central repository modeled after GitHub (or a GitHub project) to version code base and data sets, track project changes, and otherwise manage the project collectively.
    2. Social media outreach to the community. No one will use this if they don’t know it exists. We could hook into collage classes and other data management workshops to both provide educational services to the community and get people to contribute data scripts to the project.<\li>
    3. Example scripts to get things started and define standards for both coding and minimum data format definition.
    4. Final visualization QA/QC scripts for the processed data product

    We want to actively avoid:

    1. We do NOT want to become a repository. We aren’t set up to be long term archive (first off we don’t have the funding structure for it!). We should strongly encourage data providers to submit somewhere like Dryad first and get a data DOI.
    2. Relying on one person. While there will unavoidably be leaders in this project, we should strive to be flexible enough that it could be handed off to other people relatively seamlessly.

    Thoughts?

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.