Ongoing Projects

Planetary Informatics Exploration (2022 - Present)

The idea of what a planet is, and what different kinds of planets exist, has morphed over human history as new astronomical data broadened our perspective of the universe. As we have discovered over 5000 planets outside our solar system, we have expanded our understanding of planets and planetary formation in our universe. Yet, we resort to calling these exotic exoplanets by relating them to planets in our solar system, e.g., “hot Jupiters,” “mini-Neptunes,” “super-Earths,” and “melted-Titans.” These descriptors are misleading; e.g., a super-Earth is not just a scaled-up Earth: size can change interior structure, thermal evolution, outgassing, mineralogy, and the potential for life. Similarly, a hot Jupiter is not just a warmer Jupiter: temperature can change atmospheric dynamics, chemistry, and microphysics. By classifying exoplanets in relation to Solar System worlds, we inherently limit our ability to see truly novel categories of planets.

Data Driven Biosignatures (2021 - Present)

The search for signs of life on early Earth, as well as on worlds beyond our own, has been a compelling endeavor for decades. A familiar strategy is to meticulously tease out and analyze carbon-based molecular remnants of ancient cells—molecules that carry echoes of biochemistry. Most prior effort has been focused on the identification of key “smoking gun” biomolecules, such as lipids, hopanoids, and other species that are presumed to be exclusively biotic in origins. Such methods rely on teasing out a few specific analytical peaks from a noisy background. We are instead developing machine learning and analytics models and methods that can identify patterns in the molecular distribution of the samples to characterize the differences between biotic and abiotic samples. As of Oct 2023, we have already published a PNAS paper detailing a model that can predict the biotic and abiotic nature of samples with ~90% accuracy using py-GCMS data. Funded by the John Templeton Foundation. 

This NAI team will explore catalysis of electron transfer reactions by prebiotic peptides to microbial ancestral enzymes to modern nanomachines, integrated over four and a half billion years of Earth’s changing geosphere. Theme 1 focuses on the synthesis and function of the earliest peptides capable of moving electrons on Earth and other planetary bodies. Theme 2 focuses on the evolutionary history of “motifs” in extant protein structures. Theme 3 focuses on how proteins and the geosphere co-evolved through geologic time.

Earth's living and non-living components have co-evolved for 4 billion years through numerous positive and negative feedbacks. Earth and life scientists have amassed vast amounts of data in diverse fields related to planetary evolution through deep time-mineralogy and petrology, paleobiology and paleontology, paleotectonics and paleomagnetism, geochemistry and geochrononology, genomics and proteomics, and more. Yet our ability to document, model, and explore these complex, intertwined changes has been hampered by a lack of data integration from these complementary disciplines. We propose a new program of data-driven discovery in the Earth and life sciences. We want to develop, curate, and integrate diverse data resources to focus on our planet's changing near-surface oxidation state and the rise of oxygen through deep time-a critical problem that exemplifies this co-evolution and underscores the opportunities and challenges of deciphering transient characteristics of Earth's history. Using abductive reasoning applied to our newly developed "Deep-Time Data Infrastructure" to discover patterns in the evolution of our planet's environment, we will create and merge the integrated data sets, statistical methods, and visualization tools that inspire and test hypotheses applicable to modeling Earth's past and today's changing environment.

Deep Carbon Observatory (2017 - Present)

Recent advances in data generation techniques, whether by experiments, measurements or computer simulation, quickly provide complex data characterized by source heterogeneity, multiple modalities, often high volume, high dimensionality, and multiple scales (temporal, spatial, and function). In turn, science and engineering disciplines are rapidly becoming more and more data driven by a variety of goals (the Deep Carbon Observatory is an exemplar); higher sample throughput, high resolution, additional physics/ chemistry/ biology, new instrumentation, and new integrated databases all with the ultimate aim of better understanding/modeling of the complex systems and their dynamics that underlie the processes being studied. However, analyzing libraries of complex data requires managing the inherent complexity to allow integration of the information and knowledge across multiple scales and spanning traditional disciplinary boundaries. Significant advances in methods, tools and applications for data science and informatics over the last five years can now be applied to multi- and inter-disciplinary problem areas. Virtual Observatories, Virtual Organizations, complex networks, linked data across systems, full life cycle data management, data integration, citation and attribution are now increasingly becoming an integral part of projects whether small (few people, one organization, modest data needs) or the very large (many investigators, organizations, diverse data needs). 


Given this increasing data deluge, it is clear that each of the Directorates in the Deep Carbon Observatory face diverse data science and data management needs to fulfill both their decadal strategic objectives and their day-to-day tasks. This project will assess in detail the data science and data management needs for each DCO directorate and for the DCO as a whole, using a combination of informatics methods; use case development, requirements analysis, inventories and interviews.