Data Driven Competencies in Computational Science
Analysis of large, complex datasets has become increasingly important as a path to understanding a myriad of science, engineering, and business trends and phenomena. It is also emerging as a topic of large scale data analysis in the humanities and social sciences. The competencies below have been developed as a model of the basic or core competencies that students will need to achieve to begin work in these emerging areas. A set of additional competencies for more advanced work in this area is also under development and will be released as a second document. These documents have been assembled as a part of the XSEDE project education program with the collaboration of many experts from the data science community. Comments and suggestions about the competencies are welcome and can be directed to Steve Gordon, the lead for the XSEDE education program at sgordon@osc.edu.
Core/Basic Data Driven Competencies
- Prerequisites
- Basic statistics (parametric): understanding of mean, median, mode, variance, and distributions
- Preferred (but not required): regression, correlation, significance tests, outlier analysis
- Calculus: basic
- Students will understand how data is originated from diverse sources
- Students will understand the relationships between objects and their representation in a digital data repository, such as a database or a group of files, by exploring a number of example datasets.
- Students will be able to identify a variety of processes for data acquisition and data formats from physical measurements, sensors, instruments, transactions, simulations, and social media.
- Students will actively undertake a data acquisition task and the design of a relevant dataset.
- Students will recognize how the process of data acquisition and digitalization relates to data provenance chains.
- Example Activities: Students will study the process of acquiring data from archaeological objects. Students will research how sky imaging data is recorded by large telescopes and made available to the astronomy community. How genetic sequencers provide genomics data for the life sciences community. Student will survey examples of sensor networks , composed by instruments such as rain gauges, thermometers, or seismometers. Students will run simulations from computational models, such as power generation, distribution and record energy consumption. Students will observe relationships and patterns from social networks, such as Twitter and Facebook. Students will study NIST Big Data WG taxonomy or examples from NASA, NOAA, USGS, USDA, data.gov.
- Ability to recognize factors affecting and techniques employed to cope with the quality of data
- Students will understand potential sources of data errors including measurement, encoding, derivation, and missing value problems.
- Students will have an introductory knowledge about techniques based on redundancy and quality assessment to cope with data corruption.
- Students will understand how data quality is assured through checks and inspection procedures
- Student will know how to search datasets for erroneous, inconsistent, and/or missing values.
- Students will utilize simple statistical techniques to estimate missing values for numerical and other data (imputation).
- Students will devise strategies for "cleaning" datasets with inconsistent coding.
- Students will understand how bias (i.e. systematic data acquisition errors introduced by instruments , and numerical errors - introduced in data encoding and processing), become part of the dataset.
- Students will understand the difference between lossy and lossless data compression and applicability of each.
- Example Activities: Students will survey use cases such as the procedure to process genomic samples, understanding how natural variability introduces an error blur that affects analytical/statistics relationships - become statistical uncertainty regarding the data collection. Another use case example is the Large Hadron Collider (LHC) high-energy physics Monte-Carlo simulations of beam transport, employed to show blind spots in detectors and bias introduction regarding location and characteristics of detectors. Students will study the multiple levels employed by RAID data storage technology in combining multiple disk drives and its impact in data reliability, integrity and performance. Other examples can be devised using data from NASA, NOAA and USGS, all of which deal with data quality.
- Ability to organize, describe and manage data
- Students will understand that datasets can be logically grouped using abstractions such as logical collections or aggregations.
- Students will understand the fundamentals pertaining to digital data storage, including files, directories, and file systems and will be able to organize data using these abstractions.
- Students will understand the concept of metadata and that data items should be described using the appropriate metadata standards
- Access Control: Students will understand that there are issues pertaining to who is authorized to access a dataset. Some people are given the permission only to read, while others may create or modify data entries.
- Students will be able to use application program interfaces (APIs) or software libraries of standard formats, such as the Hierarchical Data Format (HDF5) and the Network Common Data Form (NetCDF), to browse large and complex data collections.
- Students will understand that some data are made private, or are licensed and require attribution, while other datasets are open to the public domain. They will also be introduced to questions of data use related to intellectual property rights, privacy policies, and legal protections of some data such as the HIPPA requirements.
- Federation of data repositories: Students will understand that digital data may reside on distinct physical locations due to several factors including cost, performance, space and reliability. Mechanisms are needed to access, retrieve and aggregate data from multiple repositories.
- Provenance: Students will have a basic understanding of how to manage, annotate and preserve reliable data context information, such as data provenance.
- Data Replication: Students will understand that a dataset may be replicated for performance and reliability reasons.
- Example Activities: Students will survey the policies defined in the Research Data Alliance Practical Policy Working Group 1 . Students will research the International Virtual Observatory Alliance (IVOA) 2 , the federation of Virtual Observatories (VO) associated with the organization and its data management policies. Students will be able to browse complex data structures, such as large time series of multidimensional, multivariate climate datasets, using APIs or software libraries of data encoding standards ranging from complex, specialized formats such as HDF5 and NetCDF to simpler formats such as CSV. Students will understand how standard tools, such as scp and bbcp, and complete systems, such as Globus Online, are used for large data transfers.
- Understanding of Databases
- Students will have a basic understanding of relational databases including normalization, database schemas and relational algebra. They will understand how to create, update, query and delete tables using standard SQL statements
- Students will understand workflows such as ETL (extract, transform, and load) to aggregate data from multiple sources integrating it in databases and data warehouses.
- Students will be able to utilize NoSQL databases including key value, wide column, document and graph stores as well as their application on non-tabular data.
- Students will be able to interpret graph databases and apply them to multi-dimensional datasets.
- Student will learn about Multimedia Databases, including objects such as audio and video snippets.
- Example Activities: Students will learn how to structure datasets to populate and index tables, how to formulate database queries and obtain results on relational database technologies such as MySQL. Students will learn and utilize multiple NoSQL databases technologies including key-value stores such as Redis, wide-column stores such as HBase, document-stores such as MongoDB and graph stores such as Neo4j.
- Understanding of Data Preservation and Sharing
- Students will understand the diverse motivations and barriers (technological , cultural, legal, ethical) associated with data sharing.
- Students will have a notion regarding different timelines in the production and consumption of data impacting availability, preservation.
- Students will distinguish between short term use and long term preservation for archival use. Understand broad range of timescales from nanoseconds sensor measurements to preservation along decades.
- Students will realize that continued investment is needed to transfer data to new media and redefine readability in order to establish long term data preservation.
- Example Activities: Students will learn about data preservation projects such as datadryad.org and have a notion about data archival best practices and standards.(Open Archival Information systems, including information package variants.)
- Ability to Plan and Practice the Data Lifecycle
- Students will understand the principles of the data lifecycle including planning , data acquisition, quality assurance, description, organization, analysis, discoverability and preservation.
- Students will recognize that data driven projects initiate and encompass different portions of a data lifecycle.
- Students will be capable of creating and documenting their own data repository.
- Example Activities: Survey different definitions of data lifecycles and discuss their commonalities and differences, including DataOne 1 , MIT, UK Digital Curation Center and others. Study the requirement for Data Management Plans from different funding agencies, including NSF, NIH and others. Survey best practices to develop Data Management Plans, include those defined by the University of California Curation Center 2 , Columbia University 3 and others. Students will propose a data collection plan, which will contemplate issues including goals, acquisition methodologies, creation of logical collections , structure and format, description, provenance, discovery and persistence. Furthermore, students will then create a data repository, based on their data management plan, and report on success, failures and lessons learned in the context of their implementation, operation and original planning considerations. 4 , 5 , 6