Advanced Data Driven Competencies in Computational Science
Analysis of large datasets has become increasingly important as a path to understanding a myriad of science, engineering, and business trends and phenomena. It is also emerging as a topic of large scale data analysis in the humanities and social sciences. The competencies below have been developed as a model of advanced competencies that students will need to achieve to work in three data driven specialization areas namely:
- Infrastructure and Systems,
- Data Management and Curation,
- Knowledge Representation and Analysis
A separate document provides a model of the basic or core competencies which are prerequisite to the competencies described here.
This document has been assembled as a part of the work of the XSEDE project education program with the collaboration of many experts on data driven science from the community. Comments and suggestions about the competencies are welcome and can be directed to Steve Gordon, the lead for the XSEDE education program at sgordon@osc.edu.
Infrastructure and Systems
- Prerequisites
- Ability to cope with the impacts of the memory and storage hierarchy on data input, output, and analysis tasks.
- Students will have a working knowledge about capacity and performance issues regarding CPU access to registers, vectors, multiple levels of cache and in-core memory.
- Students will have a working knowledge about capacity and latency issues regarding CPU I/O operations when accessing persistent storage.
- Students will know the main types of storage devices and systems, including hard disks, SSDs, tape archives, RAID, and Hierarchical Storage Systems.
- Students will be able to balance performance and cost trade-offs associated with Hierarchical Storage Systems and remote data transfers. Within this context, students will understand the trade-offs regarding data staging and caching.
- Example Activities: Students will study data driven algorithms, such as sorting and calculation of mean values. Advanced: Students will design and optimize data driven algorithms taking into account processor functional units, cache and memory architecture. Students will design strategies to manage large dataset operations, which require out of core access to disks and/or tapes and analyze the performance and cost trade-offs associated with their design.
- Students will understand the main concepts of communication networks and their impact on remote data access.
- Students will understand the role, performance impact and applicability of communication protocols.
- Students will have a working knowledge about the layered structure of network architectures, including the physical, link and network layers.
- Students will understand the concept of network quality of service (QoS) factors, including packet loss, throughput, delay and jitter.
- Students will understand the principles and multiple trade-offs impacting data transfers, such as quality of service factors and cost, when using routed package switching and end-to-end circuit switching.
- Students will understand the fundamental importance of performance tuning the host configurations at each end of a network connection to provide efficient utilization of the communication channels . 1
- Students will understand the principles of Software Defined Networks and how they can be applied in designing dynamic networking solutions for distributed data access.
- Students will understand the relative performance of ethernet (1 GbE, 10 GbE, 40 GbE and 100 GbE) and InfiniBand (SDR, DDR, QDR, FDR).
- Students will have a basic understanding of the TCP and UDP protocols together with their applicability.
-
Example Activities: Students will study and compare the expected performance of multiple network protocols, such
as TCP and UDP as wells as their impact in the transfer of binary data and continuous
streaming media, such as audio and video. Students will study OpenFlow and case studies of
Software Defined Networks. Students will learn the network best practices aggregated at the
ESnet Fasterdata Knowledge Base (
http://fasterdata.es.net/).
Advanced: Students will conduct network experiments to benchmark and compare the performance multiple network protocols and their impact in diverse types of communication traffic and applications.
- Students will understand the performance advantages of distributed and parallel I/O or data access.
- Students will have a working knowledge of data parallelization frameworks including MPI/IO and Map Reduce
- Students will understand different paradigms regarding parallel file systems such as Lustre, PVFS, GPFS and HDFS.
- Example Activities: Students will study MPI/IO codes which perform numerical analysis on data from multi-dimensional arrays stored on a parallel file system. Advanced: Students will write, run and analyze the performance of MPI/IO codes on a parallel file system. Students will run data analysis on systems based such as Hadoop and Apache Spark.
- Ability to recognize different categories of data infrastructure and how they relate to different community goals.
- Students will understand the role of data repositories, such as the data record archive at NOAA
- Students will understand the purpose of union catalogs for discovery, such as the NASA Global Change Master Directory (GCMD) 2
- Student will understand that there are communities which focus on web-services mechanisms to support manipulation of the data, like Polyglot 3 or other web-service environments.
- Students will understand how data structures are tied to the goals and activities of the community that shares the data.
- Example Activities: Students will survey a list of major data repositories that hold observational data or discipline-specific data. For instance, NOAA, with the data records, NASA with satellite data, and others such as EPA, Department of Transportation, Department of Agriculture, USGS, etc. all of which have current data holdings which are of interest to researchers on NSF data projects. Students will explore the re3data.org and databib.org registries of repositories, which can be browsed by multiple criteria. Students will browse and develop a report about the use of the NASA Global Change Master Directory (GCMD).
- Ability to leverage middleware technology in support of data infrastructures
- Students will be knowledgeable about the main aspects of data grid design, architecture and their applicability in large scale data analysis and management.
- Students will know how to leverage main instances of this technology, such as iRODS, LCG, Globus Online, SRM, REDCap or XNat.
- Students will be familiar with technologies supporting digital libraries such as iRODS, or DSpace, or Fedora.
- Example Activities: Study how data grid technologies such as OSG and LGC have been applied in High Energy Physics and Large Hadron Collider related projects. Students will study how the Sloan Digital Sky Survey (SDSS) and the concept of Virtual Observatories have revolutionized astronomy.
- Ability to leverage Cloud Computing technology to manage data storage and analysis
- Students will know how make use the cloud computing virtualization paradigms, such as Data as a Service, Infrastructure as a Service, Software as a Service, Platform as a Service, etc.
- Students will understand the concept of cloud computing elasticity and relative impact on cost and performance.
-
Example Activities: Students will research the use of Cloud Computing solutions for data management and analysis,
provided by vendors such as Amazon, VMWare, Microsoft, EMC and others. Students will study the
solutions discussed at the “Big Data Architecture Models: A Survey”, from the NIST
Big Data Working Group (NBD-WG)
4
.
Advanced: Students will employ Cloud Computing services to implement data management solutions.
- Students will have an understanding of the trade-offs regarding energy efficiency, cooling strategies, performance and other metrics in the context of data driven infrastructure.
- Example Activities: Students will survey case studies and strategies concerning energy efficiency, such as warm-water cooling and oil immersion approaches, hot/cold aisle containment, compute blades vs. traditional servers and the Advanced Configuration and Power Interface (ACPI) specification.
- Ability to apply infrastructure capabilities to address data privacy and security issues.
- Students will have a working knowledge of authentication, authorization, identity federation, de-identification and access control technologies.
- Students will understand the need to segregate privacy sensitive data, such as HIPPA patient records, credit card information, etc.
- Example Activities: Students will study technologies such as the INCOMMON identity federation and solutions such as ORCID. Students will investigate case studies regarding data segregation and de-identification, such as clinical information.
Data Management and Curation
- Prerequisites
- Ability to understand and apply data curation techniques 5 , 6 , 7 , 8 , 9 , 10
- Students will understand the relevance of the digital data curation and management lifecycle 11 in enabling someone other than the original data producer to reproduce results, reuse, interpret and add value to the data across organizations and over time.
- Students will know how to conceive and plan the creation of digital objects, including data capture methods, quality assessment and storage options.
- Students will be capable of assigning and managing administrative, descriptive, structural and technical metadata to data objects and collections.
- Students will understand the differences between the HIPPA and FERPA privacy rules.
- Students will know how to ensure that designated users can easily access digital objects, understanding that some digital objects may be publicly available, while others may be safe guarded for privacy due, for instance, to regulations such as HIPPA and FERPA.
- Students will be able to appraise digital objects, selecting those requiring long-term curation and preservation as well as recognizing those which should be disposed of.
- Students will know how to manage the accession and transfer of digital objects to an archive, trusted digital repository, data center, digital library or similar, again abiding to documented guidance, policies and legal requirements.
- Students will know the actions required to rid systems of digital objects not selected for long-term curation and preservation, while observing documented guidance, policies, de-accession recommended practices, and legal requirements.
- Students will understand the concept of Data Citation, Data Publications and alternate conceptualizations of these and their limitations.
- Example Activities: Students will learn the Open Archival Information System (OAIS) model and associated standards and reference models such as ISO 14721. Students will study the DCC Curation Lifecycle Model11. Students will research digital curation case studies. Student will design digital curation plans for specified data driven cases.
- Ability to Manage Digital Data Preservation 12 , 13 , 14
- Students will understand the concept of digital preservation and will have a working knowledge of the actions and best practices required to maintain access to digital materials beyond the limits of media failure or technological change, keeping the ability to use, share and interpret data in the long-term.
- Students will have a working knowledge of the techniques and best practices which aim to maintain data integrity, providing support that the data is authentic and therefore has not been forged, substituted and the bit-stream is maintained.
- Students will understand the relationship and trade-offs associated to redundancy, system reliability and data bit-stream integrity.
- Students will have a working knowledge of approaches the data bit-stream integrity.
- Because digital preservation techniques may alter the data, students will understand that authenticity has to be demonstrated by paying attention to characteristics of the data such as the provenance chain.
- Students will understand the relationship between workflows, data product derivation and the bookkeeping of data provenance.
- Example Activities: Students will learn trustworthy digital repositories aspects and practices based on standards such as the ISO 16363:2012.
- Ability to Plan and Sustain Data Repositories
- Students will be able to consider business model and sustainability considerations in designing data management plans, such as data collection value, funding agencies limited budgets, capital, hardware, operational system issues and staffing costs.
- Know the importance and approaches to disseminating data collections to current and prospective data consumers.
- Students will have knowledge and experience in developing Data Management Plans.
- Example Activities: Study Data Value project regarding business model and data management plan sustainability considerations. Students will review Data Management and Sharing Plans related documents, such as NSF, NASA, NOAA, USGS and NIH guidance and requirements, as well as data management plan development materials from ESIP, Digital Curation Centre, University of California Curation Center, Data Management and Publishing from MIT and University of Minnesota. Finally, the students will develop plans related to their activities.
- Ability to consider access to content issues
- Students will know how to apply content licensing, such as Creative Commons licenses, OAI licenses, Open Access Open and standardization of access with data commons
- Students will understand compliance issues related to intellectual property, privacy, security, ethics, legal issues, and governance.
- Example Activities: Students will study how compliance requirements regarding HIPPA, FERPA and other regulations are implemented and enforced in data repositories.
Analysis and Knowledge Representation
- Prerequisites
- Ability to leverage knowledge bases and employ knowledge representation techniques
- Students will understand and be able to develop ontologies, semantically representing the information and relationships existent in a domain of knowledge including its corresponding data objects and collections.
- Students will be able to apply inference and reasoning tools to query and discover knowledge contained in ontology representations.
- Students will be capable of representing and extracting knowledge from linked data representations.
- Example Activities: Student will use Protégé to analyze and understand ontologies , query and infer new knowledge with reasoning tools. Students will familiarize themselves with Semantic Web concepts. Students will study the iPlant Collaborative Discovery Environment Project( http://www.iplantcollaborative.org/ci/discovery-environment) which employs semantic relationships for data discovery, and knowledge reuse. Advanced: Students will develop ontologies to represent a domain of expertise. Students will integrate ontologies across multiple domains of knowledge. Students will develop their own reasoning and inference engines.
- Students will have the ability to employ visualization tools to analyze data
- Students will be able to explore volumetric data
- Students will be introduced to the concept of network data visualization.
- Example Activities: Students will learn how to use visualization tools, such as ViSit, Paraview, VisTrails and others to investigate volumetric data. Students will learn to use Cytoscape to investigate network data. Advanced: Students will develop customized visualization tools to analyze specific datasets.
- Students will be able to employ analytic techniques based on statistics, data mining and machine learning to extract knowledge, make predictions and decisions based on data 15 , 16 , 17
- Students will be knowledgeable about methods for organizing data, including sorting, searching and matching.
- Students will be capable of applying mathematical and statistical models and concepts to detect similarity, structure, patterns and classify clusters in data.
- Students will be able to apply mathematical and statistical models to detect outliers and rare events in datasets.
- Students will be able to use predictive analytics algorithms to characterize large datasets including decision trees, clustering algorithms, neural networks and other data mining techniques.
- Example Activities: Students will study statistical modeling, including likelihood estimates, Markov Models, hypothesis testing, regression and Bayesian inference. Students will study techniques such as dynamic programming, gradient descent and apply these techniques in optimization, search and matching problems. The students will use existent analysis tools, such as MATLAB, Weka, R or RapidMiner to detect patterns and outliers in example datasets. Students will apply the learned analytical techniques and tools to analyze, predict and make decisions on a dataset of study. Advanced: Students will design or develop new analytic techniques or tools to be applied to specific datasets or domains.
- Ability to use workflow systems
- The students will know how to use workflow tools to automate data analysis.
- Students will understand the relationship between workflows, data product derivation and the bookkeeping of data provenance.
- o Students will understand that more automation is needed when systems scale up and that workflows address this demand.
- Example Activities: Students will study workflow tools such as Galaxy , Pegasus, Kepler and NCSA Cyberintegrator. Advanced: Students will effectively use a workflow tool to automate the analysis of large data collections.
- Ability to use and manage databases
- Students will be capable of using, managing and customizing relational database management system implementations (RDBMS). Students should be aware of vendor extensions to the SQL standard and the tradeoffs associated with their use.
- Students will be capable of designing, using and managing workflows such as ETL (extract, transform, and load) to aggregate data from multiple sources integrating it in databases, data warehouses and cloud-based infrastructures.
- Students will be able to use, manage and customize NoSQL databases including key value, wide column, document and graph stores as well as their application on non-tabular data.
- Students will be able to use, manage and customize graph databases and apply them to multi-dimensional datasets.
- Student will be able to use, manage and customize Multimedia Databases, including objects such as audio and video snippets.
- Example Activities: Students will learn how to use, manage and integrate database systems from vendors such as Oracle, PostgreSQL, MySQL and SQLite. Students will utilize and manage multiple NoSQL databases technologies including key-value stores such as Redis, wide-column stores such as HBase, document-stores such as MongoDB and graph stores such as Neo4j. Advanced: Students will develop customized database solutions to specific data analysis needs. Students will integrate multiple vendor solutions and ad-hoc developed database systems to allow consolidated data analysis from multiple sources.