A Survey on Verification and Validation, Testing and Evaluations of Neurosymbolic Artificial Intelligence

Justus Reifkoph*, Ke Feng*, Marc Meier-Doernberg, Alvaro Velasquez, Houbing Herbert Song, Fellow, IEEE

Abstract

Neurosymbolic artificial intelligence (AI) is an emerging branch of AI that combines the strengths of symbolic AI and sub-symbolic AI. Symbolic AI is based on the idea that intelligence can be represented using semantically meaningful symbolic rules and representations, while deep learning (DL), or sometimes called sub-symbolic AI, is based on the idea that intelligence emerges from the collective behavior of artificial neurons that are connected to each other. A major drawback of DL is that it acts as a 'black box', meaning that predictions are difficult to explain, making the testing & evaluation (T&E) and validation & verification (V&V) a challenging task.

Core problem

The core problem is the difficulty in validating and verifying (V&V) neurosymbolic AI systems, which combine symbolic and sub-symbolic AI. The primary challenge arises from the 'black box' nature of sub-symbolic AI, particularly deep learning, making it hard to explain predictions and rigorously test, evaluate, and validate these systems, especially in safety-critical domains like autonomous systems.

Key findings and Contribution

The paper compares two taxonomies for neurosymbolic AI: Kautz's taxonomy with six types of interactions and Yu's simplified taxonomy with three categories.
Verification and validation techniques for symbolic AI are discussed, with propositional logic being decidable and first-order logic undecidable.
The paper surveys V&V methods for sub-symbolic AI, focusing on robustness, reachability, and interpretability, and mentions tools like DeepXplore and DLFuzz.
Current V&V methods can be applied to individual components of neurosymbolic AI, with potential to enhance sub-symbolic part verification using the symbolic component.

Limitations

There is a lack of dedicated testing frameworks for neurosymbolic AI as a whole.
Computational intensity of verifying symbolic AI with methods like truth tables does not scale well with the number of variables.
First-order logic's undecidability limits its full verification.
New deep learning architectures, such as GCNs, require new or adapted testing methods.
Identified research gaps include the need for more efficient logic rule verification and comparison of neurosymbolic AI with conventional deep learning approaches.

Key quotes

The paper surveys V&V methods for sub-symbolic AI, focusing on robustness, reachability, and interpretability, and mentions tools like DeepXplore and DLFuzz. For robustness, methods such as generating adversarial examples are used to test the system's ability to maintain correct decisions under noisy or manipulated inputs.

Type: Key Finding

robustness

adversarial examples

DeepXplore

DLFuzz

Computational intensity of verifying symbolic AI with methods like truth tables does not scale well with the number of variables. These methods are often only feasible for problems with a small number of propositional variables due to their exponential complexity.

Type: Limitation/Constraint

truth tables

propositional variables

exponential complexity

symbolic AI

Abstract

Core problem

Key findings and Contribution

The paper compares two taxonomies for neurosymbolic AI: Kautz's taxonomy with six types of interactions and Yu's simplified taxonomy with three categories.

Verification and validation techniques for symbolic AI are discussed, with propositional logic being decidable and first-order logic undecidable.

The paper surveys V&V methods for sub-symbolic AI, focusing on robustness, reachability, and interpretability, and mentions tools like DeepXplore and DLFuzz.

Current V&V methods can be applied to individual components of neurosymbolic AI, with potential to enhance sub-symbolic part verification using the symbolic component.

Limitations

There is a lack of dedicated testing frameworks for neurosymbolic AI as a whole.

Computational intensity of verifying symbolic AI with methods like truth tables does not scale well with the number of variables.

First-order logic's undecidability limits its full verification.

New deep learning architectures, such as GCNs, require new or adapted testing methods.

Identified research gaps include the need for more efficient logic rule verification and comparison of neurosymbolic AI with conventional deep learning approaches.