Towards Causal Representation Learning
Last modified on May 16, 2022

Yoshua Bengio talk. Also, the associated paper.

causal representation learning: the discovery of high-level causal variables from low-level observations.

In practice, i.i.d. is a bad assumption. Things don’t stay the same distribution as train. Current DL systems are brittle.

But…what assumption can we replace it with, then?

how does the brain break knowledge apart into “pieces” that can be reused? => compositionality (thinking decomposition into helper functions in programming.) Examples of compositionality include

Systematic Generalization

Current DL methods overfit the training distribution. That is, if they encounter OOD data, they will perform poorly.

Conscious processing helps humans deal with OOD settings

We are agents, and agents face a dynamic environment – particularly because there are other agents! We want our knowledge to generalize across different places, times, input modalities, goals, etc.

System 1 vs. System 2

System 1: Intuitive, fast, unconscious, parallel, non-linguistic, habitual
System 2: Slow, logical, sequential, conscious, linguistic, algorithmic, planning, reasoning

Current deep learning systems excel at System 1 – they are fast, intuitive, but brittle. How can we incorporate more System 2 to allow DL to reason about the world?

Implicit vs. verbalizable knowledge

Most of our knowledge is implicit, and not verbalizable. Same for neural networks.

Verbalizable knowledge can be explicitly reasoned with, planned with.

Independent mechanisms

Hypothesis: We can explain the world by the composition of informationally independent pieces/modules/mechanisms. (Note: not statistically independent, but independent s.t. any causal intervention would affect just one such mechanism.)

Some System 2 inductive priors

Sparse causal graph of high-level, semantically meaningful variables.

Sparse factor graph.

Semantic variables are causal: agents, intentions, controllable objects, for example.

Changes in distribution are due to causal interventions (in the aforementioned high-level semantic space.) Provided we have the right abstractions, it would only take a few words to describe this change.

Everything that’s happening can be reported in simple language. (Interesting that this is an example of report/access consciousness.) Mapping from semantic variables <=> sentences

“generic rules” of how things work are shared across instances – need variables / functions / some form of indirection.

Stability/robustness in meaning (e.g. of laws of physics,) even with changes in distribution, vs. things that do change. E.g.: early visual layers are stable after childhood. Later things like object recognition can be adapted to very quickly.

Causal chains to explain things are short. (Interesting: connection to dissonance reduction: we like simple explanations of the world around us (possibly because it helps us streamline our cognition.))

What should the causal variables be?

Position and momentum of every particle: computationally intractable.

Take inspiration from scientists (and humans in general): we invent high-level abstractions that make the causal structure of the world simpler.

Agency to Guide Representation Learning & Disentangling

(E. Bengio et al, 2017; V Thomas e al 2017; Kim et al ICML 2019)

Independent mechanisms: there are ways to modify a single object in the graph (e.g., you can move a chair ➡️🪑. )

Way that we represent actions <=> objects: there’s a bijection there.

Connected to the psychological notion of affordances: the way we understand objects is by the things we can do with them.

What causes changes in distribution?

hypothesis to replace i.i.d. assumption: changes in distribution = consequence of an intervention on one/few causes /mechanisms. So, not identically distributed, but pretty similar, if you’re in the right high-level representation space. (E.g. if you put shaded glasses on, all the pixels change in basic RGB space – but in some high-level semantic space, only one bit changed!)

Causal induction from interventional data

How to handle unknown intervention? infer it.

Thoughts, Consciousness, Language

If we want better NLP/NLU, we need to ground language in higher-level concepts.

Grounded language learning: BabyAI (2019)

Core ingredient for conscious processing: attention

Attention enables us to make dynamic connections to the various different “modules” in the brain. Creates competition between the modules for which deserves attention.

Going from attention to consciousness

Dehaene et al. – workspace theory of consciousness [1]




S. Dehaene, J.-P. Changeux, L. Naccache, J. Sackur, and C. Sergent, “Conscious, preconscious, and subliminal processing: A testable taxonomy,” Trends in cognitive sciences, vol. 10, no. 5, pp. 204–211, May 2006, doi: 10.1016/j.tics.2006.03.007.
B. Schölkopf et al., “Towards Causal Representation Learning,” Feb. 2021, doi: 10.48550/arXiv.2102.11107.

Links to “Towards Causal Representation Learning”