Shallow Networks for Deeper Understanding?

In the first half of the 20th century, the notion of an artificial neural network composed of many different layers of processors was born (McCulloch & Pitts, 1943).  These networks were very powerful, but had to be hand wired because a learning rule capable of training them had not yet been invented.

The first learning rule for artificial neural networks was discovered around the time of the cognitive revolution (Rosenblatt, 1958, 1962).  However, this rule could not train networks that contained hidden units.  As a result this learning rule could only train perceptrons, which are networks of limited capability (Minsky & Papert, 1969).

The rise of modern connectionism began with the discovery of supervised learning rules for networks with hidden units (Ackley, Hinton, & Sejnowski, 1985; Amari, 1967; Anderson, 1995; Rumelhart, Hinton, & Williams, 1986; Werbos, 1994).  Researchers could now teach networks that had enormous computational power (in principle).  Networks like the multilayer perceptron became the staple of connectionist cognitive science.

In the early decades of the 21st century some researchers expressed concern with the limitations of the supervised training of multilayer perceptrons.  While such networks can learn to perform a variety of complicated tasks, researchers often encounter practical problems in their use.  Some have pointed out that the incredible power of the human brain arises from its use of many, many different layers of hidden neurons (Bengio, 2009).  However, when 20th century supervised learning rules are used, networks of many layers are enormously difficult to train.  The old approaches to network training face practical obstacles that prevent the in principle power of multilayer networks from being exploited.

Modern researchers have discovered new types of learning rules that permit networks with many layers of hidden units to be trained (Bengio, Courville, & Vincent, 2013; Hinton, 2007; Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006; Larochelle, Mandel, Pascanu, & Bengio, 2012).  These new rules, often called deep learning, now permit researchers to train deep belief networks to accomplish tasks far beyond the capabilities of shallow, late 20th century networks.  Deep learning has produced networks for classification tasks involving natural language, image classification, and the processing of sound (Hinton, 2007; Hinton et al., 2006; Mohamed, Dahl, & Hinton, 2012; Sarikaya, Hinton, & Deoras, 2014).  Daily news reports reveal deep learning applications are being employed by various companies such as Google, Facebook and PayPal; deep learning rules are widely available (Fischer & Igel, 2014; Testolin, Stoianov, De Grazia, & Zorzi, 2013).

The networks studied in the current book are clearly antiquated in comparison to modern deep belief networks.  What is the point of using older, less powerful, networks to investigate music?

The primary motivation for exploring music with older architectures is the frequent disconnect between the technology of neural networks and the cognitive science of neural networks (Dawson & Shamanski, 1994).  The development of artificial neural networks occurs in many different disciplines, and these different disciplines often have different goals.  For instance, deep learning is emerging from computer science, and current research on it focuses on developing new procedures for accomplishing deep learning efficiently (Bengio, 2009).  In other words, deep learning is being developed from a technological perspective; its developers are concerned with successfully training networks to perform extremely complex pattern classification tasks.

The cognitive science of deep learning is lagging far behind its technology.  Some researchers have expressed concern that while deep learning produces networks that solve problems worthy of human neural processing, these networks are not themselves providing any insight about the workings of the human brain or the human mind.

One reason for this is that most deep learning advances are currently quantitative, not qualitative (Erhan, Courville, & Bengio, 2010).  Techniques for interpreting the internal structure of deep belief networks are in their infancy.  If a network cannot be interpreted, then it likely cannot contribute to cognitive science (McCloskey, 1991).  Without interpretation, deep belief networks are magnificent artifacts, but are neither cognitive nor biological theories.

Of course, this is not to say that researchers are not interested in interpreting the internal structure of deep belief networks (Erhan et al., 2010; Hinton et al., 2006).  For instance, in the very first publication describing a method for deep learning Hinton et al. (2006) look into a network’s “mind” by observing responses of network processors to various stimuli in hope of discovering the abstract features that are detected by hidden layers.  However, few sophisticated techniques for interpreting deep networks exist.  Erhan et al. (2010) observe that typically researchers only visually examine the receptive field (i.e. the connection weights) that feed into processors in the first hidden layer of a deep belief network.

One reason to explore older architectures in the current book is because there are many more procedures in existence for interpreting their internal structure.  This in turn permits them to be more likely contributors to a cognitive science of music.

A second reason to focus on older artificial neural network architectures is the goal of seeking the simplest network that is required to solve a particular task.  For example, in the next chapter we will see that no hidden units are required at all to identify the tonic of a scale.  If such a simple network can accomplish this task, then why would we examine it with a deep belief network?  Indeed, though very old architectures like the perceptron  are extraordinarily simple, they can easily be used to contribute to a variety of topics in modern cognitive science (Dawson, 2008; Dawson & Dupuis, 2012; Dawson, Dupuis, Spetch, & Kelly, 2009; Dawson, Kelly, Spetch, & Dupuis, 2010).

Of course, the proof of the pudding is in the eating.  Thus in order to defend the claim that older network architectures can contribute to musical cognition, we must actually demonstrate their utility.  The goal of the remaining chapters in this book is to do exactly that.  Can we show that training shallow networks can provide a deeper understanding of music?

