| | The Elements of Brain Theory and Neural Networks
Michael A.
Arbib
IntroductionHow to Use Part I
Part I provides background material, summarizing a set of concepts established for the formal study of neurons and neural networks by 1986. As such, it is designed to hold few, if any, surprises for readers with a fair background in computational neuroscience or theoretical approaches to neural networks considered as dynamic, adaptive systems. Rather, Part I is designed for the many readers—be they neuroscience experimentalists, psychologists, philosophers, or technologists—who are sufficiently new to brain theory and neural networks that they can benefit from a compact overview of basic concepts prior to reading the road maps of Part II and the articles in Part III. Of course, much of what is covered in Part I is also covered at some length in the articles in Part III, and cross-references will steer the reader to these articles for alternative expositions and reviews of current research. In this exposition, as throughout the Handbook, we will move back and forth between computational neuroscience, where the emphasis is on modeling biological neurons, and neural computing, where the emphasis shifts back and forth between biological models and artificial neural networks based loosely on abstractions from biology, but driven more by technological utility than by biological considerations.
Section I.1, “Introducing the Neuron,” conveys the basic properties of neurons, receptors, and effectors, and then introduces several simple neural models, including the discrete-time McCulloch-Pitts model and the continuous-time leaky integrator model. References to Part III alert the reader to more detailed properties of neurons which are essential for the neuroscientist and provide interesting hints about future design features for the technologist.
Section I.2, “Levels and Styles of Analysis,” is designed to give the reader a feel for the interdisciplinary nexus in which the present study of brain theory and neural networks is located. The selection begins with a historical fragment which traces our federation of disciplines back to their roots in cybernetics, the study of control and communication in animals and machines. We look at the way in which the research addresses brains, machines, and minds, going back and forth between brain theory, artificial intelligence, and cognitive psychology. We then review the different levels of analysis involved, whether we study brains or intelligent machines, and the use of schemas to provide intermediate functional units that bridge the gap between an overall task and the neural networks which implement it.
Section I.3, “Dynamics and Adaptation in Neural Networks,” provides a tutorial on the concepts essential for understanding neural networks as dynamic, adaptive systems. It introduces the basic dynamic systems concepts of stability, limit cycles, and chaos, and relates Hopfield nets to attractors and optimization. It then introduces a number of basic concepts concerning adaptation in neural nets, with discussions of pattern recognition, associative memory, Hebbian plasticity and network self-organization, perceptrons, network complexity, gradient descent and credit assignment, and backpropagation. This section, and with it Part I, closes with a cautionary note. The basic learning rules and adaptive architectures of neural networks have already illuminated a number of biological issues and led to useful technological applications. However, these networks must have their initial structure well constrained (whether by evolution or technological design) to yield approximate solutions to the system’s tasks—solutions that can then be efficiently and efficaciously shaped by experience. Moreover, the full understanding of the brain and the improved design of intelligent machines will require not only improvements in these learning methods and their initialization, but also a fuller understanding of architectures based on networks of networks. Cross-references to articles in Part III will set the reader on the path to this fuller understanding. Because Part I focuses on the basic concepts established for the formal study of neurons and neural networks by 1986, it differs hardly at all from Part I of the first edition of the Handbook. By contrast, Part II, which provides the road maps that guide readers through the radically updated Part III, has been completely rewritten for the present edition to reflect the latest research results.
We introduce the neuron. The dangerous word in the preceding sentence is the. In biology, there are radically different types of neurons in the human brain, and endless variations in neuron types of other species. In brain theory, the complexities of real neurons are abstracted in many ways to aid in understanding different aspects of neural network development, learning, or function. In neural computing (technology based on networks of “neuron-like” units), the artificial neurons are designed as variations on the abstractions of brain theory and are implemented in software, or VLSI or other media. There is no such thing as a “typical” neuron, yet this section will nonetheless present examples and models which provide a starting point, an essential set of key concepts, for the appreciation of the many variations on the theme of neurons and neural networks presented in Part III.
An analogy to the problem we face here might be to define vehicle for a handbook of transportation. A vehicle could be a car, a train, a plane, a rowboat, or a forklift truck. It might or might not carry people. The people could be crew or passengers, and so on. The problem would be to give a few key examples of form (such as car versus plane) and function (to carry people or goods, by land, air, or sea, etc.). Moreover, we would find interesting examples of co-evolution: for example, modern highway systems would not have been created without the pressure of increasing car traffic; most features of cars are adapted to the existence of sealed roads, and some features (e.g., cruise control) are specifically adapted to good freeway conditions. Following a similar procedure, Part III offers diverse examples of neural form and function in both biology and technology.
Here, we start with the observation that a brain is made up of a network of cells called neurons, coupled to receptors and effectors. Neurons are intimately connected with glial cells, which provide support functions for neural networks. New empirical data show the importance of glia in regeneration of neural networks after damage and in maintaining the neurochemical milieu during normal operation. However, such data have had very little impact on neural modeling and so will not be considered further here. The input to the network of neurons is provided by receptors, which continually monitor changes in the external and internal environment. Cells called motor neurons (or motoneurons), governed by the activity of the neural network, control the movement of muscles and the secretion of glands. In between, an intricate network of neurons (a few hundred neurons in some simple creatures, hundreds of billions in a human brain) continually combines the signals from the receptors with signals encoding past experience to barrage the motor neurons with signals that will yield adaptive interactions with the environment. In animals with backbones (vertebrates, including mammals in general and humans in particular), this network is called the central nervous system (CNS), and the brain constitutes the most headward part of this system, linked to the receptors and effectors of the body via the spinal cord. Invertebrate nervous systems (neural networks) provide astounding variations on the vertebrate theme, thanks to eons of divergent evolution. Thus, while the human brain may be the source of rich analogies for technologists in search of “artificial intelligence,” both invertebrates and vertebrates provide endless ideas for technologists designing neural networks for sensory processing, robot control, and a host of other applications. (A few of the relevant examples may be found in the Part II road maps, Vision, Robotics and Control Theory, Motor Pattern Generators, and Neuroethology and Evolution.)
The brain provides far more than a simple stimulus-response chain from receptors to effectors (although there are such reflex paths). Rather, the vast network of neurons is interconnected in loops and tangled skeins so that signals entering the net from the receptors interact there with the billions of signals already traversing the system, not only to yield the signals that control the effectors but also to modify the very properties of the network itself, so that future behavior will reflect prior experience.
The Diversity of Receptors
Rod and cone receptors in the eyes respond to light, hair cells in the ears respond to pressure, and other cells in the tongue and the mouth respond to subtle traces of chemicals. In addition to touch receptors, there are receptors in the skin that are responsive to movement or to temperature, or that signal painful stimuli. These external senses may be divided into two classes: (1) the proximity senses, such as touch and taste, which sense objects in contact with the body surface, and (2) the distance senses, such as vision and hearing, which let us sense objects distant from the body. Olfaction is somewhere in between, using chemical signals “right under our noses” to sense nonproximate objects. Moreover, even the proximate senses can yield information about nonproximate objects, as when we feel the wind or the heat of a fire. More generally, much of our appreciation of the world around us rests on the unconscious fusion of data from diverse sensory systems.
The appropriate activity of the effectors must depend on comparing where the system should be—the current target of an ongoing movement—with where it is now. Thus, in addition to the external receptors, there are receptors that monitor the activity of muscles, tendons, and joints to provide a continual source of feedback about the tensions and lengths of muscles and the angles of the joints, as well as their velocities. The vestibular system in the head monitors gravity and accelerations. Here, the receptors are hair cells monitoring fluid motion. There are also receptors to monitor the chemical level of the bloodstream and the state of the heart and the intestines. Cells in the liver monitor glucose, while others in the kidney check water balance. Receptors in the hypothalamus, itself a part of the brain, also check the balance of water and sugar. The hypothalamus then integrates these diverse messages to direct behavior or other organs to restore the balance. If we stimulate the hypothalamus, an animal may drink copious quantities of water or eat enormous quantities of food, even though it is already well supplied; the brain has received a signal that water or food is lacking, and so it instructs the animal accordingly, irrespective of whatever contradictory signals may be coming from a distended stomach.
Basic Properties of Neurons
To understand the processes that intervene between receptors and effectors, we must have a closer look at “the” neuron. As already emphasized, there is no such thing as a typical neuron. However, we will summarize properties shared by many neurons. The “basic neuron” shown in Figure 1 is abstracted from a motor neuron of mammalian spinal cord. From the soma (cell body) protrudes a number of ramifying branches called dendrites; the soma and dendrites constitute the input surface of the neuron. There also extrudes from the cell body, at a point called the axon hillock (abutting the initial segment), a long fiber called the axon, whose branches form the axonal arborization. The tips of the branches of the axon, called nerve terminals or boutons, impinge on other neurons or on effectors. The locus of interaction between a bouton and the cell on which it impinges is called a synapse, and we say that the cell with the bouton synapses upon the cell with which the connection is made. In fact, axonal branches of some neurons can have many varicosities, corresponding to synapses, along their length, not just at the end of the branch.
Figure 1.
A “basic neuron” abstracted from a motor neuron of mammalian spinal cord. The dendrites and soma (cell body) constitute the major part of the input surface of the neuron. The axon is the “output line.” The tips of the branches of the axon form synapses upon other neurons or upon effectors (although synapses may occur along the branches of an axon as well as at the ends). (From Arbib, M. A., 1989, The Metaphorical Brain 2: Neural Networks and Beyond, New York: Wiley-Interscience, p. 52. Reproduced with permissions. Copyright © 1989 by John Wiley & Sons, Inc.)
We can imagine the flow of information as shown by the arrows in Figure 1. Although “conduction” can go in either direction on the axon, most synapses tend to “communicate” activity to the dendrites or soma of the cell they synapse upon, whence activity passes to the axon hillock and then down the axon to the terminal arborization. The axon can be very long indeed. For instance, the cell body of a neuron that controls the big toe lies in the spinal cord and thus has an axon that runs the complete length of the leg. We may contrast the immense length of the axon of such a neuron with the very small size of many of the neurons in our heads. For example, amacrine cells in the retina have branchings that cannot appropriately be labeled dendrites or axons, for they are short and may well communicate activity in either direction to serve as local modulators of the surrounding network. In fact, the propagation of signals in the “counter-direction” on dendrites away from the soma has in recent years been seen to play an important role in neuronal function, but this feature is not included in the account of the “basic neuron” given here (see Dendritic Processing—titles in small caps refer to articles in Part III).
To understand more about neuronal “communication,” we emphasize that the cell is enclosed by a membrane, across which there is a difference in electrical charge. If we change this potential difference between the inside and outside, the change can propagate in much the same passive way that heat is conducted down a rod of metal: a normal change in potential difference across the cell membrane can propagate in a passive way so that the change occurs later, and becomes smaller, the farther away we move from the site of the original change. This passive propagation is governed by the cable equation
If the starting voltage at a point on the axon is V0, and no further conditions are imposed, the potential will decay exponentially, having value V(x) = V0e−x at distance x from the starting point, where the length unit, the length constant, is the distance in which the potential changes by a factor of 1/e. This length unit will differ from axon to axon. For “short” cells (such as the rods, cones, and bipolar cells of the retina), passive propagation suffices to signal a potential change from one end to the other; but if the axon is long, this mechanism is completely inadequate, since changes at one end will decay almost completely before reaching the other end. Fortunately, most nerve cells have the further property that if the change in potential difference is large enough (we say it exceeds a threshold), then in a cylindrical configuration such as the axon, a pulse can be generated that will actively propagate at full amplitude instead of fading passively.
If propagation of various potential differences on the dendrites and soma of a neuron yields a potential difference across the membrane at the axon hillock which exceeds a certain threshold, then a regenerative process is started: the electrical change at one place is enough to trigger this process at the next place, yielding a spike or action potential, an undiminishing pulse of potential difference propagating down the axon. After an impulse has propagated along the length of the axon, there is a short refractory period during which a new impulse cannot be propagated along the axon.
The propagation of action potentials is now very well understood. Briefly, the change in membrane potential is mediated by the flow of ions, especially sodium and potassium, across the membrane. Hodgkin and Huxley (1952) showed that the conductance of the membrane to sodium and potassium ions—the ease with which they flow across the membrane—depends on the transmembrane voltage. They developed elegant equations describing the voltage and time dependence of the sodium and potassium conductances. These equations (see the article Axonal Modeling in Part III) have given us great insight into cellular function. Much mathematical research has gone into studying Hodgkin-Huxley-like equations, showing, for example, that neurons can support rhythmic pulse generation even without input (see Oscillatory and Bursting Properties of Neurons), and explicating triggered long-distance propagation. Hodgkin and Huxley used curve fitting from experimental data to determine the terms for conductance change in their model. Subsequently, much research has probed the structure of complex molecules that form channels which selectively allow the passage of specific ions through the membrane (see Ion Channels: Keys to Neuronal Specialization). This research has demonstrated how channel properties not only account for the terms in the Hodgkin-Huxley equation, but also underlie more complex dynamics which may allow even small patches of neural membrane to act like complex computing elements. At present, most artificial neurons used in applications are very simple indeed, and much future technology will exploit these “subneural subtleties.”
An impulse traveling along the axon from the axon hillock triggers new impulses in each of its branches (or collaterals), which in turn trigger impulses in their even finer branches. Vertebrate axons come in two varieties, myelinated and unmyelinated. The myelinated fibers are wrapped in a sheath of myelin (Schwann cells in the periphery, oligodendrocytes in the CNS—these are glial cells, and their role in axonal conduction is the primary role of glia considered in neural modeling to date). The small gaps between successive segments of the myelin sheath are called nodes of Ranvier. Instead of the somewhat slow active propagation down an unmyelinated fiber, the nerve impulse in a myelinated fiber jumps from node to node, thus speeding passage and reducing energy requirements (see Axonal Modeling).
Surprisingly, at most synapses, the direct cause of the change in potential of the postsynaptic membrane is not electrical but chemical. When an impulse arrives at the presynaptic terminal, it causes the release of transmitter molecules (which have been stored in the bouton in little packets called vesicles) through the presynaptic membrane. The transmitter then diffuses across the very small synaptic cleft to the other side, where it binds to receptors on the postsynaptic membrane to change the conductance of the postsynaptic cell. The effect of the “classical” transmitters (later we shall talk of other kinds, the neuromodulators) is of two basic kinds: either excitatory, tending to move the potential difference across the postsynaptic membrane in the direction of the threshold (depolarizing the membrane), or inhibitory, tending to move the polarity away from the threshold (hyperpolarizing the membrane). There are some exceptional cell appositions that are so large or have such tight coupling (the so-called gap junctions) that the impulse affects the postsynaptic membrane without chemical mediation (see Neocortex: Chemical and Electrical Synapses).
Most neural modeling to date focuses on the excitatory and inhibitory interactions that occur on a fast time scale (a millisecond, more or less), and most biological (as distinct from technological) models assume that all synapses from a neuron have the same “sign.” However, neurons may also secrete transmitters that modulate the function of a circuit on some quite extended time scale. Modeling that takes account of this neuromodulation (see Synaptic Interactions and Neuromodulation in Invertebrate Nervous Systems) will become increasingly important in the future, since it allows cells to change their function, enabling a neural network to switch dramatically its overall mode of activity.
The excitatory or inhibitory effect of the transmitter released when an impulse arrives at a bouton generally causes a subthreshold change in the postsynaptic membrane. Nonetheless, the cooperative effect of many such subthreshold changes may yield a potential change at the axon hillock that exceeds threshold, and if this occurs at a time when the axon has passed the refractory period of its previous firing, then a new impulse will be fired down the axon.
Synapses can differ in shape, size, form, and effectiveness. The geometrical relationships between the different synapses impinging on the cell determine what patterns of synaptic activation will yield the appropriate temporal relationships to excite the cell (see Dendritic Processing). A highly simplified example (Figure 2) shows how the properties of nervous tissue just presented would indeed allow a simple neuron, by its very dendritic geometry, to compute some useful function (cf. Rall, 1964, p. 90). Consider a neuron with four dendrites, each receiving a single synapse from a visual receptor, so arranged that synapses A, B, C, and D (from left to right) are at increasing distances from the axon hillock. (This is not meant to be a model of a neuron in the retina of an actual organism; rather, it is designed to make vivid the potential richness of single neuron computations.) We assume that each receptor reacts to the passage of a spot of light above its surface by yielding a generator potential which yields, in the postsynaptic membrane, the same time course of depolarization. This time course is propagated passively, and the farther it is propagated, the later and the lower is its peak. If four inputs reached A, B, C, and D simultaneously, their effect may be less than the threshold required to trigger a spike there. However, if an input reaches D before one reaches C, and so on, in such a way that the peaks of the four resultant time courses at the axon hillock coincide, the total effect could well exceed threshold. This, then, is a cell that, although very simple, can detect direction of motion across its input. It responds only if the spot of light is moving from right to left, and if the velocity of that motion falls within certain limits. Our cell will not respond to a stationary object, or one moving from left to right, because the asymmetry of placement of the dendrites on the cell body yields a preference for one direction of motion over others (for a more realistic account of biological mechanisms, see Directional Selectivity). This simple example illustrates that the form (i.e., the geometry) of the cell can have a great impact on the function of the cell, and we thus speak of form-function relations. When we note that neurons in the human brain may have 10,000 or more synapses upon them, we can understand that the range of functions of single neurons is indeed immense.
Figure 2.
An example, conceived by Wilfrid Rall, of the subtleties that can be revealed by neural modeling when dendritic properties (in this case, length-dependent conduction time) are taken into account. As shown in Part C, the effect of simultaneously activating all inputs may be subthreshold, yet the cell may respond when inputs traverse the cell from right to left (D). (From Arbib, M. A., 1989, The Metaphorical Brain 2: Neural Networks and Beyond, New York: Wiley-Interscience, p. 60. Reproduced with permission. Copyright © 1989 by John Wiley & Sons, Inc.)
Receptors and Effectors
On the “input side,” receptors share with neurons the property of generating potentials, which are transmitted to various synapses upon neurons. However, the input surface of a receptor does not receive synapses from other neurons, but can transduce environmental energy into changes in membrane potential, which may then propagate either actively or passively. (Visual receptors do not generate spikes; touch receptors in the body and limbs use spike trains to send their message to the spinal cord.) For instance, the rods and cones of the eye contain various pigments that react chemically to light in different frequency bands, and these chemical reactions, in turn, lead to local potential changes, called generator potentials, in the membrane. If the light falling on an array of rods and cones is appropriately patterned, then their potential changes will induce interneuron changes to, in turn, fire certain ganglion cells (retinal output neurons whose axons course toward the brain). Properties of the light pattern will thus be signaled farther into the nervous system as trains of impulses (see Retina).
At the receptors, increasing the intensity of stimulation will increase the generator potential. If we go to the first level of neurons that generate pulses, the axons “reset” each time they fire a pulse and then have to get back to a state where the threshold and the input potential meet. The higher the generator potential, the shorter the time until they meet again, and thus the higher the frequency of the pulse. Thus, at the “input” it is a useful first approximation to say that intensity or quantity of stimulation is coded in terms of pulse frequency (more stimulus ≈ more spikes), whereas the quality or type of stimulus is coded by different lines carrying signals from different types of receptors. As we leave the periphery and move toward more “computational” cells, we no longer have such simple relationships, but rather interactions of inhibitory cells and excitatory cells, with each inhibitory input moving a cell away from, and each excitatory input moving it toward, threshold.
To discuss the “output side,” we must first note that a muscle is made up of many thousands of muscle fibers. The motor neurons that control the muscle fibers lie in the spinal cord or the brainstem, whence their axons may have to travel vast distances (by neuronal standards) before synapsing upon the muscle fibers. The smallest functional entity on the output side is thus the motor unit, which consists of a motor neuron cell body, its axon, and the group of muscle fibers the axon influences.
A muscle fiber is like a neuron to the extent that it receives its input via a synapse from a motor neuron. However, the response of the muscle fiber to the spread of depolarization is to contract. Thus, the motor neurons which synapse upon the muscle fibers can determine, by the pattern of their impulses, the extent to which the whole muscle comprised of those fibers contracts, and can thus control movement. (Similar remarks apply to those cells that secrete various chemicals into the bloodstream or gut, or those that secrete sweat or tears.)
Synaptic activation at the motor end-plate (i.e., the synapse of a motor neuron upon a muscle fiber) yields a brief “twitch” of the muscle fiber. A low repetition rate of action potentials arriving at a motor end-plate causes a train of twitches, in each of which the mechanical response lasts longer than the action potential stimulus. As the frequency of excitation increases, a second action potential will arrive while the mechanical effect of the prior stimulus still persists. This causes a mechanical summation or fusion of contractions. Up to a point, the degree of summation increases as the stimulus interval becomes shorter, although the summation effect decreases as the interval between the stimuli approaches the refractory period of the muscle, and maximum tension occurs. This limiting response is called a tetanus. To increase the tension exerted by a muscle, it is then necessary to recruit more and more fibers to contract. For more delicate motions, such as those involving the fingers of primates, each motor neuron may control only a few muscle fibers. In other locations, such as the shoulder, one motor neuron alone may control thousands of muscle fibers. As descending signals in the spinal cord command a muscle to contract more and more, they do this by causing motor neurons with larger and larger thresholds to start firing. The result is that fairly small fibers are brought in first, and then larger and larger fibers are recruited. The result, known as Henneman’s Size Principle, is that at any stage, the increment of activation obtained by recruiting the next group of motor units involves about the same percentage of extra force being applied, aiding smoothness of movement (see Motoneuron Recruitment).
Since there is no command that a neuron may send to a muscle fiber that will cause it to lengthen—all the neuron can do is stop sending it commands to contract—the muscles of an animal are usually arranged in pairs. The contraction of one member of the pair will then act around a pivot to cause the expansion of the other member of the pair. Thus, one set of muscles extends the elbow joint, while another set flexes the elbow joint. To extend the elbow joint, we do not signal the flexors to lengthen, we just stop signaling them to contract, and then they will be automatically lengthened as the extensor muscles contract. For convenience, we often label one set of muscles as the “prime mover” or agonist, and the opposing set as the antagonist. However, in such joints as the shoulder, which are not limited to one degree of freedom, many muscles, rather than an agonist-antagonist pair, participate. Most real movements involve many joints. For example, the wrist must be fixed, holding the hand in a position bent backward with respect to the forearm, for the hand to grip with its maximum power. Synergists are muscles that act together with the main muscles involved. A large group of muscles work together when one raises something with one’s finger. If more force is required, wrist muscles may also be called in; if still more force is required, arm muscles may be used. In any case, muscles all over the body are involved in maintaining posture.
Neural Models
Before presenting more realistic models of the neuron (see Perspective on Neuron Model Complexity; Single-Cell Models), we focus on the work of McCulloch and Pitts (1943), which combined neurophysiology and mathematical logic, using the all-or-none property of neuron firing to model the neuron as a binary discrete-time element. They showed how excitation, inhibition, and threshold might be used to construct a wide variety of “neurons.” It was the first model to tie the study of neural nets squarely to the idea of computation in its modern sense. The basic idea is to divide time into units comparable to a refractory period so that, in each time period, at most one spike can be generated at the axon hillock of a given neuron. The McCulloch-Pitts neuron (Figure 3A) thus operates on a discrete-time scale, t = 0, 1, 2, 3, … , where the time unit is (in biology) on the order of a millisecond. We write y(t) = 1 if a spike does appear at time t, and y(t) = 0 if not. Each connection, or synapse, from the output of one neuron to the input of another has an attached weight. Let wi be the weight on the ith connection onto a given neuron. We call the synapse excitatory if wi > 0, and inhibitory if wi < 0. We also associate a threshold θ with each neuron, and assume exactly one unit of delay in the effect of all presynaptic inputs on the cell’s output, so that a neuron “fires” (i.e., has value 1 on its output line) at time t + 1 if the weighted value of its inputs at time t is at least θ. Formally, if at time t the value of the ith input is xi(t) and the output one time step later is y(t + 1), then
Figure 3.
a, A McCulloch-Pitts neuron operating on a discrete-time scale. Each input has an attached weight wi, and the neuron has a threshold θ. The neuron “fires” at time t + 1 just in case the weighted values of its inputs at time t is at least θ. b, Settings of weights and threshold for neurons that function as an AND gate (i.e., the output fires if x1 and x2 both fire). c, An OR gate (the output fires if x1 or x2, or both fire). d, A NOT gate (the output fires if x1 does NOT fire).
Parts b through d of Figure 3 show how weights and threshold can be set to yield neurons that realize the logical functions AND, OR, and NOT. As a result, McCulloch-Pitts neurons are sufficient to build networks that can function as the control circuitry for a computer carrying out computations of arbitrary complexity; this discovery played a crucial role in the development of automata theory and in the study of learning machines. Although the McCulloch-Pitts neuron no longer plays an active part in computational neuroscience, it is still widely used in neural computing, especially when it is generalized so that the input and output values can lie anywhere in the range [0, 1] and the function , which yields y(t + 1), is a continuously varying function rather than a step function. However, it is one thing to define model neurons with sufficient logical power to subserve any discrete computation; it is quite another to understand how the neurons in actual brains perform their tasks. More generally, the problem is to select just which units to model, and to decide how such units are to be represented. Thus, when we turn from neural computing to computational neuroscience, we must turn to more realistic models of neurons. On the other hand, we may say that neural computing cannot reach its full power without applying new mechanisms based on current and future study of biological neural networks (see the road map Biological Neurons and Synapses).
Modern brain theory no longer uses the binary model of the neuron, but instead uses continuous-time models that either represent the variation in average firing rate of the neuron or actually capture the time course of membrane potentials. It is only through such correlates of measurable brain activity that brain models can really feed back to biological experiments. Such models also require the brain theorist to know a great deal of detailed anatomy and physiology as well as behavioral data. Hodgkin and Huxley (1952) have shown us how much can be learned from analysis of membrane properties about the propagation of electrical activity along the axon: Rall (1964; cf. Figure 2) was a leader in showing that the study of membrane properties in a variety of connected “compartments” of membrane in dendrite, soma, and axon can help us understand small neural circuits, as in the Olfactory Bulb (q.v.) or for Dendritic Processing (q.v.). Nonetheless, in many cases, the complexity of compartmental analysis makes it more insightful to use a more lumped representation of the individual neuron if we are to assemble the model neurons to analyze large networks. A computer simulation of the response of a whole brain region which analyzed each component at the finest level of detail available would be too large to run on even a network of computers. In addition to the importance of detailed models of single neurons in themselves, such studies can also be used to fine-tune more economical models of neurons, which can then serve as the units in models of large networks, whether to model systems in the brain or to design artificial neural networks which exploit subtle neural capabilities.
We may determine units in the brain physiologically, e.g., by electrical recording, and anatomically, e.g., by staining. In many regions of the brain, we have an excellent correlation between physiological and anatomical units; that is, we know which anatomical entity yields which physiological response. Unfortunately, this is not always the case. We may have data on the electrophysiological correlates of animal behavior, and anatomical data as well, yet not know which specific cell, defined anatomically, yields an observed electrophysiological response. Another problem that we confront in modeling is that we have both too much and too little anatomical detail: too much in that there are many synapses that we cannot put into our model without overloading our capabilities for either mathematical analysis or computer simulation, and too little in that we often do not know which details of synaptology may determine the most important modes of behavior of a particular region of the brain. Judicious choices from available data, and judicious hypotheses concerning missing data, must thus be made in setting up a model, leading to the design of experiments whose results may either confirm these hypotheses or lead to their modification. An important point of good modeling methodology is thus to set up simulations in such a way that we can use different connectivity on different simulations, both to test alternative hypotheses and to respond to new data as they become available.
The simplest “realistic” model consonant with the above material is the leaky integrator model. Although some biological neurons communicate by the passive propagation (cable equation) of membrane potential down their (necessarily short) axons, most communicate by the active propagation of “spikes.” The generation and propagation of such spikes has been described in detail by the Hodgkin-Huxley equations. However, the leaky integrator model omits such details. It is a continuous-time model based on using the firing rate (e.g., the number of spikes traversing the axon in the most recent 20 ms) as a continuously varying output measure of the cell’s activity, in which the internal state of the neuron is described by a single variable, the membrane potential at the spike initiation zone. The firing rate is approximated by a simple, sigmoid function of the membrane potential. That is, we introduce a function σ of the membrane potential m such that σ(m) increases from 0 to some maximum value as m increases from −∞ to +∞ (e.g., the sigmoidal function k/[l + exp(−m/θ)], increasing from 0 to its maximum k). Then the firing rate M(t) of the cell is given by the equation:
The time evolution of the cell’s membrane potential is given by a differential equation. Consider first the simple equation
We say that m(t) is in an equilibrium if it does not change under the dynamics described by the differential equation. However, dm(t)/dt = 0 if and only if m(t) = h, so that h is the unique equilibrium of Equation 1. To get more information, we now integrate Equation 1 to get
which tends to the resting level h with time constant τ with increasing t so long as τ is positive. We now add synaptic inputs to obtain
where Xi(t) is the firing rate at the ith input. Thus, an excitatory input (wi > 0) will be such that increasing it will increase dm(t)/
dt, while an inhibitory input (wi < 0) will have the opposite effect. A neuron described by Equation 2 is called a leaky integrator neuron. This is because the equation
would simply integrate the inputs with scaling constant τ,
but the −m(t) term in Equation 3 opposes this integration by a “leakage” of the potential m(t) as it tries to return to its input-free equilibrium h.
It should be noted that, even at this simple level of modeling, there are alternative models. In the foregoing model, we have used subtractive inhibition. But there are inhibitory synapses which seem better described by shunting inhibition which, applied at a given point on a dendrite, serves to divide, rather than subtract from, the potential change passively propagating from more distal synapses. Again, the “lumped frequency” model cannot model the relative timing effects crucial to our motion detector example (see Figure 2). These might be approximated by introducing appropriate delay terms
Another class of neuron models—spiking neurons, including integrate-and-fire neurons—are intermediate in complexity between leaky integrator models in which the output is the average firing rate (see Rate Coding and Signal Processing) and detailed biophysical models in which the fine details of action potential generation are modeled using the Hodgkin-Huxley equation. In these intermediate models, the output is a spike whose timing is continuously variable as a result of cellular interactions, but the spike is represented simply by its time of occurrence, with no internal structure. For example, one may track the continuous variable (4), then generate a spike each time this quantity reaches threshold, while simultaneously resetting the integral to some baseline value (see Integrate-and-Fire Neurons and Networks). Such models include the ability to transmit information very rapidly through small temporal differences between the spikes sent out by different neurons (see Spiking Neurons, Computation with).
All this reinforces the observation that there is no modeling approach that is automatically appropriate. Rather, we seek to find the simplest model adequate to address the complexity of a given range of problems. The articles in Part III of the Handbook will provide many examples of the diversity of neural models appropriate to different tasks.
More Detailed Properties of Neurons
In Section I.3, the only details we will add to the neuron models just presented will be various, relatively simple, rules of synaptic plasticity. This level of detail (though with many variations) will suffice for a fair range of models of biological neural networks, and for a range of current work on artificial neural networks (ANNs). The road map Biological Neurons and Synapses in Part II surveys a set of articles that demonstrate that biological neurons are vastly more complex than the present models suggest. Other road maps show the special structures revealed in “special-
purpose” neural circuitry in different species of animals. Table 1 lists some of the relevant articles on such circuits, together with the specific animal types on which the studies were based. The point is that much is to be learned from features specific to many different types of nervous systems, as well as from studies in humans, monkeys, cats, and rats that focus on commonalities with the human nervous system.
An appreciation of this complexity is necessary for the computational neuroscientist wishing to address the increasingly detailed database of experimental neuroscience, but it should also prove important for the technologist looking ahead to the incorporation of new capabilities into the next generation of ANNs. Nonetheless, much can be accomplished with simple models, as we shall see in Section I.3.
Many articles in this book show the benefits of interplay between biology and technology. Nonetheless, it is essential to distinguish between studying the brain and building an effective technology for intelligent systems and computation, and to distinguish among the various levels of investigation that exist (from the molecular to the system level) in these related, but by no means identical, disciplines. The present section provides a fuller sense of the disciplines that come together in brain theory and neural networks, and of the different levels of analysis involved in the study of complex biological and technological systems.
A Historical Fragment
Perhaps the simplest history of brain theory and neural networks would restrict itself to just three items: studies by McCulloch and Pitts (1943), Hebb (1949), and Rosenblatt (1958). These publications introduced the first model of neural networks as “computing machines,” the basic model of network self-organization, and the model of “learning with a teacher,” respectively. (Section I.3 provides a semitechnical introduction to this work and a key set of currently central ideas that build upon it.) The present historical fragment is designed to take us up to 1948, the year preceding the publication of Hebb’s book, to reveal our present federation of disciplines as the current incarnation of what emerged in the 1940s and is aptly summed up in the title of the book, Cybernetics: Or Control and Communication in the Animal and the Machine (Wiener, 1948). But whereas Wiener’s view of cybernetics was dominated by concepts of control and communication, our subject is dominated by notions of parallel and distributed computation, with special attention to learning in neural networks. On the other hand, notions of information and statistical mechanics championed by Wiener have reemerged as a strong strand in the study of neural networks today (see, e.g., the articles Feature Analysis and Statistical Mechanics of Neural Networks in Part III). The articles in Part III will make abundantly clear how far we have come since 1948, and also how many problems remain. My intent in the present “fragment” is to enrich the reader’s understanding of current contributions by using a selective historical tour to place them in context.
Noting that the Greek word cybernetics (κυβερνετεσ) means the helmsman of a ship (cf. the Latin word gubernator, which gives us the word “governor” in English), Wiener (1948) used the term for a subject in which feedback played a central role. Feedback is the process whereby, e.g., the helmsman notes the “error,” the extent to which he is off course, and “feeds it back” to decide which way to move the rudder. We can see the importance of this concept in endowing automata (“self-moving” machines) with flexible behavior. Two hundred years earlier, in L’Homme machine, La Mettrie had suggested that such automata as the mechanical duck and flute player of Vaucanson indicated the possibility of one day building a mechanical man that could talk. While these clockwork automata were capable of surprisingly complex behavior, they lacked a crucial aspect of animal behavior, let alone human intelligence: they were unable to adapt to changing circumstances. In the following century, machines were built that could automatically counter disturbances to restore desired performance. Perhaps the best-known example of this is Watt’s governor for the steam engine, which would let off excess steam if the velocity of the engine became too great. This development led to Maxwell’s (1868) paper, “On Governors,” which laid the basis for both the theory of negative feedback and the study of system stability (both of which are discussed in Section I.3). Negative feedback was feedback in which the error (in Watt’s case, the amount by which actual velocity exceeded desired velocity) was used to counteract the error; stability occurred if this feedback was apportioned to reduce the error toward zero. Bernard (1878) brought these notions back to biology with his study of what Cannon (1939) would later dub homeostasis, observing that physiological processes often form circular chains of cause and effect that could counteract disturbances in such variables as body temperature, blood pressure, and glucose level in the blood. In fact, following publication of Wiener’s 1948 book, the Josiah Macy, Jr., Foundation conferences, in which many of the pioneers of cybernetics were involved, became referred to as the Cybernetics Group, with the proceedings entitled Cybernetics: Circular Causal and Feedback Mechanisms in Biological and Social Systems, (see Heims, 1991, for a history of the conferences and their participants).
The nineteenth century also saw major developments in the understanding of the brain. At an overall anatomical level, a major achievement was the understanding of localization in the cerebral cortex (see Young, 1970, for a history). Magendie and Bell had discovered that the dorsal roots of the spinal cord were sensory, carrying information from receptors in the body, while the ventral roots (on the belly side) were motor, carrying commands to the muscles. Fritsch and Hitzig, and then Ferrier, extended this principle to the brain proper, showing that the rear of the brain contains the primary receiving areas for vision, hearing, and touch, while the motor cortex is located in front of the central fissure. All this understanding of localization in the cerebral cortex led to the nineteenth century neurological doctrine, perhaps best exemplified in Lichtheim’s (1885) development of the insights of Broca and Wernicke into brain mechanisms of language, which viewed different mental “faculties” as being localized in different regions of the brain. Thus, neurological deficits were to be explained as much in terms of lesions of the connections linking two such regions as in terms of lesions to the regions themselves. We may also note a major precursor of the connectionism of this volume, where the connections are those between neuron-like units rather than anatomical regions: the associationist psychology of Alexander Bain (1868), who represented associations of ideas by the strengths of connections between “neurons” representing those ideas.
Around 1900, two major steps were taken in revealing the finer details of the brain. In Spain, Santiago Ramón y Cajal (e.g., 1906) gave us exquisite anatomical studies of many regions of the brain, revealing the particular structure of each as a network of neurons. In England, the physiological studies of Charles Sherrington (1906) on reflex behavior provided the basic physiological understanding of synapses, the junction points between the neurons. Somewhat later, in Russia, Ivan Pavlov (1927), extending associationist psychology and building on the Russian studies of reflexes by Sechenov in the 1860s, established the basic facts on the modifiability of reflexes by conditioning (see Fearing, 1930, for a historical review).
A very different setting of the scene for cybernetics came from work in mathematical logic in the 1930s. Kurt Gödel published his famous Incompleteness Theorem in 1931 (see Arbib, 1987, for a proof as well as a debunking of the claim that Gödel’s theorem sets limits on machine intelligence). The “formalist” program initiated by David Hilbert, which sought to place all mathematical truth within a single formal system, had reached its fullest expression in the Principia Mathematica of Whitehead and Russell. But Gödel showed that, if one used the approach offered in Principia Mathematica to set up consistent axioms for arithmetic and prove theorems by logical deduction from them, the theory must be incomplete, no matter which axioms (“knowledge base”) one started with—there would be true statements of arithmetic that could not be deduced from the axioms.
Following Gödel’s 1931 study, many mathematical logicians sought to formalize the notion of an effective procedure, of what could and could not be done by explicitly following an algorithm or set of rules. Kleene (1936) developed the theory of partial recursive functions; Turing (1936) developed his machines; Church (1941) developed the lambda calculus, the forerunner of McCarthy’s list processing language, LISP, a one-time favorite of artificial intelligence (AI) workers; while Emil Post (1943) introduced systems for rewriting strings of symbols, of which Chomsky’s early formalizations of grammars in 1959 were a special case. Fortunately, these methods proved to be equivalent. Whatever could be computed by one of these methods could be computed by any other method if it were equipped with a suitable “program.” It thus came to be believed (Church’s thesis) that if a function could be computed by any machine at all, it could be computed by each one of these methods.
Turing (1936) helped chart the limits of the computable with his notion of what is now called a Turing machine, a device that followed a fixed, finite set of instructions to read, write, and move upon a finite but indefinitely extendible tape, each square of which bore a symbol from some finite alphabet. As one of the ingredients of Church’s thesis, Turing offered a “psychology of the computable,” making plausible the claim that any effectively definable computation, that is, anything that a human could do in the way of symbolic manipulation by following a finite and completely explicit set of rules, could be carried out by such a machine equipped with a suitable program. Turing also provided the most famous example of a noncomputable problem, “the unsolvability of the Halting Problem.” Let p be the numerical code for a Turing machine program, and let x be the code for the initial contents of a Turing machine’s tape. Then the halting function h(p, x) = 1 if Turing machine p will eventually halt if started with data x; otherwise it is 0. Turing showed that there was no “computer program” that could compute h.
And so we come to 1943, the key year for bringing together the notions of control mechanism and intelligent automata.
In “A Logical Calculus of the Ideas Immanent in Nervous Activity,” McCulloch and Pitts (1943) united the studies of neurophysiology and mathematical logic. Their formal model of the neuron as a threshold logic unit (see Section I.1) built on the neuron doctrine of Ramón y Cajal and the excitatory and inhibitory synapses of Sherrington, using notation from the mathematical logic of Whitehead, Russell, and Carnap. McCulloch and Pitts provided the “physiology of the computable” by showing that the control box of any Turing machine, the essential formalization of symbolic computation, could be implemented by a network (with loops) of their formal neurons. The ideas of McCulloch and Pitts influenced John von Neumann and his colleagues when they defined the basic architecture of stored program computing. Thus, as electronic computers were built toward the end of World War II, it was understood that whatever they could do could be done by a network of neurons.
Craik’s (1943) book, The Nature of Explanation, viewed the nervous system “as a calculating machine capable of modeling or paralleling external events,” suggesting that the process of forming an “internal model” that paralleled the world is the basic feature of thought and explanation. In the same year, Rosenblueth, Wiener, and Bigelow published “Behavior, Purpose and Teleology.” Engineers had noted that if feedback used in controlling the rudder of a ship were too brusque, the rudder would overshoot, compensatory feedback would yield a larger overshoot in the opposite direction, and so on and so on as the system wildly oscillated. Wiener and Bigelow asked Rosenblueth whether there was any corresponding pathological condition in humans and were given the example of intention tremor associated with an injured cerebellum. This evidence for feedback within the human nervous system (see Motor Control, Biological and Theoretical) led the three scientists to advocate that neurophysiology move beyond the Sherringtonian view of the CNS as a reflex device adjusting itself in response to sensory inputs. Rather, setting reference values for feedback systems could provide the basis for analysis of the brain as a purposive system explicable only in terms of circular processes, that is, from nervous system to muscles to the external world and back again via receptors.
Such studies laid the basis for the emergence of cybernetics, which in turn gave birth to a number of distinct new disciplines, such as AI, biological control theory, cognitive psychology, and neural modeling, which each went their separate ways in the 1970s. The next subsection introduces a number of these disciplines and the relations between them; this analysis will continue in many articles in Part III of the Handbook.
Brains, Machines, and Minds
Brains.Brain theory comprises many different theories as to how the structures of the brain can subserve such diverse functions as perception, memory, control of movement, and higher mental function. As such, it includes both attempts to extend notions of computing, as well as applications of modern electronic computers to explore the performance of complex models. An example of the former is the study of cooperative computation between different structures in the brain which seeks to offer a new paradigm for computing that transcends classical notions associated with serial execution of symbolic programs. For the latter, computational neuroscience makes systematic use of mathematical analysis and computer simulation to provide ever better models of the structure and function of living brains, building on earlier work in both neural modeling and biological control theory.
Machines.Artificial intelligence studies how computers may be programmed to yield “intelligent” behavior without necessarily attempting to provide a correlation between structures in the program and structures in the brain. Robotics is related to AI but emphasizes the flexible control of machines (robots) which have receptors (e.g., television cameras) and effectors (e.g., wheels, legs, arms, grippers) that allow them to interact with the world.
Brain theory has spawned a companion field of neural computing, which involves the design of machines with circuitry inspired by, but which need not faithfully emulate, the neural networks of brains. Many technologists usurp the term “neural networks” for this latter field, but we will use it as an umbrella term which may, depending on context, describe biological nervous systems, models thereof, and the artificial networks which (sometimes at great remove) they inspire. When the emphasis is on “higher mental functions,” neural computing may be seen as a new branch of AI (see the road map Artificial Intelligence in Part II), but it also contributes to robotics (especially to those robot designs inspired by analysis of animal behavior), and to a wide range of technologies, including those based on image analysis, signal processing, and control (see the road map Applications).
For the latter work, many people emphasize adaptive neural networks which, without specific programming, can adjust their connections through self-organization or to meet specifications given by some teacher. There are also significant contributions to the systematic design, rather than emergence through learning, of neural networks, especially for applications in low-level vision (such as stereopsis, optic flow, and shape-from-shading). However, complex problems cannot, in general, be solved by the tuning or the design of a single unstructured network. For example, robot control may integrate a variety of low-level vision networks with a set of competing and cooperating networks for motor control and its planning. Brain theory and neural computing thus have to address the analysis and design, respectively, of networks of networks (see, e.g., Hybrid Connectionist/Symbolic Systems and Modular and Hierarchical Learning Systems).
Minds.Here, I want to distinguish the brain from the mind (the realm of the “mental”). In great part, brain theory seeks to analyze how the brain guides the behaving organism in its interactions with the dynamic world around it, but much of the control of such interactions is not mental, and much of what is mental is subsymbolic and/or unconscious (see Philosophical Issues in Brain Theory and Connectionism and Consciousness, Neural Models of). Without offering a precise definition of “mental,” let me just say that many people can agree on examples of mental activity (perceiving a visual scene, reading, thinking, etc.) even if they take the diametrically opposite philosophical positions of dualism (mind and brain are separate) or monism (mind is a function of brain). They would then agree that some mental activity (e.g., contemplation) need not result in overt “interactions with the dynamic real world,” and that much of the brain’s activity (e.g., controlling normal breathing) is not mental. Face recognition seems to be a mental activity that we do not carry out through symbol manipulation. Indeed, even psychologists who reject Freud’s particular psychosexual theories accept his notion that much of our mental behavior is shaped by unconscious forces (for an assessment of Freud and an account of consciousness, see Arbib and Hesse, 1986).
Cognitive psychology attempts to explain the mind in terms of “information processing” (a notion which is continuing to change). It thus occupies a middle ground between brain theory and AI in which the model must explain psychological data (e.g., what tasks are hard for humans, people’s ability at memorization, the development of the child, patterns of human errors, etc.) but in which the units of the model need not correspond to actual brain structures. In the 1960s and 1970s, the majority of cognitive psychologists formulated their theories in terms of information theory and/or symbol manipulation, while theories of biological organization were ignored. However, workers in both AI and cognitive psychology now pay increasing attention to the cooperative computation paradigm. The term connectionism has come to be used for studies that model human thought and behavior in terms of parallel distributed networks of neuron-like units, with learning mediated by changes in strength of the connections between these elements (see Cognitive Modeling: Psychology and Connectionism).
The study of brain theory and neural networks thus has a twofold aim: (1) to enhance our understanding of human thought and the neural basis of human and animal behavior (brain theory), and (2) to learn new strategies for building “intelligent” machines or adaptive robots (neural computing). In either case, we seek organizational principles that will help us understand how neurons (whether biological or artificial) can work together to yield complex patterns of behavior. Brain theory requires empirical data to shape and constrain modeling, but in return provides concepts and hypotheses to shape and constrain experimentation. In neural computing, the criterion for success is the design of a machine that can perform a task cheaply, reliably, and effectively, even if, in the process of making the best use of available (e.g., silicon) technology, the final design departs radically from the biological neural network that inspired it. It will be important in reading this Handbook, then, to be clear as to whether a particular study is an exercise in brain theory/computational neuroscience or in AI/neural computing. What will not be in doubt is that the influence of these subjects works both ways: not only can brain mechanisms inspire new technology, but new technologies provide metaphors to drive new theories of brain function. To this it must be added that most workers in ANNs know little of brain function, and relatively few neuroscientists have a deep understanding of brain theory or know much of neural computing beyond the basic ideas of Hebbian plasticity and, perhaps, backpropagation (see Section I.3). However, the level of interchange has increased since the first edition of this Handbook appeared, and this new edition is designed to further increase the flow of information between these scientific communities.
Levels of Analysis
Whether the emphasis is on humans, animals, or machines, it becomes clear that we can seek insight at many different levels of analysis; from large information processing blocks down to the finest details of molecular structure. Much of psychology and linguistics looks at human behavior “from the outside,” whether studying overall competence or attending to details of performance. Neuropsychology relates behavior to the interaction of various brain regions. Neurophysiology studies the activity of neurons, both to understand the intrinsic properties of the neurons and to help understand their role in the subsystems dissected out by the neuropsychologist, such as networks for pattern recognition or for visuomotor coordination. Molecular and cell biology and biophysics correlate the structure and connectivity of the membranes and subcellular systems which constitute cells with the way these cells transform incoming patterns or subserve memory by changing function with repeated interactions.
These differing levels make it possible to focus individual research studies, but they are ill-defined, and a scientist who works on any one level needs to make occasional forays, both downward to find mechanisms for the functions studied, and upward to understand what role the studied function can play in the overall scheme of things. Top-down modeling starts from some overall behavior and explains it in terms of the interaction of high-level functional units, while bottom-up modeling starts from the interaction of individual neurons (or even smaller units) to explain network properties. It requires a judicious blend of the two to connect the clear overview of crucial questions to the hard data of neuroscience or, in the case of neural engineering, to the details of implementation. Most successful modeling will be purely bottom-up or top-down only in its initial stages, if at all—constraints on an initial top-down model will be given, for example, by the data on regional localization offered by the neurologist, or the circuit-cell-synapse studies of much current neuroscience.
We must now distinguish the brain’s computation from connectionist computation “in the style of the brain.” If a connectionist model succeeds in describing some psychological input/output behavior, it may become an important hypothesis that its internal structure is “real” (see Recurrent Networks: Neurophysiological Modeling). In general, however, much additional work will be required to find and assimilate neurophysiological data to provide brain models in which the neurons are not mere formal units but actually represent biological neurons in the brain.
Much study of the brain is guided by evolutionary and comparative studies of animal behavior and brain function (cf. Evolution of the Ancestral Vertebrate Brain and related articles in the road map Neuroethology and Evolution). The information about the function of the human brain that is gained in the neurological clinic or during neurosurgery can thus be supplemented by humane experimentation on animals. (However, as evidenced by Table 1 of Section I.1, we can learn a great deal by studying the differences, as well as the similarities, between the brains of different species.) We learn by stimulating, recording from, or excising portions of an animal’s brain and seeing how the animal’s behavior changes. We may then compare such results with observations using such techniques as positron emission tomography (PET) or functional magnetic resonance imaging (fMRI) of the relative activity of different parts of the human brain during different tasks (see Imaging the Grammatical Brain, Imaging the Motor Brain, and Imaging the Visual Brain). The grand aim of cognitive neuroscience (as neuropsychology has now become; see the Cognitive Neuroscience road map) is to use clinical data and brain imaging to form a high-level view of the involvement of various brain regions in human cognition, using single-cell activity recorded from animals engaged in analogous behaviors to suggest the neural networks underlying this involvement (see Synthetic Functional Brain Mapping). The catch, of course, is that the “analogous behaviors” of animals are not very analogous at all when it comes to such symbolic activities as language and reasoning. In Part III, we will see that “higher mental functions” tend to be modeled more in connectionist terms constrained (if at all) by psychological or psycholinguistic data (cf. the Part II road maps Psychology and Linguistics and Speech Processing), while the greatest successes in seeking the neural underpinnings of human behavior have come in areas such as vision, memory, and motor control, where we can make neural network models of animal analogues of human capabilities (cf. the road maps Vision, Other Sensory Systems, Neural Plasticity, Biological Networks, Motor Pattern Generators, and Mammalian Motor Control).
We also learn from the attempt to reproduce various aspects of human behavior in a robot, even though human action, memory, learning, and perception are far richer than those of any machine yet built or likely to be built in the near future (see Biologically Inspired Robotics). Thus, when we suggest that the brain can be thought of in some ways as a (highly distributed) computer, we are not trying to reduce humans to the level of extant machines, but rather to understand ways in which machines give us insight into human attributes. This type of study has been referred to as cybernetics, extending the concept of Norbert Wiener, who, as we have seen, defined the subject as “the study of control and communication in man and machine.”
To the extent that they address “higher mental function,” the studies presented in this Handbook suggest that there is no single “thing” called intelligence, but rather a plexus of properties that, taken one at a time, may be little cause for admiration, but any sizable collection of which will yield behavior that we would label as intelligent. Turing (1950) argued that we would certainly regard a machine as intelligent if it could pass the following test: An experimenter sits in a room with two teletypes by which she conducts a “conversation” with two systems. One is a human, the other is a machine, but the experimenter is not told which is which. If, after asking many questions, she is likely to have much doubt about which is human and which is machine, we should, says Turing, concede intelligence to the machine. However, unless one dogmatically insists that being intelligent entails behaving in a human way, it is “harder” for a machine to pass this Turing test than to be intelligent. For instance, whereas a computer can answer problems in arithmetic quickly and correctly, a much more complex program would be required to ensure that it answered as slowly and erratically as a human. Turing’s aim was not to find a necessary set of conditions to ensure intelligence, but rather to devise a test which, if passed by a machine, would convince most skeptics that the machine had intelligence.
Schema Theory
The analysis of complex systems, whether they subserve natural or artificial intelligence, requires a coarser grain of analysis to complement that of neural networks. To make sense of the brain, we often divide it into functional systems—such as the motor system, the visual system, and so on—as well as into structural subsystems—from the spinal cord and the hippocampus to the various subdivisions of the prefrontal cortex. Similarly, in distributed AI (see Multiagent Systems), the solution of a task may be distributed over a complex set of interacting agents, each with their dedicated processors for handling the information available to them locally. Thus, both neuroscience and artificial intelligence require a language for expressing the distribution of function across units intermediate between overall function and the final units of analysis (e.g., neurons or simple instructions).
Since the “units of thought” or the subfunctions of a complex behavior may be quite high-level compared to the fine-grain computation of the myriad neurons in the human brain, Schema Theory (q.v.; see also Arbib, 1981; Arbib, Érdi, and Szentágothai, 1998, chap. 3) complements connectionism by providing a bridging language between functional description and neural networks. It is based on a theory of the concurrent activity of interacting functional units called schemas. Perceptual schemas are those used for perceptual analysis, while motor schemas are those which provide the control systems that can be coordinated to effect a wide variety of movement. Other schemas compete and cooperate to meld action, internal state, and perception in an ongoing action-perception cycle.
Figure 4A represents brain theory, while Figure 4B offers a similar but distinct picture for distributed AI. We may model the brain either functionally, analyzing some behavior in terms of interacting schemas, or structurally, through the interaction of anatomically defined units, such as brain regions (cf. the examples in the road map Mammalian Brain Regions) or substructures of these regions, such as layers or columns. In brain theory, we ultimately seek an explanation in terms of neural networks, since the neuron may be considered the basic unit of function as well as of structure, and much further work in computational neuroscience seeks to explain the complex functionality of real neurons in terms of “subneural” units, such as membrane compartments, channels, spines, and synapses. What makes the story more subtle is that, in general, a functional analysis proceeding “top-down” from some overall behavior need not map directly into a “bottom-up” analysis proceeding upward from the neural circuitry (brain theory) or basic set of processors (distributed AI), and that several iterations from the “middle out” may be required to bring the structural and functional accounts into consonance. Brain theory may then seek to replace an initially plausible schema analysis with one whose schemas may be constituted by an assemblage of schemas which can each be embodied in one structure (without denying that a given brain region may support the activity of multiple schemas). The schemas that serve as the functional units in our initial hypotheses about the decomposition of some overall function may well differ from the more refined hypotheses which provide an account of structural correlates as well. On the other hand, distributed AI may adopt any schema analysis that is technologically effective, and the schemas may be implemented in whatever medium is appropriate, whether as conventional computer programs, ANNs, or special-purpose devices. These different approaches then rest on effective design of VLSI “chips” or other computing materials (cf. the road map Implementation and Analysis).
Figure 4.
Views of level of analysis of brain and behavior (A) and a distributed technological system (B), highlighting the role of schemas as an intermediate level of functional analysis in each case.
For brain theory, the top-level schemas must be “large” enough to allow an analysis of behavior at or near the psychological level, yet also be subject to successive decomposition down to a level that may, in certain cases, be implemented in specific neural networks. We again distinguish a schema as a functional unit from a neural network as a structural unit. A given schema may be distributed across several neural networks; a given neural network may be involved in the implementation of several different schemas. The same will be true for relating connectionist units to single biological neurons. If there is to be a fuller rapprochement between connectionism and neuropsychology, it will be important to use a vocabulary (or context) that allows one to make the necessary distinctions between connectionist and biological neurons.
A top-down analysis (decomposing a function) may suggest that a certain schema is embedded in a certain part of the brain; we can then marshal the available data from anatomy and neurophysiology to assess whether the circuitry can, indeed, subserve an instance of that schema. It often happens that the empirical data are inadequate. We then make hypotheses for experimental confirmation. Alternatively, bottom-up analysis of a brain region (assembling its constituents) may suggest that it subserves a different schema from that originally hypothesized, and we must then conduct a new top-down analysis in the light of these newfound constraints.
To illuminate the notion of experimental insight modifying an initial top-down analysis, we consider an example from Rana computatrix, a set of models of visuomotor coordination in the frog and toad (cf. Visuomotor Coordination in Frog and Toad). Frogs and toads snap at small moving objects and jump away from large ones (to oversimplify somewhat). Thus, a simple schema-model of the frog brain might simply postulate four schemas: two perceptual schemas (processes for recognizing objects or situations) and two motor schemas (processes for controlling some structured behavior). One perceptual schema would recognize small moving objects and activate a motor schema for approaching the prey; the other would recognize large moving objects and activate a motor schema for avoiding the predator. Lesion experiments can put such a model to the test if it is enhanced by hypotheses on the localization of each schema in the brain. It was thought that the tectum (a key visual region in the animal’s midbrain) was the locus for recognizing small moving objects, while the pretectum (a region just in front of the tectum) was the locus for recognizing large moving objects. Based on these localization hypotheses, the model described would predict that an animal with a lesioned pretectum would be unresponsive to large objects, but would respond normally to small objects. However, the facts are quite different. A pretectum-lesioned toad will approach moving objects, both large and small, and does not exhibit avoidance behavior. This has led to a new schema model in which a perceptual schema to recognize large moving objects is still localized in the pretectum, but the tectum now contains a perceptual schema for all moving objects. We then add that activity of the pretectal schema not only triggers the avoidance motor schema but also inhibits approach. This new schema model still yields the normal behavior to large and small moving objects, but also fits the lesion data, since removal of the pretectum removes inhibition, meaning that the animal will now approach any moving object (Ewert and von Seelen, 1974).
We have thus seen how schemas may be used to provide falsifiable models of the brain, using lesion experiments to test schema models of behavior, and leading to new functional models that better match the structure of the brain. Note again that, in different species, the map from function to brain structure may be different, while in distributed AI the constraints are not those of analysis but rather those of design—namely, for a given function and a given set of processors, a schema decomposition must be found that will map most efficiently onto a network of processors of a certain kind.
While the brain may be considered a network of interacting “boxes” (anatomically distinguishable structures), there is no reason to expect each such box to mediate a single function that is well-defined from a behavioral standpoint. We have just seen that the frog tectum is implicated in both approach and (when modulated by pretectum) avoidance behavior. The language of schemas lets us express hypotheses about the various functions that the brain performs without assuming localization of any one function in any one region, but also allows us to express the way in which many regions participate in a given function, or a given region participates in many functions.
The style of cooperative computation (see Cooperative Phenomena) exhibited in both schema theory and connectionism is far removed from serial computation and the symbol-based ideas that have dominated conventional AI. As we shall see in example after example in Part III, the brain has many specialized areas, each with a partial representation of the world. It is only through the interaction of these regions that the unity of behavior of the animal emerges, and the human is no different in this regard. The representation of the world is the pattern of relationships between all its partial representations. Much work in AI contributes to schema theory, even when it does not use this term. For example, Brooks (1986) builds robot controllers using layers made up of asynchronous modules that can be considered to be a version of schemas (see Reactive Robotic Systems). This work shares with schema theory, with its mediation of action through a network of schemas, the point that no single, central, logical representation of the world needs link perception and action. It is also useful to view cooperative computation as a social phenomenon. A schema is a self-contained computing agent (object) with the ability to communicate with other agents, and whose function is specified by some behavior. Whereas schema theory was motivated in great part by the study of interacting brain regions (other influences are reviewed in Schema Theory), much early work in distributed AI was motivated by a social analogy in which the schemas were thought of as “agents” analogous to people interacting in a social setting to compete or cooperate in solving some overall problem, a theme elaborated on by Minsky (1985) and whose current status is reviewed in Multiagent Systems.
Section I.1 introduced a number of key concepts from the biological study of neurons, stressing the diversity of neurons both within the human CNS and across species. It presented several simple models of neurons, noting that computational neuroscience has gone on to produce more subtle and complicated neuronal models, while neural computing tends to use simple neurons augmented by “learning rules” for changing connection strengths on the basis of “experience.” The purpose of this section is to introduce two key approaches that dominate the modern study of neural networks: (1) the study of neural networks as dynamic systems (developed more fully in the road map Dynamic Systems), and (2) the study of neural networks as adaptive systems (see Learning in Artificial Networks). To make this section essentially self-contained, we start by recalling the definitions of the McCulloch-Pitts and leaky integrator neurons from Section I.1, but we do this in the context of a general, semiformal, introduction to dynamic systems.
Dynamic Systems
We motivate the notion of dynamic systems by considering how to abstract the interaction of an organism (or a machine) with its environment. The organism will be influenced by aspects of the current environment—the inputs to the organism—while the activity of the environment will be responsive in turn to aspects of the current activity of the organism, the outputs of the organism. The inputs and outputs that actually enter into a theory of the organism (or machine) are a small sampling of the flux of its interactions with the rest of the universe. There is essentially no limit to how many variables one could include in the analysis; a crucial task in any theory building is to pick the “right” variables.
Depending on the context, we will use the word system to denote either the physical reality (which we cannot know in its entirety) or the abstraction with which we approximate it. Inputs and outputs do not constitute a complete description of a system. We cannot predict how someone will answer a question unless we know her state of knowledge; nor can we tell how a computer will process its data unless we know the instructions controlling its computation. In short, we must include a description of the internal state of the system which determines what it will extract from its current stimulation in determining its current actions and modifying its internal state. Our abstraction of any real system contains five elements:
The set of inputs: those variables of the environment which we believe will affect the system behavior of interest to us.
The set of outputs: those variables of the system which we choose to observe, or which we believe will significantly affect the environment.
The set of states: those internal variables of the system (which may or may not also be output variables) which determine the relationship between input and output. Essentially, the state of a system is the system’s “internal residue of the past”: when we know the state of a system, no further information about the past behavior of the system will enable us to refine predictions of the way in which future inputs and outputs of the system will be related.
The state-transition function: that function which determines how the state will change when the system obtains various inputs.
The output function: that function which determines what output the system will yield with a given input when in a given state.
Any system in which the state-transition function and output function uniquely determine the new state and output from a specification of the initial state and subsequent inputs is called a deterministic system. If, no matter how carefully we specify subsequent inputs to a system, we cannot specify exactly what will be the subsequent states and outputs, we say the system is probabilistic or stochastic. A stochastic treatment may be worthwhile, either because we are analyzing systems, which are “inescapably” stochastic (e.g., at the quantum level), or because we are analyzing macroscopic systems, which lend themselves to a stochastic description by ignoring “fine details” of microscopic variables. For example, it is usually more reasonable to describe a coin in terms of a 0.5 probability of coming up heads than to measure the initial placement of the coin on the finger and the thrust of the thumb in sufficient detail to determine whether the coin will come up heads or tails.
Continuous-Time Systems
In Newtonian mechanics, the state of the system comprises the positions of its components, which are directly observable, and their velocities, which can be estimated from the observed trajectory over a period of time. Time is continuous (i.e., characterized by the set ℝ of real numbers), and the way in which the state changes is described by a differential equation: classical mechanics provides the basic example of continuous-time systems in which the present state and input determine the rate at which the state changes. This requires that the input, output, and state spaces be continuous spaces in which such continuous changes can occur. Consider the simple example of a point mass undergoing rectilinear motion. At any time, its position y(t) is the observable output of the system, and the force u(t) acting upon it is the input applied to the system. Newton’s third law says that the force applied to the system equals the mass times the acceleration ÿ(t) = mu(t), where the acceleration ÿ(t) is the second derivative of y(t). According to Newton’s laws, the state of the system is given by the position and velocity of the particle. We call the position-velocity pair, at any time, the instantaneous state q(t) of the system. In fact, the earlier equation gives us enough information to deduce the rate of change dq(t)/dt of this state. Using standard matrix formalism, we thus have
while
This is an example of a linear system in which the rate of change of state depends linearly on the present state and input, and the present output depends linearly on the present state. That is, there are matrices F, G, and H such that
More generally, a physical system can be expressed by a pair of equations:
The first expresses the rate of change dq(t)/dt of the state as a function of both the state q(t) and the input or control vector u(t) applied at any time t; the second reads the output from the current state.
We now present the definition of a leaky integrator neuron as a continuous-time system. The internal state of the neuron is its membrane potential, m(t), and its output is the firing rate, M(t). The state transition function of the cell is expressed as
while the output function of the cell is given by the equation
Thus, if there are m inputs Xi(t), i = 1, … , m, then the input space of the neuron is ℝm, with current value (X1(t), … , Xm(t)), while the state and output spaces of the neuron both equal ℝ, with current values m(t) and M(t), respectively.
Let us now briefly (and semiformally) see how a neural network comprised of leaky integrator neurons can also be seen as a continuous-time system in this sense. As typified in Figure 5, we characterize a neural network by selecting N neurons (each with specified input weights and resting potential) and by taking the axon of each neuron, which may be split into several branches carrying identical output signals, and either connecting each line to a unique input of another neuron or feeding it outside the net to provide one of the K network output lines. Then every input to a given neuron must be connected either to an output of another neuron or to one of the (possibly split) L input lines of the network. Thus the input set X = ℝL, the state set Q = ℝN, and the output set Y = ℝK. If the ith output line comes from the jth neuron, then the output function is determined by the fact that the ith component of the output at time t is the firing rate Mj(t) = σj(Mj(t)) of the jth neuron at time t. The state transition function for the neural network follows from the state transition functions of each of the N neurons
Figure 5.
A neural network viewed as a system. The input at time t is the pattern of firing on the input lines, the output is the pattern of firing on the output lines; and the internal state is the vector of firing rates of all the neurons of the network.
as soon as we specify whether Xij(t) is the output Mk(t) of the kth neuron or the value xl(t) currently being applied on the lth input line of the overall network.
Discrete-Time Systems
In contrast to continuous-time systems, which must have continuous state spaces on which the differential equations for the state transition function can be defined, discrete-time systems may have either continuous or discrete state spaces. (A discrete state space is just a set with no specific metric or topological structure.) For example, a McCulloch-Pitts neuron is considered to operate on a discrete-time scale, t = 0, 1, 2, 3, … , and has connection weights wi and threshold θ. If at time t the value of the ith input is xi(t), then the output one time step later, y(t + 1), equals 1 if and only if . If there are m inputs (x1(t), … , xm(t)), then, since inputs and outputs are binary, such a neuron has input set = {0, 1}m, state set = output set {0, 1} (we treat the current state and output as being identical). On the other hand, the important learning scheme known as backpropagation (defined later) is based on neurons which operate on discrete time, but with both input and output taking continuous values in some range, say [0, 1].
In computer science, an automaton is a discrete-time system with discrete input, output, and state spaces. Formally, we describe an automaton by the sets X, Y, and Q of inputs, outputs, and states, respectively, together with the next-state function δ: Q × X → Q and the output function β: Q → Y. If the automaton is in state q and receives input x at time t, then its next state will be δ(q, x) and its next output will be β(q). It should be clear that a McCulloch-Pitts neural network (i.e., a network like that shown in Figure 5, but a discrete-time network with each neuron a McCulloch-Pitts neuron) functions like a finite automaton, as each neuron changes state synchronously on each tick of the time scale t = 0, 1, 2, 3, . … Conversely, it can be shown (see Arbib, 1987; the result was essentially, though inscrutably, due to McCulloch and Pitts, 1943) that any finite automaton can be simulated by a suitable McCulloch-Pitts neural network.
Stability, Limit Cycles, and Chaos
With the previous discussion, we now have more than enough material to understand the crucial dynamic systems concept of stability and the related concepts of limit cycles and chaos (see Computing with Attractors and Chaos in Neural Systems). We want to know what happens to an “unperturbed” system, i.e., one for which the input is held constant (possibly with some specific “null input,” usually denoted by 0, the “zero” input in X). An equilibrium is a state q in which the system can stay at rest, i.e., such that δ(q, 0) = q (discrete time) or dq/dt = f(q, 0) = 0 (continuous time). The study of stability is concerned with the issue of whether or not this rest point will be maintained in the face of slight disturbances. To see the variety of equilibria, we use the image of a sticky ball rolling on the “hillside” of Figure 6. We say that point A on the “hillside” in this diagram is an unstable equilibrium because a slight displacement from A will tend to increase over time. Point B is in a region of neutral equilibrium because slight displacements will tend not to change further, while C is a point of stable equilibrium, since small displacements will tend to decrease over time. Note the word “small”: in a nonlinear system like that of Figure 6, a large displacement can move the ball from the basin of attraction of C (the set of states whose dynamics tends toward C) to another one. Clearly, the ball will not tend to return to C after a massive displacement that moves the ball to the far side of A’s hilltop.
Figure 6.
An energy landscape: For a ball rolling on the “hillside,” point A is an unstable equilibrium, point B lies in a region of neutral equilibrium, and point C is a point of stable equilibrium. Point C is called an attractor: the basin of attraction of C comprises all states whose dynamics tend toward C.
Many nonlinear systems have another interesting property: they may exhibit limit cycles. These are closed trajectories in the state space, and thus may be thought of as “dynamic equilibria.” If the state of a system follows a limit cycle, we may also say it oscillates or exhibits periodic behavior. A limit cycle is stable if a small displacement will be reduced as the trajectory of the system comes closer and closer to the original limit cycle. By contrast, a limit cycle is unstable if such excursions do not die out. Research in nonlinear systems has also revealed what are called strange attractors. These are attractors which, unlike simple limit cycles, describe such complex paths through the state space that, although the system is deterministic, a path that approaches the strange attractor gives every appearance of being random. The point here is that very small differences in initial state may be amplified with the passage of time, so that differences that at first are not even noticeable will yield, in due course, states that are very different indeed. Such a trajectory has become the accepted mathematical model of chaos, and it is used to describe a number of physical phenomena, such as the onset of turbulence in a weather system, as well as a number of phenomena in biological systems (see Chaos in Biological Systems; Chaos in Neural Systems).
Hopfield Nets
Many authors have treated neural networks as dynamical systems, employing notions of equilibrium, stability, and so on, to classify their performance (see, e.g., Grossberg, 1967; Amari and Arbib, 1977; see also Computing with Attractors). However, it was a paper by John Hopfield (1982) that was the catalyst in attracting the attention of many physicists to this field of study. In a McCulloch-Pitts network, every neuron processes its inputs to determine a new output at each time step. By contrast, a Hopfield net is a net of such units with (1) symmetric weights (wij = wji) and no self-connections (wii = 0), and (2) asynchronous updating. For instance, let si denote the state (0 or 1) of the ith unit. At each time step, pick just one unit at random. If unit i is chosen, si takes the value 1 if and only if Σwijsj ≥ θi. Otherwise si is set to 0. Note that this is an autonomous (input-free) network: there are no inputs (although instead of considering θi as a threshold we may consider −θi as a constant input, also known as a bias).
Hopfield defined a measure called the energy for such a net (see Energy Functionals for Neural Networks)
This is not the physical energy of the neural net but a mathematical quantity that, in some ways, does for neural dynamics what the potential energy does for Newtonian mechanics. In general, a mechanical system moves to a state of lower potential energy just as, in Figure 6, the ball tends to move downhill. Hopfield showed that his symmetrical networks with asynchronous updating had a similar property.
For example, if we pick a unit and the foregoing firing rule does not change its si, it will not change E. However, if si initially equals 0, and Σwijsj ≥ θi, then si goes from 0 to 1 with all other sj constant, and the “energy gap,” or change in E, is given by
Similarly, if si initially equals 1, and Σwijsj < θi, then si goes from 1 to 0 with all other sj constant, and the energy gap is given by
In other words, with every asynchronous updating, we have ΔE ≤ 0. Hence the dynamics of the net tends to move E toward a minimum. We stress that there may be different such states—they are local minima, just as, in Figure 6, both D and E are local minima (each of them is lower than any “nearby” state) but not global minima (since C is lower than either of them). Global minimization is not guaranteed.
The expression just presented for ΔE depends on the symmetry condition, wij = wji, for without this condition, the expression would instead be (wijsj + wjisj) + θi and in this case, Hopfield’s updating rule need not yield a passage to energy minimum, but might instead yield a limit cycle, which could be useful in, e.g., controlling rhythmic behavior (see, e.g., Respiratory Rhythm Generation). In a control problem, a link wij might express the likelihood that the action represented by i would precede that represented by j, in which case wij = wji is normally inappropriate.
The condition of asynchronous update is crucial, too. If we consider the simple “flip-flop” with w12 = w21 = 1 and θ1 = θ2 = 0.5, then the McCulloch-Pitts network will oscillate between the states (0, 1) and (1, 0) or will sit in the states (0, 0) or (1, 1); in other words, there is no guarantee that it will converge to an equilibrium. However, with E = −(½)Σijsisjwij + Σisiθi, we have E(0, 0) = 0, E(0, 1) = E(1, 0) = 0.5, and E(1, 1) = 0, and the Hopfield network will converge to the global minimum at either (0, 0) or (1, 1).
Hopfield also aroused much interest because he showed how a number of optimization problems could be “solved” using neural networks. (The quotes around “solved” acknowledge the fact that the state to which a neural network converges may represent a local, rather than a global, optimum of the corresponding optimization problem.) Such networks were similar to the “constraint satisfaction” networks that had already been studied in the computer vision community. (In most vision algorithms—see, e.g., Stereo Correspondence—constraints can be formulated in terms of symmetric weights, so that wij = wji is appropriate.) The aim, given a “constraint satisfaction” problem, is to so choose weights for a neural network so that the energy E for that network is a measure of the overall constraint violation. A famous example is the Traveling Salesman Problem (TSP): There are n cities, with a road of length lij joining city i to city j. The salesman wishes to find a way to visit the cities that is optimal in two ways: each city is visited only once, and the total route is as short as possible. We express this as a constraint satisfaction network in the following way: Let the activity of neuron Nij express the decision to go straight from city i to city j. The cost of this move is simply lij, and so the total “transportation cost” is Σij1ijNij. It is somewhat more challenging to express the cost of violating the “visit a city only once” criterion, but we can reexpress it by saying that, for city j, there is one and only one city i from which j is directly approached. Thus, Σj(ΣiNij − 1)2 = 0 just in case this constraint is satisfied; a non-zero value measures the extent to which this constraint is violated. This can then be mapped into the setting of weights and thresholds for a Hopfield network. Hopfield and Tank (1986) constructed chips for this network which do indeed settle very quickly to a local minimum of E. Unfortunately, there is no guarantee that this minimum is globally optimal. The article Optimization, Neural presents this and a number of other neurally based approaches to optimization. The article Simulated Annealing and Boltzmann Machines shows how noise may be added to “shake” a system out of a local minimum and let it settle into a global minimum. (Consider, for example, shaking that is strong enough to shake the ball from D to A, and thus into the basin of attraction of C, in Figure 6, but not strong enough to shake the ball back from C toward D.)
Adaptation in Dynamic Systems
In the previous discussion of neural networks as dynamic systems, the dynamics (i.e., the state transition function) has been fixed. However, just as humans and animals learn from experience, so do many important applications of ANNs depend on the ability of these networks to adapt to the task at hand by, e.g., changing the values of the synaptic weights to improve performance. We now introduce the general notion of an adaptive system as background to some of the most influential “learning rules” used in adaptive neural networks. The key motivation for using learning networks is that it may be too hard to program explicitly the behavior that one sees in a black box, but one may be able to drive a network by the actual input/output behavior of that box, or by some description of its trajectories, to cause it to adapt itself into a network which approximates that given behavior. However, as we will stress at the end of this section, a learning algorithm may not solve a problem within a reasonable period of time unless the initial structure of the network is suitable.
Adaptive Control
A key problem of technology is to control a complex system so that it behaves in some desired way, whether getting a space probe on course to Mars or a steel mill to produce high-quality steel. A common situation that complicates this control problem is that the controlled system may not be known accurately; it may even change its character somewhat with time. For example, as fuel is depleted, the mass and moments of inertia of the probe may change in unpredicted ways. The adaptation problem involves determining, on the basis of interaction with a given system, an appropriate “model” of the system which the controller can use in solving the control problem.
Suppose we have available an identification procedure which can find an adequate parametric representation of the controlled system (see Identification and Control). Then, rather than build a controller specifically designed to control this one system, we may instead build a general-purpose controller which can accommodate to any reasonable set of parameters. The controller then uses the parameters which the identification procedure provides as the best estimate of the controlled system’s parameters at that time. If the identification procedure can make accurate estimates of the system’s parameters as quickly as they actually change, the controller will be able to act efficiently despite fluctuations in controlled system dynamics. The controller, when coupled to an identification procedure, is an adaptive controller; that is, it adapts its control strategy to changes in the dynamics of the controlled system. However, the use of an explicit identification procedure is only one way of building an adaptive controller. Adaptive neural nets may be used to build adaptive procedures which may directly modify the parameters in some control rule, or identify the system inverse so that desired outputs can be automatically transformed into the inputs that will achieve them. (See Sensorimotor Learning for the distinction between forward and inverse models.)
Pattern Recognition
In the setup shown in Figure 7, the preprocessor extracts from the environment a set of “confidence levels” for various input features (see Feature Analysis), with the result represented by a vector of d real numbers. In this formalization, any pattern x is represented by a point (x1, x2, … , xd) in a d-dimensional Euclidean space ℝd called the pattern space. The pattern recognizer then takes the pattern and produces a response that may have one of K distinct values where there are K categories into which the patterns must be sorted; points in ℝd are thus grouped into at least K different sets (see Concept Learning and Pattern Recognition). However, a category might be represented in more than one region of ℝd. To take an example from visual pattern recognition (although the theory of pattern recognition networks applies to any classification of ℝd), a and A are members of the category of the first letter of the English alphabet, but they would be found in different connected regions of a pattern space. In such cases, it may be necessary to establish a hierarchical system involving a separate apparatus to recognize each subset, and a further system that recognizes that the subsets all belong to the same set (a related idea was originally developed by Selfridge, 1959; for adaptive versions, see Modular and Hierarchical Learning Systems). Here we avoid this problem by concentrating on the case in which the decision space is divided into exactly two connected regions.
Figure 7.
One strategy in pattern recognition is to precede the adaptive neural network by a fixed layer of “preprocessors” or “feature extractors” which replace the image by a finite vector for further processing. In other approaches, the functions defined by the early layers of the network may themselves be subject to training.
We call a function f: ℝd → ℝ a discriminant function if the equation f(x) = 0 gives the decision surface separating two regions of a pattern space. A basic problem of pattern recognition is the specification of such a function. It is virtually impossible for humans to “read out” the function they use (not to mention how they use it) to classify patterns. Thus, a common strategy in pattern recognition is to provide a classification machine with an adjustable function and to “train” it with a set of patterns of known classification that are typical of those with which the machine must ultimately work. The function may be linear, quadratic, polynomial, or even more subtle yet, depending on the complexity and shape of the pattern space and the necessary discriminations. The experimenter chooses a class of functions with parameters which, it is hoped, will, with proper adjustment, yield a function that will successfully classify any given pattern. For example, the experimenter may decide to use a linear function of the form
(i.e., a McCulloch-Pitts neuron!) in a two-category pattern classifier. The equation f(x) = 0 gives a hyperplane as the decision surface, and training involves adjusting the coefficients (w1, w2, … , wd, wd+1) so that the decision surface produces an acceptable separation of the two classes. We say that two categories are linearly separable if an acceptable setting of such linear weights exists. Thus, pattern recognition poses (at least) the following challenges to neural networks:
(a) Find a “good” set of preprocessors. Competitive learning based on Hebbian plasticity (see Competitive Learning, as well as the following text) provides one way of finding such features by extracting statistically significant patterns from a set of input patterns. For example, if such a network were exposed to many, but only, letters of the Roman alphabet, then it would find that certain line segments and loops occurred repeatedly, even if there were no teacher to tell it how to classify the patterns.
(b) Given a set of preprocessors and a set of patterns which have already been classified, adjust the connections of a neural network so that it acts as an effective pattern recognizer. That is, its response to a preprocessed pattern should usually agree well with the classification provided by a teacher.
(c) Of course, if the neural network has multiple layers with adaptable synaptic weights, then the early layers can be thought of as preprocessors for the later layers, and we have a case of supervised, rather than Hebbian, formation of these “feature detectors”—
emphasizing features which are not only statistically significant elements of the input patterns but which also serve to distinguish usefully to which class a pattern belongs.
Associative Memory
In pattern recognition, we associate a pattern with a “label” or “category.” Alternatively, an associative memory takes some “key” as input and returns some “associated recollection” as output (see Associative Networks). For example, given the sound of a word, we may wish to recall its spelling. Given a misspelled word, we may wish to recall the correctly spelled word of which it is most plausibly a “degraded image.” There are two major approaches to the use of neural networks as associative memories:
In nonrecurrent neural networks, there are no loops (i.e., we cannot start at any neuron and “follow the arrows” to get back to that neuron). We use such a network by fixing the pattern of inputs as the key, and holding them steady. Since the absence of loops ensures that the input pattern uniquely determines the output pattern (after the new inputs have time to propagate their effects through the network), this uniquely determined output pattern is the recollection associated with the key.
In recurrent networks, the presence of loops implies that the input alone may not determine the output of the net, since this will also depend on the initial state of the network. Thus, recurrent networks are often used as associative memories in the following way. The inputs are only used transiently to establish the initial state of the neural network. After that, the network operates autonomously (i.e., uninfluenced by any inputs). If and when it reaches an equilibrium state, that state is read out as the recollection associated with the key.
In either case, the problem is to set the weights of the neural network so that it associates keys as accurately as possible with the appropriate recollections.
Learning Rules
Most learning rules in current models of “lumped neurons” (i.e., those that exclude detailed analysis of the fine structure of the neuron or the neurochemistry of neural plasticity) take the form of schemes for adjusting the synaptic weights, the “ws.” The two classic learning schemes for McCulloch-Pitts-type formal neurons are due to Hebb (see Hebbian Synaptic Plasticity) and Rosenblatt (the perceptron, see Perceptrons, Adalines, and Backpropagation), and we now introduce these in turn.
Hebbian Plasticity and Network Self-Organization
In Hebb’s (1949) learning scheme (see Hebbian Synaptic Plasticity), the connection between two neurons is strengthened if both neurons fire at the same time. The simplest example of such a rule is to increase wij by the following amount:
where synapse wij connects a presynaptic neuron with firing rate xj to a postsynaptic neuron with firing rate yi. The trouble with the original Hebb model is that every synapse will eventually get stronger and stronger until they all saturate, thus destroying any selectivity of association. Von der Malsburg’s (1973) solution was to normalize the synapses impinging on a given neuron. To accomplish this, one must first compute the Hebbian “update” Δwij = kxiyj and then divide this by the total putative synaptic weights to get the final result which replaces wi by
where the summation k extends over all inputs to the neuron. This new rule not only increases the strengths of those synapses with inputs strongly correlated with the cell’s activity, but also decreases the synaptic strengths of other connections in which such correlations did not arise.
Von der Malsburg was motivated by the pattern recognition problem and was concerned with how individual cells in his network might come to be tuned so as to respond to one particular input “feature” rather than another (see Ocular Dominance and Orientation Columns for background as well as a review of more recent approaches). This exposed another problem with Hebb’s rule: a lot of nearby cells may, just by chance, all have initial random connectivity which makes them easily persuadable by the same stimulus; alternatively, the same pattern might occur many times before a new pattern is experienced by the network. In either case, many cells would become tuned to the same feature, with not enough cells left to learn important and distinctive features. To solve this, von der Malsburg introduced lateral inhibition into his model. In this connectivity pattern, activity in any one cell is distributed laterally to reduce (partially inhibit) the activity of nearby cells. This ensures that if one cell—call it A—were especially active, its connections to nearby cells would make them less active, and so make them less likely to learn, by Hebbian synaptic adjustment, those features that most excite A.
In summary, then, when the Hebbian rule is augmented by a normalization rule, it tends to “sharpen” a neuron’s predisposition “without a teacher,” getting its firing to become better and better correlated with a cluster of stimulus patterns. This performance is improved when there is some competition between neurons so that if one neuron becomes adept at responding to a pattern, it inhibits other neurons from doing so (Competitive Learning). Thus, the final set of input weights to the neuron depends both on the initial setting of the weights and on the pattern of clustering of the set of stimuli to which it is exposed (see Data Clustering and Learning). Other “post-Hebbian” rules, motivated both by technological efficiency and by recent biological findings, are discussed in several articles in Part III, including Hebbian Learning and Neuronal Regulation and Post-Hebbian Learning Algorithms.
In the adaptive architecture just described, the inputs are initially randomly connected to the cells of the processing layer. As a result, none of these cells is particularly good at pattern recognition. However, by sheer statistical fluctuation of the synaptic connections, one will be slightly better at responding to a particular pattern than others are; it will thus slightly strengthen those synapses which allow it to fire for that pattern and, through lateral inhibition, this will make it harder for cells initially less well tuned for that pattern to become tuned to it. Thus, without any teacher, this network automatically organizes itself so that each cell becomes tuned for an important cluster of information in the sensory inflow. This is a basic example of the kind of phenomenon treated in Self-Organization and the Brain.
Perceptrons
Perceptrons are neural nets that change with “experience,” using an error-correction rule designed to change the weights of each response unit when it makes erroneous responses to stimuli that are presented to the network. We refer to the judge of what is correct as the “teacher,” although this may be another neural network, or some environmental input, rather than a signal supplied by a human teacher in the usual schoolroom sense. Consider the case in which a set R of input lines feeds a Pitts-McCulloch neural network whose neurons are called associator units and which in turn provide the input to a single McCulloch-Pitts neuron (called the output unit of the perceptron) with adjustable weights (w1, …!opentag008!closetag, wd) and threshold θ. (In the case of visual pattern recognition, we think of R as a rectangular “retina” onto which patterns may be projected.) A simple perceptron is one in which the associator units are not interconnected, which means that it has no short-term memory. (If such connections are present, the perceptron is called cross-coupled. A cross-coupled perceptron may have multiple layers and loops back from an “earlier” to a “later” layer.) If the associator units feed the pattern x = (x1, … , xd) to the output unit, then the response of that unit will be to provide the pattern discrimination with discriminant function f(x) = w1x1 + … + wdxd − θ. In other words, the simple perceptron can only compute a linearly separable function of the pattern as provided by the associator units. The question asked by Rosenblatt (1958) and answered by many others since (cf. Nilsson, 1965) was, “Given a simple perceptron (i.e., only the synaptic weights of the output unit are adjustable), can we ‘train’ it to recognize a given linearly separable set of patterns by adjusting the ‘weights’ on various interconnections on the basis of feedback on whether or not the network classifies a pattern correctly?” The answer was “Yes: if the patterns are linearly separable, then there is a learning scheme which will eventually yield a satisfactory setting of the weights.” The best-known perceptron learning rule strengthens an active synapse if the efferent neuron fails to fire when it should have fired, and weakens an active synapse if the neuron fires when it should not have done so:
As before, synapse wij connects a presynaptic neuron with firing rate xj to a postsynaptic neuron with firing rate yi, but now Yi is the “correct” output supplied by the “teacher.” (This is similar to the Widrow-Hoff [1960] least mean squares model of adaptive control; see Perceptrons, Adalines, and Backpropagation.) Notice that the rule does change the response to xj “in the right direction.” If the output is correct, Yi = yi and there is no change, Δwj = 0. If the output is too small, then Yi − yi > 0, and the change in wj will add Δwjxj = k(Yi − yi)xjxj > 0 to the output unit’s response to (x1, … , xd). Similarly, if the output is too large, then Yi − yi < 0, Δwj will add k(Yi − yi)xjxj < 0 to the output unit’s response. Thus, there is a sense in which the new setting w′ = w + Δw classifies the input pattern x “more nearly correctly” than w does. Unfortunately, in classifying x “more correctly” we run the risk of classifying another pattern “less correctly.” However, the perceptron convergence theorem (see Arbib, 1987, pp. 66–69, for a proof) shows that Rosenblatt’s procedure does not yield an endless seesaw, but will eventually converge to a correct set of weights if one exists, albeit perhaps after many iterations through the set of trial patterns.
Network Complexity
The perceptron convergence theorem states that, if a linear separation exists, the perceptron error-correction scheme will find it. Minsky and Papert (1969) revivified the study of perceptrons (although some AI workers thought they had killed it!) by responding to such results with questions like, “Your scheme works when a weighting scheme exists, but when does there exist such a setting of the weights?” More generally, “Given a pattern-recognition problem, how much of the retina must each associator unit ‘see’ if the network is to do its job?” Minsky and Papert studied when it was possible for a McCulloch-Pitts neuron (no matter how trained) to combine information in a single preprocessing layer to perform a given pattern recognition task, such as recognizing whether a pattern X of 1s on the retina (the other retinal units having output 0) is connected, that is, whether a path can be drawn from any 1 of X to another without going through any 0s. Another question was to determine whether X is of odd parity, i.e., whether X contains an odd number of 1s. The question is, “How many inputs are required for the preprocessing units of a simple perceptron to successfully implement f?” We can get away with using a single element, computing an arbitrary Boolean function, and connecting it to all the units of the retina. So the question that really interests us is whether we can get away with a response unit connected to proprocessors, each of which receives inputs from a limited set of retinal units to make a global decision by synthesizing an array of local views.
We convey the flavor of Minsky and Papert’s approach by the example of XOR, the simple Boolean operation of addition modulo 2, also known as the exclusive-or. If we imagine the square with vertices (0, 0), (0, 1), (1, 1), and (1, 0) in the Cartesian plane, with (x1, x2) being labeled by x1 ⊕ x2, we have 0s at one diagonally opposite pair of vertices and 1s at the other diagonally opposite pair of vertices. It is clear that there is no way of interposing a straight line such that the 1s lie on one side and the 0s lie on the other side. However, we shall prove it mathematically to gain insight into the techniques used by Minsky and Papert.
Consider the claim that we wish to prove wrong: that there actually exists a neuron with threshold θ and weights α and β such that x1 ⊕ x2 = 1 if and only if αx1 + βx2 ≥ θ. The crucial point is to note that the function of addition modulo 2 is symmetric; therefore, we must also have x1 ⊕ x2 = 1 if and only if βx1 + αx2 ≥ θ, and, so, adding together the two terms, we have x1 ⊕ x2 = 1 if and only if (½)(α + β)(x1 + x2) ≥ θ. Writing (½)(α + β) as γ, we see that we have reduced three putative parameters α, β, and θ to just two, namely γ and θ.
We now set t = x1 + x2 and look at the polynomial γt − θ. It is a degree 1 polynomial, but note: at t = 0, γt − θ must be less than zero (0 ⊕ 0 = 0); at t = 1, it is greater than or equal to zero (0 ⊕ 1 = 1 ⊕ 0 = 1); and at t = 2, it is again less than zero (1 ⊕ 1 = 0). This is a contradiction—a polynomial of degree 1 cannot change sign from positive to negative more than once. We conclude that there is no such polynomial, and thus that there is no threshold element which will add modulo 2.
We now understand a general method used again and again by Minsky and Papert: start with a pattern-classification problem. Observe that certain symmetries leave it invariant. For instance, for the parity problem (is the number of active elements even or odd?), which includes the case of addition modulo 2 when the retina has only two units, any permutation of the points of the retina would leave the classification unchanged. Use this to reduce the number of parameters describing the circuit. Then lump items together to get a polynomial and examine actual patterns to put a lower bound on the degree of the polynomial, fixing things so that this degree bounds the number of inputs to the response unit of a simple perceptron.
Minsky and Papert provide many interesting theorems (for the proof of an illustrative sample, see Arbib, 1987, pp. 82–84). As just one example, we may note that they prove that the parity function requires preprocessors big enough to scan the whole retina if the preprocessors can only be followed by a single McCulloch-Pitts neuron. By contrast, to tell whether the number of active retinal inputs reaches a certain threshold only requires two inputs per neuron in the first layer. (For other complexity results, see the articles listed in the road map Computability and Complexity.)
Gradient Descent and Credit Assignment
The implication of the results on “network complexity” is clear: if we limit the complexity of the units in a neural network, then in general we will need many layers, rather than a single layer, if the network is to have any chance of being trained to realize many “interesting” functions. This conclusion motivates the study of training rules for multilayer perceptrons, of which the most widely used is backpropagation. Before describing this method, we first discuss two general notions of which it is an important exemplar: gradient descent and credit assignment.
In discussing Hopfield networks, we introduced the metaphor of an “energy landscape” (see Figure 6). The asynchronous updates move the state of the network (the vector of neural activity levels) “downhill,” tending toward a local energy minimum. Our task now is to realize that the metaphor works again on a far more abstract level when we consider learning. In learning, the dynamic variable is not the network state, but rather the vector of synaptic weights (or whatever other set of network parameters is adjusted by the learning rules). We now conduct gradient descent in weight space. At each step, the weights are adjusted in such a way as to improve the performance of the network. (As in the case of the simple perceptron, the improvement is a “local” one based on the current situation. It is, in this case, a matter for computer simulation to prove that the cumulative effect of these small changes is a network which solves the overall problem.)
But how do we recognize which “direction” in weight space is “downhill”? Suppose success is achieved by a complex mechanism after operating over a considerable period of time (for example, when a chess-playing program wins a game). To what particular decisions made by what particular components should the success be attributed? And, if failure results, what decisions deserve blame? This is closely related to the problem known as the “mesa” or “plateau” problem (Minsky, 1961). The performance evaluation function available to a learning system may consist of large level regions in which gradient descent degenerates to exhaustive search, so that only a few of the situations obtainable by the learning system and its environment are known to be desirable, and these situations may occur rarely.
One aspect of this problem, then, is the temporal credit assignment problem. The utility of making a certain action may depend on the sequence of actions of which it is a part, and an indication of improved performance may not occur until the entire sequence has been completed. This problem was attacked successfully in Samuel’s (1959) learning program for playing checkers. The idea is to interpret predictions of future reward as rewarding events themselves. In other words, neutral stimulus events can themselves become reinforcing if they regularly occur before events that are intrinsically reinforcing. Such temporal difference learning (see Reinforcement Learning) is like a process of erosion: the original uninformative mesa, where only a few sink holes allow gradient descent to a local minimum, is slowly replaced by broader valleys in which gradient descent may successfully proceed from many different places on the landscape.
Another aspect of credit assignment concerns structural factors. In the simple perceptron, only the weights to the output units are to be adjusted. This architecture can only support maps which are linearly separable as based on the patterns presented by the preprocessors, and we have seen that many interesting problems require preprocessing units of undue complexity to achieve linear separability. We thus need multiple layers of preprocessors, and, since one may not know a priori the appropriate set of preprocessors for a given problem, these units should be trainable too. This raises the question, “How does a neuron deeply embedded within a network ‘know’ what aspect of the outcome of an overall action was ‘its fault’?” This is the structural credit assignment problem. In the next section, we shall study the most widely used solution to this problem, called backpropagation, which propagates back to a hidden unit some measure of its responsibility.
Backpropagation is an “adaptive architecture”: it is not just a local rule for synaptic adjustment; it also takes into account the position of a neuron in the network to indicate how the neuron’s weights are to change. (In this sense, we may see the use of lateral inhibition to improve Hebbian learning as the first example of an adaptive architecture in these pages.) This adaptive architecture is an example of “neurally inspired” modeling, not modeling of actual brain structures; and there is no evidence that backpropagation represents actual brain mechanisms.
Backpropagation
The task of backpropagation is to train a multilayer (feedforward) perceptron (or MLP), a loop-free network which has its units arranged in layers, with a unit providing input only to units in the next layer of the sequence. The first layer comprises fixed input units; there may then be several layers of trainable “hidden units” carrying an internal representation, and finally, there is the layer of output units, also trainable. (A simple perceptron then corresponds to the case in which we view the input units as fixed associator units, i.e., they deliver a preprocessed, rather than a “raw,” pattern which connect directly to the output units without any hidden units in between.) For what follows, it is crucial that each unit not be binary: it has both input and output taking continuous values in some range, say [0, 1]. The response is a sigmoidal function of the weighted sum. Thus, if a unit has inputs xk with corresponding weights wik, the output xi is given by xi = fi(Σwikxk), where fi is a sigmoidal function, say
with θi being a bias or threshold for the unit.
The environment only evaluates the output units. We are given a training set of input patterns p and corresponding desired target patterns tp for the output units. With op the actual output pattern elicited by input p, the aim is to adjust the weights in the network to minimize the error
Rumelhart, Hinton, and Williams (1986) were among those who devised a formula for propagating back the gradient of this evaluation from a unit to its inputs. This process can continue by backpropagation through the entire net. The scheme seems to avoid many false minima. At each trial, we fix the input pattern p and consider the corresponding “restricted error”
where k ranges over designated “output units.” The net has many units interconnected by weights wij. The learning rule is to change wij so as to reduce E by gradient descent:
Consider a net divided into m + 1 layers, with nets in layer g + 1 receiving all their inputs from layer g; with layer 0 comprising the input units; and layer m comprising the output units. If i is an output unit (remember, wij connects from j to i) then the only non-zero term in the last equation has k = i. Now ok = Σwilol where wil ≠ 0 only for o1 which are outputs from the previous layer. We thus have
where fi′ is the derivative of the activation function evaluated at the activation level ini = Σwilol to unit i. Thus Δwij for an output unit i is proportional to δioj, where δi = (ti − oi)fi′.
Next, suppose that i is a hidden unit whose output drives only output units:
However, the only ol that depends on wij is oi, and so
so that Δwij = 2 (tk − ok)[fk′wki] · [fi′oj].
Recalling that δk = (tk − ok)fk′ for an output unit k, we may rewrite this as
Thus, Δwij is proportional to δioj, with δi = (Σkδkwki)fi′, where k runs over all units which receive unit i’s output. More generally, we can prove the following, by induction on how many layers back we must go to reach a unit:
Proposition. Consider a layered loop-free net with error E = (tk − ok)2, where k ranges over designated “output units,” and let the weights wij be changed according to the gradient descent rule
Then the weights may be changed inductively, working back from the output units, by the rule
where:
Basis Step: δi = (ti − oi)fi′ for an output unit.
Induction Step: If i is a hidden unit, and if δk is known for all units that receive unit i’s output, then δi = ( δkwki)fi′, where k runs over all units which receive unit i’s output.
Thus the “error signal” δi propagates back layer by layer from the output units. In Σkδkwki, unit i receives error propagated back from a unit k to the extent to which i affects k. For output units, this is essentially the delta rule given by Widrow and Hoff (1960) (see Perceptrons, Adalines, and Backpropagation).
The theorem just presented tells us how to compute Δwij for gradient descent. It does not guarantee that the above step-size is appropriate to reach the minimum, nor does it guarantee that the minimum, if reached, is global. The backpropagation rule defined by this proposition is, thus, a heuristic rule, not one guaranteed to find a global minimum, but is still perhaps the most diversely used adaptive architecture. Many other approaches to learning, including some which are “neural-like” in at best a statistical sense, rather than being embedded in adaptive neural networks, may be found in the road map Learning in Artificial Networks (not just neural networks).
A Cautionary Note
The previous subsections have introduced a number of techniques that can be used to make neural networks more adaptive. In a typical training scenario, we are given a network N which, in response to the presentation of any x from some set of input patterns, will eventually settle down to produce a corresponding y from the set Y of the network’s output patterns. A training set is then a sequence of pairs (xk, yk) from X × Y, 1 ≤ k ≤ n. The foregoing results say that, in many cases (and the bounds are not yet well defined), if we train the net with repeated presentations of the various (xk, yk), it will converge to a set of connections which cause N to compute a function f: X → Y with the property that, over the set of k’s from 1 to n, the f(xk) “correlate fairly well” with the yk. Of course, there are many other functions g: X → Y such that the g(xk) “correlate fairly well” with the yk, and they may differ wildly on those “tests” x in X that do not equal an xk in the training set. The view that one may simply present a trainable net with a few examples of solved problems, and it will then adjust its connections to be able to solve all problems of a given class, glosses over three main issues:
Complexity: Is the network complex enough to encode a solution method?
Practicality: Can the net achieve such a solution within a feasible period of time? and
Efficacy: How do we guarantee that the generalization achieved by the machine matches our conception of a useful solution?
Part III provides many “snapshots” of the research underway to develop answers to these problems (for the “state of play” see, for example, Learning and Generalization: Theoretical Bounds; PAC Learning and Neural Networks; and Vapnik-Chervonenkis Dimension of Neural Nets). Nonetheless, it is clear that these training techniques will work best when training is based on an adaptive architecture and an initial set of weights appropriate to the given problem. Future work on the neurally inspired design of intelligent systems will involve many domain-specific techniques for system design, such as those exemplified in the road maps Vision and Robotics and Control Theory, as well as general advances in adaptive architectures.
Envoi
With this, our tour of some of those basic landmarks of Brain Theory and Neural Networks established by 1986 is complete. I now invite each reader to follow the suggestions of the section “How to Use this Book” of the Handbook to begin exploring the riches of Part III, possibly with the guidance of a number of the road maps in Part II.
References
Amari, S., and Arbib, M. A., 1977, Competition and cooperation in neural nets, in Systems Neuroscience (J. Metzler, Ed.), New York: Academic Press, pp. 119–165.
Arbib, M. A., 1981, Perceptual structures and distributed motor control, in Handbook of Physiology—The Nervous System, vol. II, Motor Control (V. B. Brooks, Ed.), Bethesda, MD: American Physiological Society, pp. 1449–1480.
Arbib, M. A., 1987, Brains, Machines, and Mathematics, 2nd ed., New York: Springer-Verlag.
Arbib, M. A., 1989, The Metaphorical Brain 2: Neural Networks and Beyond, New York: Wiley-Interscience.
Arbib, M. A., Érdi, P., and Szentágothai, J., 1998, Neural Organization: Structure, Function, and Dynamics, Cambridge, MA: MIT Press.
Arbib, M. A., and Hesse, M. B., 1986, The Construction of Reality, New York: Cambridge University Press.
Bain, A., 1868, The Senses and the Intellect, 3rd ed.
Bernard, C., 1878, Leçons sur les phénomènes de la Vie.
Brooks, R. A., 1986, A robust layered control system for a mobile robot, IEEE Robot. Automat., RA-2:14–23.
Cannon, W. B., 1939, The Wisdom of the Body, New York: Norton.
Chomsky, N., 1959, On certain formal properties of grammars, Inform. Control, 2:137–167.
Church, A., 1941, The Calculi of Lambda-Conversion, Annals of Mathematics Studies 6, Princeton, NJ: Princeton University Press.
Craik, K. J. W., 1943, The Nature of Explanation, New York: Cambridge University Press.
Ewert, J.-P., and von Seelen, W., 1974, Neurobiologie and System-Theorie eines visuellen Muster-Erkennungsmechanismus bei Kroten, Kybernetik, 14:167–183.
Fearing, F., 1930, Reflex Action, Baltimore: Williams and Wilkins.
Gödel, K., 1931, Uber formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme: I, Monats. Math. Phys., 38:173–198.
Grossberg, S., 1967, Nonlinear difference-differential equations in prediction and learning theory, Proc. Natl. Acad. Sci. USA, 58:1329–1334.
Hebb, D. O., 1949, The Organization of Behavior, New York: Wiley.
Heims, S. J., 1991, The Cybernetics Group, Cambridge, MA: MIT Press.
Hodgkin, A. L., and Huxley, A. F., 1952, A quantitative description of membrane current and its application to conduction and excitation in nerve, J. Physiol. Lond., 117:500–544.
Hopfield, J., 1982, Neural networks and physical systems with emergent collective computational properties, Proc. Natl. Acad. Sci. USA, 79:2554–2558.
Hopfield, J. J., and Tank, D. W., 1986, Neural computation of decisions in optimization problems, Biol. Cybern., 52:141–152.
Kleene, S. C., 1936, General recursive functions of natural numbers, Math. Ann., 112:727–742.
La Mettrie, J., 1953, Man a Machine (trans. by G. Bussey from the French original of 1748), La Salle, IL: Open Court.
Lichtheim, L., 1885, On aphasia, Brain, 7:433–484.
Maxwell, J. C., 1868, On governors, Proc. R. Soc. Lond., 16:270–283.
McCulloch, W. S., and Pitts, W. H., 1943, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys., 5:115–133.
Minsky, M. L., 1961, Steps toward artificial intelligence, Proc. IRE, 49:8–30.
Minsky, M. L., 1985, The Society of Mind, New York: Simon and Schuster.
Minsky, M. L., and Papert, S., 1969, Perceptrons: An Essay in Computational Geometry, Cambridge, MA: MIT Press.
Nilsson, N., 1965, Learning Machines, New York: McGraw–Hill.
Pavlov, I. P., 1927, Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex (translated from the Russian by G. V. Anrep), New York: Oxford University Press.
Post, E. L., 1943, Formal reductions of the general combinatorial decision problem, Am. J. Math., 65:197–268.
Rall, W., 1964, Theoretical significance of dendritic trees for neuronal input–output relations, in Neural Theory and Modeling (R. Reiss, Ed.), Stanford, CA: Stanford University Press, pp. 73–97.
Ramón y Cajal, S., 1906, The structure and connexion of neurons, reprinted in Nobel Lectures: Physiology or Medicine, 1901–1921, New York: Elsevier, 1967, pp. 220–253.
Rosenblatt, F., 1958, The perceptron: A probabilistic model for information storage and organization in the brain, Psychol. Rev., 65:386–408.
Rosenblueth, A., Wiener, N., and Bigelow, J., 1943, Behavior, purpose and teleology, Philos. Sci., 10:18–24.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J., 1986, Learning internal representations by error propagation, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1 (D. Rumelhart and J. McClelland, Eds.), Cambridge, MA: MIT Press/Bradford Books, pp. 318–362.
Samuel, A. L., 1959, Some studies in machine learning using the game of checkers, IBM J. Res. Dev., 3:210–229.
Selfridge, O. G., 1959, Pandemonium: A paradigm for learning, in Mechanisation of Thought Processes, London: Her Majesty’s Stationery Office, pp. 511–531.
Sherrington, C., 1906, The Integrative Action of the Nervous System, New York: Oxford University Press.
Turing, A. M., 1936, On computable numbers with an application to the Entscheidungsproblem, Proc. Lond. Math. Soc. (Series 2), 42:230–265.
Turing, A. M., 1950, Computing machinery and intelligence, Mind, 59:433–460.
von der Malsburg, C., 1973, Self-organization of orientation-sensitive cells in the striate cortex, Kybernetik, 14:85–100.
Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive switching circuits, in 1960 IRE WESCON Convention Record, vol. 4, pp. 96–104.
Wiener, N., 1948, Cybernetics: Or Control and Communication in the Animal and the Machine, New York: Technology Press and Wiley (2nd ed., Cambridge, MA: MIT Press, 1961).
Young, R. M., 1970, Mind, Brain and Adaptation in the Nineteenth Century: Cerebral Localization and Its Biological Context from Gall to Ferrier, New York: Oxford University Press.
| |