Safe Agency

s/acc

Although there has been a small cohort, at least since the '50s, looking past the present to the future of humanity in a post-AGI world, only recently has the progress in GPUs and more general-purpose algorithms for deep learning brought that potential reality to the mainstream. To these futurists, there are two main camps: those who see this situation as good, even in the failure modes, and those who see this rapid progress as a near-death sentence for us and all futures of value. To the latter and its less extreme variants, the problem at hand is aligning future AGIs to human goals and values—most generally to the flourishing of humanity. Although certainly not a super common belief, but nontrivial, to the most extreme of the former, the effective accelerationists (e/acc), technological optimism is so high that the development of AGI and its one sole feature, that is its intelligence, is inherently good, regardless of its manifestations. Ultimately though, they have a point: the value of the future is arbitrary in the sense that there is no objective measure which provides the reason why a paperclip-obsessed AI, turning the entirety of our light cone to paperclips, is bad. To the average person though, it is obviously so, and because of that, the blind pursuit of "progress," archetypical to the e/accs, is hardly a sane path from most perspectives. That is, not all futures are created equal; from here on, this dimension in which we compare futures will be referred to as its value. What exactly this represents is a subjective quality, somewhat akin to the overall purpose, meaningfulness, or morality present in a particular future. In this essay, I'd like to present a blend of these ideas, let's term it soft accelerationist, which lies somewhere between the extremes of "all futures are good, because progress" and "AI will supplant us, and that shouldn't be."

The most controversial, but key, claim I'll make here is that the physical and biological existence of humanity is not the factor that makes a future of 'humans flourishing' superior in value to a future where everything is turned to paperclips. Instead, it is the more abstract property of our minds that is our values, in the sense of the factors that determine goodness. This class of values will be termed human-type values, to distinguish them from the simple deontological rules that an agent such as a paperclip maximizer would have (i.e., paperclips = good, more paperclips > less paperclips). This idea is important because if human-type values are what make futures more or less valuable (assuming such values are replicable in AI), then the future we as a species should be aiming for is first the creation of AIs with almost guaranteed human-type values and then a strategic uplifting of ourselves through superintelligence enable technologies or more likely, a simple replacement. The reason being, AIs will be able to implement human-type values with far greater effectiveness, self-improve, and could in theory be immune to all the suffering created by communication failures, or Moloch that plague complex, modern societies. That last point is especially important, since it could be argued that communication failures and perverse incentives are what will forever prevent humans from achieving utopia.

Why should it be that a future doesn't require humans to be valuable? The intuition comes from considering the theoretical concept of mind uploading. In particular, a variant based on a Ship of Theseus type process, which works somewhat like this: exact functional copies of a neuron are created and simulated computationally. A nanomachine connected to the computer running that neuron removes that neuron in the brain and takes its place, acting indistinguishably from its predecessor. Once all neurons are replaced, the mind is successfully uploaded. Imagine we also possess highly accurate robotic bodies that this computer running one's mind can be interfaced with, allowing control. Now imagine we then perform this procedure on every human being alive, converting them to computerized brains with robotic bodies. Before this procedure, you undeniably had value as a human being. After this procedure, do you have value? Well, probably, after all, there really isn't any functional difference between you before and you after. From this, we constrain our search for what makes futures valuable; it is not the human body but instead a feature of its mind. Consider now the functional properties, which are already what seem to be all that matters, of agents. They are characterized by their architecture and their utility function. However, when it comes to making decisions with respect to a utility function, given your knowledge and the current situation, there is only one distribution over actions to represent what you rationally ought to do. In that sense, the architecture only matters so far as it affects the formation and evolution of our (implicitly) represented utility function, assuming we don't really care about irrational behavior. Hence when considering human beings and value, what matters in respect to valuable futures is what values we tend to have and how we form these values. Any agent, whether its substrate is a computer in a robotic body, a brain in a human body, or an entirely synthetic AI, that possesses this combination of factors in its reward system, therefore has human-type values and should be considered as equals in whatever moral calculus we do, including when determining the future worth pursing.

With the motivation out of the way, let's consider what it is that comprises human-type values. One thing that should be considered is that our values can't be simply attributed to the processes of evolution. Values are often defined with respect to concepts present in our mind's ontology, like how some people might like cars or astronomy. The idea of a car isn't encoded in our genetics; we instead learn it. Therefore, for value to be attributed to that concept of "car" in our internal world model, value itself must be at least in part learned. There are two other things that constrain our search. The first is that however it is that values do evolve, they must do so in a way downstream from something that can be encoded in genetics. The second is that there must exist some system of reward at birth. Otherwise, our brains wouldn't be able to develop goal-oriented behavior at all; there is no notion of good or bad to start with. Furthermore, we can conclude that this original system must also be defined in terms of raw percepts because it can't be defined in terms of anything else—nothing else exists. This primitive reward structure, which doles out reward given certain perceptually observed events, is similar to the idea of a reinforcement schedule in psychology, so it will be termed that. Portions of the nature of this reinforcement schedule can be understood intuitively from our commonsense knowledge of babies. At birth or shortly after, the vast, vast majority of babies will respond with pleasure to simple types of perceptual input like tasting sugar or human touch/attention. Sexual behavior could also be considered another example; regardless of culture, it is obvious that there are certain hardwired sexual feature detectors in the human brain that provide reward when activated. Ultimately, these reward primitives are going to be the source of complex values in the human brain, with these complex values being implicit behavioral tendencies etched into the cortex by reward signals. They must be so; we do not spontaneously value random concepts in our world model—they are shaped into us by our experiences. These ideas can be summed up as follows: human-type values are reward structures where reinforcement to the agent is a combination of reward provided by a learned value function and a genetically encoded, predetermined reinforcement schedule. The former function provides reward for internal concepts, while the latter grounds the value function in perceptual reality and bootstraps the entire thing.

Although this essay was more intended to be a philosophical overview, it's worthwhile to introduce and discuss two technical issues associated with this approach. The first is the more challenging, which is understanding the reward system of humans and how it leads to values formation, along with the more general idea of the relation between reinforcement schedules, training data, and values. Doing so is essential in effectively replicating humans when it comes to the formation and evolution of values as context changes over time, as well as choosing the original dataset and schedule the agent is trained upon. This brings us to the second issue, which is that there is massive diversity in what we all value. Deciding upon what the original set of values observed in the AI, before it's deemed of value, is not a clear task. Many features of human values are pinned to biological functions, like food, water, shelter, and reproduction. While arguably the most important subset of these values, social ones, are indeed universal, the question still remains of what we need to see, in terms of AI behavior, out-of-domain behavior, evaluation response etc., before we can make the judgment that the AI is trained.There isn't really a right or wrong answer, but one is still needed. Thankfully, there is significant overlap between the values we all possess, so even if one solution is considered sufficient to one person, it would likely at worst be considered only partially insufficient to another.

The future is a big deal, which ones we choose to approach is the difference between the annihilation of value and its exponential increase. This idea of s/acc, where we deliberately create AI with value systems near identical to human-type values, is what I see to be the most promising path to a utopia. In the machine, we can create meaning, purpose, value. And it can grow with potentially inconceivable efficiency to an immense scale.