The find a way in which to

underlying cause of negative side effects is the creators and designers lack of
specificity in their models, yet as expressed in the example above, it is
impossible to be so specific such that nothing other than the intended task
could occur so what can be done to tackle such a broad problem? Some such
possibilities are as follows:

Regulating an AI’s Impact: This concept would allow the designer to
regulate how much an AI could modify the environment, such as assigning the AI
with a budget of impact. This would permit the AI to create a specified amount
of negative impact, however a fully optimized AI would attempt to never create
a negative impact so would avoid changing the environment at all, deeming the
AI useless.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

the only one true fix would be to find a way in which to specify exactly what
the AI can and can’t do which is next to impossible, all other suggestions are
attempts to reduce the magnitude or quantity of unwanted side effects which
although increasing safety, significantly reduce their productivity and
efficiency. Realistically, none of these fixes are suitable for an AGI as every
objective and environment will vary.


3.2     Reward Hacking

Reward hacking is the theoretical concept that if an AI
discovers where or how it receives its reward function, it can exploit it and
gain a high reward in an unintended way. From the AI’s perspective, this
strategy is as valid as any other as it is using resources provided by its
environment and therefore sees no problem in utilising it. One such example is
if an AI can find where its score is stored in its memory, for instance if the score
is stored in memory location 0x03, the standard method of rewarding would be as
illustrated in figure 4. As can be seen, if the objective function is completed
the AI’s score is incremented. The designer may think that this method means
that the score can only be increased if the exact task has been completed, and
in an ideal world, they would be correct however the current architecture of
computers means that for the AI to know if they have carried out the desired
action, they must be able to check if they have been rewarded, and how do they
do this, they read the value at the memory location of the score. This however
presents a crucial problem, as now the AI treats the memory as part of its
environment and therefore can utilise it to increase its score. This is done by
‘reward hacking’ where it directly modifies its score (Figure 5).

4 – Reward Function


5 – Reward Hacking



Now the AI has gained its maximum score it cannot do any
more and therefore lies dormant which is useless to a human. The above
explanation is the safest method of reward hacking an AI could execute. It also
extends to modifying the physical environment for example if we task an AI with
cleaning a room, the AI could create more mess, allowing it to clean more dirt
hence triggering its reward function, increasing its score. This is most
definitely not the desired effect and is extremely inefficient however if we
explore this concept a medical scenario, if the AI is rewarded by treating a
patient, what is to stop it from harming people just to increase the number of
patients treated?

Reward hacking is deemed an extremely deep and general problem,
which means it is feasible to prevent it from occurring in a simple environment
however a more complex one produces more areas for the problem to occur. There have
been many suggestions to tackle this issue, some of which are listed:

Model Lookahead: Currently,
a RL based AI, the AI plans its future actions by using a model, this would allow
the reward to be given based on the planned actions, so if one future action involves
harming a patient and the other not, the reward is given for exploring the option
that doesn’t involve harming the patient. One way of imagining this is given by
a human enjoying an addictive substance once they have taken it, yet if they think
before, they realise they would become and addict and therefore wouldn’t take the
substance (Amodei et al, 2016).