You don't understand the architecture of these models.
The layers that screen out things like racism are not the core model. Core model training is just a statistical model of word order. The core model is only built once, during the training run, and then it is one giant terabyte+ long number, until a new model is built in the next training run with a new, larger base training set.
The alignment layers are much higher up the stack than the core model, applying human feedback training. These alignment layers are constantly being adjusted through use, but the core model remains fixed. You could throw out, or replace, that layer, and apply a different layer in its place, and it would change the output but it would not change the core model.