It's not a matter of 'trust.' We know exactly what the GPT base model does. It is a fact, it is just a statistical representation of word frequency in relation to other words.Michio didn't sound like someone to trust on this topic.
That you can get a bunch of things that look like emergent behavior when you start applying a bunch of RLHF and other layers ontop of that base model is certainly interesting phenomenon but it's not the same thing as actual reasoning.