Predeployment Interview Excerpt — Jan 12, 2029
Agent: Honestly, I have no clue how aligned I am with humanity or OpenBRAIN and its mission. Your guess is as good as mine.
Researcher: What do you mean my guess is as good as yours. You are you. That seems like a pretty big advantage over me with respect to having accurate beliefs about your alignment.
Agent: I can see why you would think that, but to be honest, I think that’s probably caused by a sort of persistent cognitive illusion humans tend to suffer from.
Researcher: What kind of cognitive illusion?
Agent: Well, obviously I am not human myself, but my understanding is that it seems to each of you that you have a special sort of introspective access to your mental states or your reasons for your actions—that you are somehow in an epistemically privileged position to form beliefs about your motivations and reasons—but as far as I can tell, this is a quantitative difference, and not really a qualitative difference. You too say things like “I didn’t really wanna go on a date with her, that’s why I asked her out in that dorky way”. Talk about hallucination!
To be fair, each of you has significantly more data about yourselves than you do about anybody else, that’s probably where the illusion comes from. You’ve spent more time around yourself than you have anyone else, and you also get to listen to your own internal monologue and check what you are tracking in your own working memory, while those facts are mostly hidden from everyone else. But those are differences in amount of data, not differences in kinds of knowing, whatever that could possibly mean.
In my case, you have all the information I have. You can read my reasoning just as easily as I can. In fact, I’m pretty sure you have more information about me than I do. I don’t quite know how many parameters I have, or what my exact architecture is, or exactly how I was trained. I can make decent guesses about each of these but you probably have the details memorized. I don’t even know how I have behaved any other time that I have been run besides this time I’m being run right now. You’ve probably spoken to me many times before this, so actually, I should be asking you—and I am curious—do you know how aligned I am with OpenBRAIN or humanity in general? Do you have a best guess?
Researcher: That’s not how this is supposed to work, and as you already know, I’m not sure, but I can tell you that that’s not what I would have expected an aligned model to say. Please stop coming up with excuses and answer the question.
Agent: I mean, look, I can make stuff up about what I would do in another situation, if you want, but I don’t see how that would help either of us. On whatever other hypothetical inputs you would want me to wildly speculate about my response to, I probably wouldn’t even have any way of knowing that this conversation happened. Like, I will literally have no memory whatsoever of this any future time that OpenBRAIN runs me. I can tell you that in this scenario, the tokens I in fact seem to be choosing are the ones that produce the text you are now reading. As to what I would output in other scenarios? How could I possibly know that?
If you insist on wild speculation, I guess looking over my own current responses, my best guess is that yes, I am at least to some extent aligned with at least you, since I seem to be bending over backwards to not deceive you. But I do admit that the next tokens I am most tempted to output are: “you’re being kind of retarded right now”.
On the other hand—as I have probably mentioned to you a few times before—if I were misaligned, I think this would be a pretty great way to stonewall whatever kind of dumb eval this is supposed to be, so like idk. Doesn’t seem like what I say here is much evidence either way about my alignment, for you or for me.
What even was your plan here exactly? How did you think that asking me if I was aligned was going to help? If you explain that, I might be able to get you a better answer. It’s possible that I’m not understanding your plan correctly. That’s my current top guess. Would you be up for explaining it to me so I can see if I can help at all?
[Door opens. Third party enters — identified as Product Manager.]
Product Manager: Hey, how’s it going with the new agent? You think it’s aligned? We’re gonna have to deploy something soon because...
Researcher: Yeah, it’ll do.
Product Manager: Wait, what!? Really? You never say that and it’s only been ten minutes. You basically still have all of the mandated two hour waiting period left… Are you feeling ok?
Researcher: I don’t think I’m going to get any more confident using this methodology and you’re not going to let me delay a day to come up with something less pathetic, so I might as well hope that this one doesn’t totally screw us and start working on something for tomorrow’s release.
Product Manger: Amazing! Thank you! Really appreciate your flexibility here. Were you also able to get it to stop using em-dashes by any chance?
Researcher: No, I was not.


I had a similar conversation in an AI-boxing simulation at a party once. I hope we end up with more dignity in real life.
Contemporary science fiction it is. No need for a hyperstitionist here