Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models.