Recent studies have found pre-trained language models (PLMs) to suffer from miscalibration, indicating a lack of accuracy in confidence estimates.Evaluation methods assuming lower calibration error estimates indicate more reliable predictions may be flawed.Fine-tuned PLMs often resort to shortcuts, leading to overconfident predictions that lack generalizability.Models with seemingly superior calibration actually have higher levels of non-generalizable decision rules.