Meanwhile, I also checked the calibration of the tree-based and nn-based models.
The conclusion is that both models are well calibrated by default, as long as you use early stopping.
If you disable early stopping and max_iter is too small (under fit) or too large (over fit) then the models can either be significantly under-confident or over-confident.