and comminuted fracture is a significant part of surgical diagnosis[1]. However, compared with
the huge number of patients, there is a lack of excellent surgeons. Therefore, surgeons urgently
need an assistant to relieve their work pressure. In order to solve this problem, many computer-
aided detection and diagnosis methods[2] have been proposed. In recent years, substantial
progress has been made in developing deep learning-based CAD systems to fracture diagnosis.
Guan et al. proposed a convolutional neural network for thighbone fracture detection that can
balance the information of each feature map in ResNeXt’s feature pyramid.[3]. Firat et al.
designed an integrated object detection model for wrist X-ray image fracture detection[4]. At
present, the state-of-the-art fracture detection methods are usually developed based on large-
scale expert annotations such as 5134 labeled CT images for spinal fracture detection[5], 7356
wrist radiographic images[6], 9040 labeled hand, wrist, knee, ankle, foot and ankle radiographs
for multiple fracture detection[7].
Compared with the above-mentioned methods, semi-supervised learning (SSL) uses both
labeled data and unlabeled data when training the model, and uses unlabeled data to assist in
optimizing the model, so as to save training cost. The state-of-the-art semi-supervised methods
are the pseudo-label based approaches[8]. Specifically, the model is trained on labeled data,
and then the trained model is used to predict the pseudo labels on unlabeled images. Teacher-
student model[9] is a common method to generate pseudo labels in semi-supervised learning in
which key idea is to train two independent models, namely teacher model and student model.
The teacher model is trained on the labeled images to label the unlabeled images and then mix
these pseudo labeled images with the labeled images to train the student model.
Most research on SSOD has focused on the two-stage detectors[10–13]. But basing on the
single-stage detectors (such as FCOS[14], YOLOF[15], RestinaNet[16]) has more attractive
and practical, because they can be easily deployed on devices with limited resources, eliminate
cumbersome preprocessing and post-processing except for NMS[17]. The main difference be-
tween the single-stage detectors and the two-stage detectors is that Region proposal network
(RPN)[18] of the two-stage detector can filter most of the background samples, and in the next
stage, the remaining candidate boxes are further predicted the detailed categories. The single-
stage detectors make dense prediction for all areas of the image at one time, as long as few
bounding boxes can be predicted as positive samples. Because in the single-stage detectors, the
generation and judgment of the proposal are integrated, this lead to that the detection speed
is faster but the classification score of one-stage detectors is lower than two-stage detectors.
And directly sending pseudo labels with low classification score into student model will bring a
lot of noise and affect the training accuracy of the model. Therefore, how to deal with a large
number of low-quality pseudo labels in dense prediction is still an important problem.
Regression branch is another component of object detection task. The regression quality
of pseudo box is another important factor that determines the performance of semi supervised
target detection model. Xu et al.[20] find that the accuracy of the regression is related to the
uncertainty calculated by the BoxJitter module, but the BoxJitter module relies on Regions
2