Background: This study investigates the variation in segmentation of several pelvic anatomical structures on computed tomography (CT) between multiple observers and a commercial automatic segmentation method, in the context of quality assurance and evaluation during a multicentre clinical trial.Methods: CT scans of two prostate cancer patients ('benchmarking cases'), one high risk (HR) and one intermediate risk (IR), were sent to multiple radiotherapy centres for segmentation of prostate, rectum and bladder structures according to the TROG 03.04 " RADAR" trial protocol definitions. The same structures were automatically segmented using iPlan software for the same two patients, allowing structures defined by automatic segmentation to be quantitatively compared with those defined by multiple observers. A sample of twenty trial patient datasets were also used to automatically generate anatomical structures for quantitative comparison with structures defined by individual observers for the same datasets.Results: There was considerable agreement amongst all observers and automatic segmentation of the benchmarking cases for bladder (mean spatial variations <0.4 cm across the majority of image slices). Although there was some variation in interpretation of the superior-inferior (cranio-caudal) extent of rectum, human-observer contours were typically within a mean 0.6 cm of automatically-defined contours. Prostate structures were more consistent for the HR case than the IR case with all human observers segmenting a prostate with considerably more volume (mean +113.3%) than that automatically segmented. Similar results were seen across the twenty sample datasets, with disagreement between iPlan and observers dominant at the prostatic apex and superior part of the rectum, which is consistent with observations made during quality assurance reviews during the trial.Conclusions: This study has demonstrated quantitative analysis for comparison of multi-observer segmentation studies. For automatic segmentation algorithms based on image-registration as in iPlan, it is apparent that agreement between observer and automatic segmentation will be a function of patient-specific image characteristics, particularly for anatomy with poor contrast definition. For this reason, it is suggested that automatic registration based on transformation of a single reference dataset adds a significant systematic bias to the resulting volumes and their use in the context of a multicentre trial should be carefully considered. © 2013 Geraghty et al.; licensee BioMed Central Ltd.