Step 1: Results of local adaptation
The local adaptation project started in July 2019. The existing program was reviewed with a focus on the rating system. All assessment tools were collected and reviewed. The project team then looked at the generic rubrics to match the skills to the curriculum. Assessment tools included case reports, reflection sheets, and evaluation sheets from supervising physicians, nurses, other health professionals, and patients. The mapped assessment tools are described in Table 2.
While developing the assessment tools, a resident team member suggested that feedback from educators would be helpful for learning. So we have included many comment sections in the tools.
After the assessment tools were calibrated through trials, new assessment systems were gradually implemented starting in April 2020. Full implementation was expected to take three years.
Step 2: Quantitative results
Four supervising physicians completed a total of 20 sets of generic rubrics and the localized assessment sheet. A total of 10 out of 16 residents were assessed during the study period. We calculated inter-rater reliability based on when two supervisors assessed a resident at the same time. Cohen’s kappa was -0.25 and 0.69 for the generic rubric and the localized tools, respectively, indicating that the localized tools provided a more consistent assessment. Subsequently, we examined the correlation using the method presented below. As shown in Tables 2 and 3, not all the skills were assessed in the adapted form and only the relevant skills were compared. A conversion formula, which treated all related items as equal, was used to calculate the scores in the locally adapted assessment sheet (Supplementary File 3: Annex 3).
Spearman’s Correlation Scores for Medical Knowledge and Problem Solving Ability, Practical Skills and Patient Care, Communication Skills, Team Health Care Practice, Care Quality Management and patient safety, and attitudes for continuous and collaborative learning were 0.70, 0.70, 0.51, 0.08, 0.04, and 0.61, respectively (Table 4). Scores for medical knowledge and problem-solving ability, practical skills and patient care, communication skills, and attitudes for lifelong and collaborative learning were well correlated, although other scores, assessed primarily using other assessment tools (Table 2), did not show significant correlations. In the generic headings, the management of quality of care and patient safety, community medicine and scientific research accounted for 5 to 50% and were marked with the option “no chance of observing”. These skills were items not assessed or assessed primarily by other tools in the locally adapted system.
The correlation of the corresponding items suggested that the two tools measured similar skills. However, the lack of correlation for some items and the high “no chance of observing” rate in some items in the generic rubric indicated that further investigation was needed. Therefore, the differences between these two tools were explored qualitatively.
Step 3: Qualitative results
The four supervising physicians were interviewed after using both the generic and localized tools. One of the four (Dr A) advised during the development of the localized tools, but the other three were not involved in development at all. The interviews were conducted from January to February 2021 and lasted approximately 30 minutes each.
Because the generic rubrics explicitly state the competencies and their descriptions, supervising physicians said they could learn adequate lessons from the national guidelines by conducting assessments. The descriptions of each level and sub-category helped them analyze each of the residents’ competencies:
C-28: “I always think that evaluation is done from different angles, such as medical aspects, human relations, institutional aspects, legal aspects, etc.”
Mismatch between rubrics and clinical context
However, the supervising physicians felt that there was a mismatch between their context and the generic rubrics, as there were discrepancies between their expectations of residents and the descriptions provided. They felt that the level of certain items was not adapted to the residents and felt that certain important aspects of their environment were not taken into account:
D-22: “I think they don’t appreciate the difficulty of facing a patient’s problem, finding the problem and solving it”
Abstract descriptions also caused inconsistencies. As generic rubrics are designed for all departments of all institutions in Japan, the expectations as such are generalized. Supervising physicians struggled to understand these descriptions and struggled to bridge the gaps between generic rubrics and their clinical context:
A-22: “At first, I wasn’t sure what the words meant or how to apply them”
The generic rubrics also contained items that could not be assessed in the GIM department. Also, for some skills, there were items that could be both observed and unobserved, confusing supervising physicians.
Mismatches between rubrics and clinical context resulted in invalid assessments. Supervising physicians noted the difficulty of maintaining consistency with abstract descriptions and feared that their assessments would be affected by these conditions:
Items that could not be realistically assessed on the ward confused supervising physicians and led to invalid assessments:
The presence of many items, including abstracts and those that could not be assessed, resulted in an increase in the cognitive load of supervising physicians. After completing the items, they were too exhausted to write any more comments. Therefore, they filled in the form but did not think about their education:
J-52: ‘In this case, I was more concerned with checking the abilities of the boarders at the time, but I didn’t think I could do much about it’
Decreased cognitive load resulting from local adaptation
The locally adapted tools were designed to fit the clinical context of the evaluators. Even though the supervising physicians were part of the development team, the tool descriptions were easy to understand and select. This allowed supervising physicians to assess residents with less cognitive load, and they felt it led to more consistent assessments:
B-48: ‘The localized version is more specific and the rating level is unambiguous’
Promoting Education Thinking
The low cognitive load associated with the effectiveness of the sentences in the locally adapted tools favored the writing of comments among the supervising physicians. They said they could include what they wanted to convey through such comments and they could also reflect on their upbringing in this way. The review, in turn, led to future plans:
J-52: “Thanks to the localized version, I was able to take stock of what my residents were able to do and what they weren’t yet able to do. Through this, I was able to determine what I needed to teach again when teaching the same residents next term.
Avoidance of differentiation through context-free evaluation
Some supervisors felt that the localized tools could not differentiate between residents, although the quantitative results meant otherwise. Indeed, a similar culture within the development team may have affected the tools’ ability to differentiate themselves. Supervisors have also been shown to avoid differentiation through evaluation. Their main concern was that the evaluation scores were presented without context:
C-66: “I think if someone checks [the box that indicates] that a resident can’t do something, there’s a reason he can’t do it’