A persistent issue in SNOMED CT authoring is the appearance of inappropriate or inconsistent Unicode characters in descriptions. These include non-standard whitespace, typographic punctuation, invisible characters, and occasionally symbols that render inconsistently or break text processing in downstream systems. These often come from copy and paste into authoring tools and validation varies.
While SNOMED CT is Unicode-capable, not all Unicode characters are appropriate for terminology descriptions (for example control characters or variations of punctuation characters). But the RF2 specification is silent on these issues, by default leaving this all to editorial policy.
This post proposes a standardised, explicit character policy for SNOMED CT descriptions for discussion, evolution, and hopefully implementation as a standardised ruleset for release validation and authoring tool input validation.
Proposed Rules for Allowed and Disallowed Characters
Control characters
Disallow:
-
U+0000âU+001F
-
U+007FâU+009F
Examples: NULL, ACK, ESC, FF, BELâŠ
Rationale: Not printable, never meaningful, cause downstream failures.
Whitespace
Allow:
- ASCII space only (U+0020)
Disallow:
-
Tabs, CR, LF
-
Non-breaking space (U+00A0)
-
All Unicode space characters (U+2000âU+200A)
-
Zero-width characters (U+200BâU+200D)
Structural rules:
-
No leading/trailing spaces
-
No multiple consecutive spaces
Rationale: Ensures stable, predictable lexical behaviour.
Standardised punctuation and symbols
Allow:
-
Printable ASCII punctuation and symbols
-
! " # $ % & â ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
Disallow:
-
Curly quotes
-
En/em dashes
-
Ellipsis
-
Decorative punctuation
-
Mathematical operators
Rationale: Eliminates variation from copy-paste, ensures tool compatibility.
Symbols
Allow:
-
âą (U+2122)
-
Âź (U+00AE)
-
© (U+00A9)
Disallow:
-
Emoji
-
Dingbats
-
Decorative or pictographic symbols
Rationale: Supports legally accurate product names and existing practice while avoiding instability.
Zero-width and combining characters
Disallow:
-
Zero-width characters
-
characters that are made by combining a base letter with a separate accent mark when a single âprecomposedâ version of that accented letter already exists in Unicode.
Example:
Combining
- e = U+0065 (LATIN SMALL LETTER E)
- âÌ = U+0301 (COMBINING ACUTE ACCENT)
forms Ă© (which looks like Ă©, but is actually two characters)
Rationale: These introduce invisible variation that is extremely difficult to detect and validate.
Accented and non-Latin characters
Allow:
-
Precomposed accented Latin characters (Ă©, ñ, ĂŒ, etc.)
-
Characters required to support languages
Rationale: Supports multilingual authoring correctly and safely.
Superscript and subscript characters
Unicode provides numerous superscript and subscript characters, used in clinical notation for things like ions (CaÂČâș), haemoglobin fractions (HbAâ), and indices (CD4âș).
However, from a terminology and interoperability perspective these are typographic variants rather than semantically required characters.
Proposed rule:
Superscript and subscript Unicode characters (e.g., ÂČ Âł âș â» âââ) must not be used in SNOMED CT descriptions. ASCII equivalents must be used instead.
Examples of recommended replacements:
| Unicode glyph | ASCII representation |
|---|---|
| Âč ÂČ Âł ⎠┠ⶠⷠ➠âč â° | 1 2 3 4 5 6 7 8 9 0 |
| âș | + |
| â» | - |
| ⌠| â=â |
Example conversions:
-
CaÂČâș â Ca2+
-
NOââ» â NO3-
-
HbAâ â HbA2
-
CD4âș T-cell â CD4+ T-cell
Rationale:
-
Superscripts/subscripts display inconsistently or fail entirely in many systems.
-
They complicate search, equality comparison, and indexing.
-
They are not required for unambiguous text representation when ASCII equivalents are available.
Special Discussion Item: Clinically Meaningful Greek Letters
Greek symbols are widely used in clinical publications (e.g., ÎČ-thalassaemia, α-thalassaemia, ÎF508 mutation, ÎČ-lactam antibiotics). They carry clinical meaning and are visually compact.
However, significant safety, usability, and interoperability issues arise in terminology systems:
-
Many legacy EHRs and older message formats cannot reliably display or encode non-ASCII characters.
-
Greek characters are often lost, substituted, or corrupted during data exchange (e.g., in HL7 v2, CSV, or legacy databases).
-
Search engines often do not equate ÎČ with âbetaâ, reducing findability and leading to duplicated or divergent concepts.
-
Sorting and collation differ across locales.
For these reasons the Australian Medicines Terminology explicitly requires spelling out Greek letters instead of using symbols:
- e.g., âalphaâ, not âαâ
This was mandated for consistency, safety, and interoperability across national clinical systems.
Yet many international publications (especially genetics and haematology) do use Greek letters in their published names. This reflects long-standing clinical convention.
Ultimately given SNOMED CTâs role as a reference terminology consumed by heterogeneous systems globally, including many legacy platforms, the interoperability and safety risks appear to outweigh the clinical convenience of Greek symbols.
Therefore it is recommend that SNOMED CT use spelled-out forms (âalphaâ, âbetaâ, âgammaâ, âdeltaâ) in all description types (FSN, PT, synonyms) unless a compelling and explicitly defined exception is required.
Proposed Validation Rules
A whitelist of allowed character classes could be defined, for example:
^[A-Za-z0-9\u00C0-\u024F \-'"(),.:;/?Ÿ©âą]+$
-
A-Za-z0-9 â ASCII letters and digits
-
\u00C0-\u024F â precomposed accented Latin letters
-
Specific punctuation and symbols explicitly allowed
-
Notably excluding Greek letters and superscript/subscript ranges, consistent with the recommendations above.
SNOMED CT Description Character Analysis
Here is an analysis from a recent SNOMED CT release using the rules above. There are only a few violations.
Character â
- Unicode:
U+2010 - Name: HYPHEN
- Appears in: 105 active descriptions
- Violates:
non_ascii_punct_symbol
Examples
descId=4555763010 conceptId=1149439000 term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Periâimplant Diseases and Conditions localized periodontitis Stage 1 Grade C (disorder)
descId=4555764016 conceptId=1149439000 term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Periâimplant Diseases and Conditions localised periodontitis Stage 1 Grade C
descId=4555765015 conceptId=1149439000 term=AAP/EFP 2017 Classification of Periodontal and Periâimplant Diseases and Conditions localised periodontitis Stage 1 Grade C
descId=4555766019 conceptId=1149439000 term=AAP/EFP 2017 Classification of Periodontal and Periâimplant Diseases and Conditions localized periodontitis Stage 1 Grade C
descId=4555767011 conceptId=1149439000 term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Periâimplant Diseases and Conditions localized periodontitis Stage 1 Grade C
Character â
- Unicode:
U+2082 - Name: SUBSCRIPT TWO
- Appears in: 4 active descriptions
- Violates:
superscript_subscript
Examples
descId=2772867016 conceptId=250781000 term=Respired carbon dioxide (COâ) concentration
descId=2785992017 conceptId=437962006 term=Lipoprotein associated phospholipase Aâ measurement (procedure)
descId=2795232011 conceptId=437962006 term=Lipoprotein associated phospholipase Aâ measurement
descId=2795346013 conceptId=438173002 term=Ratio of arterial oxygen tension to inspired oxygen fraction (PaOâ/FiOâ)
Character â
- Unicode:
U+2011 - Name: NON-BREAKING HYPHEN
- Appears in: 2 active descriptions
- Violates:
non_ascii_punct_symbol
Examples
descId=5482476017 conceptId=230461000087109 term=Postâsympathectomy neuralgia syndrome
descId=5482477014 conceptId=230461000087109 term=Postâsympathectomy neuralgia
Character α
- Unicode:
U+03B1 - Name: GREEK SMALL LETTER ALPHA
- Appears in: 1 active descriptions
- Violates:
greek_letter
Examples
descId=2819810010 conceptId=441108001 term=Thromboelastography (TEG) α angle
Questions for TRAG
-
Is this a useful/worthwhile addition to the specification and validation rules?
-
Does TRAG support adopting a spelled-out Greek-letter policy (e.g., âbeta-thalassaemiaâ), rather than using Greek symbols in SNOMED CT descriptions?
-
Should any clinical domains be allowed to request exceptions, and if so, how would that work?
-
Does TRAG agree that superscript and subscript Unicode characters should be disallowed and replaced with ASCII equivalents as proposed?
-
Should search/indexing guidance include formal transliteration tables for Greek symbols and superscripts/subscripts (e.g., ÎČ â âbetaâ, ÂČ â â2â, âș â â+â)?
-
Should these character rules apply equally to FSNs, PTs, and synonyms, or is there a case for differing levels of strictness across description types?
-
Should a blacklist-based validation approach (disallowed ranges) or a whitelist-based approach (explicitly allowed character set) be used?
-
Where can we write all of this down so we donât have to talk about it again?
