Standardised Set of Allowed Characters in SNOMED CT Descriptions

A persistent issue in SNOMED CT authoring is the appearance of inappropriate or inconsistent Unicode characters in descriptions. These include non-standard whitespace, typographic punctuation, invisible characters, and occasionally symbols that render inconsistently or break text processing in downstream systems. These often come from copy and paste into authoring tools and validation varies.

While SNOMED CT is Unicode-capable, not all Unicode characters are appropriate for terminology descriptions (for example control characters or variations of punctuation characters). But the RF2 specification is silent on these issues, by default leaving this all to editorial policy.

This post proposes a standardised, explicit character policy for SNOMED CT descriptions for discussion, evolution, and hopefully implementation as a standardised ruleset for release validation and authoring tool input validation.

Proposed Rules for Allowed and Disallowed Characters

Control characters

Disallow:

  • U+0000–U+001F

  • U+007F–U+009F

Examples: NULL, ACK, ESC, FF, BEL


Rationale: Not printable, never meaningful, cause downstream failures.

Whitespace

Allow:

  • ASCII space only (U+0020)

Disallow:

  • Tabs, CR, LF

  • Non-breaking space (U+00A0)

  • All Unicode space characters (U+2000–U+200A)

  • Zero-width characters (U+200B–U+200D)

Structural rules:

  • No leading/trailing spaces

  • No multiple consecutive spaces

Rationale: Ensures stable, predictable lexical behaviour.

Standardised punctuation and symbols

Allow:

  • Printable ASCII punctuation and symbols

  • ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

Disallow:

  • Curly quotes

  • En/em dashes

  • Ellipsis

  • Decorative punctuation

  • Mathematical operators

Rationale: Eliminates variation from copy-paste, ensures tool compatibility.

Symbols

Allow:

  • ℱ (U+2122)

  • Âź (U+00AE)

  • © (U+00A9)

Disallow:

  • Emoji

  • Dingbats

  • Decorative or pictographic symbols

Rationale: Supports legally accurate product names and existing practice while avoiding instability.

Zero-width and combining characters

Disallow:

  • Zero-width characters

  • characters that are made by combining a base letter with a separate accent mark when a single “precomposed” version of that accented letter already exists in Unicode.

Example:

Combining

  • e = U+0065 (LATIN SMALL LETTER E)
  • ◌́ = U+0301 (COMBINING ACUTE ACCENT)

forms Ă© (which looks like Ă©, but is actually two characters)

Rationale: These introduce invisible variation that is extremely difficult to detect and validate.

Accented and non-Latin characters

Allow:

  • Precomposed accented Latin characters (Ă©, ñ, ĂŒ, etc.)

  • Characters required to support languages

Rationale: Supports multilingual authoring correctly and safely.

Superscript and subscript characters

Unicode provides numerous superscript and subscript characters, used in clinical notation for things like ions (CaÂČâș), haemoglobin fractions (HbA₂), and indices (CD4âș).

However, from a terminology and interoperability perspective these are typographic variants rather than semantically required characters.

Proposed rule:

Superscript and subscript Unicode characters (e.g., ÂČ Âł âș ⁻ ₀–₉) must not be used in SNOMED CT descriptions. ASCII equivalents must be used instead.

Examples of recommended replacements:

Unicode glyph ASCII representation
Âč ÂČ Âł ⁎ ⁔ ⁶ ⁷ ⁞ âč ⁰ 1 2 3 4 5 6 7 8 9 0
âș +
⁻ -
⁌ “=”

Example conversions:

  • CaÂČâș → Ca2+

  • NO₃⁻ → NO3-

  • HbA₂ → HbA2

  • CD4âș T-cell → CD4+ T-cell

Rationale:

  • Superscripts/subscripts display inconsistently or fail entirely in many systems.

  • They complicate search, equality comparison, and indexing.

  • They are not required for unambiguous text representation when ASCII equivalents are available.

Special Discussion Item: Clinically Meaningful Greek Letters

Greek symbols are widely used in clinical publications (e.g., ÎČ-thalassaemia, α-thalassaemia, ΔF508 mutation, ÎČ-lactam antibiotics). They carry clinical meaning and are visually compact.

However, significant safety, usability, and interoperability issues arise in terminology systems:

  • Many legacy EHRs and older message formats cannot reliably display or encode non-ASCII characters.

  • Greek characters are often lost, substituted, or corrupted during data exchange (e.g., in HL7 v2, CSV, or legacy databases).

  • Search engines often do not equate ÎČ with “beta”, reducing findability and leading to duplicated or divergent concepts.

  • Sorting and collation differ across locales.

For these reasons the Australian Medicines Terminology explicitly requires spelling out Greek letters instead of using symbols:

  • e.g., “alpha”, not “α”

This was mandated for consistency, safety, and interoperability across national clinical systems.

Yet many international publications (especially genetics and haematology) do use Greek letters in their published names. This reflects long-standing clinical convention.

Ultimately given SNOMED CT’s role as a reference terminology consumed by heterogeneous systems globally, including many legacy platforms, the interoperability and safety risks appear to outweigh the clinical convenience of Greek symbols.

Therefore it is recommend that SNOMED CT use spelled-out forms (“alpha”, “beta”, “gamma”, “delta”) in all description types (FSN, PT, synonyms) unless a compelling and explicitly defined exception is required.

Proposed Validation Rules

A whitelist of allowed character classes could be defined, for example:

^[A-Za-z0-9\u00C0-\u024F \-'"(),.:;/?Ÿ©ℹ]+$
  • A-Za-z0-9 – ASCII letters and digits

  • \u00C0-\u024F – precomposed accented Latin letters

  • Specific punctuation and symbols explicitly allowed

  • Notably excluding Greek letters and superscript/subscript ranges, consistent with the recommendations above.

SNOMED CT Description Character Analysis

Here is an analysis from a recent SNOMED CT release using the rules above. There are only a few violations.

Character ‐

  • Unicode: U+2010
  • Name: HYPHEN
  • Appears in: 105 active descriptions
  • Violates: non_ascii_punct_symbol

Examples

descId=4555763010  conceptId=1149439000  term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localized periodontitis Stage 1 Grade C (disorder)
descId=4555764016  conceptId=1149439000  term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localised periodontitis Stage 1 Grade C
descId=4555765015  conceptId=1149439000  term=AAP/EFP 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localised periodontitis Stage 1 Grade C
descId=4555766019  conceptId=1149439000  term=AAP/EFP 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localized periodontitis Stage 1 Grade C
descId=4555767011  conceptId=1149439000  term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localized periodontitis Stage 1 Grade C

Character ₂

  • Unicode: U+2082
  • Name: SUBSCRIPT TWO
  • Appears in: 4 active descriptions
  • Violates: superscript_subscript

Examples

descId=2772867016  conceptId=250781000  term=Respired carbon dioxide (CO₂) concentration
descId=2785992017  conceptId=437962006  term=Lipoprotein associated phospholipase A₂ measurement (procedure)
descId=2795232011  conceptId=437962006  term=Lipoprotein associated phospholipase A₂ measurement
descId=2795346013  conceptId=438173002  term=Ratio of arterial oxygen tension to inspired oxygen fraction (PaO₂/FiO₂)

Character ‑

  • Unicode: U+2011
  • Name: NON-BREAKING HYPHEN
  • Appears in: 2 active descriptions
  • Violates: non_ascii_punct_symbol

Examples

descId=5482476017  conceptId=230461000087109  term=Post‑sympathectomy neuralgia syndrome
descId=5482477014  conceptId=230461000087109  term=Post‑sympathectomy neuralgia

Character α

  • Unicode: U+03B1
  • Name: GREEK SMALL LETTER ALPHA
  • Appears in: 1 active descriptions
  • Violates: greek_letter

Examples

descId=2819810010  conceptId=441108001  term=Thromboelastography (TEG) α angle

Questions for TRAG

  1. Is this a useful/worthwhile addition to the specification and validation rules?

  2. Does TRAG support adopting a spelled-out Greek-letter policy (e.g., “beta-thalassaemia”), rather than using Greek symbols in SNOMED CT descriptions?

  3. Should any clinical domains be allowed to request exceptions, and if so, how would that work?

  4. Does TRAG agree that superscript and subscript Unicode characters should be disallowed and replaced with ASCII equivalents as proposed?

  5. Should search/indexing guidance include formal transliteration tables for Greek symbols and superscripts/subscripts (e.g., ÎČ â†’ “beta”, ÂČ â†’ “2”, âș → “+”)?

  6. Should these character rules apply equally to FSNs, PTs, and synonyms, or is there a case for differing levels of strictness across description types?

  7. Should a blacklist-based validation approach (disallowed ranges) or a whitelist-based approach (explicitly allowed character set) be used?

  8. Where can we write all of this down so we don’t have to talk about it again? :slight_smile:

7 Likes

@jsnyder sent me this message, unfortunately he can’t post here for some reason, so I’m adding this for him. He makes a good point, I don’t have knowledge of LOINC’s rules/efforts in this area but it makes sense to be consistent and use any prework.

Hi Dion,

Thank you for taking the time to pull this proposal together and post it to the forum. Unfortunately, I do not have permissions to respond to forum posts for the TRAG even though I would like to fully support this proposal.

One area that I would like the TRAG to take into consideration is to review the proposal within the scope of the LOINC/SNOMED Cooperation work. LOINC has suffered from the presence of these characters in their content for over a decade now with the non-printable space being specifically troublesome for implementors. I would like to see whatever solution SNOMED adopts be inclusive and/or exclusive enough in the rule set to accommodate the LOINC content so that the extension doesn’t need to modify the rule set just for their content.

Please let me know if you have any questions or what we can do to help you move this proposal forward.

Thanks

John

This is a super useful post @dmcmurtrie, and with a quick read through I don’t see anything I’d disagree with, but I’ll go through it in more detail.

We have an internal ticket (RP-975) for our Release Issues Report to compile a list of ‘good’ characters to check against, because our current approach of just flagging as ‘bad’ characters as and when they’re identified leaves us open to repeated surprises. I’ll cross reference your post here.

1 Like

Thanks @pwilliams that sounds great.

Ultimately I think it would be great to get a defined set of good and bad characters into the spec so everyone knows what the rules are and validation can be standardised. UTF-8 Unicode is a very broad target with multiple ways of doing essentially the same thing in places. It’d be good to constrain that problem space.

It’d be great to see what you come up with.

What do we think about this for an “Acceptable Character Policy File” ?

U+0000–U+001F	DISALLOW	# C0 control characters
U+007F–U+009F	DISALLOW	# C1 control characters

U+0020	        ALLOW	    # ASCII space

U+0021–U+007E	ALLOW	    # Printable ASCII punctuation, symbols, letters, digits

U+00A9	        ALLOW	    # COPYRIGHT SIGN
U+00AE	        ALLOW	    # REGISTERED SIGN
U+2122	        ALLOW	    # TRADE MARK SIGN

U+00C0–U+024F	ALLOW	    # Precomposed accented Latin letters

U+0300–U+036F	DISALLOW	# Combining diacritical marks

U+0370–U+03FF	DISALLOW	# Greek and Coptic

U+2070–U+207F	WARN	    # Superscripts
U+2080–U+208F	WARN	    # Subscripts

“Disallow” is perhaps redundant, but it allows us to make the distinction between “character unknown” and “Character known and specifically agreed we’re not going to use these”.

I liked your regex @dmcmurtrie, but I want to specify the location, match potentially multiple failures in the same description, and say what the character is, rather than just saying “this description did not match expectations”.

I’ll add the above into our Release Issues Report here and see how it plays against International and all our Managed Service customers


1 Like

HI @pwilliams,

This looks really good.

The only range that may be too restrictive to cover LOINC, UCUM, and the SNOMED organism hierarchy would be:

  • U+0370–U+03FF DISALLOW # Greek and Coptic
    ** U+03B1 (e.g., α-carbon, α-helix for chemical names)
    ** U+03B2 (e.g ÎČ-lactam for organisms)
    ** U+03BC (e.g. Micro for unit of measure in UCUM)
    ** U+03A9 (e.g. Ohm for unit of measure in UCUM)

There are probably more, but these are some examples.

The only other question, which may not be valid, would be do we need specific allow rules for unicode characters for Oriental languages that are being encoded to UTF-8 or is that validation handled separately? That validation may be better handle in association with a specific language refset, but I don’t know the intimates of the back-end system designs.

Thanks
John