Standardised Set of Allowed Characters in SNOMED CT Descriptions

dmcmurtrie · December 5, 2025, 12:37am

A persistent issue in SNOMED CT authoring is the appearance of inappropriate or inconsistent Unicode characters in descriptions. These include non-standard whitespace, typographic punctuation, invisible characters, and occasionally symbols that render inconsistently or break text processing in downstream systems. These often come from copy and paste into authoring tools and validation varies.

While SNOMED CT is Unicode-capable, not all Unicode characters are appropriate for terminology descriptions (for example control characters or variations of punctuation characters). But the RF2 specification is silent on these issues, by default leaving this all to editorial policy.

This post proposes a standardised, explicit character policy for SNOMED CT descriptions for discussion, evolution, and hopefully implementation as a standardised ruleset for release validation and authoring tool input validation.

Proposed Rules for Allowed and Disallowed Characters

Control characters

Disallow:

U+0000–U+001F
U+007F–U+009F

Examples: NULL, ACK, ESC, FF, BEL…

Rationale: Not printable, never meaningful, cause downstream failures.

Whitespace

Allow:

ASCII space only (U+0020)

Disallow:

Tabs, CR, LF
Non-breaking space (U+00A0)
All Unicode space characters (U+2000–U+200A)
Zero-width characters (U+200B–U+200D)

Structural rules:

No leading/trailing spaces
No multiple consecutive spaces

Rationale: Ensures stable, predictable lexical behaviour.

Standardised punctuation and symbols

Allow:

Printable ASCII punctuation and symbols
! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

Disallow:

Curly quotes
En/em dashes
Ellipsis
Decorative punctuation
Mathematical operators

Rationale: Eliminates variation from copy-paste, ensures tool compatibility.

Symbols

Allow:

™ (U+2122)
® (U+00AE)
© (U+00A9)

Disallow:

Emoji
Dingbats
Decorative or pictographic symbols

Rationale: Supports legally accurate product names and existing practice while avoiding instability.

Zero-width and combining characters

Disallow:

Zero-width characters
characters that are made by combining a base letter with a separate accent mark when a single “precomposed” version of that accented letter already exists in Unicode.

Example:

Combining

e = U+0065 (LATIN SMALL LETTER E)
◌́ = U+0301 (COMBINING ACUTE ACCENT)

forms é (which looks like é, but is actually two characters)

Rationale: These introduce invisible variation that is extremely difficult to detect and validate.

Accented and non-Latin characters

Allow:

Precomposed accented Latin characters (é, ñ, ü, etc.)
Characters required to support languages

Rationale: Supports multilingual authoring correctly and safely.

Superscript and subscript characters

Unicode provides numerous superscript and subscript characters, used in clinical notation for things like ions (Ca²⁺), haemoglobin fractions (HbA₂), and indices (CD4⁺).

However, from a terminology and interoperability perspective these are typographic variants rather than semantically required characters.

Proposed rule:

Superscript and subscript Unicode characters (e.g., ² ³ ⁺ ⁻ ₀–₉) must not be used in SNOMED CT descriptions. ASCII equivalents must be used instead.

Examples of recommended replacements:

Unicode glyph	ASCII representation
¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁰	1 2 3 4 5 6 7 8 9 0
⁺	+
⁻	-
⁼	“=”

Example conversions:

Ca²⁺ → Ca2+
NO₃⁻ → NO3-
HbA₂ → HbA2
CD4⁺ T-cell → CD4+ T-cell

Rationale:

Superscripts/subscripts display inconsistently or fail entirely in many systems.
They complicate search, equality comparison, and indexing.
They are not required for unambiguous text representation when ASCII equivalents are available.

Special Discussion Item: Clinically Meaningful Greek Letters

Greek symbols are widely used in clinical publications (e.g., β-thalassaemia, α-thalassaemia, ΔF508 mutation, β-lactam antibiotics). They carry clinical meaning and are visually compact.

However, significant safety, usability, and interoperability issues arise in terminology systems:

Many legacy EHRs and older message formats cannot reliably display or encode non-ASCII characters.
Greek characters are often lost, substituted, or corrupted during data exchange (e.g., in HL7 v2, CSV, or legacy databases).
Search engines often do not equate β with “beta”, reducing findability and leading to duplicated or divergent concepts.
Sorting and collation differ across locales.

For these reasons the Australian Medicines Terminology explicitly requires spelling out Greek letters instead of using symbols:

e.g., “alpha”, not “α”

This was mandated for consistency, safety, and interoperability across national clinical systems.

Yet many international publications (especially genetics and haematology) do use Greek letters in their published names. This reflects long-standing clinical convention.

Ultimately given SNOMED CT’s role as a reference terminology consumed by heterogeneous systems globally, including many legacy platforms, the interoperability and safety risks appear to outweigh the clinical convenience of Greek symbols.

Therefore it is recommend that SNOMED CT use spelled-out forms (“alpha”, “beta”, “gamma”, “delta”) in all description types (FSN, PT, synonyms) unless a compelling and explicitly defined exception is required.

Proposed Validation Rules

A whitelist of allowed character classes could be defined, for example:

^[A-Za-z0-9\u00C0-\u024F \-'"(),.:;/?®©™]+$

A-Za-z0-9 – ASCII letters and digits
\u00C0-\u024F – precomposed accented Latin letters
Specific punctuation and symbols explicitly allowed
Notably excluding Greek letters and superscript/subscript ranges, consistent with the recommendations above.

SNOMED CT Description Character Analysis

Here is an analysis from a recent SNOMED CT release using the rules above. There are only a few violations.

Character `‐`

Unicode: U+2010
Name: HYPHEN
Appears in: 105 active descriptions
Violates: non_ascii_punct_symbol

Examples

descId=4555763010  conceptId=1149439000  term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localized periodontitis Stage 1 Grade C (disorder)
descId=4555764016  conceptId=1149439000  term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localised periodontitis Stage 1 Grade C
descId=4555765015  conceptId=1149439000  term=AAP/EFP 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localised periodontitis Stage 1 Grade C
descId=4555766019  conceptId=1149439000  term=AAP/EFP 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localized periodontitis Stage 1 Grade C
descId=4555767011  conceptId=1149439000  term=American Academy of Periodontology and European Federation of Periodontology 2017 Classification of Periodontal and Peri‐implant Diseases and Conditions localized periodontitis Stage 1 Grade C

Character `₂`

Unicode: U+2082
Name: SUBSCRIPT TWO
Appears in: 4 active descriptions
Violates: superscript_subscript

Examples

descId=2772867016  conceptId=250781000  term=Respired carbon dioxide (CO₂) concentration
descId=2785992017  conceptId=437962006  term=Lipoprotein associated phospholipase A₂ measurement (procedure)
descId=2795232011  conceptId=437962006  term=Lipoprotein associated phospholipase A₂ measurement
descId=2795346013  conceptId=438173002  term=Ratio of arterial oxygen tension to inspired oxygen fraction (PaO₂/FiO₂)

Character `‑`

Unicode: U+2011
Name: NON-BREAKING HYPHEN
Appears in: 2 active descriptions
Violates: non_ascii_punct_symbol

Examples

descId=5482476017  conceptId=230461000087109  term=Post‑sympathectomy neuralgia syndrome
descId=5482477014  conceptId=230461000087109  term=Post‑sympathectomy neuralgia

Character `α`

Unicode: U+03B1
Name: GREEK SMALL LETTER ALPHA
Appears in: 1 active descriptions
Violates: greek_letter

Examples

descId=2819810010  conceptId=441108001  term=Thromboelastography (TEG) α angle

Questions for TRAG

Is this a useful/worthwhile addition to the specification and validation rules?
Does TRAG support adopting a spelled-out Greek-letter policy (e.g., “beta-thalassaemia”), rather than using Greek symbols in SNOMED CT descriptions?
Should any clinical domains be allowed to request exceptions, and if so, how would that work?
Does TRAG agree that superscript and subscript Unicode characters should be disallowed and replaced with ASCII equivalents as proposed?
Should search/indexing guidance include formal transliteration tables for Greek symbols and superscripts/subscripts (e.g., β → “beta”, ² → “2”, ⁺ → “+”)?
Should these character rules apply equally to FSNs, PTs, and synonyms, or is there a case for differing levels of strictness across description types?
Should a blacklist-based validation approach (disallowed ranges) or a whitelist-based approach (explicitly allowed character set) be used?
Where can we write all of this down so we don’t have to talk about it again?

dmcmurtrie · December 7, 2025, 11:33pm

@jsnyder sent me this message, unfortunately he can’t post here for some reason, so I’m adding this for him. He makes a good point, I don’t have knowledge of LOINC’s rules/efforts in this area but it makes sense to be consistent and use any prework.

Hi Dion,

Thank you for taking the time to pull this proposal together and post it to the forum. Unfortunately, I do not have permissions to respond to forum posts for the TRAG even though I would like to fully support this proposal.

One area that I would like the TRAG to take into consideration is to review the proposal within the scope of the LOINC/SNOMED Cooperation work. LOINC has suffered from the presence of these characters in their content for over a decade now with the non-printable space being specifically troublesome for implementors. I would like to see whatever solution SNOMED adopts be inclusive and/or exclusive enough in the rule set to accommodate the LOINC content so that the extension doesn’t need to modify the rule set just for their content.

Please let me know if you have any questions or what we can do to help you move this proposal forward.

Thanks

John

pwilliams · December 8, 2025, 12:17pm

This is a super useful post @dmcmurtrie, and with a quick read through I don’t see anything I’d disagree with, but I’ll go through it in more detail.

We have an internal ticket (RP-975) for our Release Issues Report to compile a list of ‘good’ characters to check against, because our current approach of just flagging as ‘bad’ characters as and when they’re identified leaves us open to repeated surprises. I’ll cross reference your post here.

dmcmurtrie · December 22, 2025, 1:32am

Thanks @pwilliams that sounds great.

Ultimately I think it would be great to get a defined set of good and bad characters into the spec so everyone knows what the rules are and validation can be standardised. UTF-8 Unicode is a very broad target with multiple ways of doing essentially the same thing in places. It’d be good to constrain that problem space.

It’d be great to see what you come up with.

pwilliams · January 15, 2026, 2:41pm

What do we think about this for an “Acceptable Character Policy File” ?

U+0000–U+001F	DISALLOW	# C0 control characters
U+007F–U+009F	DISALLOW	# C1 control characters

U+0020	        ALLOW	    # ASCII space

U+0021–U+007E	ALLOW	    # Printable ASCII punctuation, symbols, letters, digits

U+00A9	        ALLOW	    # COPYRIGHT SIGN
U+00AE	        ALLOW	    # REGISTERED SIGN
U+2122	        ALLOW	    # TRADE MARK SIGN

U+00C0–U+024F	ALLOW	    # Precomposed accented Latin letters

U+0300–U+036F	DISALLOW	# Combining diacritical marks

U+0370–U+03FF	DISALLOW	# Greek and Coptic

U+2070–U+207F	WARN	    # Superscripts
U+2080–U+208F	WARN	    # Subscripts

“Disallow” is perhaps redundant, but it allows us to make the distinction between “character unknown” and “Character known and specifically agreed we’re not going to use these”.

I liked your regex @dmcmurtrie, but I want to specify the location, match potentially multiple failures in the same description, and say what the character is, rather than just saying “this description did not match expectations”.

I’ll add the above into our Release Issues Report here and see how it plays against International and all our Managed Service customers…

jsnyder · January 15, 2026, 3:37pm

HI @pwilliams,

This looks really good.

The only range that may be too restrictive to cover LOINC, UCUM, and the SNOMED organism hierarchy would be:

U+0370–U+03FF DISALLOW # Greek and Coptic
** U+03B1 (e.g., α-carbon, α-helix for chemical names)
** U+03B2 (e.g β-lactam for organisms)
** U+03BC (e.g. Micro for unit of measure in UCUM)
** U+03A9 (e.g. Ohm for unit of measure in UCUM)

There are probably more, but these are some examples.

The only other question, which may not be valid, would be do we need specific allow rules for unicode characters for Oriental languages that are being encoded to UTF-8 or is that validation handled separately? That validation may be better handle in association with a specific language refset, but I don’t know the intimates of the back-end system designs.

Thanks
John

jsnyder · April 6, 2026, 8:27pm

Hi @pwilliams ,

Was the release Issues report updated to flag disallowed characters within the last 30 days? I am seeing disallowed character entries for the greek alpha character in my report this month for a concept added to the terminology in 2017.

Thanks
John

pwilliams · April 7, 2026, 9:29am

Hi @jsnyder yes it was. Hopefully we can make some progress on this in Vienna, although I suspect we’re going to want to add an enhancement to allow these rules to work on an extension-by-extension basis.

I think what we’ll do is add those characters you identified above as ‘provisionally acceptable’ just to reduce the chatter on these reports (everyone likes a clean report), and then either we can move towards some consensus at the International level, or we’ll switch to managing these on a case-by-case basis. Which in fact would not be at all difficult to do.

CC @eilyukhina

jsnyder · April 7, 2026, 12:10pm

Thanks @pwilliams and @eilyukhina ,

I can confirm the report code appears to be working as expected. I had 3 concepts show up on my report in preparation for the authoring platform upgrade to the April International edition. One of the concepts was just added in the last few weeks and contained a character that references the U.S. code of federal regulations as well as the Alpha character in a legacy concept. I have already cleaned those up and promoted the fixes to mainline

Well done and nice work. Much appreciated.

mnystrom · April 12, 2026, 3:56pm

Non‑breaking spaces should generally be allowed where they make sense, since they add value and are included in almost all character sets. At the same time, it could be helpful if editors provided a gentle warning when a non‑breaking space is entered, just to give users a chance to double‑check that it is really intended in that particular place.

Although Greek characters, superscripts, subscripts, and similar symbols are not currently generally supported due to limitations in today’s EHR systems, it would be good to keep a longer‑term perspective to support them. These characters are often requested as improvements in EHR systems, and supporting them would clearly be beneficial. Using Greek characters, superscripts, subscripts, and similar symbols in a product gives the product a much more modern look and feel. One possible approach could be to use different description types or separate language reference sets, where one allows these characters and one do not allows these characters.

It would also be useful to clarify the intended scope. Is this meant to apply only to English descriptions, or more broadly? If the ambition is a universal solution, a more in‑depth analysis will probably be needed, since different languages rely on different character sets and usage patterns.

Standardised Set of Allowed Characters in SNOMED CT Descriptions

Proposed Rules for Allowed and Disallowed Characters

Control characters

Whitespace

Standardised punctuation and symbols

Symbols

Zero-width and combining characters

Accented and non-Latin characters

Superscript and subscript characters

Special Discussion Item: Clinically Meaningful Greek Letters

Proposed Validation Rules

SNOMED CT Description Character Analysis

Character ‐

Examples

Character ₂

Examples

Character ‑

Examples

Character α

Examples

Questions for TRAG

Character `‐`

Character `₂`

Character `‑`

Character `α`