Here’s what our Robot Overlord has to say on the topic…
1. What SNOMED International actually assumes
Within SNOMED CT, the languageCode on a Description is defined as:
ISO 639-1 language of the description text
This implicitly encodes two assumptions:
-
The text is natural language
-
The text belongs to a linguistic system with a speech community
LOINC descriptions violate both:
-
They are algorithmically constructed strings
-
They have no independent linguistic community
-
Their semantics are positional and compositional, not discursive
So semantically, tagging them as "en" is already a category error.
2. Why this matters operationally
In a modern terminology-enabled system, “language = en” is rarely just metadata. It is typically executable semantics.
(a) Tokenisation and lexical pipelines
Most implementations will:
-
apply English tokenisation rules
-
strip stopwords (“of”, “in”, “by”)
-
perform stemming/lemmatisation
-
weight tokens based on corpus frequency
LOINC strings (e.g. “Glucose:SCnc:Pt:Ser/Plas:Qn”) behave very differently:
-
punctuation is semantically meaningful, not syntactic noise
-
tokens are controlled abbreviations, not lexical words
-
stopword removal can be destructive
-
stemming is nonsensical
Result: index corruption or degraded recall/precision
(b) Query expansion and synonymy
Systems often:
Injecting LOINC strings into the “en” pool means:
-
they may be incorrectly surfaced as synonyms
-
they may pollute relevance scoring
-
they may bias autocomplete toward opaque encodings
(c) UI/UX expectations
User interfaces frequently assume:
LOINC strings are often:
(d) Distributional semantics / embeddings
More advanced systems:
LOINC strings will:
3. Is it really “not English”?
LOINC descriptions are best characterised as:
a controlled compositional representation language embedded in English orthography
They are not:
So from an ISO 639 perspective, they are closer to:
4. Why SNOMED is tempted to label them “en”
This is largely a consequence of schema constraints:
So the options are:
| Option |
Outcome |
Use "en" |
Semantically wrong, but valid |
| Use non-existent code |
Invalid RF2 |
| Leave blank |
Invalid RF2 |
| Use private code |
Non-conformant |
They are effectively boxed into "en" unless the model changes.
5. Risk magnitude
-
In systems with basic string matching → low to moderate impact
-
In systems with linguistic processing / IR pipelines → high impact
-
In systems with ML/NLP layers → systemic distortion risk
The key variable is whether the implementation treats languageCode as:
6. Mitigation strategies (within current constraints)
If SNOMED proceeds without changing the model, implementers need compensating controls:
(a) Type-based filtering
Treat LOINC-derived descriptions differently based on:
(b) Refset segregation
Place LOINC descriptions in a dedicated:
-
language refset
-
or custom reference set
and exclude from general “en” processing pipelines
(c) Tokenisation overrides
Detect LOINC patterns and:
(d) UI suppression rules
Prevent display in:
-
preferred term slots
-
patient-facing contexts
7. The architecturally “correct” fix
From a modelling standpoint, the real issue is that:
“languageCode” is being overloaded to carry something it was never designed for
A more robust design would separate:
For example:
But that requires:
8. Bottom line
-
ISO 639-1 codes are being used outside their domain of applicability
-
LOINC descriptions are not linguistically equivalent to English text
-
Treating them as such can materially degrade system behaviour
If this proceeds without architectural mitigation, you should expect: