Retention of LOINC specific descriptions for promoted content

As part of the HPO alignment work being done, a number of Observable entity concepts currently in the LOINC extension are being promoted to the International release to support the modeling of HPO phenotypes.

We find when we promote LOINC concepts, they carry with them the LOINC specific short common name, long common name and fully specified (LOINC) descriptions. For example:

While we have agreed that no LOINC description would be used for the FSN or PT of SNOMED concepts, the question remains as to whether the native LOINC descriptions should be retained for those familiar with LOINC terminology as suggested by Regenstrief. Most of these descriptins violate current editorial guidance for descirptions (i.e. abbreviations, colons, special characters) so would need to be made exceptions.

As these concepts are part of collaborative work with Regenstrief and the concepts are being attributed to them, will the presence of these nonconformant descriptions pose any issues for implementers or other users?

Before going out to the community as a whole, I would be interested in the views of the EAG members

LOINC descriptions aren’t written in “English” or any dialiect of it, in any normal sense. For that reason they likely carry a significant risk of perturbing existing systems to index and search over the space of en descriptions.

But perhaps that all points at the obvious solution: LOINC descriptions should not be assigned to language code = en, because they’re not fundamentally written in “English”. They’re a private language that, though it has obvious derivational links to English, no longer actually IS English in any meaningful or normal linguistic sense. If they had a private languageCode, their descriptions would and could be ignored by systems attempting to index and search over English or its dialects.

1 Like

Here’s what our Robot Overlord has to say on the topic…

1. What SNOMED International actually assumes

Within SNOMED CT, the languageCode on a Description is defined as:

ISO 639-1 language of the description text

This implicitly encodes two assumptions:

  1. The text is natural language

  2. The text belongs to a linguistic system with a speech community

LOINC descriptions violate both:

  • They are algorithmically constructed strings

  • They have no independent linguistic community

  • Their semantics are positional and compositional, not discursive

So semantically, tagging them as "en" is already a category error.


2. Why this matters operationally

In a modern terminology-enabled system, “language = en” is rarely just metadata. It is typically executable semantics.

(a) Tokenisation and lexical pipelines

Most implementations will:

  • apply English tokenisation rules

  • strip stopwords (“of”, “in”, “by”)

  • perform stemming/lemmatisation

  • weight tokens based on corpus frequency

LOINC strings (e.g. “Glucose:SCnc:Pt:Ser/Plas:Qn”) behave very differently:

  • punctuation is semantically meaningful, not syntactic noise

  • tokens are controlled abbreviations, not lexical words

  • stopword removal can be destructive

  • stemming is nonsensical

:backhand_index_pointing_right: Result: index corruption or degraded recall/precision


(b) Query expansion and synonymy

Systems often:

  • expand queries using synonyms within the same language refset

  • assume interchangeability of “en” descriptions

Injecting LOINC strings into the “en” pool means:

  • they may be incorrectly surfaced as synonyms

  • they may pollute relevance scoring

  • they may bias autocomplete toward opaque encodings


(c) UI/UX expectations

User interfaces frequently assume:

  • “English descriptions” are human-readable

  • preferred terms are clinically safe for display

LOINC strings are often:

  • compressed

  • cryptic

  • not safe for patient-facing contexts


(d) Distributional semantics / embeddings

More advanced systems:

  • build embeddings over “English descriptions”

  • cluster concepts via lexical similarity

LOINC strings will:

  • distort vector spaces

  • introduce artificial similarity based on shared abbreviations

  • degrade clustering quality


3. Is it really “not English”?

LOINC descriptions are best characterised as:

a controlled compositional representation language embedded in English orthography

They are not:

  • a natural language

  • a dialect

  • even a controlled natural language in the usual sense (like ASD-STE)

So from an ISO 639 perspective, they are closer to:

  • a formal representation system

  • which ISO 639 explicitly does not aim to classify


4. Why SNOMED is tempted to label them “en”

This is largely a consequence of schema constraints:

  • languageCode is mandatory

  • constrained to ISO 639-1 (2-letter)

  • no extension mechanism

So the options are:

Option Outcome
Use "en" Semantically wrong, but valid
Use non-existent code Invalid RF2
Leave blank Invalid RF2
Use private code Non-conformant

:backhand_index_pointing_right: They are effectively boxed into "en" unless the model changes.


5. Risk magnitude

  • In systems with basic string matching → low to moderate impact

  • In systems with linguistic processing / IR pipelines → high impact

  • In systems with ML/NLP layers → systemic distortion risk

The key variable is whether the implementation treats languageCode as:

  • display metadata (low risk)

  • or processing signal (high risk)


6. Mitigation strategies (within current constraints)

If SNOMED proceeds without changing the model, implementers need compensating controls:

(a) Type-based filtering

Treat LOINC-derived descriptions differently based on:

  • description type (e.g. synonym vs FSN vs special marker)

  • or provenance metadata (if available)

(b) Refset segregation

Place LOINC descriptions in a dedicated:

  • language refset

  • or custom reference set

and exclude from general “en” processing pipelines

(c) Tokenisation overrides

Detect LOINC patterns and:

  • bypass standard NLP

  • use delimiter-aware parsing instead

(d) UI suppression rules

Prevent display in:

  • preferred term slots

  • patient-facing contexts


7. The architecturally “correct” fix

From a modelling standpoint, the real issue is that:

“languageCode” is being overloaded to carry something it was never designed for

A more robust design would separate:

  • natural language (ISO 639)

  • representation system / formalism (new attribute)

For example:

  • languageCode = en

  • representationType = loinc-compositional

But that requires:

  • RF2 schema evolution

  • backward compatibility strategy

  • governance agreement


8. Bottom line

  • ISO 639-1 codes are being used outside their domain of applicability

  • LOINC descriptions are not linguistically equivalent to English text

  • Treating them as such can materially degrade system behaviour

If this proceeds without architectural mitigation, you should expect:

  • subtle search regressions

  • explainability issues

  • and hard-to-diagnose NLP artefacts

Hi Jim, Jeremy and others,

I would argue that these are English - at least, I could not come up with a more appropriate language tag. But indeed, not English that conforms to the English editorial guidelines.

In the Netherlands we distinguish descriptions for patients from descriptions for healthcare professionals, by means of language reference sets. I believe other countries (e.g. Norway) use this approach as well to distinguish target audiences for a description.

LOINC provides multiple descriptions per concept that also have specific audiences in mind. In their browser, I’ve seen the Fully Specified Name (pretty illegible, with lots of colons), Long Common Name, Short Name, Display Name and Consumer Name.

The Short Name is specifically meant for labs; the Consumer name is meant for patients ( Consumer Names – LOINC ); the display names are clinician-friendly (Display Names – LOINC).

So, my view is: discuss with Regenstrief which term types they wish to show in the LOINC Ontology. Then create a new language reference set for each type they choose. Ideally, create an international concept and an English concept for each language refset: that way we can add sister language refsets for Dutch. E.g.

  • LOINC short name language reference set (metadata)
    • English LOINC short name language reference set (metadata)
    • Dutch LOINC short name language reference set (metadata)

The other potential engineering kludge, given that there is zero prospect of adding “LOINC Descriptions” to the set of constructed languages recognised by ISO 639-1, would be to allow the continued (considerable) stretching of the truth that they belong to languageCode = en, but to ship LOINC descriptions with a new typeId that is neither a synonym, FSN nor a definition. That way they can still be partitioned off and ignored by existing NLP pipelines wherever these expect and require exclusively discursive natural language strings.

I agree these descriptions are largely a kludge (like many things with LOINC). But “English” might be a reasonable concession - perhaps if it got through to core - a different description type?

Noting these descriptions would NEVER be accepted into core SNOMED under any other conditions. “DistWidth” “Bld” “Qn” “Pt” are not real words.

What is the actual value of adding these “LOINC descriptions” to SNOMED? (Even the LOINC extension?). I’m not even sure what they add to LOINC (the 6 fields can be searched). Most likely another “feature” shoehorned in over the last 30 years - for some sort of single field token searching?


As a side note: This example also highlights a concern I have with how much “faith” is being put into the “quality” of LOINC. This concept was originally named “Erythrocyte [DistWidth] in Red Blood Cells by Automated count” and modelled with a specimen on “RBC”.

But it looks like this was revised last year to (correctly) - (whole) blood specimen. I expect this quality improvement to LOINC was a direct consequence of the collaboration with SI - but is underplayed much of the time.

And as I’ve said before “Just because LOINC (or any code system) does something, doesn’t mean SNOMED CT needs to do the same”. They can leverage each other, but they are different products.

In general, I agree with all the comments made on this topic to-date.

The LOINC specific names should stay in the LOINC extension module for distribution but should not be promoted to the international edition. This is the general editorial policy with promoting content from an country extension to the international edition and some of the descriptions are appropriate for the country’s extension but are not eligible for promotion to the international edition. I do not see a compelling reason to change this just for LOINC.

Thanks everyone for your comments. I will bring this back to the LOINC project.