Html markup included in text mining results

A picture is sometimes worth a thousand words:

Note the ‘<h4>’ html tag spelled out within the paragraph. Among other things, this makes it appear as if the target and disease were in the same sentence, when in fact they are in 2 separate sub-sections of the abstract, which makes the text mining evidence both technically and practically incorrect.

To reproduce, here’s the link to both target and disease terms, scroll down to Europe PMC evidence, then to the Guillet (2022) paper, and finally ‘show matches’.

I’m seeing this happening for several abstract ‘sentences’, though I can’t exclude similar issues popping up also within other papers sections.

Hi, yes, this is quite ugly. Unfortunately due to the heterogeneity of the formats followed by the thousands of publications ingested by EuroPMC, it is difficult to build a parser that works on all publication. We let our partners at EuorPMC know about this issue.

1 Like