Building context into IA projects with Jonathan Engel

A presentation at World IA Day London 2024 in March 2024 in London, UK by London World IA Day

Slide 1

Slide 1

Building context into IA projects A review of successful structures and processes Jonathan Engel WIAD 2024

Slide 2

Slide 2

Introduction – My background • Reuters Manager for Multimedia News Production (19 years in multiple roles) • Running own information management consultancy for last 22 years – InfoArk • Designer of customised taxonomies and related metadata for classifying content in 30 major projects • Specialist in linking classification schemes with automated tagging and search software, plus content filters and linked data

Slide 3

Slide 3

Introduction – Selected clients • • • • • • • • • • • • Dow Jones newswires online Times and Sunday Times online Institute of Chartered Accountants Clifford Chance law firm Which? (Consumer Association) Cambridge University Unilever Shop Direct Group (Littlewoods, Very) UK Care Quality Commission NHS Education for Scotland UK Department for International Development Oxfam International

Slide 4

Slide 4

Connections – puzzle needs context Create four groups of four! Nose Head Stiff Wing Bulb Seal Crayon Rob Ear Engine Hose Candle Cabin Stalk Honeycomb Fleece

Slide 5

Slide 5

Connections – multiple links possible Create four groups of four! Nose ? Head ? Stiff Wing Bulb ? Seal ? Crayon Rob Ear ? Engine Cabin Stalk ? Hose ? Honeycomb Candle Fleece ?

Slide 6

Slide 6

Connections – find common thread! Create four groups of four! RIP OFF Fleece Hose Rob Stiff Nose Wing Head Stalk Honeycomb Seal PARTS OF AN AIRPLANE Cabin Engine UNITS OF VEGETABLES Bulb Ear THINGS MADE OF WAX Candle Crayon

Slide 7

Slide 7

Cycle of Context Engage specialists – begin with advice on relevant vocabularies and documents Governance – need representative bodies and transparent process to review, approve and repeat Test structure with rules to find topic-rich documents; refine results with AI/ machine learning on these curated documents Extend ontology with keywords Build and test initial taxonomy around unifying topics Extend taxonomy with synonyms Extend thesaurus with related topics and relationships

Slide 8

Slide 8

Cycle of Context Engage specialists – begin with advice on relevant vocabularies and documents Governance – need representative bodies and transparent process to review, approve and repeat Test structure with rules to find topic-rich documents; refine results with AI/ machine learning on these curated documents Extend ontology with keywords Build and test initial taxonomy around unifying topics Extend taxonomy with synonyms Extend thesaurus with related topics and relationships

Slide 9

Slide 9

Taxonomy — the heart of all IT systems CMS Intranet Business Intelligence CRM Extranet Content Structure Internet

Slide 10

Slide 10

Multi-faceted taxonomy goes beyond “subjects” • Geography for location, jurisdiction Entities • Organisation’s business units • External organisations by type Who? Where? Who for? Subject matter What? Why? Focused filters How? • List of statutes, products, roles, etc. • Business activities and issues • Business sectors, e.g. financial services • Events, projects and initiatives • Content types and level • Language

Slide 11

Slide 11

Combine structures for “extended” taxonomy

Slide 12

Slide 12

Components of an extended taxonomy Preferred term Hierarchical parent Synonyms Related term Contextual keywords

Slide 13

Slide 13

Extended taxonomy term – example Preferred term Law Hierarchical parent Law and crime Synonyms Legal system Legislation Related term Police Legislature Politics Statutes and regulations Contextual keywords Lawmaker, Senator, Congressman/ woman, Draft bill, Vote

Slide 14

Slide 14

Cycle of Context Engage specialists – begin with advice on relevant vocabularies and documents Governance – need representative bodies and transparent process to review, approve and repeat Test structure with rules to find topic-rich documents; refine results with AI/ machine learning on these curated documents Extend ontology with keywords Build and test initial taxonomy around unifying topics Extend taxonomy with synonyms Extend thesaurus with related topics and relationships

Slide 15

Slide 15

Synonyms • Equivalent terms – exact or “near” match • Example — Cardiovascular disease • Synonyms — Heart disease, Atherosclerosis, Arterial disease, Cardiovascular condition, Cardiovascular illness • Synonym rings – useful for recurring equivalencies, e.g. disease = illness = condition • Can link rings to produce “semantic nets” to discover information, e.g. Danger + Southwest

Slide 16

Slide 16

Cycle of Context Engage specialists – begin with advice on relevant vocabularies and documents Governance – need representative bodies and transparent process to review, approve and repeat Test structure with rules to find topic-rich documents; refine results with AI/ machine learning on these curated documents Extend ontology with keywords Build and test initial taxonomy around unifying topics Extend taxonomy with synonyms Extend thesaurus with related topics and relationships

Slide 17

Slide 17

Related terms • Already present in taxonomy • Associated with the preferred term • Useful to record strength of relationship for tagging, e.g. mandatory or discretionary • The City of London police will always be linked to crime prevention, but crime prevention only sometimes will be linked to that specific police force • Useful to capture type of relationship, e.g. organisation “comprises” specific members, while members are “part of” organisation

Slide 18

Slide 18

Cycle of Context Engage specialists – begin with advice on relevant vocabularies and documents Governance – need representative bodies and transparent process to review, approve and repeat Test structure with rules to find topic-rich documents; refine results with AI/ machine learning on these curated documents Extend ontology with keywords Build and test initial taxonomy around unifying topics Extend taxonomy with synonyms Extend thesaurus with related topics and relationships

Slide 19

Slide 19

Contextual keywords • Words or phrases that occur in context of preferred term, but are not linked hierarchically, by equivalence or formal association with other taxonomy terms • Tags from social networks and community “folksonomies” are prime examples • Other sources are wikis or knowledge graphs • Example — “frailty” may often occur in discussions or documents on ageing

Slide 20

Slide 20

Sources for extended taxonomy • • • • • • • “Runners up” to preferred term Acronyms Search queries Subject specialists Domain-specific documents Text-mining software Faceted-classification or search software (especially if employed when building taxonomy, not after)

Slide 21

Slide 21

Information strategy should unite realms of documents and data Strategic controlled vocabulary Structured content Unstructured content Taxonomy Conceptual data model Logical data model Ontology Consistent terms and relationships support linked data

Slide 22

Slide 22

Linked Data connects internal and external • Organisations often aim to collate and share internal and external data • The Resource Description Framework (RDF) simplifies data structures into consistent “triples” or “triple-stores” • It is similar to the way data bases contain the three elements of Entities, Attributes and Values • Thus a Study is evidenced by a Content type that is a Report. This Report has an Author who is a named Person • The entities or resources have a Uniform Resource Identifier (URI) that together reveal the entire linked chain of “triples”

Slide 23

Slide 23

Cycle of Context Engage specialists – begin with advice on relevant vocabularies and documents Governance – need representative bodies and transparent process to review, approve and repeat Test structure with rules to find topic-rich documents; refine results with AI/ machine learning on these curated documents Extend ontology with keywords Build and test initial taxonomy around unifying topics Extend taxonomy with synonyms Extend thesaurus with related topics and relationships

Slide 24

Slide 24

Case for assisted content classification • High-volume tagging consistency requires automation – one person can tag fewer than 4,000 documents per year • In same time, that staff member could define 2,400 tagging rules and templates – for 800 subjects, 100 focused filters (for content and event types) and 1,500 entities • Using additional staff often undermines consistency — Dow Jones’ study found specialist editors’ accuracy ranged from 40-100%, with nearly half of 500 sample stories failing to hit 80% accuracy target

Slide 25

Slide 25

Solution: Classification that leverages fully extended taxonomy structure Preferred term • • • • Synonyms Family hierarchy, plus related concepts, as “clues” to meaning Contextual keywords as additional “clues” Negative contextual examples to disambiguate, e.g. for “application” Related concepts as expansion tags

Slide 26

Slide 26

Use “combo” classification/search rule • Frequency test: Instances of Preferred term OR Synonyms in content AND • Prominent location test: Preferred term OR Synonyms in Title OR URL OR prominent Content element, e.g. Summary, Conclusion, etc. AND/OR • Concurrent proximity test: Preferred term and synonyms within 10 words of Hierarchical parent, Child term, Related terms and Contextual keywords (or within same paragraph or same Content section, or within same five rows of text) Use OR for more Recall; AND for more Precision

Slide 27

Slide 27

Same taxonomy can drive multiple rules Google search syntax Taxonomy elements OpenText search syntax

  • Boolean logic = FAST/Microsoft search syntax SmartLogic tagging rules Expert System tagging rules

Slide 28

Slide 28

Effective use of “mail merge” xcel Taxonomy Data Short Hierarchical Description Parent Diabetes mellitus Glucose metabolis m disorders Physicochemical 04. characteristics Substances Food Standards Agency Synonym1 Diabetes Word Template Synonym2 High blood sugar Chemical Physical characteristic characteristics s Key external organisations FSA UK Food Standards department Danish Health and Medicines Key external Authority organisations DHMA Danish Health Authority Health Products Regulatory Authority Irish Medicines Board Key external organisations HPRA Classification rule or Search query near(and(or(title:or(“«ShortDescription» ”, “«Synonym1»”,“«Synonym2»”, “«Synonym3»”, “«Synonym4»”, “«Synonym5»”, “«Synonym6»”, “«Synonym7»”), (or(“«ShortDescription»”, “«Synonym1»”,“«Synonym2»”, “«Synonym3»”, “«Synonym4»”, “«Synonym5»”, “«Synonym6»”, “«Synonym7»”))), or(“«CollectiveRelatedTerm»”, “«MandatoryRelatedTerm2»”, “«MandatoryRelatedTerm3»”, “«DiscretionaryRelatedTerm1»”, “«DiscretionaryRelatedTerm2»”, “«DiscretionaryRelatedTerm3»”, “«HighEvTerm»”, “«LowEvTerm»”)),n=10) near(and(or(title:or(“Diabetes mellitus”, “Diabetes”, “High blood sugar”, “Type 1 diabetes”, “Type 2 diabetes”, “High blood glucose”, “Hyperglycaemia”), (or(“Diabetes mellitus”, “Diabetes”, “High blood sugar”, “Type 1 diabetes”, “Type 2 diabetes”, “High blood glucose”, “Hyperglycaemia”))), or(“Glucose metabolism disorders”, “cardiovascular system”, “obesity”, “Insulin”)),n=10)

Slide 29

Slide 29

Taxonomy structure can also contribute to relevance weighting Prominence of term or synonyms found in Title or key Content elements Concurrence percentage of Hierarchical Parents, Related terms and contextual keywords also found Frequency of term or synonyms Facet of Taxonomy Depth of term in Taxonomy

Slide 30

Slide 30

Cycle of Context Engage specialists – begin with advice on relevant vocabularies and documents Governance – need representative bodies and transparent process to review, approve and repeat Test structure with rules to find topic-rich documents; refine results with AI/ machine learning on these curated documents Extend ontology with keywords Build and test initial taxonomy around unifying topics Extend taxonomy with synonyms Extend thesaurus with related topics and relationships

Slide 31

Slide 31

Time for questions

Slide 32

Slide 32

Jonathan Engel Consultant Information Architect W: www.infoark.co.uk E: j.engel@infoark.co.uk M: +44 (0) 7966 754614