14Nov

Matt McDonnell and the mother of all gateways

posted by Steve Bowbrick

This is a pretty long post based on a chat with the BBC’s head of search. If you’re interested in search, though, I reckon it’s worth ploughing through. I really learnt a lot from talking to Matt McDonnell: he has a very interesting and very important job working right at the heart of the future BBC.

Search as a gateway to everything

Matt didn’t want me to call him ‘head of search’. It’s not his job title and it sounded like “hagiography” to him. Still, he is in charge of search and I reckon he has a reasonable claim to the title ‘most important person at the BBC’ right now. I’m pretty sure the BBC org chart doesn’t reflect that, though, and I’m also sure that there are plenty of BBC executives who’ve never heard of him.

As the old ways into BBC content fade, search becomes more important. It’s a reasonable assumption that search will be the primary gateway to all BBC content within a few years, including the stuff that goes out on the linear channels (BBC1, BBC2, Radio 1, Radio 2 etc.). The channels themselves are already losing their gateway function. Viewers and listeners are much less likely to use a channel as a way into an evening’s viewing than they were in the pre-digital era. Themes, personalities, strong programme brands: all are becoming more important than channels. This, for instance, is one of the reasons for the BBC’s growing investment in top talent: Jonathan Ross may be an expensive presenter but he’s pretty economical when considered as a gateway to BBC content (at least when he’s not on suspension for being an arse).

On iPlayer, for instance, the channels already play a reduced part in programme selection. Programmes are still organised by channel but that’s an arbitrary echo of the BBC’s org chart: there’s no good reason to classify television content by linear channel once it’s online but nervous channel controllers insist on superimposing the channel name on shows that go out on iPlayer: they fear that their carefully commissioned and scheduled content has been stirred into an undifferentiated soup of shows and that the investment they’ve made in their channel’s brand will be wasted. But users conditioned by exposure to YouTube and MySpace and Google probably don’t even see the channel ident.

Likewise, the BBC’s homepage may be one of the most important in Britain but a growing proportion of users don’t use it to locate content: they find the stuff they want via a search, either using the site’s search field or by searching at Google or Yahoo or ask.com. Sitting next to Matt at his desk in White City it was revealing to watch his own navigation habits: every page he showed me was located via a search, even pages at his own site—no bookmarks, no browsing and no typing in the address field. When search is good enough it replaces all three.

Matt’s just coming to the end of a big programme of work that will sharply reduce the emphasis on web search at bbc.co.uk. The fact is that the BBC’s early ambition to ‘own’ UK web search has probably held the Corporation back from implementing really good site search and useful content structure so this is a big relief. And here’s a truly fascinating aside: when you search the web at bbc.co.uk, the top three results are often sites selected by BBC editors (here’s an example: asthma). Until recently these results were labelled as such (something like ‘best links’) but Matt’s team just removed the label.

The high quality, editor-selected results are still there, right at the top of the list but since the label was removed the click-through rate for these links has actually gone up substantially! Users weren’t clicking on the hand-selected links because they were suspicious that they might be sponsored links. They had learnt from exposure to Google and other search engines that the ‘special’ links at the top of the list are qualitatively different from the others and were avoiding them for that reason. Fascinating and counter-intuitive.

Topics

Another major initiative from the search team involves the creation of ‘topics’ pages: useful pages of information assembled from BBC sources and elsewhere about specific subjects. Topics is still in beta: you can check out the handful of hand-coded topics pages here. Many more are planned and what’s fascinating is that about 95% of them will be automatically generated.

This is all pretty hardcore semantic web stuff. The BBC topics starts by crawling Wikipedia daily and pulling in new pages created since the last visit. Wikipedia provides authority here: confirming that a topic is real (not that it’s relevant or useful: just that it exists) and doing ‘disambiguation’—sorting out the 19 different places called Rome, for instance. If the system finds a new entry at Wikipedia it then searches the BBC for information that’s similar to the Wikipedia entry—using Wikipedia’s text as a ‘training document’. If it finds none then no page is created: the topic is obviously not of sufficient relevance. If it finds content—news stories, programme pages, whatever—it generates a new topic page. John Muth, one of the developers working on the system, says he expects there to be tens of thousands of topic pages pretty soon after launch.

The result will be thousands of new pages, an extraordinarily rich information asset that exposes a lot of authoritative BBC content that would otherwise have been neglected or even lost. This is going to be a real public service win and – let’s face it – a much better idea than trying to make bbc.co.uk a destination for web search. Live syndication of Wikipedia content will also mean that the topic pages improve as Wikipedia does (although pages needn’t use Wikipedia content). Further (the semantic web is a mighty rich and interwoven thing), people will be able to syndicate the BBC topics pages for their own use: they will be published under a Creative Commons licence like the hundreds of thousands of artist pages in the /music hierarchy. Tools will be provided and schools and libraries or even businesses will be able to build useful information resources of their own by tapping into this clever blend of content from the BBC and the commons.

And video too

This all gets even more exciting when you add the potential to search the hundreds of thousands of hours of video produced by the BBC annually. Matt’s team is currently testing a system that analyses video files, creating a transcript that can then be indexed and added to the web of content on the topic pages. The transcript can also be used to ‘chapterise’ the video itself so users can jump to a particular part of the video based on the transcript.

Let’s face it: once the BBC’s audio and video content—the Corporation’s crown jewels obviously—has been opened up to search there’s really no further argument: It’s game over. All other gateways to the BBC’s content will be officially obsolete and search will have won. Maybe I should keep my mouth shut.