Analytical Access to the Domain Dark Archive: What on earth would I do with this data ?

We’re busy arranging a series of workshops in May and June. Their purpose is to gather humanities and social science scholars together to think collectively about the kind of purposes to which they might put a near-comprehensive dataset of the UK web domain 1996-2010.

The exercise is going to involve the use of the imagination to an extent, since part of this project is to help the British Library to design a user interface for the new dataset; so there isn’t yet anything ‘to play with’, as it were. In order to help fund scholars’ imaginations, I’ve started to sketch how I myself, as an historian of contemporary British Christianity, might start to use the dataset; what questions I would like to ask of it.

I come to this with a research interest in the forms of words in which religion (broadly defined) is discussed, and how those modes of discourse change over time. This can usefully be thought of using the following scheme:
(i) there are some perennial issues, in relation to what we might call constitutional Christianity, taking such questions as the position of the bishops in the House of Lords, and the establishment of the Church of England
(ii) there are older issues that have been ‘reactivated’ in recent years. For instance, denominational church schools were an issue as far back as the 1906 general election. After a period of calm about the issue in public discussion, the last decade or so has seen the issue come back to prominence - except, of course, that they are now known as faith schools.
(iii) there are also new issues, the obvious one being the perception of a threat from radical Islamism; an issue that was simply absent until relatively recently.

I personally am particularly interested in the domain dark archive, since the period 1996-2010 frames many of these issues perfectly. So, what might I ask of the archive, and which tools might I use ?

Basic visualisation: the Ngram

At a most basic level, I might want to look at the incidence of particular terms, and look for periods in which a particular term is employed more often. For this, there is the Ngram; a visualisation tool that is already employed by Google, and on the existing UK Web Archive. Consider the following case: in February 2008, the archbishop of Canterbury Rowan Williams gave a lecture to an audience of lawyers which reflected on the scope for the incorporation of sharia law into UK law. For some details of the media storm that followed, see here. An Ngram of the incidence of the word 'sharia' in the existing selective web archive looks like this:

As we might expect, there is a big spike in the incidence of the term at the time of the lecture, and then heightened activity for much of the following year. I had expected the former, certainly, but not the latter to the same extent; and so I now know to look more at the content indicated by those subsequent spikes in activity.
If one then looks for both of the terms 'sharia' and 'archbishop', it appears:

The spike in the terms happens at roughly the same point; but the incidence of 'archbishop' is higher, due perhaps to the wider speculation about Dr Williams' position as a result of the controversy. Also, the repeated peaks visible for 'sharia' aren't present for 'archbishop', suggesting that the debate about the former outlasted the particular instance of the lecture.

Proximity searching and sentiment analysis

One might, of course, want to go further than this, and by means that aren't yet possible within the UK Web Archive. One means might be using a proximity search - looking for terms occurring within a certain number of characters' distance of each other in the same source. The graph above only shows the instances of the two terms, but (crucially) not necessarily occurring together in the same source. A proximity search would make the connection that is suggested by the graphs above much more secure.

Even more interesting would be sentiment analysis: gauging the attitude of the writer of a webpage towards the term employed, using various techniques including natural language processing to find terms denoting approval or disapproval occurring in connection with the search term. The present archbishop, when he retires at the end of the year, may look back on a very particular relationship with the media during his time in office. I would be interested to see whether 'archbishop' appeared more often in the data with negative connotations after the sharia controversy.

These, of course, are only some attempts to imagine what might be possible using the Domain Dark Archive. I shall be blogging more as the project progresses, and the possibilities become clearer.

Analytical Access to the Domain Dark Archive

Saturday, 14 April 2012

What on earth would I do with this data ?

No comments:

Post a Comment