Analytical Access to the Domain Dark Archive: Researchers' final reports (1)

Our project researchers on AADDA have kindly written up the research the planned to do with the web archive, a summary of how it went and problems that they encountered. I'll be posting these as blog posts over the next few months. Here is the first, from Helen Taylor:

AADDA Report: Sentiment Analysis and the Reception of the Liverpool Poets

My project and the AADDA: a lesson in ‘digging down’

When I proposed my research project for the Analytical Access to the Domain Dark Archive project, it was based on a ‘wish list’ of tools that scholars might want to use to access this resource. The tools my proposed project required were sentiment analysis, proximity search, and geo-indexing. This latter was not available during this test period, but the first two were. However, this report is not so much a record of my findings, but about not making assumptions with the data produced via these two tools.

I sought to access information about the reception of the Liverpool Poets (in practise, I focused solely on Adrian Henri). With the Domain Dark Archive I could find avenues – fan pages, forums, and the like – which would provide me with information to consider alongside newspapers, interviews, and archival material. I wanted to see what labels were attached to the poets, and how they were viewed, in informal recollections and non-academic contexts. I would then combine and compare this data with searches for the same terms from newspaper and published works. There is a marked difference in academic and popular attitudes to the poets, and the internet archival searches should be able to provide evidence for how the people who actually received the work viewed their experiences.

Methodology: considerations and consequences

It must be noted that the AADDA project involved only a slice of the full dataset, and that my results will almost certainly differ greatly when it goes live. (Just as an example, a search for “Adrian Henri” on the AADDA browser returns 1847 results, compared to over 8,200 current UK hits on Google.) The lack of references is almost certainly due to the smaller dataset, rather than the data not being there at all (1).

Another issue was that very search term, “Adrian Henri”. Searching for just ‘Adrian’ or ‘Henri’ rather than ‘Adrian Henri’ is unhelpful in that it throws up results of which the majority are not relevant: ‘“Henri” NEAR “painter”’ might give you Matisse; ‘“Adrian” NEAR “poet”’ might give you Mitchell. My own research and interview experience has been that people are likely to refer to him as ‘Henri’ or as ‘Adrian’, so the fact that I was only searching for ‘Adrian Henri’ might have excluded some results. However, articles on online magazines and the like do usually follow academic and journalistic traditions of referring to the subject by their full name in the first instance, and then surname, so therefore are caught by the crawl.

I had to decide what labels to search for in relation to Henri, and my initial searches – using what terms I was already aware of – may have excluded other labels and ways of talking about Henri. I also found that my own academic assumptions were not the standard – there were 203 results for the label ‘Liverpool poet’, versus only 3 for ‘Merseybeat poet’, the term I am using in my thesis!

Search for ‘“Adrian Henri” AND …’	Number of items returned
“painter and poet”	5
“poet and painter”	2
“painter/poet”	5
“poet/painter”	10
“performance poet”	0
“performer”	10
“entertainer”	16

Fig 1 – examples of search terms and results

The five results for both “painter and poet” and “painter/poet” were all from the Tate Archives.(2) This – with search terms placing the artistic side of his output first – is not surprising, given that the Tate is an art gallery. It did surprise me that “performance poet” did not prove a useful search term, although this is perhaps an academic designation rather than a layman’s term – as evidenced by the results for “entertainer”. But none of these results can be taken at face value, as this report shall discuss.

Boolean searching: How near is NEAR?

These initial exploratory searches bring me to my first problem with the data. Throughout this report what I refer to as problems are not faults with the dataset or the browser but rather potential issues for the users interacting with it. Parameters for how close together the two search terms can differ, but I found that the NEAR search was sometimes not near enough here. I found two issues when reading the actual results: firstly, that the terms were often not that close together; and second, that the second term was not actually being used to discuss Henri:

Fig 2 - search result for “Adrian Henri” NEAR “painter”, post on www.ancestryaid.co.uk (3)

Therefore, the results in the table listed above are not a reliable source for enumerating the most common labels attached to Henri – one cannot rely on reading only the initial search results.

Crawl dates: Encountering a display problem

I have already stated that some results could not be ‘clicked through’ and their content displayed past that initial search results page, such as the Tate results for “painter and poet” and “painter/poet”. There is therefore no way of knowing what the pages actually contained. At other times, there were results which could not be viewed for a different reason: they did not even appear on the search results page.

This revealed itself to me when running an exploratory query. After a basic search for “Adrian Henri”, one of the things that I noticed is that there is a ‘jump’ in the number of hits in the year 2000. Whilst this is not the highest number (2007 has 345), I thought that this could be explained by this being the year that he died – obituaries, tributes, more ‘noise’ around his name.

Fig 3 – showing results for “Adrian Henri” by crawl year (4)

Clicking through to filter these results by that year – and hoping to find relevant obituary results – I encountered my first problem. From 242 results on the initial search, the “Search found 202 items”:

Fig 4 – filtering “Adrian Henri” results by crawl year “2000” (5)

Furthermore, when clicking through to the second page of these already shrinking items, the number jumped down again to 186:

Fig 5 – filtering “Adrian Henri” results by crawl year “2000”, page 2 (6)

This was repeated elsewhere – for example, the following year, 2001, went from 53 potential results to 37 search items being displayed. It was not the case that the items were only those which could be ‘clicked through’ – as the Tate example above shows, those which the Wayback Machine could not display were still included in the search items.

One potential explanation for the discrepancy between the total number of results and number of items which the “search found” is that the results returned here might omit duplications, perhaps where a second crawl finds nothing different from the first. I am unsure whether this is a valid response, as I have found many instances of crawls where the Wayback Machine’s results are exactly the same from crawl to crawl. Furthermore, of the 242 results for 2000, 235 were from Amazon.co.uk, and not related to his death. I would, therefore, propose that the ‘jump’ came simply from there being more crawls in that year, as it must be remembered that the dates are dates at which the sites were recorded, not the dates at which the material was published.(7) Whatever the reason, this shows that the results must be interrogated further along the line from the initial search, as however innocent the numbers appear, they cannot be presented without ‘digging down’ to the actual website results themselves.

Sentiment Analysis: Don’t take it on face value

Taking a quick look at the totals when doing a basic search for “Adrian Henri” reveals mostly neutral results, as one might expect from an analysis over a large amount of text, but the results are also far more positive than negative, if a sentiment is found – 136 “very positive” versus 11 “very negative”. However, this is another lesson is ‘digging down’ and not taking the results at face value.

Fig 6 – showing sentiment totals for the “Adrian Henri” search (8)

The success of sentiment analysis relies in part on how positivity or negativity is determined across the whole search parameters. This quote from a 1998 school newsletter is clearly – and does indeed appear under the term – very positive:

Many thanks to Stockport Art Gallery staff for the invitation to bring our Junior children to meet Adrian Henri, the famous artist and poet, on Wednesday 21 October. Adrian was terrific, telling us the stories behind many of the pictures currently on exhibition at the Gallery and reading from his poetry collections. We can really recommend a visit to see his work. Many thanks to Adrian for a great day with you in Stockport! (9)

However, other results which were listed as “very positive” must be discounted from this total for the same reason as the proximity searches above: the positive nature of the whole is not related to Henri’s part. See, for example, the discussion of Carol Ann Duffy’s The World’s Wife in an AQA English Literature Examiner’s Report from June 2005:

Once again, The World’s Wife proved highly popular: more centres study this text than any other on the paper. As last year, examiners were impressed by the enthusiasm and engagement with which many candidates approach Duffy’s poetry … Examiners were also concerned that intrusive, and often irrelevant, biographical material (such as lengthy character assassinations of Adrian Henri) prevented candidates from meeting the Assessment Objectives.(10)

Whilst this, therefore, means one cannot blithely cite all 136 “very positive” results in Henri’s favour, we also need to revise the total of “very negative” results. Firstly, of the 11 results, the 6 items which can be displayed are all the same Peter Finch interview:

Fig 7 – results page “Adrian Henri” with sentiment “very negative” (11)

And secondly, in this interview Henri actually appears very favourably:

The Liverpool Scene arrived, and with it the merging of music and poetry with Roger McGough, Brian Patten, Adrian Henri, and others. I eventually met Adrian Henri, who was also a painter, and the most interesting, I thought, of the three. We became frends and he pointed me in some new directions.(12)

The Wayback Machine has 12 captures of this page on this site, from October 2006 to July 2013. Each crawl obviously takes a snapshot of whatever is on the page at the time, and the crawl date is clearly indicated in the results, but the 11 apparently different “very negative” results are, in practise, all the exact same interview, the text of which has not changed (bat the removal of the first line under the title), although the formatting of the page itself has slightly changed (see the links beneath the header), as illustrated here:

Fig 8a – first Wayback Machine capture of www.argotistonline.co.uk (13)

Fig 8b – last Wayback Machine capture of www.argotistonline.co.uk (14)

I have suggested that one reason for the discrepancy between the total number of results and the items which can be displayed is that the duplications might not be shown, and the snapshots for this page do show that there have been changes over time, but what this also shows is the need to interrogate the results, at the level of those snapshots, rather than making assumptions based on the initial totals. Whilst this may be deliberately simplifying the issue, the message to take away here is not to take the results on face value: there aren’t 11 “very negative” results – there are none at all!

Brief Conclusions

This report has attempted to present some of the potential mishaps involved with looking at the Web Archive results on the surface, at face value. What my exploratory searches have shown is that one cannot make assumptions based purely on looking at the initial search results – you have to dig down.

Being involved in the AADDA project was certainly useful for my own research, as I found sources of information which I wouldn’t have found otherwise, such as pages which are no longer live, or places I hadn’t thought to look. It was also fascinating to read non-academic histories of performance poetry and the 1960s underground, where Henri and the Merseybeat poets appear as far more important than in ‘official’ criticism.[15] These histories were also presented as if public knowledge, proving my theory that those ‘ordinary’ people who received the work did have an idea of its importance, and that the audiences for this kind of poetry were significant, particularly in terms of recognising the legacy of the Merseybeat poets where academia has dismissed them. However, what my research experiences have been far more useful for, I believe, is pointing up some of the potential issues – both with the interface (display problems) and the users (making assumptions) – before the Domain Dark Archive goes live.

(1) I am aware of sites which were not included in the slice available for this initial project, as well as those without a UK domain suffix which are beyond the scope of the project, such as www.my-liverpool.co.uk or www.mudcat.org.

(2) The Tate Archive results could not be shown by the Wayback Machine, due to ‘robots.txt’ on the site – see http://web.archive.org/web/20060824234002/http://archive.tate.org.uk:80/DServe/dserve.exe?dsqServer=tg_calm&dsqApp=Archive&dsqDb=Catalog&dsqCmd=Browse.tcl&dsqSearch=*(RefNo='TAp*')&dsqKey=RefNo

(3) http://web.archive.org/web/20070514010256/http://www.ancestryaid.co.uk:80/boards/archive/index.php/t-928.html

(4) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian+henri%22&sort_by=solr_document&sort_order=ASC

(5) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian%20henri%22&sort_by=solr_document&sort_order=ASC&f[0]=crawl_year%3A%222000%22

(6) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian%20henri%22&sort_by=solr_document&sort_order=ASC&page=1&f[0]=crawl_year%3A%222000%22

(7) This is something which we have discussed at AADDA meetings, and I feel that the interface does make this clear, it is just something which should be stressed to users in any guidance material, to avoid misunderstanding.

(8) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian+henri%22&sort_by=solr_document&sort_order=ASC

(9) http://web.archive.org/web/19991008172118/http://webserv1.stockportmbc.gov.uk:80/pages/links/schools/primary/ourlarc/oct1998.htm

(10) http://web.archive.org/web/20060618094849/http://www.aqa.org.uk:80/qual/pdf/AQA-5741-6741-WRE-Jun05.pdf

(11) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian%20henri%22&sort_by=solr_document&sort_order=ASC&f[0]=sentiment%3A%22Very%20Negative%22

(12) http://web.archive.org/web/20070208145352/http://www.argotistonline.co.uk:80/Finch%20interview.htm

(13) http://web.archive.org/web/20061019024105/http://www.argotistonline.co.uk/Finch%20interview.htm

(14) http://web.archive.org/web/20130723093311/http://www.argotistonline.co.uk/Finch%20interview.htm

(15) See, for example, http://web.archive.org/web/19961221024212/http://www.users.dircon.co.uk:80/~dirkje/pjmanif.htm or http://web.archive.org/web/20020701043924/http://www.artcircus.org.uk:80/route/version5/paper/paper_article_detail.asp?idno=3