AADDA Report: Sentiment Analysis and the Reception of the Liverpool
Poets
My project and the AADDA: a
lesson in ‘digging down’
When I proposed my research
project for the Analytical Access to the Domain Dark Archive project, it was
based on a ‘wish list’ of tools that scholars might want to use to access this
resource. The tools my proposed project required were sentiment analysis,
proximity search, and geo-indexing. This latter was not available during this
test period, but the first two were. However, this report is not so much a
record of my findings, but about not making assumptions with the data produced
via these two tools.
I sought to access information
about the reception of the Liverpool Poets (in practise, I focused solely on
Adrian Henri). With the Domain Dark Archive I could find avenues – fan pages,
forums, and the like – which would provide me with information to consider
alongside newspapers, interviews, and archival material. I wanted to see what
labels were attached to the poets, and how they were viewed, in informal
recollections and non-academic contexts. I would then combine and compare this
data with searches for the same terms from newspaper and published works. There
is a marked difference in academic and popular attitudes to the poets, and the
internet archival searches should be able to provide evidence for how the
people who actually received the work viewed their experiences.
Methodology: considerations and consequences
It must be noted that the AADDA
project involved only a slice of the full dataset, and that my results will
almost certainly differ greatly when it goes live. (Just as an example, a
search for “Adrian Henri” on the AADDA browser returns 1847 results, compared
to over 8,200 current UK hits on Google.) The lack of references is almost certainly due to the smaller dataset,
rather than the data not being there at all (1).
Another issue was that very
search term, “Adrian Henri”. Searching for just ‘Adrian’ or ‘Henri’ rather than
‘Adrian Henri’ is unhelpful in that it throws up results of which the majority
are not relevant: ‘“Henri” NEAR “painter”’ might give you Matisse; ‘“Adrian”
NEAR “poet”’ might give you Mitchell. My own research and interview experience
has been that people are likely to refer to him as ‘Henri’ or as ‘Adrian’, so
the fact that I was only searching for ‘Adrian Henri’ might have excluded some
results. However, articles on online magazines and the like do usually follow
academic and journalistic traditions of referring to the subject by their full
name in the first instance, and then surname, so therefore are caught by the
crawl.
I had to decide what labels to
search for in relation to Henri, and my initial searches – using what terms I
was already aware of – may have excluded other labels and ways of talking about
Henri. I also found that my own academic assumptions were not the standard –
there were 203 results for the label ‘Liverpool poet’, versus only 3 for
‘Merseybeat poet’, the term I am using in my thesis!
Search for ‘“Adrian Henri”
AND …’
|
Number of items returned
|
“painter and poet”
|
5
|
“poet and painter”
|
2
|
“painter/poet”
|
5
|
“poet/painter”
|
10
|
“performance poet”
|
0
|
“performer”
|
10
|
“entertainer”
|
16
|
Fig 1 – examples of search terms and
results
The five results for both
“painter and poet” and “painter/poet” were all from the Tate Archives.(2) This – with search terms placing the artistic side of his output first – is not
surprising, given that the Tate is an art gallery. It did surprise me that
“performance poet” did not prove a useful search term, although this is perhaps
an academic designation rather than a layman’s term – as evidenced by the
results for “entertainer”. But none of these results can be taken at face
value, as this report shall discuss.
Boolean searching: How near is NEAR?
These initial exploratory
searches bring me to my first problem with the data. Throughout this report
what I refer to as problems are not faults with the dataset or the browser but
rather potential issues for the users interacting with it. Parameters for how close
together the two search terms can differ, but I found that the NEAR search was
sometimes not near enough here. I found two issues when reading the actual
results: firstly, that the terms were often not that close together; and
second, that the second term was not actually being used to discuss Henri:
Fig 2 - search result for “Adrian Henri”
NEAR “painter”, post on www.ancestryaid.co.uk (3)
Therefore, the results in the
table listed above are not a reliable source for enumerating the most common
labels attached to Henri – one cannot rely on reading only the initial search
results.
Crawl dates: Encountering a display problem
I have already stated that some
results could not be ‘clicked through’ and their content displayed past that
initial search results page, such as the Tate results for “painter and poet”
and “painter/poet”. There is therefore no way of knowing what the pages
actually contained. At other times, there were results which could not be viewed
for a different reason: they did not even appear on the search results page.
This revealed itself to me when
running an exploratory query. After a basic search for “Adrian Henri”, one of
the things that I noticed is that there is a ‘jump’ in the number of hits in
the year 2000. Whilst this is not the highest number (2007 has 345), I thought
that this could be explained by this being the year that he died – obituaries,
tributes, more ‘noise’ around his name.
Fig 3 – showing results for “Adrian
Henri” by crawl year (4)
Clicking through to filter these results by that year – and
hoping to find relevant obituary results – I encountered my first problem. From
242 results on the initial search, the “Search found 202 items”:
Fig 4 – filtering “Adrian Henri” results
by crawl year “2000” (5)
Furthermore, when clicking
through to the second page of these already shrinking items, the number jumped
down again to 186:
This was repeated elsewhere – for
example, the following year, 2001, went from 53 potential results to 37 search
items being displayed. It was not the case that the items were only those which
could be ‘clicked through’ – as the Tate example above shows, those which the
Wayback Machine could not display were still included in the search items.
One potential explanation for the
discrepancy between the total number of results and number of items which the
“search found” is that the results returned here might omit duplications,
perhaps where a second crawl finds nothing different from the first. I am
unsure whether this is a valid response, as I have found many instances of
crawls where the Wayback Machine’s results are exactly the same from crawl to
crawl. Furthermore, of the 242 results for 2000, 235 were from Amazon.co.uk,
and not related to his death. I would, therefore, propose that the ‘jump’ came
simply from there being more crawls in that year, as it must be remembered that
the dates are dates at which the sites were recorded, not the dates at which
the material was published.(7) Whatever the reason, this shows that the results must be interrogated further
along the line from the initial search, as however innocent the numbers appear,
they cannot be presented without ‘digging down’ to the actual website results
themselves.
Sentiment Analysis: Don’t take it on face value
Taking a quick look at the totals
when doing a basic search for “Adrian Henri” reveals mostly neutral results, as
one might expect from an analysis over a large amount of text, but the results
are also far more positive than negative, if a sentiment is found – 136 “very
positive” versus 11 “very negative”. However, this is another lesson is
‘digging down’ and not taking the results at face value.
Fig 6 – showing sentiment totals for the
“Adrian Henri” search (8)
The success of sentiment analysis
relies in part on how positivity or negativity is determined across the whole
search parameters. This quote from a 1998 school newsletter is clearly – and
does indeed appear under the term – very positive:
Many thanks to Stockport Art Gallery staff for the
invitation to bring our Junior children to meet Adrian Henri, the famous artist
and poet, on Wednesday 21 October. Adrian was terrific, telling us the stories
behind many of the pictures currently on exhibition at the Gallery and reading
from his poetry collections. We can really recommend a visit to see his work.
Many thanks to Adrian for a great day with you in Stockport! (9)
However, other results which were
listed as “very positive” must be discounted from this total for the same
reason as the proximity searches above: the positive nature of the whole is not
related to Henri’s part. See, for example, the discussion of Carol Ann Duffy’s The World’s Wife in an AQA English
Literature Examiner’s Report from June 2005:
Once again, The World’s Wife proved highly popular:
more centres study this text than any other on the paper. As last year,
examiners were impressed by the enthusiasm and engagement with which many candidates
approach Duffy’s poetry … Examiners were also concerned that intrusive, and
often irrelevant, biographical material (such as lengthy character
assassinations of Adrian Henri) prevented candidates from meeting the
Assessment Objectives.(10)
Whilst
this, therefore, means one cannot blithely cite all 136 “very positive” results
in Henri’s favour, we also need to revise the total of “very negative” results.
Firstly, of the 11 results, the 6 items which can be displayed are all the same
Peter Finch interview:
Fig 7 – results page “Adrian Henri” with
sentiment “very negative” (11)
And
secondly, in this interview Henri actually appears very favourably:
The Liverpool Scene arrived, and with it the merging
of music and poetry with Roger McGough, Brian Patten, Adrian Henri, and others.
I eventually met Adrian Henri, who was also a painter, and the most
interesting, I thought, of the three. We became frends and he pointed me in
some new directions.(12)
The
Wayback Machine has 12 captures of this page on this site, from October 2006 to
July 2013. Each crawl obviously takes a snapshot of whatever is on the page at
the time, and the crawl date is clearly indicated in the results, but the 11
apparently different “very negative” results are, in practise, all the exact same
interview, the text of which has not changed (bat the removal of the first line
under the title), although the formatting of the page itself has slightly
changed (see the links beneath the header), as illustrated here:
Fig 8a – first Wayback Machine capture
of www.argotistonline.co.uk (13)
Fig 8b – last Wayback Machine capture of
www.argotistonline.co.uk (14)
I have suggested that one reason
for the discrepancy between the total number of results and the items which can
be displayed is that the duplications might not be shown, and the snapshots for
this page do show that there have been changes over time, but what this also
shows is the need to interrogate the results, at the level of those snapshots,
rather than making assumptions based on the initial totals. Whilst this may be
deliberately simplifying the issue, the message to take away here is not to
take the results on face value: there aren’t 11 “very negative” results – there
are none at all!
Brief Conclusions
This report has attempted to
present some of the potential mishaps involved with looking at the Web Archive results
on the surface, at face value. What my exploratory searches have shown is that one
cannot make assumptions based purely on looking at the initial search results –
you have to dig down.
Being involved in the AADDA
project was certainly useful for my own research, as I found sources of
information which I wouldn’t have found otherwise, such as pages which are no
longer live, or places I hadn’t thought to look. It was also fascinating to
read non-academic histories of performance poetry and the 1960s underground,
where Henri and the Merseybeat poets appear as far more important than in
‘official’ criticism.[15] These histories
were also presented as if public knowledge, proving my theory that those
‘ordinary’ people who received the work did have an idea of its importance, and
that the audiences for this kind of poetry were significant, particularly in
terms of recognising the legacy of the Merseybeat poets where academia has
dismissed them. However, what my research experiences have been far more useful
for, I believe, is pointing up some of the potential issues – both with the
interface (display problems) and the users (making assumptions) – before the
Domain Dark Archive goes live.
(1) I am
aware of sites which were not included in the slice available for this initial
project, as well as those without a UK domain suffix which are beyond the scope
of the project, such as www.my-liverpool.co.uk
or www.mudcat.org.
(2) The Tate Archive results could not be shown by the
Wayback Machine, due to ‘robots.txt’ on the site – see http://web.archive.org/web/20060824234002/http://archive.tate.org.uk:80/DServe/dserve.exe?dsqServer=tg_calm&dsqApp=Archive&dsqDb=Catalog&dsqCmd=Browse.tcl&dsqSearch=*(RefNo='TAp*')&dsqKey=RefNo
(7) This is
something which we have discussed at AADDA meetings, and I feel that the
interface does make this clear, it is just something which should be stressed
to users in any guidance material, to avoid misunderstanding.
No comments:
Post a Comment