Tuesday, 26 May 2015

Richard Webber Discusses the Adoption of Geodemographic and Ethno-cultural Taxonomies for Analysing Big Data

by Richard Webber

During the 2015 General Election the Conservatives, Labour and the Liberal Democrats will have targeted millions, if not tens of millions of political communications using the information they hold about Britain’s 50 million or so electors.  Britain’s new administration could easily depend on the skill with which their data management teams select the right electors to target, the right message to target them with and the right people to deliver each message.  Winning General Elections is perhaps one of the most important purposes to which the principles underlying Big Data are put.

The mass of data built up from decades of political doorstep and telephone canvassing are critical inputs to the databases used to generate these communications.  But so too are measures of status and cultural background with which each of these parties populate the data record they hold against each voter.

The measures they use do not correspond to the traditional definitions of social status and ethnic origin with which social scientists are familiar.  After all there is no way the parties can establish the occupation of each individual elector.  Nor can they ask each elector to identify the ethnic group with which they identify.  Instead it is the elements of electors’ names and addresses that will be used to make these inferences used to populate each data record.  Electors’ status – and quite a lot more about them, such as whether they live in a gentrified neighbourhood of a commuter village – will be inferred from the categorisation that optimally fits their postcode. Their cultural backgrounds, including ethnic origin, religion and language, will be inferred from their personal and family names.

It is not because theories relating to voting preference have been developed using these categories that they should be used for targeted communications.  It is because they are the only practical ways to code the parties’ databases.  Clearly the classifications need to be able to predict differences in electoral attitudes, behaviour and responses – this they clearly can.

Political parties are not alone in using these tools.  Postcode classifications such as Mosaic and Acorn are routinely appended to the customer files of most large consumer facing companies, whether in the US, the UK or a number of other European countries.  As tools for the analysis of customer behaviour they have many advantages over conventional questionnaires – they are non-intrusive, can be applied retrospectively, eliminate non-response and its consequent bias, are inexpensive to obtain, easy to append and compliant with data protection legislation.  Most important of all they consistently provide high levels of discriminatory power, virtually irrespective of the behaviour analysed and typically as high as conventional measures of status and ethnicity.

Given the ubiquity of these systems in the commercial sector, it seemed curious to us three co-authors why these systems should be used so seldom by researchers working in university social science departments. This was particularly puzzling given the important influence neighbourhood characteristics are held to play in both geographic and sociological theory. Unlike other “big data” sets, many of which are commercially confidential and/or whose use involves high levels of computational competence, these generic segmentations based on big data are not difficult to acquire.  Neither the task of appending these codes to research datasets nor the cost of obtaining them would seem to present an obstacle to their use.

Given our backgrounds in marketing and market research, in government and in qualitative social sciences, we thought it would be useful to pool a hundred years of experience so as to better understand the factors contributing to the uneven level of adoption of these systems between commercially based and university based researchers.

We felt the starting point for such an investigation should be an audit of the features of these new classification systems which variously attract or repel different groups of researchers.  For example the use of labels such as “Liberal Opinion”, which many commercially employed researchers find a helpful summarisation of an important segment of the population, is evidently off-putting to many university employed researchers on account of its imprecision and lack of theoretical justification.  From our respective experiences it was also our belief that institutional factors were particularly important reasons for differences in adoption levels.  For example data managers working for political parties have good opportunities to evaluate the discriminatory power of such systems against behavioural data, they are given considerable autonomy, they do not have to justify their statistical methods to a wider reference group, the use of these systems supports a long term operational requirement and “what delivers” is ultimately a more important consideration than whether or not these systems embody particular theories.

By detailing many different considerations of this sort we concluded that it would be possible to develop a more general typology of research environments according to the pressures either favourable or hostile to the adoption of these systems.  We were able to identify seven quite distinct research environments that we could characterise in this way. Each one differed in the extent to which it was open to the adoption of these new classifications.

The theoretical conclusion which we reached as a result of our investigation was as follows. The categories on which much social scientific theory is developed necessarily depends on the categorisations which it is practical for researchers to link to behavioural data.  So long as the conventionally structured survey questionnaire was the exclusive source of quantitative data this would be social class and ethnic origin.  But these forms of categorisation easily become institutionalised as a result of the inter-dependence between theory and the categories on the basis of which theory is developed.  When alternative sources of evidence become available, as for example big data, it is no longer desirable to base analysis exclusively on the categorisation systems which were designed for collection via survey questionnaires.  Other generic, synthetic representations of demographic constructs become more practical for the analysis of big data, not least, as in the example of “Liberal Opinion”, because they incorporate multiple different dimensions into a single classification system. 

Perhaps most important of all – new evidence of this sort fosters an entirely new and original set of  hypotheses and, as a result, opens up the academic establishment to fresh lines of research enquiry.

About the author

Richard Webber is the originator of the UK-based geodemographic classifications indicators, Acorn and Mosaic, and for many years managed the micromarketing divisions of first CACI and then Experian. In recent years he has championed the use of personal and family names as a means of inferring people’s cultural background, a tool used by the political parties to support their election campaigns. His colleague in this venture is Trevor Phillips.  Richard Webber is a fellow of the Institute of Direct Marketing and of the Market Research Society, and a visiting Professor in the Department of Geography, Kings College London.