An Introduction to my PhD in N Parts: Part 2/N – Constructing a Corpus of Corporate Fraud

When you’re considering using a corpus, there’s one particular question that you need to answer first: how are you going to get that corpus in the first place?

Several of my friends/colleagues are using transcriptions of legal and political events, which come with their own set of problems, not the least of which is bureaucracy (and in one case, rubbish handwriting).

I didn’t have that problem – since I’m using newspaper articles that were published relatively recently, I could use Lexis Nexis. Lexis Nexis collects articles published by a wide variety of newspapers every day. They are already digitised, and since LN offers the opportunity for the researcher to download these articles as .txt files – so these files can be readily read by most, if not all, mainstream corpus software.

No problem.

Here’s my issue: ‘corporate fraud’ is a terribly, terribly ill-defined concept. Furthermore, there’s a discrepancy between the concept in Criminology, and the actual, colloquial use of the phrase. My starting point in the entire PhD was then to provide a suitable definition of corporate fraud. After consulting Criminological texts, legal dictionaries, the OED, the BNC, and the COCA, I finally defined corporate fraud as

Those cases in which a corporation or a (number of) employee(s) of a corporation, for the benefit and on behalf of said corporation, act(s) in a manner that conceals or misrepresents the state or situation of a good, service or case, resulting in negative consequences for other individuals, legal persons, or for society as a whole.

Which is a reasonable definition, but it did little to help me create a corpus. Thankfully, I did not have to invent a method all on my own. Those excellent corpus and CDA people at the University of Lancaster did something similar a few years ago when they created the RASIM corpus. This corpus included newspaper articles on Refugees, Asylum Seekers, and (Im)Migrants. When I first came across this corpus in my reading, it was actually called the RAS corpus, which is nicely coincidental. In order for any article to be collected for their corpus, it needed to
1) have been published in at least one of their selected 19 newspapers
2) have been published between 1996 and 2006
3) contain at least one mention of RASIM

You may spot one particular difference between their corpus and mine: RASIM are relatively well-defined concepts, corporate fraud is not. True, Gabrielatos (2007) wrote a full article about their query terms, but in the end, Gabrielatos and Baker (2008) were able to write their query as one search phrase.

This didn’t work for me. So I had to figure out a slightly more circumstantial method: lots of search terms.

I started by reading books on corporate fraud to figure out the major cases. This also showed some of the major regulatory and offending parties. From there on, by skim-reading some of the articles I found, I snowballed further search terms. At the end, I took all the articles published in all of my newspapers on a handful of random dates, to see whether there were any major cases I missed.

As my searches were necessarily imprecise, there was an awful lot of noise, or irrelevant material, in my hits. Therefore, rather than simply including all articles which mentioned at least one of the search terms once, I attempted to filter out all irrelevant hits. Biber et al. (1998) mention a handful of very pertinent practical considerations when creating a corpus, including the temporal factor – what they don’t mention, however, is how much it can feel like you’re slowly going mad when scrolling through millions of Lexis Nexis hits (although LN wasn’t around in 1998).

Anyway, so my articles had to fulfil the following requirements:
1) have been published in one of my seven selected newspapers (Daily Mail, Daily Telegraph, Financial Times, Guardian, Mirror, Sun, Times)
2) have been published between 1/1/2004 and 31/12/2014
3) contain at least one of my 126 search terms at least once
4) be relevant following my definition of corporate fraud

This then resulted in an absolute mass-load of .txt files. The RASIM-researchers have all articles neatly contained in files, one for every month. I wanted something similar, but if possible one file per article. Furthermore, if there was a way to extract all metadata from the files and stick this in, say, an Excel-file, that would be great.

The excellent Chris Norton of the University of Leeds Linguistics Department wrote me a Python script which did exactly that. The script separated all articles and stored them as .txt-files in year and paper folders (so that the location on my USB drive is I:/Corpus/[Year]/[Newspaper]). It also extracted the metadata and put it in a .csv-file, which is Excel-compatible. This file contains the following columns:

  • Filename
  • Newspaper
  • Authors
  • Word count
  • Page number
  • Section
  • Title
  • Date of publication

With everything in Excel, I could very clearly see that I had rather a few duplicates. I removed these manually.

So, after weeks of maddening filtering, what am I left with?

  • The corpus has slightly over 53m tokens
  • The corpus has slightly over 90,000 news articles about corporate fraud
  • Broadsheets make up slightly under 47m tokens, slightly over 76k articles
  • Tabloids make up slightly under 7m tokens, slightly over 14k articles

Not bad for an ill-defined topic that many Criminologists claim is under-represented in modern British news media.


Biber, D., Conrad, S. and Reppen, R. (1998). Corpus Linguistics: investigating language structure and use. Cambridge: CUP.

Gabrielatos, C. (2007). Selecting query terms to build a specialised corpus from a restricted-access database. ICAME Journal, 31, pp.5-43.

Gabrielatos, C. and Baker, P. (2008). Fleeing, sneaking, flooding: A corpus analysis of discursive constructions of refugees and asylum seekers in the UK Press 1996-2005. Journal of English Linguistics 36(1), pp.5-38.


Author: Ilse A Ras

There are times when I am doing research on crime news and language; sometimes I'm obsessed, sometimes I'm bored, and sometimes my tea is getting cold.

Contribute here!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s