Cyberleagle: The content v metadata contest at the heart of the Investigatory Powers Bill

After more than 30 hours of Commons Committee debate and 1,000 or so proposed Opposition amendments, the Investigatory Powers Bill is moving on to its Report stage. Now is a good time to revisit one of the most fundamental points in the Bill: the dividing line between content and metadata.

This is especially topical in view of reports that David Anderson Q.C. is to undertake an independent review of the operational case for bulk powers. As will become apparent the dividing line is not the same for every power. This has particular relevance for bulk interception and equipment interference.

Sensitivity of content and metadata

Why does the distinction between content and metadata matter?

The government’s position, which finds support in human rights law, is that intercepting, acquiring, processing and examining the content of a communication is more intrusive than for the “who, when, where, how” contextual data wrapped around it.

Others argue that as, thanks to the mobile internet and smartphone apps, metadata has become ever richer and more revealing so the difference in intrusiveness has become less marked. We know from the Report of the Intelligence and Security Committee of Parliament in March 2o15 that the intelligence agencies value metadata as much as, if not more than, content.

Be that as it may, under the Bill fewer safeguards and constraints apply to selection and examination of metadata than to content.

Content and metadata separation

Where to draw a line between content and metadata is not necessarily obvious. There is no assurance, come the inevitable human rights scrutiny, that courts applying ECHR Articles 8 and 10 or the EU Charter will draw a dividing line in the same place as domestic legislation.

In fact the Bill creates different dividing lines between content and metadata for different purposes: one version for mandatory retention and acquisition of communications data from service providers and another for communications interception and equipment interference. The latter designates more information as metadata and less as content.

This is perhaps not wholly surprising, since the Anderson Review (10.28) was sympathetic to the usefulness of content-derived metadata. Whether the possible extent of the change wrought by the Bill is generally appreciated is another matter.

Consequences of the demarcation

The demarcation between content and metadata has significant practical consequences. The Snowden disclosures suggest that GCHQ has bulk intercepted and stored metadata by the tens of billions of records. Even where such 'related communications data' (the term used in the Regulation of Investigatory Powers Act (RIPA)) is gathered as the by-product of an overseas-focused bulk interception campaign, the agency is able to look in the resulting metadata pool for information about people known to be in the British Islands.

Under RIPA it cannot do that for content without obtaining special Ministerial authorisation. Under the Bill that would need a targeted examination warrant. In Committee the government resisted an amendment that would have extended the requirement for a targeted examination warrant to include metadata as well as content.

Does Parliament have enough information to know where the line is drawn?

The Commons Science and Technology Committee and a Joint Parliamentary Committee scrutinised the draft Bill. Neither seemed confident that it (or anyone else) understood where the legislation drew the line between content and metadata.

The Joint Committee identified the definitions of communications data and content as one of the most common concerns among witnesses. Its Recommendation 1 said:

“Parliament will need to look again at this issue when the Bill is introduced. We urge the Government to undertake further consultation with communications service providers, oversight bodies and others to ascertain whether the definitions are sufficiently clear to those who will have to use them.”

For bulk interception the Committee noted the concerns of witnesses about the distinction between ‘related communications data’ and content. It recorded my own suggestion that:

“The Home Office could usefully produce a comprehensive list of datatype examples, where appropriate with explanations of context, categorised as to whether the Home Office believes that each would be entity data, events data, contents of a communication, data capable of being related communications data when extracted from the contents of a communication and so on.”

The Science and Technology Committee had previously noted that the government, in seeking to future-proof the legislation, had produced definitions that had led to significant confusion on the part of communications service providers and others. It said that definitions such as ‘communications content’ needed to be clarified as a matter of urgency.

The closest that the Home Office has come to producing a systematic analysis is in Annex A to evidence submitted to the Joint Committee, categorising a selection of datatypes. This fell too late to be considered by most witnesses and was light on analysis of why particular items fell on one side of the line or the other.

Since then, the Bill as introduced into Parliament in March 2016 has revised some of the definitions. Most significantly it replaces 'related communications data' with ‘secondary data’. This, explained the government:

“[makes] clear that it is broader than communications data. This clarifies the distinction between this type of data and the narrower class of data available under a communications data authorisation.” (emphasis added)

The government published draft Codes of Practice alongside the Bill. In principle the wealth of explanation in these and other sources – Explanatory Notes, Home Office evidence, fact sheets, operational cases, Ministerial statements in Committee, Home Office letters to the Committee and so on – should help us understand where the dividing line lies.

How does the Bill draw the line?

Any attempt to draw a line between content and metadata has to avoid circularity: “Why is this information not content?” “Because it is less sensitive.” “Why is this information less sensitive?” “Because it is not content.”

The Bill's new definition of content (there is no existing definition in RIPA) turns on whether data reveals anything of what might reasonably be considered to be the meaning (if any) of a communication. The Joint Committee commented on the draft Bill:

The impression of having to perform metaphysical gymnastics is bolstered when we are introduced to the concept of ‘inferred meaning’. Paragraph 2.14 of the draft Interception Code of Practice says:

“There are two exceptions to the definition of content section out in section 223(6). The first is there to address inferred meaning. When a communication is sent, the simple fact of the communication conveys some meaning, e.g. it can provide a link between persons or between a person and a service. This exception makes clear than any communications data associated with the communication remains communications data and the fact that some meaning can be inferred from it does not make it content.”

If anything this confirms Paul Bernal’s concern that since meaning can be derived from almost any data, a dividing line based on the existence of meaning is problematic.

What is the practical result of the Bill’s definitions?

Since the Bill draws the line in different places for different purposes the practical result depends on which set of definitions is used. One set applies to interception and equipment interference, the other to retention and acquisition of communications data.

The interception variety of metadata is ‘secondary data’. For equipment interference it is the similar ‘equipment data’. Both consist of either ‘systems data’ or ‘identifying data’. Systems data is a critical definition, since S.223(6) lays down that if something is systems data it cannot be content.

The overriding nature of the systems data definition relieves the intercepting or interfering agency of the need to grapple with questions of the ‘meaning’ of the communication. The draft Interception Code of Practice notes that in practice the agency will only have to decide whether information fits within the definition of systems data. If so, it cannot be content even if it reveals some of the meaning of the communication.

The Bill will also enable ‘identifying data’ to be extracted from the contents of a communication and treated as secondary data. Under RIPA, information such as an e-mail address embedded in a web page is treated as content. Under the Bill, intercepting and interfering agencies would be able to scrape such data from the body of a communication and treat it as metadata.

For retention and acquisition of communications data metadata is either ‘entity data’ or ‘events data’. Here the position is reversed: content takes precedence. If information reveals anything of the meaning of the communication (beyond the mere fact or transmission of the communication) then for these purposes it is content, even if for interception or equipment interference purposes it would be systems data. The ‘identifying data’ scraping exception does not apply.

The result is that some types of information may be treated as metadata for the purposes of interception and equipment interference, but as content for the purposes of communications data retention and acquisition.

This overlap of content and metadata is not merely theoretical. The draft Communications Data Code of Practice suggests that some communications may consist entirely of systems data (and thus be deemed to contain no content). The draft Equipment Interference Code of Practice gives the example of machine to machine messages between items of network infrastructure to enable the system to manage the flow of communications.

Testing the content/metadata dividing line

The most comprehensive way of testing the dividing line between content and metadata is to take a large number of examples of different types of information and assess which side of the line they would fall.

I have adopted a different approach: take a short e-mail and evaluate which of its components might count as content and which as metadata.

For this exercise I have used the version of the dividing line that contrasts content with ‘secondary data’. This applies to targeted, thematic and bulk interception warrants. It replaces ‘related communications data’ under RIPA. As we have seen, ‘secondary data’ is generally broader than the ‘communications data’ definition used for mandatory retention and acquisition.

Here is my sample e-mail.

An initial impression is probably that the From/To and Sent fields are metadata and everything else is content. Indeed that is the current position under RIPA. When we turn to the Bill however, things seem to be rather different. It appears that most of the e-mail may be either systems data, or identifying data that can be extracted and treated as metadata.

Of course only the visible parts of the e-mail are shown. More datatypes will be lurking in the header. Depending on exactly what they contain those are likely to be secondary data.

To understand how what looks like e-mail content can become metadata, we need to delve more deeply into the definition of 'secondary data'.

What is secondary data?

S.120 of the Bill provides that secondary data, in relation to any communication transmitted by means of a telecommunication system, means any data falling within either of two subsections:

Subsection (4) is systems data which is comprised in, included as part of, attached to or logically associated with the communication (whether by the sender or otherwise). In general terms systems data is data that enables or facilitates a telecommunication system or service, a system holding a communication, or a service provided by such a system, to function. It is not limited to the system that is conveying the communication in question. For a graphical representation of the full definition of systems data, see here.

Subsection (5) concerns identifying data. Like systems data it must be comprised in, included as part of, attached to or logically associated with the communication (whether by the sender or otherwise). Unlike systems data it must also be capable of being logically separated from the remainder of the communication; and, if it were separated, must not “reveal anything of what might reasonably be considered to be the meaning (if any) of the communication, disregarding any meaning arising from the fact of the communication or from any data relating to the transmission of the communication.”

This last condition mirrors the Bill’s general definition of content. It raises the perplexing question of what (and how much) information can be extracted from the content of a communication without revealing anything of the meaning of the communication. Examples given in the Explanatory Notes include:

the location of a meeting in a calendar appointment;

photograph information - such as the time/date and location it was taken; and

contact 'mailto' addresses within a webpage

The first two of these examples reveal a possibly surprising feature of identifying data. The data can, it seems, relate to matters such as a real world meeting or the taking of a photograph that are not an aspect of a communication.

This conclusion follows from the definition of ‘identifying data’, which includes data which may be used to identify any person, apparatus, system or service, any event, or the location of any person, event or thing. Events are – apparently - not limited to events forming part of the use of a communications system. Data may relate to the fact of the event, the type, method or pattern of event, or the time or duration of the event. For a graphical representation of the full definition of identifying data, see here.

The Home Office in its evidence to the Joint Committee said: “It is also possible for certain structured data types to be extracted from the content of a communication”. In the Bill neither the systems data nor identifying data definitions appear to be restricted to structured data (and the definition of ‘data’ is certainly not limited in that way).

Identifying data must be capable of being logically separated from the content of the communication. Does that imply some element of structure in the extractable data? It may just mean that physical separation is unnecessary. In the Bill Committee on 12 April 2016 the Minister said: “For example, if there are email addresses embedded in a webpage, those could be extracted as identifying data.”

Another conundrum is whether each item of identifying data has to be evaluated separately in determining whether it reveals anything of the meaning of the communication, or whether extracted items of identifying data should be considered cumulatively.

For the purposes of analysing my sample e-mail I have assumed that unstructured information can for the purposes of the Bill (whether it is technically possible is another matter) be “logically separated” from the rest of the communication; and that extracted elements of identifying data are not considered cumulatively. These are points on which further elucidation would be desirable.

Analysis of sample e-mail

Below is a marked up version of the e-mail. All the highlighted text could, it seems, be either systems data (yellow) or identifying data (orange).

The “From”, “To” and “Sent” fields fit the definition of systems data, as data facilitating the functioning of a telecommunications service. This is unsurprising and corresponds to the existing position under RIPA.

An e-mail 'Subject' line is content. However, as the draft Equipment Interference Code of Practice explains in relation to equipment data, elements of the subject line may be capable of being extracted and treated as metadata: “the text in the subject line would not be equipment data (unless separated as identifying data).”

So consider “last night’s call”. ‘call’ appears to be identifying data, since it identifies both the fact and type of an event (S.225(2)(b), (3)(a) and (b)). “last night’s” relates to the time of the event (225(3)(c)).

“Bill” and “Graham” both identify, or may assist in identifying, persons (s.225(2)(a)).

“Meet”, Wednesday”and "Red Lion” all appear to be identifying data. “Meet” relates to the type of event (S.225(2)(b), (3)(b)), “Wednesday” to its time (225(3)(c)) and “Red Lion” to the location of the event (225(2)(c)). The fact that this is a real world event rather than a communications event does not appear to prevent it being identifying data. The Explanatory Note gives an example of the location of a meeting in a calendar appointment. It would be odd if information sent in a calendar appointment was treated differently from the same information sent in an e-mail.

“DM”. It is possible that this is systems data, describing something connected with enabling or facilitating the functioning of a telecommunications service. If not, it appears to be identifying data as assisting in identifying a service (225(2)(a)).

“@cyberleagle” is probably systems data (there no apparent requirement that the data should relate to means used to send the intercepted communication itself). If not, this is identifying data.

If this tentative analysis is correct, the secondary data (and equipment data) provisions of the Bill would represent a significant change to the existing content/metadata boundary under RIPA.

Despite all the supporting Bill materials these provisions still present a challenge to understand. If Parliament is to have a properly informed debate on these matters a fully detailed and reasoned Home Office explanation of what data falls within each category and why would be helpful.