In recent years, the amount of data being collected and processed has risen dramatically. Every cybersecurity framework that we've seen has experienced immense growth in data ingress as the new technologies (i.e. Hadoop and friends) allow for cheaper and more scalable data collection. At last count, a framework I am familiar with was processing 100K events / second. And they would be from a variety of products including a lot of custom application log files with possibly sensitive data.
Unfortunately, this brings along a set of new problems:
- products that do not centralize data collection (i.e. each product re-invents the wheel to collect, process and store data)
- lack of a common storage schema leading to data being dumped in a way that can only retrieved by a specific product
- and thus, the inability to correctly apply organization-wide data policies in any consistent way.
The point of this article is not to help resolve all of the above (email me if you do need help with that ;)) but rather to address the last point. As the data that an organization collect grows, chances are that the CISO office (or whoever that is responsible for the data security policies) will have come up with guidance on data classification and how best to protect the data.
Hypothetical examples of data classification would be:
- Personally Identifiable Information (SSN, Licenses, etc.)
- Payment Card Information (Card numbers, CVV2, CVC2, etc.)
- Protected Health Information (Physical or mental health conditions, treatment records, etc.)
- Authentication Information (Passwords, encryption keys, etc.)
And, then a set of policies on how best to protect the data throughout the organization:
- For any PII: use SHA-256
- For any PCI or PHI: use AES-256
- For encryption keys: use RSA-2048
- For passwords: use bcrypt w/12 rounds
This guidance is critical and it takes away the guess work from an individual team or developers mind as to what should I be using to protect such and such data. This also allows for guidance to be updated and enforced in a central way. i.e. change the bcrypt rounds to 15. The problem that we see is that most organizations suffer from the problems mentioned in the beginning of the article. And in the absence of a centralized collection framework, it is critical to have the next best thing - a central data tagging and policy service.
Think about the following micro-service:
- Which takes the following inputs:
- Data: { "username":"nshah", "pword":"123456", "ssn":"111-22-3333" }
- With an user supplied initial tagging: { "pword":"password" }
- And a static organization wide policy: { "PII":"SHA-256", "password":"bcrypt(12)" }
- Performs these steps:
- Automated data classification: Figure out that "ssn" is of type PII and adds that to the tags { "pword":"password", "ssn":"PII" }
- Apply policy: Lookup the calling users' key using the identity and transform the "pword" and "ssn" fields using the appropriate policies.
- And returns:
- Data with policy transformations applied: { "username":"nshah", "pword":"hash:v1:$2a$12$wCQ4Vmi1R2SPRyHGB4lc8OlHDa1CjcD7/SHg8lEG1..U8ChRy4g12", "ssn":"key:v1:MTExLTIyLTMzMzM=" }
- New tagging: { "pword":"password", "ssn":"PII" }
To round out the whole process, you just have to do the following:
- Make sure the tags get tagged along (wow! :)) the processing and storage chain
- Ensure that your automated data classification engine continuously improves
- Integrate an identity provider and a "vault" for encryption keys, so that you can decouple the identity or the keys and perform key operations (create, rotate, revoke. etc.) without any intervention from the calling users
- And then: Profit!!!