Cloud Data Loss Prevention (DLP): Part-1

Reading Time: 5 minutes

Google Cloud Platform is a set of Computing, Networking, Storage, Big Data, Machine Learning, and Management services provided by Google that runs on the same Cloud infrastructure. There are over 100 products under the Google Cloud umbrella. In this blog, we will be discussing Cloud Data Loss Prevention (DLP) in detail.

Cloud Data Loss Prevention (DLP) provides access to a powerful sensitive data inspection. In addition, it also provides a classification and de-identification platform.

Cloud DLP includes:

  • More than 150 built-in information type (or “infoType”) identifiers.
  • In addition, the ability to define custom infoType detectors. Using dictionaries, regular expressions, and contextual elements.
  • De-identification techniques including redaction, masking, format-preserving encryption, date-shifting, and more.
  • The ability to detect sensitive data within streams of data. And, even structured text, files in storage repositories such as Cloud Storage and BigQuery, and even within images.
  • Analysis of structured data to help understand its risk of being re-identified. Moreover, including computation of metrics like k-anonymity, l-diversity, and more.
  • Moreover, the ability to automatically profile BigQuery data across an organisation. And, project to identify tables where high-risk and sensitive data reside.

Cloud Data Loss Prevention (DLP) – InfoType detector

Cloud Data Loss Prevention (DLP) uses information types (infoTypes), to define what it scans for. An infoType is a type of sensitive data. Such as a name, email address, telephone number, identification number, or credit card number.

In addition, every infoType defined in Cloud DLP has a corresponding detector. Cloud DLP uses infoType detectors in the configuration. It scans to determine what to inspect for and how to transform findings. InfoType names are also used when displaying or reporting scan results.

These are the types of InfoTypes available –

Let’s have a look at some of the built-in information type (or “infoType”) identifiers –

Global InfoTypes

The infoType detectors in this section detect global data such as –

  • AGE : An age measured in months or years.
  • CREDIT_CARD_NUMBER : A credit card number is 12 to 19 digits long. Used globally for payment transactions.
  • CREDIT_CARD_TRACK_NUMBER : A credit card track number is a variable length alphanumeric string. Used to store key cardholder information.
  • DATE : A date. This infoType includes most date formats, including the names of common world holidays.
  • DOMAIN_NAME : A domain name as defined by the DNS standard.
  • EMAIL_ADDRESS : Email address identifies the mailbox that emails are sent to or from. The maximum length of the domain name is 255 characters, and the maximum length of the local-part is 64 characters.
  • FEMALE_NAME : A common female name.
  • We have many more global Infotypes such as – GENDER, HTTP_COOKIE, LOCATION, MAC_ADDRESS, MAC_ADDRESS_LOCAL, MALE_NAME, MEDICAL_TERM, ORGANIZATION_NAME, PERSON_NAME, PHONE_NUMBER and many more.

Credentials and secrets InfoTypes

The infoType detectors in this section detect credentials and other secret data.

  • AUTH_TOKEN : An authentication token is a machine-readable way of determining whether a particular request has been authorized for a user. This detector currently identifies tokens that comply with OAuth or Bearer authentication.
  • AWS_CREDENTIALS : Amazon Web Services account access keys.
  • AZURE_AUTH_TOKEN : Microsoft Azure certificate credentials for application authentication.
  • BASIC_AUTH_HEADER : A basic authentication header is an HTTP header used to identify a user to a server. It is part of the HTTP specification in RFC 1945, section 11.
  • ENCRYPTION_KEY : An encryption key within configuration, code, or log text.
  • GCP_API_KEY : Google Cloud API key. An encrypted string that is used when calling Google Cloud APIs that don’t need to access private user data.
  • GCP_CREDENTIALS : Google Cloud service account credentials. Credentials that can be used to authenticate with Google API client libraries and service accounts.
  • We have many more credentials and secrets Infotypes such as – JSON_WEB_TOKEN, HTTP_COOKIE, PASSWORD, WEAK_PASSWORD_HASH.

Documents InfoTypes

In addition to its ability to scan and classify information contained within documents. Cloud DLP can classify documents into multiple enterprise-specific categories. When combined with personally identifiable information (PII) inspection scan results. This classification can be useful for document risk assessment, policy enforcement, and similar use cases.

  • DOCUMENT_TYPE/FINANCE/REGULATORY : Finance regulatory documents include financial regulations, tax laws, rules, and guidelines.
  • DOCUMENT_TYPE/FINANCE/SEC_FILING : An SEC filing is a formal document submitted to the US Securities and Exchange Commission. The most commonly filed SEC forms are 10-K and 10-Q.
  • DOCUMENT_TYPE/HR/RESUME : A résumé or a curriculum vitae (CV) document.
  • DOCUMENT_TYPE/LEGAL/BLANK_FORM : A blank legal form or template. This document type typically has multiple areas or boxes left empty to be filled in by an individual, who then submits the form to a legal institution.
  • DOCUMENT_TYPE/LEGAL/BRIEF : A legal brief is a document that advocates a particular outcome of a case, presenting supporting points, law interpretations, and recommendations.
  • We have many more documents Infotypes such as – DOCUMENT_TYPE/LEGAL/LAW , DOCUMENT_TYPE/R&D/PATENT, DOCUMENT_TYPE/R&D/SOURCE_CODE, DOCUMENT_TYPE/R&D/SYSTEM_LOG, DOCUMENT_TYPE/R&D/DATABASE_BACKUP.

Country-wise InfoTypes

InfoTypes for India

  • INDIA_AADHAAR_INDIVIDUAL : The Indian Aadhaar number is a 12-digit unique identity number. Obtained by residents of India, based on their biometric and demographic data.
  • INDIA_GST_INDIVIDUAL : The Indian GST identification number (GSTIN) is a unique identifier. Certainly required by every business in India for taxation.
  • INDIA_PAN_INDIVIDUAL : The Indian Personal Permanent Account Number (PAN) is a unique 10-digit alphanumeric identifier used for identification of individuals—particularly people who pay income tax. It’s issued by the Indian Income Tax Department. The PAN is valid for the lifetime of the holder.

InfoTypes for Canada

  • CANADA_BC_PHN: The British Columbia Personal Health Number (PHN) is issued to citizens, permanent residents, temporary workers, students, and other individuals who are entitled to health care coverage in the Province of British Columbia.
  • CANADA_DRIVERS_LICENSE_NUMBER: A driver’s license number for each of the ten provinces in Canada (the three territories are currently not covered).
  • CANADA_OHIP : The Ontario Health Insurance Plan (OHIP) number is issued to citizens, permanent residents, temporary workers, students, and other individuals who are entitled to health care coverage in the Province of Ontario.
  • CANADA_QUEBEC_HIN: The Québec Health Insurance Number (also known as the RAMQ number) is issued to citizens, permanent residents, temporary workers, students, and other individuals who are entitled to health care coverage in the Province of Québec.
  • CANADA_SOCIAL_INSURANCE_NUMBER : The Canadian Social Insurance Number (SIN) is the main identifier used in Canada for citizens, permanent residents, and people on work or study visas. With a Canadian SIN and mailing address, one can apply for health care coverage, driver’s licenses, and other important services.
  • We have many more infotypes for canada such as – CANADA_BANK_ACCOUNT, CANADA_PASSPORT.

InfoTypes for United States

  • AMERICAN_BANKERS_CUSIP_ID: An American Bankers’ Committee on Uniform Security Identification Procedures (CUSIP) number is a 9-character alphanumeric code that identifies a North American financial security.
  • FDA_CODE: Drug product name or active ingredient registered by the United States Food and Drug Administration (FDA).
  • US_ADOPTION_TAXPAYER_IDENTIFICATION_NUMBER: A United States Adoption Taxpayer Identification Number (ATIN) is a type of United States Tax Identification Number (TIN). An ATIN is issued by the Internal Revenue Service (IRS) to individuals who are in the process of legally adopting a US citizen or resident child.
  • US_BANK_ROUTING_MICR: The American Bankers Association (ABA) Routing Number (also called the transit number) is a nine-digit code. It’s used to identify the financial institution that’s responsible to credit or entitled to receive credit for a check or electronic transaction.
  • US_DEA_NUMBER: A US Drug Enforcement Administration (DEA) number is assigned to a health care provider by the US DEA. It allows the health care provider to write prescriptions for controlled substances. The DEA number is often used as a general “prescriber number” that is a unique identifier for anyone who can prescribe medication.
  • We have many more infotypes for united states such as – US_DRIVERS_LICENSE_NUMBER, US_EMPLOYER_IDENTIFICATION_NUMBER, US_HEALTHCARE_NPI, US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER, US_PASSPORT and many more.

Likewise, Cloud Data Loss Prevention (DLP) offers inbuilt infotypes for many countries such as – United Kingdom, Turkey, China, Mexico, Thailand, Sweden, Taiwan, Portugal, Spain, Singapore, Poland, Peru, etc.

In conclusion, in this blog, we have covered types of inbuilt infotypes Cloud Data Loss Prevention(DLP) offers. In the next blog, we will look at a demo of how we can use DLP infotypes in our code.

HAPPY LEARNING !! 🙂

References

  1. https://cloud.google.com/dlp/docs/infotypes-reference
  2. https://www.encryptionconsulting.com/google-cloud-platforms-data-loss-prevention-api-in-depth/#:~:text=Google%20Cloud%20Platform’s%20Data%20Loss%20Protection%20API%20provides%20a%20service,data%20exposure%20and%20data%20breaches.
Scala Future

Written by 

Tanishka Garg is a Software Consultant working in AI/ML domain.