Cloud Data Loss Prevention (DLP): Part-2

Reading Time: 2 minutes

Google Cloud Platform’s Data Loss Protection API provides a service that can make organizations manage sensitive data, including detecting and redaction, masking, and tokenizing such data. This can help organizations comply with regulations such as GDPR, and reduce the risk of data exposure and data breaches. Such as a name, email address, telephone number, identification number, or credit card number.

In the previous blog Cloud Data Loss Prevention (DLP): Part-1. We saw different types of inbuilt infotypes DLP offers. In this blog, we will cover how to use Infotypes with an example.

How to use Cloud Data Loss Prevention (DLP)

If you wanted to look for a phone number in a block of text, you would specify the PHONE_NUMBER infoType detector in the inspection configuration.

The following Output screenshot and code samples demonstrate a simple scan request to the Cloud DLP API. Notice that the PHONE_NUMBER detector is specified in inspectConfig, which instructs Cloud DLP to scan the given string for a phone number.

NOTE: Always make sure that the format of string you want your respective inbuilt infoptye to detect should be correct. Only then infotype will be able to fetch the data for you.

CODE

def extract_metadata(project, item,
                     info_types=["PHONE_NUMBER"], # more info can be searched
                     min_likelihood="LIKELY"):
    """Inspects and extracts the info types
    Args:
        project: The Google Cloud project id to use as a parent resource.
        item: The string to inspect (will be treated as text).
        info_types: A list of strings representing info types to look for.
            A full list of info type categories can be fetched from the API.
    Returns:
        None; the response from the API is printed to the terminal.
    """
    # Import the client library
    import google.cloud.dlp

    # Instantiate a client
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Convert the project id into a full resource id.
    parent = f"projects/{project}"

    # Construct inspect configuration dictionary
    inspect_config = {"info_types": [{"name": info_type} for info_type in info_types],
                      "min_likelihood": min_likelihood,
                      "include_quote": True}

    # Call the API
    response = dlp.inspect_content(
        request={
            "parent": parent,
            "inspect_config": inspect_config,
            "item": {"value": item},
        }
    )

    # Print out the results.
    if response.result.findings:
        for finding in response.result.findings:
            try:
                if finding.quote:
                    print("Quote: {}".format(finding.quote))
            except AttributeError:
                pass
            print("Info type: {}".format(finding.info_type.name))
            print("Likelihood: {}".format(finding.likelihood))
        return response
    else:
        print("No findings.")

# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    project_id = 'XXXX' #Edit this with your project ID
    content = 'My phone number is 91-9876543210'
    print("----EXTRACTION OF PHONE NUMBER BY INBUILT INFOTYPE ----")
    extract_metadata(project_id, content)

OUTPUT

When you send the preceding request to the specified endpoint, Cloud DLP returns the following:

Conclusion

In this series of Cloud Data Loss Prevention (DLP), I tried to explain what different types of inbuilt infotypes are available and how to use them in your code. Hope it’s helpful to you.

HAPPY LEARNING 🙂

References

  1. https://cloud.google.com/dlp/docs/infotypes-reference
  2. https://cloud.google.com/dlp/docs/concepts-infotypes

Written by 

Tanishka Garg is a Software Consultant working in AI/ML domain.