Personal Data Classification

Introduction

Airbnb is built on trust. One key way we maintain trust with our community is by ensuring that personal data is handled with care, in a manner that meets security, privacy, and compliance requirements. Understanding where and what personal data exists is foundational to this.

Over the past several years, we’ve built our own data classification system that adapts to the needs of our data ecosystem, to streamline our processes, and further unlock our ability to protect the data entrusted to Airbnb. This was made possible by many teams working closely to achieve this overarching, shared objective. Information Security, Privacy, Data Governance, Legal, and Engineering collaborated to tackle this problem holistically to produce a unified data identification and classification strategy across all data stores.

In this blog, we will shed light on the complexities of how data classification works at Airbnb, what measurements we set to assess the quality, performance, and accuracy of the systems involved, and the important considerations when building a data classification system. We hope to share insights for others that are facing similar challenges and to provide a framework for how data classification systems can be built at scale.

The Complexities of Data Classification at Airbnb

Data classification is the process of identifying where data exists and then organizing, detecting, and annotating that data based on a taxonomy. At Airbnb, we have established a Personal Data Taxonomy Council to define the taxonomy for personal data and to refine it over time. This taxonomy breaks down personal data into various data elements that are relevant for our ecosystem such as email address, physical address, and guest names. Once data is annotated with its applicable personal data element(s), various enforcement systems use these annotations to ensure personal data is handled according to our Security and Privacy policies. In this blog post, we will focus primarily on the data classification workflow and not each type of enforcement use case.

The workflow can be classified into three pillars:

Catalog: What data do we have?
Detection: What data do we suspect is personal data?
Reconciliation: Which classification do we choose?

Press enter or click to view image in full size

Personal Data Classification Flow

Let’s dig deeper into how each of these form the backbone of data classification.

Catalog

Cataloging involves building a dynamic, accurate, and scalable system to first identify where data exists and then organize the whole inventory. Cataloging is akin to mapping the data landscape or organizing a library. It involves dynamically discovering new data, enriching it with metadata from various sources, and manually inputting information. This process is crucial for enforcing data policies, accurately classifying data, and assigning it to the correct owners.

Automated and Dynamic Discovery: Automation makes the cataloging process scalable and efficient. For the variety of data stores that Airbnb uses, such as production and analytical databases, object stores, and cloud storage, our catalogs connect to them and dynamically fetch the full inventory of data. Either through stream or batch processing, they dynamically update to reflect new and changed data. This ensures the catalog is a reliable and accurate source of truth.
Complexity and Diversity in Data Sources: The challenge of cataloging stems from the variety and complexity of data sources, including different formats and locations. Our cataloging systems fetch metadata in several ways: through direct API calls or by crawling schemas in formats like thrift, JSON, yaml, or config files, accommodating the diverse nature of modern data storage.

For search and discovery, many of our data entities are surfaced in the data management platform, Metis. This helps the data owners quickly answer questions such as which data contains personal data, who owns the data, and which controls are in place.

Press enter or click to view image in full size

Catalog UI

Detection

For personal data detection, we use the in-house automated detection service in our Data Protection Platform which was built to protect data in compliance with global regulations and security requirements. As our own taxonomy grows, we have expanded our capabilities and made the service easily extensible to detect all other types of personal data elements and personal Airbnb IDs.

Detection engine

For each data entity stored in the catalogs, scanning jobs are scheduled through a message queue, which then samples data and runs through our list of classifiers. Recognizing the need for periodic classifier updates, the detection engine was designed for simplicity and flexibility. Since its inception, our detection engine has upgraded to include additional steps and adopted the approach of configuration-driven development. The majority of the logic of the detection engine has been rewritten as simpler configurations to increase the speed of iterating on existing classifiers, improve testing, and enable quick development of new features.

The detection engine can be seen as a pipeline, which involves the scanner, validator, and thresholding.

Press enter or click to view image in full size

Detection Engine

Scanner: The scanner classifies personal data using metadata and content, employing methods like regex for emails and keyword lists for cities, and advanced machine learning models for complex data types requiring contextual understanding.

Validator: Sampled data matching a scanner undergoes a customizable validation step to enhance classifier accuracy, verifying details like latitude/longitude ranges or custom ciphertexts from encryption services.

Thresholding: To reduce noise, thresholding is applied before storing results, varying by data structure type (e.g., matched rows vs. findings in a document) and set based on historical data frequency and criticality.

With the revamped pipeline, this has resulted in a significant decrease in false positive findings and reduced the burden on data owners to verify every result, which has historically impeded their productivity.

Reconciliation

Not every detection surfaced may be correct or, in other cases, more context may be required. Therefore, we employ a human-in-the-loop strategy: where data owners confirm the classifications. This step is critical in ensuring these classifications are correct before any data policies are automatically enforced to protect our data.

Automated Notifications

For compliance, we have an automated notification system that issues tickets whenever personal data is detected. These get surfaced to the appropriate data or service owners with strict SLAs.

For data entities that have schemas defined in code, such as transactional tables from production services (online), Amazon S3 buckets, or tables that are exported to our data warehouse, we assist the developers by automatically creating code changes that update their table schemas with the detected personal data elements.

Enforcing resolution

To enforce resolution on these tickets, tables are automatically access controlled in the data warehouse when the tickets are not resolved within SLA. Additionally reviews are conducted to ensure our classifications are correct for data where its handling requirements apply.

Tracking actions taken on these tickets for when personal data is detected has been important to assess the quality of our data classification flow and to keep an audit trail of past detections. It also highlights points of friction developers face when resolving these tickets. The investments we have made in this area have continued to improve the process and reduce the time needed to resolve tickets each year.

Assessing Quality of a Data Classification System

Because of the complexity of the system and its sub-components, this presented a unique challenge as we strive to define what quality means for the entire system. To build with the long-term in mind, we evaluated how well our entire data classification system functions as a whole.