This article contains a collection of notes, mind maps and resources to support you while preparing for the google cloud professional data engineer certification.

Disclaimer: The new Professional Data Engineer exam will be live starting November 13. The new version reflects updates to Google Cloud’s data storing, data sharing, and data governance and has less emphasis on operationalizing machine learning models. That being said, I believe most of the content is still relevant and can serve as a guide to assist you as you begin your preparation.

So brace yourselves, this is gonna be a rather long post filled with too many images extracted from different parts of the mind maps I used for my preparation in order to make them easy to read and follow.

So let’s get started!


Contents


Google Cloud

Before we dive into the characteristics of Google Cloud services that will enable professional data engineers to design, build and operationalize data processing systems, let’s start with a 10,000-foot view on different topics that may be included in the exam presented in the below interactive mindmap:

Figure 1: 10,000 View on Google Cloud Professional Data Engineer Exam Topics

In the coming sections, we will cover each topic from the mindmap separately.

Infrastructure

Google Cloud services are available in different locations divided into Regions. Regions contain multiple Zones where the resources are deployed and are isolated from one another so that failures in one zone do not affect other zones in a region. Most regions have at least three zones and can have more. All regions have at least two zones.

world map of google cloud region locations
Figure 2: Google Cloud Regions - Image Source

Google data centers are connected with Google’s own high-speed network. Google is the only cloud provider that owns all the fiber connecting its data center together. A huge amount of the world’s internet traffic goes through Google’s network.

In addition to the data centers, there are points of presence all over the world. They allow access to Google’s network where all messages are encrypted, secure and very fast.

In addition to the POPs, Google runs a global caching system or CDN that consists of hundreds of more nodes. You can easily take advantage of this CDN to cache your content, thus increasing your application performance and decreasing your networking cost.

VPC Networks

mindmap of topics related to VPC networks
Figure 3: Google Cloud Platform VPC Networks

Data Transfer Services

mindmap of topics related to data transfer services
Figure 4: Google Cloud Platform Data Transfer Services

Resource Manager

mindmap of topics related to resource manager
Figure 5: Google Cloud Platform Resource Manager

Security

mindmap of topics related to security
Figure 6: Security in Google Cloud Platform

Compute

mindmap of topics related to compute
Figure 7: Google Cloud Platform Compute

Storage

mindmap of topics related to storage
Figure 8: Google Cloud Platform Storage

Ingestion and Processing

mindmap of topics related to ingestion and processing
Figure 9: Ingestion and Processing in Google Cloud Platform

Data Pipelines Management

mindmap of topics related to data pipelines management
Figure 10: Data Pipelines Management in Google Cloud Platform Data

Data Governance

mindmap of topics related to data governance
Figure 11: Data Governance in Google Cloud Platform

Analytics

mindmap of topics related to analytics
Figure 12: Analytics in Google Cloud Platform

Machine Learning

mindmap of topics related to machine learning
Figure 13: Machine Learning in Google Cloud Platform

Ingestion and Pocessing

As a professional data engineer, designing data processing systems requires building and operationalizing data pipelines by choosing the appropriate services to integrate new data sources and processing the data in batch or streaming fashion. In this section, we deep dive into services that will allow you to ingest data in real time and build data processing systems whether you are migrating on premises workloads or starting from scratch.

Figure 14: Ingestion and Processing Topics

Pub/Sub

mindmap of topics related to pub sub
Figure 15: Pub/Sub

Dataproc

mindmap of topics related to dataproc part 1
Figure 16: Dataproc Part 1/2
mindmap of topics related to dataproc part 2
Figure 17: Dataproc Part 2/2

Dataflow

It allows you to execute your Apache Beans pipelines on Google Cloud.

  • A managed service that provides the resources necessary to create pipelines
    • Defines HOW to run the pipeline:
      • Optimizes the graph by fusing transforms for example for best execution path
      • Breaks jobs into units of work
      • Schedules them to various workers
      • Optimization is always ongoing
        • Units of work are continually rebalanced mid job which provides fault tolerance
        • autoscaling mid job
      • Resources –both compute and storage– are deployed on demand and on a per job basis
  • The Apache Beam SDK, which provides the programming environment to make the creation of streaming and batch pipelines easier
    • Defines WHAT has to be done
mindmap of topics related to dataproc part 1
Figure 18: Datflow Part 1/3
mindmap of topics related to dataproc part 2
Figure 19: Datflow Part 2/3
mindmap of topics related to dataproc part 3
Figure 20: Datflow Part 3/3

Storage

One of a data engineer’s most important skills is choosing the right storage technology, which involves knowing how to use managed services and having a solid grasp of storage performance and pricing. To further optimize your data processing and cut expenses, consider data modeling, schema design, and data life cycle management. In this section we will delve into the many storage options provided by Google Cloud.

Figure 21: Storage Topics

Cloud Storage

mindmap of topics related to cloud storage part 1
Figure 22: Cloud Storage Part 1/2
mindmap of topics related to cloud storage part 2
Figure 23: Cloud Storage Part 2/2

Google Cloud provides 3 ways to manage the KEK encryption key:

  • Google Managed Encryption Keys - GMEK: automatic encryption using Cloud KMS (Key Management Service)
  • Customer Managed Encryption Keys - CMEK: you control the creation and existance of the KEK key in KMS
  • Customer Supplied Encryption Keys - CSEK: you provide the KEK key

Cloud SQL

Cloud SQL is a fully managed relational database service for:

  • MySQL
  • PostgreSQL
  • Microsoft SQL
mindmap of topics related to cloud slq
Figure 24: Cloud SQL

Query Insights

mindmap of topics related to query insights
Figure 25: Query Insights

Cloud Spanner

mindmap of topics related to cloud spanner
Figure 26: Cloud Spanner

Firestore

mindmap of topics related to firestore
Figure 27: Firestore

Datastore

mindmap of topics related to datastore
Figure 28: Datastore

Memorystore

mindmap of topics related to memorystore
Figure 29: Memorystore

Bigtable

Bigtable is a fully managed NoSQL database service. It is suitable for:

  • Storing > 1TB
  • High Throughput
  • Low latency random data access
mindmap of topics related to bigtable
Figure 30: Bigtable

BigQuery

Figure 31: BigQuery Topics

The last section is solely dedicated to BigQuery. BigQuery is a serverless and cost-effective data warehouse. It is deeply integrated with the GCP’s analytical and data processing offering, allowing customers to build an enterprise ready cloud native data warehouse. BigQuery is part of Google Cloud’s comprehensive data analytics platform that covers the analytics value chain from Ingest, process and store to advanced analytics and collaboration.

mindmap of topics related to bigquery part 1
Figure 32: BigQuery Part 1/12
mindmap of topics related to bigquery part 2
Figure 33: BigQuery Part 2/12
mindmap of topics related to bigquery part 3
Figure 34: BigQuery Part 3/12
mindmap of topics related to bigquery part 4
Figure 35: BigQuery Part 4/12
mindmap of topics related to bigquery part 5
Figure 36: BigQuery Part 5/12
mindmap of topics related to bigquery part 6
Figure 37: BigQuery Part 6/12
mindmap of topics related to bigquery part 7
Figure 38: BigQuery Part 7/12
mindmap of topics related to bigquery part 8
Figure 39: BigQuery Part 8/12
mindmap of topics related to bigquery part 9
Figure 40: BigQuery Part 9/12
mindmap of topics related to bigquery part 10
Figure 41: BigQuery Part 10/12
mindmap of topics related to bigquery part 11
Figure 42: BigQuery Part 11/12
mindmap of topics related to bigquery part 12
Figure 43: BigQuery Part 12/12

Resources

Developer Cheat Sheet:

The Cloud Girl:

Google Cloud Product list:

21 products explained under 2 minutes:

GCP Data Engineer Study Guide:

Data Engineering Cheat Sheet on GCP:

Schema design best practices for Bigtable:

Optimize query computation for BigQuery:

With this we have reached the end of this post, I hope you enjoyed it!

If you have any remarks or questions, please don’t hesitate and do drop a comment below.