Preparation Guide for Google Cloud Professional Data Engineer Certification

This article contains a collection of notes, mind maps and resources to support you while preparing for the google cloud professional data engineer certification.

Disclaimer: The new Professional Data Engineer exam will be live starting November 13. The new version reflects updates to Google Cloud’s data storing, data sharing, and data governance and has less emphasis on operationalizing machine learning models. That being said, I believe most of the content is still relevant and can serve as a guide to assist you as you begin your preparation.

So brace yourselves, this is gonna be a rather long post filled with too many images extracted from different parts of the mind maps I used for my preparation in order to make them easy to read and follow.

So let’s get started!

Google Cloud
Ingestion and Pocessing
Storage
BigQuery
Resources

Google Cloud

Before we dive into the characteristics of Google Cloud services that will enable professional data engineers to design, build and operationalize data processing systems, let’s start with a 10,000-foot view on different topics that may be included in the exam presented in the below interactive mindmap:

Figure 1: 10,000 View on Google Cloud Professional Data Engineer Exam Topics

In the coming sections, we will cover each topic from the mindmap separately.

Infrastructure

Google Cloud services are available in different locations divided into Regions. Regions contain multiple Zones where the resources are deployed and are isolated from one another so that failures in one zone do not affect other zones in a region. Most regions have at least three zones and can have more. All regions have at least two zones.

world map of google cloud region locations — Figure 2: Google Cloud Regions - Image Source

Google data centers are connected with Google’s own high-speed network. Google is the only cloud provider that owns all the fiber connecting its data center together. A huge amount of the world’s internet traffic goes through Google’s network.

In addition to the data centers, there are points of presence all over the world. They allow access to Google’s network where all messages are encrypted, secure and very fast.

In addition to the POPs, Google runs a global caching system or CDN that consists of hundreds of more nodes. You can easily take advantage of this CDN to cache your content, thus increasing your application performance and decreasing your networking cost.

VPC Networks

Data Transfer Services

Resource Manager

Security

Compute

Storage

Ingestion and Processing

Data Pipelines Management

Data Governance

Analytics

Machine Learning

Ingestion and Pocessing

As a professional data engineer, designing data processing systems requires building and operationalizing data pipelines by choosing the appropriate services to integrate new data sources and processing the data in batch or streaming fashion. In this section, we deep dive into services that will allow you to ingest data in real time and build data processing systems whether you are migrating on premises workloads or starting from scratch.

Figure 14: Ingestion and Processing Topics

Pub/Sub

Dataproc

Dataflow

It allows you to execute your Apache Beans pipelines on Google Cloud.

A managed service that provides the resources necessary to create pipelines
- Defines HOW to run the pipeline:
  - Optimizes the graph by fusing transforms for example for best execution path
  - Breaks jobs into units of work
  - Schedules them to various workers
  - Optimization is always ongoing
    - Units of work are continually rebalanced mid job which provides fault tolerance
    - autoscaling mid job
  - Resources –both compute and storage– are deployed on demand and on a per job basis
The Apache Beam SDK, which provides the programming environment to make the creation of streaming and batch pipelines easier
- Defines WHAT has to be done

mindmap of topics related to dataproc part 3 — Figure 20: Datflow Part 3/3

Storage

One of a data engineer’s most important skills is choosing the right storage technology, which involves knowing how to use managed services and having a solid grasp of storage performance and pricing. To further optimize your data processing and cut expenses, consider data modeling, schema design, and data life cycle management. In this section we will delve into the many storage options provided by Google Cloud.

Figure 21: Storage Topics

Cloud Storage

Google Cloud provides 3 ways to manage the KEK encryption key:

Google Managed Encryption Keys - GMEK: automatic encryption using Cloud KMS (Key Management Service)
Customer Managed Encryption Keys - CMEK: you control the creation and existance of the KEK key in KMS
Customer Supplied Encryption Keys - CSEK: you provide the KEK key

Cloud SQL

Cloud SQL is a fully managed relational database service for:

MySQL
PostgreSQL
Microsoft SQL

mindmap of topics related to cloud slq — Figure 24: Cloud SQL

Query Insights

Cloud Spanner

Firestore

Datastore

Memorystore

Bigtable

Bigtable is a fully managed NoSQL database service. It is suitable for:

Storing > 1TB
High Throughput
Low latency random data access

mindmap of topics related to bigtable — Figure 30: Bigtable

BigQuery

Figure 31: BigQuery Topics

The last section is solely dedicated to BigQuery. BigQuery is a serverless and cost-effective data warehouse. It is deeply integrated with the GCP’s analytical and data processing offering, allowing customers to build an enterprise ready cloud native data warehouse. BigQuery is part of Google Cloud’s comprehensive data analytics platform that covers the analytics value chain from Ingest, process and store to advanced analytics and collaboration.