Q: What is the AWS Glue Data Catalog and why is it important?

The AWS Glue Data Catalog is a centralized metadata repository that stores table definitions, schemas, connection information, and partition metadata for your data assets across your AWS environment. Why it matters: It acts as a <st

Q: What is a Glue Job Bookmark and when would you use it?

A Job Bookmark is Glue's built-in mechanism for incremental processing . When enabled, Glue tracks the state of the last successful job run and only processes new or changed data in subsequent runs. How it works for S3 sources: Glue trac

Q: What is AWS Glue and how is it different from traditional ETL tools like Informatica or SSIS?

AWS Glue is a serverless, fully managed ETL service built on Apache Spark. The key differences from traditional ETL tools are: No infrastructure — Traditional tools like Informatica require dedicated ETL servers you must provision, patch, and

Q: What is a Glue Crawler and what does it do?

A Glue Crawler is an automated agent that scans a data source, infers its schema, and registers or updates table definitions in the Glue Data Catalog . How it works: You configure a crawler with a data source (S3 path, JDBC connectio

Question 1

What are the cost optimization strategies for AWS Glue?

Accepted Answer

AWS Glue costs can spiral quickly on large workloads. Key strategies:

1. Use Python Shell for lightweight tasks — A Python Shell job costs 1/16th of a Spark job. API calls, small file moves, metadata updates, and notifications do not need Spark. Use Python Shell for the

Question 2

What is the AWS Glue Data Catalog and why is it important?

Accepted Answer

The AWS Glue Data Catalog is a centralized metadata repository that stores table definitions, schemas, connection information, and partition metadata for your data assets across your AWS environment.

Why it matters:

It acts as a

Question 3

What is a Glue Job Bookmark and when would you use it?

Accepted Answer

A Job Bookmark is Glue's built-in mechanism for incremental processing. When enabled, Glue tracks the state of the last successful job run and only processes new or changed data in subsequent runs.

How it works for S3 sources: Glue trac

Question 4

What is the difference between a DynamicFrame and a Spark DataFrame in Glue? When would you use each?

Accepted Answer

Both represent distributed datasets in Glue, but they handle schema differently:

DynamicFrame

Question 5

How do you optimize an AWS Glue job that is running slowly?

Accepted Answer

Performance tuning is almost always asked in senior data engineering interviews. Here is a systematic approach:

1. Check the Spark UI first — Go to Glue job → Run → Spark UI. Look for: stages with long duration, high shuffle read/write, tasks with skewed sizes, and spil

Question 6

Design a CDC pipeline using AWS Glue that replicates changes from RDS PostgreSQL to S3 in near real-time.

Accepted Answer

This is a classic system design question for senior roles. The full architecture:

Architecture: PostgreSQL WAL → AWS DMS (CDC) → S3 (raw CDC files) → Glue Streaming Job → S3 Silver Layer (Iceberg/Delta)

Step-by-step:

Question 7

How do you handle small file problems in AWS Glue and S3?

Accepted Answer

The small file problem is one of the most common performance killers in data lakes. When you have thousands of tiny files (under 1MB each) in S3, Spark spends more time on S3 API calls and task scheduling overhead than actual data processing.

Root cause in Glue: If your

Question 8

What is AWS Glue and how is it different from traditional ETL tools like Informatica or SSIS?

Accepted Answer

AWS Glue is a serverless, fully managed ETL service built on Apache Spark. The key differences from traditional ETL tools are:

No infrastructure — Traditional tools like Informatica require dedicated ETL servers you must provision, patch, and

Question 9

How do you connect AWS Glue to an RDS or on-premise database?

Accepted Answer

Glue connects to JDBC-compatible databases (RDS, Aurora, SQL Server, Oracle, on-prem databases) using Glue Connections.

Setup steps:

Create a Glue Connection — In the Glue console, create a connection with the JDBC

Question 10

What is a Glue Crawler and what does it do?

Accepted Answer

A Glue Crawler is an automated agent that scans a data source, infers its schema, and registers or updates table definitions in the Glue Data Catalog.

How it works:

You configure a crawler with a data source (S3 path, JDBC connectio

AWS Glue Interview Questions