What are the cost optimization strategies for AWS Glue?

AWS Glue costs can spiral quickly on large workloads. Key strategies: 1. Use Python Shell for lightweight tasks — A Python Shell job costs 1/16th of a Spark job. API calls, small file moves, metadata updates, and notifications do not need Spark. Use Python Shell for the

What is the AWS Glue Data Catalog and why is it important?

The AWS Glue Data Catalog is a centralized metadata repository that stores table definitions, schemas, connection information, and partition metadata for your data assets across your AWS environment. Why it matters: It acts as a <st

What is a Glue Job Bookmark and when would you use it?

A Job Bookmark is Glue's built-in mechanism for incremental processing . When enabled, Glue tracks the state of the last successful job run and only processes new or changed data in subsequent runs. How it works for S3 sources: Glue trac

What is the difference between a DynamicFrame and a Spark DataFrame in Glue? When would you use each?

Both represent distributed datasets in Glue, but they handle schema differently: DynamicFrame

How do you optimize an AWS Glue job that is running slowly?

Performance tuning is almost always asked in senior data engineering interviews. Here is a systematic approach: 1. Check the Spark UI first — Go to Glue job → Run → Spark UI. Look for: stages with long duration, high shuffle read/write, tasks with skewed sizes, and spil

transformpaidintermediate

AWS Glue

Q: What is a Glue Job Bookmark and when would you use it?

A Job Bookmark is Glue's built-in mechanism for incremental processing . When enabled, Glue tracks the state of the last successful job run and only processes new or changed data in subsequent runs. How it works for S3 sources: Glue trac

Q: What is the difference between a DynamicFrame and a Spark DataFrame in Glue? When would you use each?

Both represent distributed datasets in Glue, but they handle schema differently: DynamicFrame

Q: How do you optimize an AWS Glue job that is running slowly?

Performance tuning is almost always asked in senior data engineering interviews. Here is a systematic approach: 1. Check the Spark UI first — Go to Glue job → Run → Spark UI. Look for: stages with long duration, high shuffle read/write, tasks with skewed sizes, and spil

Serverless ETL service on Amazon Web Services

Official Docs ↗Interview Questions

Vendor

Amazon

Type

transform

Pricing

paid

Level

intermediate

serverlessAWSSparkPySparkETL

Advertisement · Google AdSense 728×90

What is AWS Glue?

AWS Glue is Amazon's fully managed, serverless ETL (Extract, Transform, Load) service. Unlike traditional ETL tools that require dedicated servers, Glue provisions compute resources on demand and releases them when the job finishes — you only pay for the seconds your job runs.

It is built on Apache Spark, which means your ETL scripts are distributed across a cluster automatically. You write PySpark or Scala, and Glue handles parallelism, fault tolerance, and scaling. This makes it capable of processing terabytes of data without any infrastructure management.

The three pillars of AWS Glue:

Data Catalog — A central metadata repository that stores table definitions, schemas, and connection details. Think of it as a managed Hive Metastore that integrates with Athena, EMR, and Redshift Spectrum.
Crawlers — Automated agents that scan data sources (S3, RDS, DynamoDB, JDBC), infer schemas, and populate the Data Catalog. You schedule crawlers to detect schema changes automatically.
ETL Jobs — Spark-based transformation scripts that read from sources, apply transformations, and write to targets. Jobs can be authored in Glue Studio (visual), or as PySpark/Scala scripts.

Core Concepts You Must Know

Glue Job Types

AWS Glue supports three job types:

Spark — Full Apache Spark cluster. Best for large-scale batch transformations. Billed per DPU-hour.
Spark Streaming — Continuous processing from Kafka or Kinesis. Keeps the cluster alive and processes micro-batches.
Python Shell — A single Python process (no Spark). Ideal for lightweight tasks, API calls, small file operations. Much cheaper than Spark jobs.
Ray — Distributed Python for ML workloads. Newer addition, good for data science pipelines.

DPUs (Data Processing Units)

A DPU is Glue's unit of compute — 4 vCPUs and 16 GB RAM. You configure how many DPUs your job uses. A standard Spark job defaults to 10 DPUs. You can set --number-of-workers and --worker-type (Standard, G.1X, G.2X) to tune performance and cost. G.1X gives 1 DPU per worker, G.2X gives 2 DPUs — use G.2X for memory-intensive joins.

Job Bookmarks

Job Bookmarks are Glue's built-in incremental processing mechanism. When enabled, Glue tracks which data it has already processed (based on S3 object modification timestamps or sequence numbers) and only processes new data on subsequent runs. This prevents reprocessing the entire dataset every time, which is critical for cost control on large S3 datasets.

Dynamic Frames vs DataFrames

Glue introduces its own abstraction called a DynamicFrame on top of Spark DataFrames. DynamicFrames handle semi-structured data with inconsistent schemas (e.g., JSON with nullable fields, mixed types) without requiring a predefined schema. You can convert between the two: dynamic_frame.toDF() converts to a Spark DataFrame for full Spark API access, and DynamicFrame.fromDF(df, glue_context, "name") converts back.

Glue Triggers

Triggers control when jobs run. Three types: Scheduled (cron expression), On-demand (manual or API call), and Conditional (starts when another job/crawler succeeds or fails). Conditional triggers let you build job chains — Crawler → Transform Job → Load Job — without an external orchestrator.

Advertisement · Google AdSense 300×250

Common AWS Glue Architectures.

Pattern 1 — S3 Data Lake ETL

The most common pattern: raw data lands in S3 (Bronze layer), a Glue Crawler scans it and registers the schema in the Data Catalog, a Glue Spark job transforms it (clean, deduplicate, cast types), and writes Parquet to a Silver S3 bucket. Athena or Redshift Spectrum then queries the Silver layer directly.

Pattern 2 — RDS to Redshift

Glue reads from an operational RDS database (MySQL/PostgreSQL) using a JDBC connection, applies transformations, and writes to Amazon Redshift. Glue manages the JDBC connection pooling and Redshift COPY command automatically via the write_dynamic_frame method with redshift connection type.

Pattern 3 — Incremental CDC with Bookmarks

Enable Job Bookmarks on an S3-source job. Glue tracks the high watermark of processed files. On each run it only reads files modified since the last successful run. Combine with partitioning by date for efficient incremental loads into a data warehouse.

Pattern 4 — Glue + Step Functions

For complex multi-step pipelines, AWS Step Functions orchestrates Glue jobs. Step Functions handles retries, parallel execution, conditional branching, and error handling — more powerful than Glue Triggers alone for enterprise workflows.

PySpark Code Patterns in AWS Glue

Below are the most commonly used Glue PySpark patterns you will write and be tested on in interviews.

AWS Glue Interview Questions

Frequently asked AWS Glue interview questions from real data engineering interviews.