transformpaidintermediate

AWS Glue

Serverless ETL service on Amazon Web Services

Vendor
Amazon
Type
transform
Pricing
paid
Level
intermediate
serverlessAWSSparkPySparkETL
Advertisement · Google AdSense 728×90

What is AWS Glue?

AWS Glue is Amazon's fully managed, serverless ETL (Extract, Transform, Load) service. Unlike traditional ETL tools that require dedicated servers, Glue provisions compute resources on demand and releases them when the job finishes — you only pay for the seconds your job runs.

It is built on Apache Spark, which means your ETL scripts are distributed across a cluster automatically. You write PySpark or Scala, and Glue handles parallelism, fault tolerance, and scaling. This makes it capable of processing terabytes of data without any infrastructure management.

The three pillars of AWS Glue:

  • Data Catalog — A central metadata repository that stores table definitions, schemas, and connection details. Think of it as a managed Hive Metastore that integrates with Athena, EMR, and Redshift Spectrum.
  • Crawlers — Automated agents that scan data sources (S3, RDS, DynamoDB, JDBC), infer schemas, and populate the Data Catalog. You schedule crawlers to detect schema changes automatically.
  • ETL Jobs — Spark-based transformation scripts that read from sources, apply transformations, and write to targets. Jobs can be authored in Glue Studio (visual), or as PySpark/Scala scripts.

Core Concepts You Must Know

Glue Job Types

AWS Glue supports three job types:

  • Spark — Full Apache Spark cluster. Best for large-scale batch transformations. Billed per DPU-hour.
  • Spark Streaming — Continuous processing from Kafka or Kinesis. Keeps the cluster alive and processes micro-batches.
  • Python Shell — A single Python process (no Spark). Ideal for lightweight tasks, API calls, small file operations. Much cheaper than Spark jobs.
  • Ray — Distributed Python for ML workloads. Newer addition, good for data science pipelines.

DPUs (Data Processing Units)

A DPU is Glue's unit of compute — 4 vCPUs and 16 GB RAM. You configure how many DPUs your job uses. A standard Spark job defaults to 10 DPUs. You can set --number-of-workers and --worker-type (Standard, G.1X, G.2X) to tune performance and cost. G.1X gives 1 DPU per worker, G.2X gives 2 DPUs — use G.2X for memory-intensive joins.

Job Bookmarks

Job Bookmarks are Glue's built-in incremental processing mechanism. When enabled, Glue tracks which data it has already processed (based on S3 object modification timestamps or sequence numbers) and only processes new data on subsequent runs. This prevents reprocessing the entire dataset every time, which is critical for cost control on large S3 datasets.

Dynamic Frames vs DataFrames

Glue introduces its own abstraction called a DynamicFrame on top of Spark DataFrames. DynamicFrames handle semi-structured data with inconsistent schemas (e.g., JSON with nullable fields, mixed types) without requiring a predefined schema. You can convert between the two: dynamic_frame.toDF() converts to a Spark DataFrame for full Spark API access, and DynamicFrame.fromDF(df, glue_context, "name") converts back.

Glue Triggers

Triggers control when jobs run. Three types: Scheduled (cron expression), On-demand (manual or API call), and Conditional (starts when another job/crawler succeeds or fails). Conditional triggers let you build job chains — Crawler → Transform Job → Load Job — without an external orchestrator.

Advertisement · Google AdSense 300×250

Common AWS Glue Architectures.

Pattern 1 — S3 Data Lake ETL

The most common pattern: raw data lands in S3 (Bronze layer), a Glue Crawler scans it and registers the schema in the Data Catalog, a Glue Spark job transforms it (clean, deduplicate, cast types), and writes Parquet to a Silver S3 bucket. Athena or Redshift Spectrum then queries the Silver layer directly.

Pattern 2 — RDS to Redshift

Glue reads from an operational RDS database (MySQL/PostgreSQL) using a JDBC connection, applies transformations, and writes to Amazon Redshift. Glue manages the JDBC connection pooling and Redshift COPY command automatically via the write_dynamic_frame method with redshift connection type.

Pattern 3 — Incremental CDC with Bookmarks

Enable Job Bookmarks on an S3-source job. Glue tracks the high watermark of processed files. On each run it only reads files modified since the last successful run. Combine with partitioning by date for efficient incremental loads into a data warehouse.

Pattern 4 — Glue + Step Functions

For complex multi-step pipelines, AWS Step Functions orchestrates Glue jobs. Step Functions handles retries, parallel execution, conditional branching, and error handling — more powerful than Glue Triggers alone for enterprise workflows.

PySpark Code Patterns in AWS Glue

Below are the most commonly used Glue PySpark patterns you will write and be tested on in interviews.

AWS Glue Interview Questions

Frequently asked AWS Glue interview questions from real data engineering interviews.