We use Glue extensively, but we have a rule of thumb not to use any of the 'spec...

kommissar · on Nov 13, 2020

> We use Glue extensively, but we have a rule of thumb not to use any of the 'special sauce'. That means is using it purely for 'Spark as a service'

This is spot on IMO. I use Glue internally (opinions are my own) and still believe that the best course of action is Glue should only run managed Spark. We provide an empty Scala "script" that does nothing, and load a compiled JAR file with the Scala code that actually runs our job as a library and have Glue exec into that.

We can version the ETL in git, run local tests outside of the Glue data plane, prototype in the Spark shell, and much more.

orf · on Nov 12, 2020

GlueContext mostly manages bookmarks as far as I can tell, which are an insanely useful feature for us.

RobinL · on Nov 12, 2020

Interesting - I was vaguely aware of the existence of bookmarks. I'd be interested to know about what you're using them for - they definitely _sound_ useful. I guess it probably depends what sort of workloads you're doing. At the moment we use Airflow to manage DAGs/retries etc. I like it as a user, but from what I understand from our ops people it's a pain to manage.

orf · on Nov 12, 2020

The use case is pretty simple. You’ve got a bucket that you want to load data from and shuffle it away somewhere else (redshift, s3, whatever). This could be populated by a Firehose, another system, etc etc. Bookmarks just store the greatest “created time” for the files you’re loading from s3. So when you trigger a job it will only load files created since the last successful run. It does some funky stuff to handle s3’s eventual consistency with LIST operations.

Super simple incremental loading. This also works when loading data from a relational database, by storing the greatest primary key value.

RobinL · on Nov 12, 2020

Thanks, that's really useful

ByteJockey · on Nov 13, 2020

Reading from/writing to glue tables works rather well, in addition to S3, although that's just a rather thin layer over s3.