Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We use Glue extensively, but we have a rule of thumb not to use any of the 'special sauce'. That means is using it purely for 'Spark as a service', so we're pretty much always reading/writing data from/to S3 using spark using a script that would work on any Spark cluster (i.e. not using the GlueContext type stuff, i don't even know what it does tbh).

For this purpose I think it's fantastic. Write a PySpark script and press go (we have a package on pypi called etl_manager to facilitate this). It 'just works' for this use case, and there's a huge amount of value for us in not having to think at all about managing or configuring a Spark cluster.

Our biggest bugbear was slow job startup times and a lack of pip installs, but both of those are fixed with glue 2.0 which was released recently.

We don't use any of the visual/GUI based tools for our jobs, we just write our own Spark code and version control in Github. That's unlikely to change any time soon with products like Databrew. That said, the data profiling tool in Databrew does look like it could be useful as something to refer to when writing code.

(I realise this doesn't help with your specific issue, but i thought it was helpful to offer an example of a good experience)



> We use Glue extensively, but we have a rule of thumb not to use any of the 'special sauce'. That means is using it purely for 'Spark as a service'

This is spot on IMO. I use Glue internally (opinions are my own) and still believe that the best course of action is Glue should only run managed Spark. We provide an empty Scala "script" that does nothing, and load a compiled JAR file with the Scala code that actually runs our job as a library and have Glue exec into that.

We can version the ETL in git, run local tests outside of the Glue data plane, prototype in the Spark shell, and much more.


GlueContext mostly manages bookmarks as far as I can tell, which are an insanely useful feature for us.


Interesting - I was vaguely aware of the existence of bookmarks. I'd be interested to know about what you're using them for - they definitely _sound_ useful. I guess it probably depends what sort of workloads you're doing. At the moment we use Airflow to manage DAGs/retries etc. I like it as a user, but from what I understand from our ops people it's a pain to manage.


The use case is pretty simple. You’ve got a bucket that you want to load data from and shuffle it away somewhere else (redshift, s3, whatever). This could be populated by a Firehose, another system, etc etc. Bookmarks just store the greatest “created time” for the files you’re loading from s3. So when you trigger a job it will only load files created since the last successful run. It does some funky stuff to handle s3’s eventual consistency with LIST operations.

Super simple incremental loading. This also works when loading data from a relational database, by storing the greatest primary key value.


Thanks, that's really useful


Reading from/writing to glue tables works rather well, in addition to S3, although that's just a rather thin layer over s3.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: