AWS releases Glue Databrew, a visual ETL tool

ctvo · on Nov 12, 2020

The thing folks don't mention regarding AWS is the inherent competitive advantage their micro-startups have. We focus on AWS launching managed ElasticSearch or managed Kafka, and talk about them (legally) using open source contributions to make money, but I think those are minor compared to things like this.

What AWS has is a culture and institutional knowledge on how to launch new products that take foundational AWS services (S3, Lambda, EC2, DDB, etc.) and glues (!) them together better than what a competing non-AWS company can do. This is a bold claim (since AWS launches some very crappy products), but imagine being able to use AWS infrastructure at cost, having internal knowledge on how to best optimize that infrastructure and access to the engineers that own those services while you build abstractions and better user experiences on top of them.

I don't know how cos that compete in any related space can survive. When AWS is willing to throw whatever against a wall (launching 50+ services a year) to see what sticks, sooner or later they're going to land in your space.

Become more locked into AWS's foundational services -> these abstractions on top of them start to make more sense in engineering complexity / delivery time / possible cost dimensions -> Use more of these -> Become more locked into AWS's foundational services.

This feels very different from Azure or GCP.

kbanman · on Nov 12, 2020

Speaking as a former AWS Engineer, I disagree with the sentiment that they are able to glue AWS services together better than what a competing non-AWS company can do. Internally the use of AWS is subject to the same constraints and APIs you and I have.

Their competitive advantage is their captive customer base, which will much rather pay a premium to use an AWS-managed service than use another vendor.

threeseed · on Nov 12, 2020

AWS biggest advantage is in the enterprise space.

Which is that companies do not procure individual AWS services but rather AWS itself. Meaning that whenever AWS releases a new tool it is instantly approved and available for use across the company (baring internal processes e.g. security hardening).

Compare this with a startup which has to go through a 6 month long procurement process complete with vendor bake-offs in order to sell their similar tool.

If AWS continues to move into the application space they will surely dominate the enterprise because of this.

rorykoehler · on Nov 13, 2020

Our whole cloud procurement strategy is based on this page https://aws.amazon.com/compliance/services-in-scope/

Just yesterday I selected SNS for a project instead of a local provider because AWS put it through the security audits we care about and we don't win prizes for spending time on these decisions.

lmeyerov · on Nov 13, 2020

Yep, the two-step:

-- AWS advertises the tool via the console and preintegrates it from both directions

-- IT+Procurement already approved AWS for projects, so PMs can skip vendor/tool approval+onboarding dances and focus on the budget one

True of not just AWS but Azure + GCP too

Startups can compete, but gets into stuff like deep tech or cross-vendor integrations, where the visibility and integration advantages don't apply as much to the cloud vendors so they rather go after easier targets until they can't. (Folks here posted about UI, but for b2b, I disagree for most cases, unless there's something deeply technical about it that a 20 person team can't copy.)

0xbadcafebee · on Nov 13, 2020

The exception is open source software. Free Software's biggest risk to the org is patent and license encumbrance. If we can develop an easy way to detect such things in Free Software, and we develop communities where enterprises can freely contribute to them (to make them more Enterprsey), then it has a chance to replace incumbent managed solutions.

lmeyerov · on Nov 14, 2020

I'm mixed here --

- Historically, OSS seems to be free product dev for big cloud (... cue AWS's paid PR people to say otherwise ... ). Their integration, advertising, and procurement advantages makes it MUCH easier to win contracts before the OSS devs may even know it is being used and without a bid process. For a fraction of the effort and contribution, they are switching it to a model of monopoly channel owners vs content producers and and driving the sw margins to 0 on the content side. That's why anti-big-cloud LGPL-when-SaaS style licenses are emerging. There are always exceptions, but it's not the axis to compete on unless you do such a license..

- I agree about the community aspect, indirectly. If the software, in addition to being OSS, relies somehow on community and its steward -- not just source code -- and participation in it is somehow what's paying for the OSS dev, yes. For example, maybe the community is also a social network (Slack/Teams across orgs), or generating threat intel -- the software (post-scale) matters less post-scale, so forking is ok.

star-trek-fleet · on Nov 12, 2020

The ability to stop by the desk of S3 team member and ask whatever technical questions and get authoritative answers is enough to defeat any competitors who want to build products on top of S3.

Note mentioning accessing to road-map, strategic investment, genuine appreciation of product strength and weakness etc.

devcpp · on Nov 12, 2020

Don't forget being able to get high priority in the backlog if you need a feature from another service in order to launch.

Former AWS engineer who launched a service here. That, access to source code and being able to setup an hour-long meeting with any engineer are the big points. Not that I think that lacking these is insurmountable, but they're very nice to have.

protomikron · on Nov 12, 2020

You underestimate how hard it is in large companies to actually speak to the people in charge that can help you concerning your problem.

star-trek-fleet · on Nov 12, 2020

I did not, I am comparing that to a random guy from some random startup, who is ranking even behind the poor customers who cannot get hold any devs for their confusing issues of using AWS...

orf · on Nov 12, 2020

> Internally the use of AWS is subject to the same constraints and APIs you and I have.

Is that true though? IAM isn’t open, and things like service accounts and service chaining (a can only access b through c) are also not.

outworlder · on Nov 12, 2020

> This feels very different from Azure or GCP.

Yes. Compared to those, newly AWS services are more likely to work with, and integrate with existing services. However, the further you stray from 'Compute' the less likely this is to be the case. More 'esoteric' services tend to be their own microcosm and sometimes feel like they could have come from another company entirely (Quicksight? etc)

This is still light years ahead of Azure (and to a less extent GCP), where even compute services will not necessarily work with one another. You need to make sure the "SKU"s are compatible. Want to use some fancy storage? Oh no you need to use SKUs XYZ and premium this premium that. Whereas if AWS releases a new storage type (such as IO2), you can pretty much assume you can attach that to any of your existing instances (even if some particular types could be recommened).

Not to mention surprising behavior when you try to mix and match features. GCP and AWS, you have instances working perfectly fine, but you have discovered that they provide the ability to create 'internal' load balancers? Cool! Create one, point to the instances, or point to their respective automatically managed groups (ASGs or instance groups). It will be there in case you need it, your workloads are unaffected. Do that on Azure, and now your instances have no internet connectivity whatsoever, as all traffic is now routed through it. There are footguns everywhere.

Technically, GCP tends to be the most advanced of the bunch (their automatic instance migration is brilliant, meanwhile AWS keeps sending emails to us saying that some instance is degraded and it's our problem now). Their networking capabilities are impressive as well (first to have global anycast load balancers, Google's premium network, subnets spanning AZs, etc). However, they do seem to be too opinionated. Want proxy protocol on your NLBs, even though NLBs preserve source IP so in theory you don't need this(but with K8s ingress you might). AWS says sure, we have the feature, enable it, we don't care. Google says: why do you need proxy protocol, the source IP is there. These are not the headers you are looking for. Azure says: proxy protocol wat?

jiggawatts · on Nov 13, 2020

> This is still light years ahead of Azure (and to a less extent GCP), where even compute services will not necessarily work with one another.

You can't use the SQL Server Virtual Machine extension on an Azure VM to extend the disks if the VM size is one of the AMD EPYC CPU types.

During the support call, the Microsoft tech shared a screenshot of the source code for the SQL VM extension, and it had a switch statement that decides if each feature is "supported" or not.

Let that sink in: Microsoft literally hard-codes their VM-size-to-feature lookups in probably thousands and thousands of places with huge switch statements full of code like this:

    case "Standard_M416ms_v2": return false;
    case "Standard_M416s_v2": return false;
    case "Standard_M64ls": return true;
    case "Standard_M64ms": return true;

This is their standard coding practice.

So next time you try a new VM size or type, don't be surprised if things randomly don't work or "aren't supported" for mysterious reasons...

taylorwc · on Nov 12, 2020

> I don't know how cos that compete in any related space can survive. When AWS is willing to throw whatever against a wall (launching 50+ services a year) to see what sticks, sooner or later they're going to land in your space.

This is true for a subset of products, but not uniformly. To the extent you're building an infrastructure product, you get to choose what axis to compete on. If you're going up against AWS, then trying to compete with them on things like cost and reliability are likely poor choices. But something like user/dev experience isn't. DynamoDB has a mongo compatible API and yet Mongo's Atlas hosted service is responsible for most of the company's growth over the past year. Why? Because it provides a unique offering, not just a 'good-enough' offering, which is what a lot of higher-level AWS services are.

gk1 · on Nov 12, 2020

> sooner or later they're going to land in your space.

Absolutely. I've seen this a handful of times with companies I consult, where they suddenly find themselves competing with AWS. I call it the November surprise because it happens around Re:Invent.

There are several reasons this is a tough thing to compete against, and AWS's vertical integration is just one of them. I've already written about them and also how to come out ahead if you find yourself in this situation: https://www.gkogan.co/blog/big-cloud/

shepardrtc · on Nov 12, 2020

We were using Alooma for ETL for years until Google bought it and started to deprecate AWS connections. It was a massive PITA, but it mostly worked. We switched over to AWS DMS and it was easy. Honestly it didn't take much effort. It has worked flawlessly - literally zero errors - from the day we started it up, and best of all, it's free. All you pay for is the instance it's using for you. That sort of thing can save startups much needed money. Yes, you're tied to the ecosystem - and that's what they want - but it's worth it. Once I talk to people and basically say the same thing you're saying, they start to look at AWS a bit differently.

jjoonathan · on Nov 12, 2020

> What AWS has is a culture and institutional knowledge...

Their execution is routinely abysmal but it never matters because they have two trump cards:

1. Backdoor through the purchase process bureaucracy

2. Network effects of existing services

tuna-piano · on Nov 12, 2020

I think their key advantage is sales. Imagine a product that adds a small amount of value to a company but requires a long drawn out sales process including research on available vendors, pricing, security, use cases, determining requirements, etc vs a developer going to the AWS console and clicking "create databrew". It's no competition.

And with sales taking up such a huge percentage of a lot of these SAAS companies revenue, Amazon can pass the lack of sales to the customer as cost savings. Skip the sales process and the sales cost. win win.

soamv · on Nov 12, 2020

And yet, snowflake

freeone3000 · on Nov 12, 2020

Why does this feel different from Azure, who are also expanding services and have the same advantages?

ctvo · on Nov 12, 2020

I don't see the same broad amount of services launched out of Azure as I do from AWS, and definitely not from GCP.

I don't know if this is a strategic difference or an execution / cultural difference (AWS ships products faster, but they're almost barely usable in v1)

kthejoker2 · on Nov 13, 2020

How are you judging the "broad amount of services" launched out of Azure? They release something on the order of 10-25 updates a week, their services feed runs nonstop.

I claim no special knowledge of AWS, but Azure is full apace, certainly faster than even us global SIs can keep up with in terms of providing support and capabilities.

VectorLock · on Nov 12, 2020

>AWS ships products faster, but they're almost barely usable in v1

Sounds like the widely accepted "minimum viable product" approach.

FridgeSeal · on Nov 12, 2020

The difference is that AWS services become usable, whereas the Azure services stay broken.

aketchum · on Nov 12, 2020

I am a big fan of AWS and am happily running our entire tech stack with their services for a very reasonable price. That said, Glue is an absolute dumpster fire of a product. My team and I have wasted countless hours trying to wrangle a DynamoDB -> Glue -> Athena -> Quicksight pipeline and Glue refused to cooperate (we ended up building our own DDB to SQL pipeline after finally giving up on Glue). Hopefully this will increase the usability of the Glue product and actually enable out of the box ETL.

RobinL · on Nov 12, 2020

We use Glue extensively, but we have a rule of thumb not to use any of the 'special sauce'. That means is using it purely for 'Spark as a service', so we're pretty much always reading/writing data from/to S3 using spark using a script that would work on any Spark cluster (i.e. not using the GlueContext type stuff, i don't even know what it does tbh).

For this purpose I think it's fantastic. Write a PySpark script and press go (we have a package on pypi called etl_manager to facilitate this). It 'just works' for this use case, and there's a huge amount of value for us in not having to think at all about managing or configuring a Spark cluster.

Our biggest bugbear was slow job startup times and a lack of pip installs, but both of those are fixed with glue 2.0 which was released recently.

We don't use any of the visual/GUI based tools for our jobs, we just write our own Spark code and version control in Github. That's unlikely to change any time soon with products like Databrew. That said, the data profiling tool in Databrew does look like it could be useful as something to refer to when writing code.

(I realise this doesn't help with your specific issue, but i thought it was helpful to offer an example of a good experience)

kommissar · on Nov 13, 2020

> We use Glue extensively, but we have a rule of thumb not to use any of the 'special sauce'. That means is using it purely for 'Spark as a service'

This is spot on IMO. I use Glue internally (opinions are my own) and still believe that the best course of action is Glue should only run managed Spark. We provide an empty Scala "script" that does nothing, and load a compiled JAR file with the Scala code that actually runs our job as a library and have Glue exec into that.

We can version the ETL in git, run local tests outside of the Glue data plane, prototype in the Spark shell, and much more.

orf · on Nov 12, 2020

GlueContext mostly manages bookmarks as far as I can tell, which are an insanely useful feature for us.

RobinL · on Nov 12, 2020

Interesting - I was vaguely aware of the existence of bookmarks. I'd be interested to know about what you're using them for - they definitely _sound_ useful. I guess it probably depends what sort of workloads you're doing. At the moment we use Airflow to manage DAGs/retries etc. I like it as a user, but from what I understand from our ops people it's a pain to manage.

orf · on Nov 12, 2020

The use case is pretty simple. You’ve got a bucket that you want to load data from and shuffle it away somewhere else (redshift, s3, whatever). This could be populated by a Firehose, another system, etc etc. Bookmarks just store the greatest “created time” for the files you’re loading from s3. So when you trigger a job it will only load files created since the last successful run. It does some funky stuff to handle s3’s eventual consistency with LIST operations.

Super simple incremental loading. This also works when loading data from a relational database, by storing the greatest primary key value.

RobinL · on Nov 12, 2020

Thanks, that's really useful

ByteJockey · on Nov 13, 2020

Reading from/writing to glue tables works rather well, in addition to S3, although that's just a rather thin layer over s3.

drchopchop · on Nov 12, 2020

Agreed, Glue turned out to be very underwhelming. We ended up moving ETL flows to self-hosted Prefect (https://www.prefect.io/) instead.

In general, AWS excels at a lot of core features (EC2, S3, the databases) but the higher-level services feel very thrown-together, with awful documentation. They are launching a large amount of half-baked services these days, (intending to capitalize on vendor lock-in?), but it's making the ecosystem start to look like a confusing mess. A lot of these wouldn't survive if they were marketed as standalone products.

dumbfounder · on Nov 12, 2020

We also had issues with Glue/Athena, never got parquet to work right, have schema issues with schemaless data coming from MongoDB, and in general it is obtuse and hard to work with. We did a data lab with AWS using Glue and it was an exercise in pain. Then we did a POC with Snowflake and we cried with joy at how easy it was to work with large amounts of data. But our ELT was very light, not true transformation of data, more rearranging. I still think we will need Glue for some workloads and I pray this makes our lives easier.

chrisjc · on Nov 12, 2020

Sounds very similar to what we're dealing with, although never turned to Glue to try and resolve our challenge.

Ended up going with an ETL as a service (Alooma and now transitioning to ETLWorks) to extract and load our data into Snowflake.

jjoonathan · on Nov 12, 2020

I haven't used it myself but my coworkers who gave it a spin unilaterally agree that Glue is a dumpster fire, even though they're otherwise huge AWS fans.

tobilg · on Nov 13, 2020

You might be interested in the new DynamoDB to S3 export feature announced this week: https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-...

greggyb · on Nov 12, 2020

I'd love to hear more about this - do you have any write-up you've done on the process you and your team went through?

If not, would you be willing to share where some of the big pain points were?

aketchum · on Nov 12, 2020

I don't have any write ups but off the top of my head I remember issues with data type casting from ddb -> athena via glue. If a ddb number type was an int in one item and a float in a different Item, glue transformed it to a struct ( something like struct(long:null,double:50.50) and struct(long:20,double:null)). The suggested fix of a cast function didn't work.

mobjack · on Nov 12, 2020

That brings back bad memories from working with Glue. If the source data isn't 100% clean and compatible with the destination, it is such a pain to get it working.

I eventually got it set up, but for the amount of effort involved, I could have just wrote my own custom ETL solution in less time.

The scheduling jobs and triggers is nice once set up. I do hope AWS makes improvements because Glue has potential, but it doesn't feel like it is ready for prime time yet.

MSM · on Nov 13, 2020

I'll add to this that because AWS is simply piecing different technologies together under the hood, there are a lot of data type issues.

Another example is that some date/time columns got brought in and crawled as a string. That's a bummer because obviously you want to do native operations of these, datediff, datepart, etc. without having to cast all over the place. We manually set them to timestamp and they work perfect (awesome!), even in Athena, so we thought the problem was solved.

However, once we did anything with those columns in Glue ETL, those fields got set as nulls.

The problem can be fixed of course, but these types of issues happen fairly often and they quietly fail (no errors, just values set to null).

orf · on Nov 12, 2020

Glue is an absolute horrendous mishmash mess that seemed to suffer from a serious lack of investment or vision. The managed spark component is a good product buried under an all-round terrible developer and console UX, and the data catalog/schema crawling is really useful.

But I’m glad the lack of investment is turning around with this, the recently released Glue Studio and the fantastic “glue 2 fast startup” job types.

ghc · on Nov 12, 2020

Running any sort of innovative data infrastructure startup (whether data prep, database, data pipeline, etc.) is now an exercise in futility. The big three cloud providers will embrace your innovation, extend your product, and extinguish your business.

Given the market power of cloud providers, every infrastructure innovation now a "sustaining innovation" in Christensen's terminology.

The key to success seems to be building a product for a niche the cloud providers think is too small, and then either maximizing your value within that niche so that if your market grows large enough for a cloud provider like AWS to come after you, you can pivot to providing customizations to your highest margin customers. MongoDB is a good example of this.

On the other hand, none of the major cloud providers seem capable of moving up the value chain to the application level, so if I were starting a company today I would focus on leveraging my infrastructure-level innovation to create a vertical opportunity in a high margin market instead of seeking to build a horizontal platform (IoT platform for example).

streetcat1 · on Nov 12, 2020

Right. However, you should do the opposite. Embrace their innovation and offer it on-perm (for example, on Kubernetes).

fs111 · on Nov 12, 2020

Don't buy the copy, buy the original: https://www.trifacta.com/

wills_forward · on Nov 12, 2020

YES. AWS' UI looks like a wholesale ripoff. Sad. Product manager: "Hey guys, see Trifacta? Go make that."

richardowright · on Nov 13, 2020

Or use the opensource take of it - https://cdap.io/

typpo · on Nov 12, 2020

I'm glad to see a competitor to Trifacta/Google Cloud Dataprep. My company relies on it heavily, but we constantly run into bugs, UI glitches, and crashes that can sometimes block people for hours or days. It's the sort of software that you hate to use, but the benefits are too good to ignore.

The benefit to visual ETL is that non-engineers can do a lot of basic data engineering. We tie this into our more complex code-based ETL pipelines. It was a game-changer for us and helps us get a lot more done.

blakeburch · on Nov 13, 2020

Curious to know - how do you tie the no-code transformation with your code-based transformations? Usually these processes end up siloed from each other.

seddonm1 · on Nov 12, 2020

These GUI driven/Visual ETL tools certainly have their place but are firmly at one end of the ease of use vs engineering discipline based ETL continuum.

As other posters commented Visual ETL often suffer from source control or limited extension ability but do provide the rapid development environment that users (generally more business oriented) seek. They also tend to trivialize the value of experience/discipline - for example I go to an accountant for my tax because they apply learned-experience relating to tax that I do not have (even though the math is easy) whereas in data engineering seemingly simple tasks such as correctly applying data typing to money or dealing with timezones seems to be glossed over in the pursuit of DIY - and wondering why your money columns don't reconcile or you lose data in failure scenarios.

At the other end of the continuum large teams writing bespoke ETL code for every job does not scale well for many reasons (https://reorchestrate.com/posts/code-doesnt-scale-for-etl/). I think the positive reaction to ideas like Data Mesh comes from the failures of these large, centralized teams which coincided with the Hadoop era.

Our solution has been to develop an open source (MIT) declarative framework (https://arc.tripl.ai/) that allows configuration driven ETL - mostly developed via a Jupyter Notebook environment (to allow rapid development and appeal to a larger audience) - whilst making most of the difficult tasks mentioned above easier. This has been in development for a few years now and continues to evolve. We value your feedback.

slt2021 · on Nov 12, 2020

looks pretty basic limited copy-cat of trifacta/tableau prep/alteryx.

this tool requires ready mostly clean-ish data to work with. but the #1 problem in data engineering is lack of such data

georgewfraser · on Nov 13, 2020

Visual ETL is not as good of an idea as it seems at first. You end up putting a ton of business logic into the menus of these tools, and it’s not version controlled, and it’s not searchable. You’re better off doing this kind of work in SQL.

breck · on Nov 13, 2020

My guess (without having tried it) is Databrew is backed by a solid DSL, and you are really generating good clean code when you are using this "Low Code" tool.

And that works great—a great DSL + a visual GUI that edits that DSL. Because like you said you absolutely need VC.

kevinsundar · on Nov 13, 2020

Yup there is a JSON based DSL that the UI generates. You can export, import it as well.

iblaine · on Nov 13, 2020

Having used GUI ETL tools for years (SSIS, Informatca, Talend, Appworx) and now using Airflow(Prefect is an excellent alternative btw), I hope to never go back. Great to see Glue improving and for the industry’s sake I hope it doesn’t catch on. Most ETL should be treated as code. As code, ETLs are easier to write, maintain, and manage complexity.

kfk · on Nov 13, 2020

I started my team thanks to Alteryx but happily moved all to Python a year later. UIs are terrible for change management, versioning and documentation. Once you have a big-ish team doing data work UIs will become a significant drag on collaboration and productivity

2wrist · on Nov 12, 2020

Have to say as slick as stuff like this looks I do find myself gravtatiing towards ETL in code. (It feels easier to read/understand)

How would you change control something like this?

reasonabl_human · on Nov 13, 2020

Azure Data Factory accomplished what you’re looking for by generating and version controlling JSON pipeline definitions... you can just edit the pipeline JSONs too if you’d rather build in code than in the visual canvas

rhombocombus · on Nov 12, 2020

it looks like the jobs are exportable as JSON, which would make it far more maintainable than some of the existing proprietary ETL tools (lookin' at you IBM).

shmoogy · on Nov 13, 2020

Are there any alternatives to this style of application? I was going to try to make something similar to this for my team to use that would be able to give them access to map columns and simple transforms, then push the resulting flow to me to move it into an airflow dag.

I would really like a visual editor I can adjust with Regex functions and mappings that I can self host and iterate on.

richardowright · on Nov 13, 2020

Try CDAP - https://cdap.io/ . Open source with Google backing it.

shmoogy · on Nov 13, 2020

Thank you, this looks incredibly close to what I want! Definitely testing this tomorrow, thanks again.

crb002 · on Nov 12, 2020

https://conexus.com/ should get more love. Based off of https://www.categoricaldata.net/ . Ensures that complex transforms are provably correct.

ManWith2Plans · on Nov 12, 2020

Haven't used this yet, but this looks like a really good user experience from their demo video. Haven't used competitors like Alteryx myself, but just having this integrate so well into the AWS ecosystem makes this seem really useful.

ineedasername · on Nov 12, 2020

Seems like this would be a good fit to expand to cover SageMaker & AirFlow for a really powerful GUI workflow editor that includes ML directly.

manigandham · on Nov 12, 2020

Looks very similar to GCP's Cloud Dataprep (which itself is powered by Trifacta).

VectorLock · on Nov 12, 2020

The $1 per 30 minute session pricing really jumped out at me.

QuinnyPig · on Nov 13, 2020

This service name makes me viscerally angry.

awinter-py · on Nov 13, 2020

my brain keeps re-parsing this to 'grue datablew'