The thing folks don't mention regarding AWS is the inherent competitive advantage their micro-startups have. We focus on AWS launching managed ElasticSearch or managed Kafka, and talk about them (legally) using open source contributions to make money, but I think those are minor compared to things like this.
What AWS has is a culture and institutional knowledge on how to launch new products that take foundational AWS services (S3, Lambda, EC2, DDB, etc.) and glues (!) them together better than what a competing non-AWS company can do. This is a bold claim (since AWS launches some very crappy products), but imagine being able to use AWS infrastructure at cost, having internal knowledge on how to best optimize that infrastructure and access to the engineers that own those services while you build abstractions and better user experiences on top of them.
I don't know how cos that compete in any related space can survive. When AWS is willing to throw whatever against a wall (launching 50+ services a year) to see what sticks, sooner or later they're going to land in your space.
Become more locked into AWS's foundational services -> these abstractions on top of them start to make more sense in engineering complexity / delivery time / possible cost dimensions -> Use more of these -> Become more locked into AWS's foundational services.
Speaking as a former AWS Engineer, I disagree with the sentiment that they are able to glue AWS services together better than what a competing non-AWS company can do. Internally the use of AWS is subject to the same constraints and APIs you and I have.
Their competitive advantage is their captive customer base, which will much rather pay a premium to use an AWS-managed service than use another vendor.
Which is that companies do not procure individual AWS services but rather AWS itself. Meaning that whenever AWS releases a new tool it is instantly approved and available for use across the company (baring internal processes e.g. security hardening).
Compare this with a startup which has to go through a 6 month long procurement process complete with vendor bake-offs in order to sell their similar tool.
If AWS continues to move into the application space they will surely dominate the enterprise because of this.
Just yesterday I selected SNS for a project instead of a local provider because AWS put it through the security audits we care about and we don't win prizes for spending time on these decisions.
-- AWS advertises the tool via the console and preintegrates it from both directions
-- IT+Procurement already approved AWS for projects, so PMs can skip vendor/tool approval+onboarding dances and focus on the budget one
True of not just AWS but Azure + GCP too
Startups can compete, but gets into stuff like deep tech or cross-vendor integrations, where the visibility and integration advantages don't apply as much to the cloud vendors so they rather go after easier targets until they can't. (Folks here posted about UI, but for b2b, I disagree for most cases, unless there's something deeply technical about it that a 20 person team can't copy.)
The exception is open source software. Free Software's biggest risk to the org is patent and license encumbrance. If we can develop an easy way to detect such things in Free Software, and we develop communities where enterprises can freely contribute to them (to make them more Enterprsey), then it has a chance to replace incumbent managed solutions.
- Historically, OSS seems to be free product dev for big cloud (... cue AWS's paid PR people to say otherwise ... ). Their integration, advertising, and procurement advantages makes it MUCH easier to win contracts before the OSS devs may even know it is being used and without
a bid process. For a fraction of the effort and contribution, they are switching it to a model of monopoly channel owners vs content producers and and driving the sw margins to 0 on the content side. That's why anti-big-cloud LGPL-when-SaaS style licenses are emerging. There are always exceptions, but it's not the axis to compete on unless you do such a license..
- I agree about the community aspect, indirectly. If the software, in addition to being OSS, relies somehow on community and its steward -- not just source code -- and participation in it is somehow what's paying for the OSS dev, yes. For example, maybe the community is also a social network (Slack/Teams across orgs), or generating threat intel -- the software (post-scale) matters less post-scale, so forking is ok.
The ability to stop by the desk of S3 team member and ask whatever technical questions and get authoritative answers is enough to defeat any competitors who want to build products on top of S3.
Note mentioning accessing to road-map, strategic investment, genuine appreciation of product strength and weakness etc.
Don't forget being able to get high priority in the backlog if you need a feature from another service in order to launch.
Former AWS engineer who launched a service here. That, access to source code and being able to setup an hour-long meeting with any engineer are the big points. Not that I think that lacking these is insurmountable, but they're very nice to have.
I did not, I am comparing that to a random guy from some random startup, who is ranking even behind the poor customers who cannot get hold any devs for their confusing issues of using AWS...
Yes. Compared to those, newly AWS services are more likely to work with, and integrate with existing services. However, the further you stray from 'Compute' the less likely this is to be the case. More 'esoteric' services tend to be their own microcosm and sometimes feel like they could have come from another company entirely (Quicksight? etc)
This is still light years ahead of Azure (and to a less extent GCP), where even compute services will not necessarily work with one another. You need to make sure the "SKU"s are compatible. Want to use some fancy storage? Oh no you need to use SKUs XYZ and premium this premium that. Whereas if AWS releases a new storage type (such as IO2), you can pretty much assume you can attach that to any of your existing instances (even if some particular types could be recommened).
Not to mention surprising behavior when you try to mix and match features. GCP and AWS, you have instances working perfectly fine, but you have discovered that they provide the ability to create 'internal' load balancers? Cool! Create one, point to the instances, or point to their respective automatically managed groups (ASGs or instance groups). It will be there in case you need it, your workloads are unaffected. Do that on Azure, and now your instances have no internet connectivity whatsoever, as all traffic is now routed through it. There are footguns everywhere.
Technically, GCP tends to be the most advanced of the bunch (their automatic instance migration is brilliant, meanwhile AWS keeps sending emails to us saying that some instance is degraded and it's our problem now). Their networking capabilities are impressive as well (first to have global anycast load balancers, Google's premium network, subnets spanning AZs, etc). However, they do seem to be too opinionated. Want proxy protocol on your NLBs, even though NLBs preserve source IP so in theory you don't need this(but with K8s ingress you might). AWS says sure, we have the feature, enable it, we don't care. Google says: why do you need proxy protocol, the source IP is there. These are not the headers you are looking for. Azure says: proxy protocol wat?
> This is still light years ahead of Azure (and to a less extent GCP), where even compute services will not necessarily work with one another.
You can't use the SQL Server Virtual Machine extension on an Azure VM to extend the disks if the VM size is one of the AMD EPYC CPU types.
During the support call, the Microsoft tech shared a screenshot of the source code for the SQL VM extension, and it had a switch statement that decides if each feature is "supported" or not.
Let that sink in: Microsoft literally hard-codes their VM-size-to-feature lookups in probably thousands and thousands of places with huge switch statements full of code like this:
case "Standard_M416ms_v2": return false;
case "Standard_M416s_v2": return false;
case "Standard_M64ls": return true;
case "Standard_M64ms": return true;
This is their standard coding practice.
So next time you try a new VM size or type, don't be surprised if things randomly don't work or "aren't supported" for mysterious reasons...
> I don't know how cos that compete in any related space can survive. When AWS is willing to throw whatever against a wall (launching 50+ services a year) to see what sticks, sooner or later they're going to land in your space.
This is true for a subset of products, but not uniformly. To the extent you're building an infrastructure product, you get to choose what axis to compete on. If you're going up against AWS, then trying to compete with them on things like cost and reliability are likely poor choices. But something like user/dev experience isn't. DynamoDB has a mongo compatible API and yet Mongo's Atlas hosted service is responsible for most of the company's growth over the past year. Why? Because it provides a unique offering, not just a 'good-enough' offering, which is what a lot of higher-level AWS services are.
> sooner or later they're going to land in your space.
Absolutely. I've seen this a handful of times with companies I consult, where they suddenly find themselves competing with AWS. I call it the November surprise because it happens around Re:Invent.
There are several reasons this is a tough thing to compete against, and AWS's vertical integration is just one of them. I've already written about them and also how to come out ahead if you find yourself in this situation: https://www.gkogan.co/blog/big-cloud/
We were using Alooma for ETL for years until Google bought it and started to deprecate AWS connections. It was a massive PITA, but it mostly worked. We switched over to AWS DMS and it was easy. Honestly it didn't take much effort. It has worked flawlessly - literally zero errors - from the day we started it up, and best of all, it's free. All you pay for is the instance it's using for you. That sort of thing can save startups much needed money. Yes, you're tied to the ecosystem - and that's what they want - but it's worth it. Once I talk to people and basically say the same thing you're saying, they start to look at AWS a bit differently.
I think their key advantage is sales. Imagine a product that adds a small amount of value to a company but requires a long drawn out sales process including research on available vendors, pricing, security, use cases, determining requirements, etc vs a developer going to the AWS console and clicking "create databrew". It's no competition.
And with sales taking up such a huge percentage of a lot of these SAAS companies revenue, Amazon can pass the lack of sales to the customer as cost savings. Skip the sales process and the sales cost. win win.
I don't see the same broad amount of services launched out of Azure as I do from AWS, and definitely not from GCP.
I don't know if this is a strategic difference or an execution / cultural difference (AWS ships products faster, but they're almost barely usable in v1)
How are you judging the "broad amount of services" launched out of Azure? They release something on the order of 10-25 updates a week, their services feed runs nonstop.
I claim no special knowledge of AWS, but Azure is full apace, certainly faster than even us global SIs can keep up with in terms of providing support and capabilities.
I am a big fan of AWS and am happily running our entire tech stack with their services for a very reasonable price. That said, Glue is an absolute dumpster fire of a product. My team and I have wasted countless hours trying to wrangle a DynamoDB -> Glue -> Athena -> Quicksight pipeline and Glue refused to cooperate (we ended up building our own DDB to SQL pipeline after finally giving up on Glue). Hopefully this will increase the usability of the Glue product and actually enable out of the box ETL.
We use Glue extensively, but we have a rule of thumb not to use any of the 'special sauce'. That means is using it purely for 'Spark as a service', so we're pretty much always reading/writing data from/to S3 using spark using a script that would work on any Spark cluster (i.e. not using the GlueContext type stuff, i don't even know what it does tbh).
For this purpose I think it's fantastic. Write a PySpark script and press go (we have a package on pypi called etl_manager to facilitate this). It 'just works' for this use case, and there's a huge amount of value for us in not having to think at all about managing or configuring a Spark cluster.
Our biggest bugbear was slow job startup times and a lack of pip installs, but both of those are fixed with glue 2.0 which was released recently.
We don't use any of the visual/GUI based tools for our jobs, we just write our own Spark code and version control in Github. That's unlikely to change any time soon with products like Databrew. That said, the data profiling tool in Databrew does look like it could be useful as something to refer to when writing code.
(I realise this doesn't help with your specific issue, but i thought it was helpful to offer an example of a good experience)
> We use Glue extensively, but we have a rule of thumb not to use any of the 'special sauce'. That means is using it purely for 'Spark as a service'
This is spot on IMO. I use Glue internally (opinions are my own) and still believe that the best course of action is Glue should only run managed Spark. We provide an empty Scala "script" that does nothing, and load a compiled JAR file with the Scala code that actually runs our job as a library and have Glue exec into that.
We can version the ETL in git, run local tests outside of the Glue data plane, prototype in the Spark shell, and much more.
Interesting - I was vaguely aware of the existence of bookmarks. I'd be interested to know about what you're using them for - they definitely _sound_ useful. I guess it probably depends what sort of workloads you're doing. At the moment we use Airflow to manage DAGs/retries etc. I like it as a user, but from what I understand from our ops people it's a pain to manage.
The use case is pretty simple. You’ve got a bucket that you want to load data from and shuffle it away somewhere else (redshift, s3, whatever). This could be populated by a Firehose, another system, etc etc. Bookmarks just store the greatest “created time” for the files you’re loading from s3. So when you trigger a job it will only load files created since the last successful run. It does some funky stuff to handle s3’s eventual consistency with LIST operations.
Super simple incremental loading. This also works when loading data from a relational database, by storing the greatest primary key value.
Agreed, Glue turned out to be very underwhelming. We ended up moving ETL flows to self-hosted Prefect (https://www.prefect.io/) instead.
In general, AWS excels at a lot of core features (EC2, S3, the databases) but the higher-level services feel very thrown-together, with awful documentation. They are launching a large amount of half-baked services these days, (intending to capitalize on vendor lock-in?), but it's making the ecosystem start to look like a confusing mess. A lot of these wouldn't survive if they were marketed as standalone products.
We also had issues with Glue/Athena, never got parquet to work right, have schema issues with schemaless data coming from MongoDB, and in general it is obtuse and hard to work with. We did a data lab with AWS using Glue and it was an exercise in pain. Then we did a POC with Snowflake and we cried with joy at how easy it was to work with large amounts of data. But our ELT was very light, not true transformation of data, more rearranging. I still think we will need Glue for some workloads and I pray this makes our lives easier.
I haven't used it myself but my coworkers who gave it a spin unilaterally agree that Glue is a dumpster fire, even though they're otherwise huge AWS fans.
I don't have any write ups but off the top of my head I remember issues with data type casting from ddb -> athena via glue. If a ddb number type was an int in one item and a float in a different Item, glue transformed it to a struct ( something like struct(long:null,double:50.50) and struct(long:20,double:null)). The suggested fix of a cast function didn't work.
That brings back bad memories from working with Glue. If the source data isn't 100% clean and compatible with the destination, it is such a pain to get it working.
I eventually got it set up, but for the amount of effort involved, I could have just wrote my own custom ETL solution in less time.
The scheduling jobs and triggers is nice once set up. I do hope AWS makes improvements because Glue has potential, but it doesn't feel like it is ready for prime time yet.
I'll add to this that because AWS is simply piecing different technologies together under the hood, there are a lot of data type issues.
Another example is that some date/time columns got brought in and crawled as a string. That's a bummer because obviously you want to do native operations of these, datediff, datepart, etc. without having to cast all over the place. We manually set them to timestamp and they work perfect (awesome!), even in Athena, so we thought the problem was solved.
However, once we did anything with those columns in Glue ETL, those fields got set as nulls.
The problem can be fixed of course, but these types of issues happen fairly often and they quietly fail (no errors, just values set to null).
Glue is an absolute horrendous mishmash mess that seemed to suffer from a serious lack of investment or vision. The managed spark component is a good product buried under an all-round terrible developer and console UX, and the data catalog/schema crawling is really useful.
But I’m glad the lack of investment is turning around with this, the recently released Glue Studio and the fantastic “glue 2 fast startup” job types.
Running any sort of innovative data infrastructure startup (whether data prep, database, data pipeline, etc.) is now an exercise in futility. The big three cloud providers will embrace your innovation, extend your product, and extinguish your business.
Given the market power of cloud providers, every infrastructure innovation now a "sustaining innovation" in Christensen's terminology.
The key to success seems to be building a product for a niche the cloud providers think is too small, and then either maximizing your value within that niche so that if your market grows large enough for a cloud provider like AWS to come after you, you can pivot to providing customizations to your highest margin customers. MongoDB is a good example of this.
On the other hand, none of the major cloud providers seem capable of moving up the value chain to the application level, so if I were starting a company today I would focus on leveraging my infrastructure-level innovation to create a vertical opportunity in a high margin market instead of seeking to build a horizontal platform (IoT platform for example).
I'm glad to see a competitor to Trifacta/Google Cloud Dataprep. My company relies on it heavily, but we constantly run into bugs, UI glitches, and crashes that can sometimes block people for hours or days. It's the sort of software that you hate to use, but the benefits are too good to ignore.
The benefit to visual ETL is that non-engineers can do a lot of basic data engineering. We tie this into our more complex code-based ETL pipelines. It was a game-changer for us and helps us get a lot more done.
Curious to know - how do you tie the no-code transformation with your code-based transformations? Usually these processes end up siloed from each other.
These GUI driven/Visual ETL tools certainly have their place but are firmly at one end of the ease of use vs engineering discipline based ETL continuum.
As other posters commented Visual ETL often suffer from source control or limited extension ability but do provide the rapid development environment that users (generally more business oriented) seek. They also tend to trivialize the value of experience/discipline - for example I go to an accountant for my tax because they apply learned-experience relating to tax that I do not have (even though the math is easy) whereas in data engineering seemingly simple tasks such as correctly applying data typing to money or dealing with timezones seems to be glossed over in the pursuit of DIY - and wondering why your money columns don't reconcile or you lose data in failure scenarios.
At the other end of the continuum large teams writing bespoke ETL code for every job does not scale well for many reasons (https://reorchestrate.com/posts/code-doesnt-scale-for-etl/). I think the positive reaction to ideas like Data Mesh comes from the failures of these large, centralized teams which coincided with the Hadoop era.
Our solution has been to develop an open source (MIT) declarative framework (https://arc.tripl.ai/) that allows configuration driven ETL - mostly developed via a Jupyter Notebook environment (to allow rapid development and appeal to a larger audience) - whilst making most of the difficult tasks mentioned above easier. This has been in development for a few years now and continues to evolve. We value your feedback.
Visual ETL is not as good of an idea as it seems at first. You end up putting a ton of business logic into the menus of these tools, and it’s not version controlled, and it’s not searchable. You’re better off doing this kind of work in SQL.
My guess (without having tried it) is Databrew is backed by a solid DSL, and you are really generating good clean code when you are using this "Low Code" tool.
And that works great—a great DSL + a visual GUI that edits that DSL. Because like you said you absolutely need VC.
Having used GUI ETL tools for years (SSIS, Informatca, Talend, Appworx) and now using Airflow(Prefect is an excellent alternative btw), I hope to never go back. Great to see Glue improving and for the industry’s sake I hope it doesn’t catch on. Most ETL should be treated as code. As code, ETLs are easier to write, maintain, and manage complexity.
I started my team thanks to Alteryx but happily moved all to Python a year later. UIs are terrible for change management, versioning and documentation. Once you have a big-ish team doing data work UIs will become a significant drag on collaboration and productivity
Azure Data Factory accomplished what you’re looking for by generating and version controlling JSON pipeline definitions... you can just edit the pipeline JSONs too if you’d rather build in code than in the visual canvas
it looks like the jobs are exportable as JSON, which would make it far more maintainable than some of the existing proprietary ETL tools (lookin' at you IBM).
Are there any alternatives to this style of application? I was going to try to make something similar to this for my team to use that would be able to give them access to map columns and simple transforms, then push the resulting flow to me to move it into an airflow dag.
I would really like a visual editor I can adjust with Regex functions and mappings that I can self host and iterate on.
Haven't used this yet, but this looks like a really good user experience from their demo video. Haven't used competitors like Alteryx myself, but just having this integrate so well into the AWS ecosystem makes this seem really useful.
What AWS has is a culture and institutional knowledge on how to launch new products that take foundational AWS services (S3, Lambda, EC2, DDB, etc.) and glues (!) them together better than what a competing non-AWS company can do. This is a bold claim (since AWS launches some very crappy products), but imagine being able to use AWS infrastructure at cost, having internal knowledge on how to best optimize that infrastructure and access to the engineers that own those services while you build abstractions and better user experiences on top of them.
I don't know how cos that compete in any related space can survive. When AWS is willing to throw whatever against a wall (launching 50+ services a year) to see what sticks, sooner or later they're going to land in your space.
Become more locked into AWS's foundational services -> these abstractions on top of them start to make more sense in engineering complexity / delivery time / possible cost dimensions -> Use more of these -> Become more locked into AWS's foundational services.
This feels very different from Azure or GCP.