Have You Ever Tried SQLing Your Cloud? The Power of Steampipe
- Wix Engineering
- 4 hours ago
- 6 min read

We, the production engineers, are probably all very familiar with this scenario:
Your TL: "Can you tell me which instance type is used for our use1-consul-01 server?"
You: "Sure."
Logs into AWS
Opens EC2 dashboard
Goes to EC2 'Instances' page
Write 'use1-consul-01' into the search bar
You: "We use m6in.xlarge"
Consider this one as well:
Your TL: "Can you let me know what are our most used instance types? Let's say, top 15 ones?"
You:
contemplating at a certain point in space
Thinking to yourself: "ohhhhhh I can use AI!!!!"
You: "Sure!"
Opening Claude
* Typing: "Write me a Python code that will fetch me the 15 most used instance types in my VPC"
Thinking to yourself: "oh wait, it should get the creds from env vars, and get the vpc-id as param"
A few iterations of prompts, runs, until you get the expected result
You: "OK, got it, sending it over"
Now what if:
Your TL: "Can you let me know what are our most used instance types? Let's say, top 15 ones?"
You: "Sure!"
Running 'steampipe query "select count(instance_id) as instance_count, instance_type from
aws_ec2_instance group by instance_type order by instance_count desc limit 15"'
You: "Got it, sending it over :)"
The efficiency is a bit exaggerated but believe me, it takes much less time (and my SQL skills improved), and since I heard about Steampipe, this tool has become a significant part of my daily workflow.
Steampipe merges the power of SQL queries with the ability to access cloud data effortlessly, making it an extra useful tool for anyone managing complex environments.
What is Steampipe?
An open-source zero-ETL tool fetching data directly from APIs and services. It uses plugins and each plugin is a service with a set of APIs for querying (I use AWS and Kubernetes).
The beauty of Steampipe lies in its ability to treat the cloud resources you have as tables in a SQL database (the plugins I use for example, are actually Postgres FDWs). This allows you to leverage (postgres-flavoured) SQL syntax to gain insights without the need for cumbersome code.
Key Features of Steampipe:
Unified Querying: Use SQL to interact with multiple APIs seamlessly (e.g. AWS EC2 and S3 APIs).
Extensible: Support for various plugins allows cross-platform data interaction (AWS and K8s plugins).
Open Source: Community-driven development promotes innovation and collaboration.
Real-World Example
There was a security initiative to change a highly used IAM role to a different one with more granular permissions. First, we needed to evaluate the scope of the task. In other words, we needed DATA!!
The role (named insecure-role for the sake of example) was used only with instance-profiles, so what we need is to find the number of instances that use this IAM role.
Sounds pretty simple, right?
A possible course of action is trying to do it in the AWS console, which I am not even going to get into. Another one is trying to do it with Python, which is a viable option on all accounts.
For this you can use your favorite model to generate the code, and try to run it.
But in my experience, there are always tweaks and adjustments to make it work, which can be tedious if I have to make these tweaks every time I need to fetch some data.
The next option is, as expected, using Steampipe. I had it installed on my local machine (already configured with AWS profile and the list of our regions). Without too much complexity, I first generated the temp creds for the AWS profile, and then entered the interactive query shell of Steampipe.
Then I constructed the SQL statement. First I had to find out which columns I wanted to fetch.
This can be done using:

I found the column I want to filter by: iam_instance_profile_arn. I need to group it by region, and get the amount of instances that have this attached.

As simple as that, I got the result:

Now, I can give a much better estimation for the effort to deprecate the insecure-role across all our regions.
That is one example of how to handle relatively simple, but otherwise tedious fetches of data (that I find myself doing pretty frequently).
I still use Python many times when I fetch data BTW, but it’s more useful in the cases where I need to perform some logic after I get the data.
Getting Started with Steampipe (AWS and Kubernetes Plugins)
To illustrate the power of Steampipe, let’s walk through a few examples until eventually you will see how both AWS and Kubernetes plugins are used together in harmony.
Step 0: Install Steampipe
Install Steampipe (of course there are installations for Mac, Linux and Windows).
Step 1: Install the Necessary Plugins
First, ensure you have Steampipe plugins installed:

Step 2: Getting Ready to Query
After installation of the plugins you have config files created for them under ~/.steampipe/config/.
In aws .spc you’ll find essential stuff like:

Where you can configure the AWS profile (assumed to be existing locally) and the regions you want to query.
In kubernetes .spc in the same directory:

So you can configure any kubeconfig you have present on your workstation.
When those are configured, you are good to go!!!
NOTE: I use the default single connection in the config files. When using more connections the name format of tables you query changes.
Step 3: Query!!
Starting with a simple query using the AWS plugin. All virtual tables that are under that plugin have the format aws_* (like aws_vpc, aws_vpc_security_group, aws_ec2_instance, etc).
To see what VPCs you have in your account in the regions you configured in aws .spc:

There are many more columns than shown in the above query (it has been redacted for the sake of example).
Be advised, when querying large amounts of resources, this can be API intensive and even sometimes result in rate-limits by AWS. To avert that, add limit 10 or the like, at the end of the query.
A simple example of querying with the kubernetes plugin:

The ->> operator is used to extract a value of a certain key in a json object.The object looked like: {“a”: “b”, ..., “osImage”: “Amazon Linux 2”, ... }
When pointing to the desired kubeconfig, steampipe will try querying all the contexts in there unless you configure a single context specifically.
There are more commands I curated that can be very useful for getting started.
Step 4: Taking it One Step Further
Let’s try to make it a bit more useful than just getting data (even though that in itself is pretty useful).
Say I want to find all the ingress (inbound) security group rules in a certain VPC vpc-1234567890abcdef that contain the range 192.168.10.0/24.
To do that, I can use something like this:

The <<= operator is a Postgres function on network addresses (the column type is cidr).
Step 5: (Bonus) Taking it One Step Sideways
Now, say I want to find the tag with the key environment on the instance a pod is running on.
For example, let’s take the pod aws-node-aaabb, which is on a problematic node:

Key points here:
The LEFT(TEXT(i.private_ip_address), -3) part is to convert the inet type to text and remove redundant chars to make the join possible.
This part: addresses->0 ->> 'address', is used to take the value of the key address from the first item in the addresses list of objects.
All these are postgres functions (docs).
Conclusion
Steampipe has simplified data access for me, allowing for cross-referencing between different sources and enabling me to derive insights with ease. Initially, it was a tool I had never heard of, but it quickly became essential for me. Now, it’s my go-to solution for fetching data from my AWS environment, especially when I need to cross-reference it with Kubernetes resources.
If you haven’t explored Steampipe yet, I highly recommend giving it a try.
Useful Queries
Show the AWS security group inbound rules (that contain CIDRs as source) that are contained in another rule (e.g. 2 rules on port 8080, where the 1st allows 192.168.10.0/24 and the 2nd allows 192.168.0.0/16).

Getting VPC CIDRs for a certain VPC.

NOTE: Since kubectl is a very flexible and robust tool, getting info in the Kubernetes context can be done with it on its own. But when you want to get some AWS-context data on the K8s resources, that’s when Steampipe can be really useful.
Nevertheless, here are a few queries using the Kubernetes plugin:
Extracting the startup delay in seconds (the amount of seconds it took for pods since their creation time until the container is started)

Amount of failed scheduling events per namespace in the last 24 hours.

Amount of pending pods per namespace


This post was written by Omer Dolev
More of Wix Engineering's updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel
Comments