Data Engineering job market in Stockholm

Interviews
A Sankey diagram of my job search

I got a job! I will be working at Mentimeter doing Data Engineering stuff!

There aren’t that many blog posts about looking for a job in Stockholm. I thought it would be fun to contribute by writing an After Action Report.

Background

I quit my job in June of 2022 to go traveling with my partner. We visited some cool places in Asia (Hiking in the Himalayas was a high point (🤓)). We were back in Sweden in early December. I then started sending out job applications.

I used this resume.

An executive summary of my resume would be:

  • 2 years of experience
  • Spark, Python & SQL
  • Hadoop & GCP
  • CI/CD and terraform stuff

I was only looking for roles based in Stockholm.

Screening

All processes started with a recruiter reaching out by email to schedule a phone call or a video meeting. In the beginning of my search it was easy to say yes to their proposed time, but after a week my calendar was quite full. I solved that by Publishing my calendar online

Most of the screenings was them telling me more about the role and the company. I also almost always got the question “Tell me a little about yourself”. And of course I got some time to ask questions about the company. I would recommend to have some questions ready!

Here are some reasons I withdrew:

  • They had bad work-life balance (I’m very glad the recruiter was honest about it!)
  • The product made didn’t make any sense (could just be me being stupid though!)
  • They couldn’t bother to show up to our booked meetings, twice!
No show
Actual email after I withdrew after two no-shows. Red flag!

Technical

The majority of the technical interviews were take-homes. They were a Jupyter notebook (locally or Google Colab) with some data loaded. There were some tasks to do basic transformations, either in Python or in SQL. Then there was some basic system design questions, like “What would you do to fix a slow loading dashboard?”

One technical interview was just chatting with the hiring manager about technical stuff. Very pleasant experience!

For the live coding I had one in pure SQL and another in Python.

The Python one was similar to the take-homes with a jupyter notebook with some tasks. Just that I was doing it live with a time limit and also needed to talk through what I was doing.

I failed the SQL one, that’s on me for not preparing my SQL skills enough!

The live system design was a fun one, we started with some basic functionality. Then the interviewers started adding more and more requirements. I had to think about scale and cloud services for that one.

Cultural

I had two cultural interviews. They are quite personal, where you need to talk about how you handled situations at work.

I recommend breaking down a story into Situation, Task, Action, and Result. Called the STAR Method, famously used by Amazon for their behaviorals.

Offer

I got multiple offers and wanted to evaluate them all fairly. With the first few offers I asked for more time to finish the other processes. They were reluctant but agreed!

For me it’s also important who I’m working with. I arranged going to their office and meeting the team as well.

End

That’s it! If you have any other questions about the Data Engineering job market in Stockholm, shoot me an email at [email protected] :)

Publish your calendar and let recruiters schedule interviews themselves

I am unemployed! It’s great!

I have been (deliberately) out of work since June 2022. It’s been a magical time!

  • I tried a lot of new recipes
  • I spent time on my side-project bostadsbussen
  • I visited Singapore, South Korea, Thailand, Nepal (Hiking Himalayas), Turkey, Italy, France
  • I slept a lot
  • I spent time with family

But I also used a lot of my savings.

So I’m looking for a job.

Anyone who has looked for a job has probably had this exchange:

<Monday>
Recruiter 1: Can I schedule a call? When are you free?
Alex: I'm free on Wednesday from 9 to 12.

Recruiter 2: Can I schedule a call? When are you free?
Alex: I'm free on Wednesday from 9 to 12.

Recruiter 1: Great, I'll call you on Wednesday at 9.

<Tuesday>
Recruiter 2: Great, I'll call you on Wednesday at 9.

A classic race condition! Now I need to reach out to Recruiter 2 and re-schedule. Then I would need to submit a time-slot again, which could create another race condition with a hypothetical Recruiter 3

This can result in a back and forth email chain just to find a suitable time. That is not very productive!

So how can we solve this problem?

If my calendar was online the recruiters could find suitable times themselves.

So let’s put it on a web page!

I use Proton Calendar, which has an option to export my calender as an ICS link:

proton calendar

That gives us an ics subscription link, which can be imported into a calendar. But expecting busy recruiters to figure that out is not really respecting their time.

I found Open Web Calendar which will generate an iframe that can be pasted anywhere.

It was easily deployed to my Raspberry Pi! You can see the result below or on dahl.dev/calendar

The UI breaks a little bit on mobile devices. But I think recruiters will be using a desktop when scheduling. After this job hunt I might spend some time on creating a PR that fixes that 🤓!

No more scheduling conflict. I actually started using it for my job hunt and I have gotten good feedback from the recruiters. They said it made their job so much easier!

So now all my interactions look like this instead:

<Monday>
Recruiter 1: Can I schedule a call? When are you free?
Alex: You can see my availability on dahl.dev/calendar

Recruiter 2: Can I schedule a call? When are you free?
Alex: You can see my availability on dahl.dev/calendar

Recruiter 1: Great, I'll call you on Wednesday at 9.

<Tuesday>
Recruiter 2: Great, I'll call you on Wednesday at 10.

Plotting Sweden’s real estate prices on a heatmap with deck.gl and Cloudflare

Stockholm

A heat map of apartment prices in central Stockholm

Disclaimer: I will be linking to sites in Swedish. A translation extension might be handy!

I, like a lot of people in Stockholm, need to buy an apartment. The rental situation is bad. Getting a “first-hand contract” is hard. I have friends who have even had to settle for temporary “third-hand” contracts!

With finding a rental unit out of the picture, buying is the only option. Buying an apartment in an inflated market during big increases in mortgage rates is not fun position to be in. So what should a data person such as myself do to identify which areas in Stockholm are reasonably priced? Plot all the data points on a heatmap! Which is what I set out to do.

My side project bostadsbussen scrapes user entered real estate listings from hemnet and archives them. You can read about the tech behind it in my previous blog post.

All right, we have a place to host a heatmap. First we need to get the data!

hemnet Picture of the closing prices page on hemnet.se

Luckily the data is out there on the internet! hemnet.se provides the closing prices for most of their listings. The problem is that they only return max 2500 results per search query. So we need to craft some queries to extract all the 1 million+ results on their site. It was as simple as limiting the search queries by different parameters until the result was lower than 2500. Then extracting the data from each listing was easy.

I was also very mindful of not putting unnecessary load on their servers.
I chose to not parallelize the scraping.
Getting all the listings took a week in real time.

Cool! Now we have a big JSON array with a million properties. Now I want to visualize this on an interactive map! And share it with the internet!

My first thought was to spin up a dashboarding solution like metabase or superset on a rented VM. They are both great tools and it would have been a great option. But a rented VM that can handle bursty traffic could be quite expensive. I also don’t want to deal with autoscaling stuff like kubernetes without getting paid 🤓

So I would need to build the visualization myself to get around renting a VM. I found deck.gl which is great for displaying large amounts of data on a map. Perfect!

We also need some map tiles that we can overlay the visualization from deck.gl on. Mapbox has an excellent free tier where the first 50000 views per month doesn’t cost anything. I doubt I will ever get more traffic than that.

map The heat map of southern Sweden (where all the reasonable people live (sorry Norrland))

OK, with this built locally on my machine I had a pretty cool visualization. I spent an hour dragging the map around Sweden to see if my pre-conceived notions about expensive areas was true. It was! (The Östermalm area in Stockholm is really expensive)

It works locally, now we need to host it! I chose Cloudflare Pages for this. But it’s not really a visualization if there is no data to visualize.

This leads us to the problem of getting the data to the user.

My JSON array was 25MB compressed with gzip (125MB uncompressed). Hosting it on an object storage like GCS would cost nothing storage wise. The big problem would be the egress fees. GCS charges $0.12 per GB. If I got lucky (or unlucky) and had 10000 people download the data, I would be looking at $30 in just egress fees. Not good for a product with zero revenue!

Luckily Cloudflare’s object storage R2 has 0 egress fees. Zero! Now I could use that share the data to the user with a simple GET request.

I ran into some CORS problems for the public bucket but that was easily solved with this guide.

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <CORSRule>
        <AllowedMethod>GET</AllowedMethod>
        <AllowedMethod>HEAD</AllowedMethod>
        <AllowedOrigin>*</AllowedOrigin>
    </CORSRule>
</CORSConfiguration>

With the above allow all CORS, the data could be shared with everyone on internet. I don’t have to be worried about waking up to a huge cloud bill, since it all costs zero! An additional benefit was that I could provide the full json data. So other interested parties don’t need to hit hemnet.se servers and instead just download that file!

The map is available on https://bostadsbussen.se/sold/map (In Swedish!)

My next steps is to include some line charts for analysis and also make sure the json blob is updated with new data everyday. Kind of like a serverless dashboard!

I’m also reaching the end of my travel sabbatical (trekking in Nepal was a highlight!). So I’m looking for a Data Engineering or Infrastructure job. Based in Stockholm or EU remote. Here’s my resume

Shoot me an email at [email protected] if you want to talk 🤓

Stringing together several free tiers to host an application with zero cost using fly.io, Litestream and Cloudflare

Infrastructure

The image was generated by putting the blog post title into DALL-E 2. Quite fitting!

I have a side project called bostadsbussen. It scrapes property listings for the Swedish real estate market. The site needs to persist data in the form of user accounts, property data and images.

At the time of writing this post I am currently on a sabbatical. With my income at 0, I want to keep the cost of hosting my side project as low as possible. We will be traveling a few months in Asia, so hosting the site on the home servers is also out of the question.

That leaves us with the cloud ☁️

There are cloud offerings like Firebase which are a great place to host a side project. But I want to avoid the vendor lock-in and have the option to move the entire application to my own server in the future. So this post will skip examining Firebase et al.

Renting a VPS (Virtual Private Server), is a good and cheap option with no lock-in. They usually cost around $5/month for a 1GB RAM and a shared CPU. But what if we want to do it even cheaper?

What if we could do it for free?

Enter fly.io. They provide a free 256MB instance that you can spin up with a valid Dockerfile and fly deploy. Great developer experience!

❯ fly deploy
==> Verifying app config
--> Verified app config
==> Building image
Remote builder fly-builder-spring-snow-7814 ready
==> Creating build context
--> Creating build context done
==> Building image with Docker
...
--> Building image done
==> Pushing image to fly
...
==> Creating release
--> release v8 created

And we have released our application on fly.io!

All right, we got our free server, what should we do about persisting data? If we store data on the fly.io instance and if it crashes we lose everything! The common choice would be to spin up a separate database server and use that for storing our data.

But with the introduction of Litestream, we don’t need to! Litestream will replicate the changes to an SQLite database to an object storage. Litestream will also restore the database when the server restarts. No dedicated database service needed! Michael Lynch has written a great blog post on this.

When it comes to cloud storage all providers are very cheap for running Litestream. So it comes down to developer preference. I chose Cloudflare R2 because of their free tier.

r2

Getting Litestream to communicate with R2 is quite simple:

# The litestream config

dbs:
  - path: /pb_data/data.db
    replicas:
      - type: s3
        endpoint: ${R2_URL}
        path: ${R2_DATA_PATH}
        bucket: ${R2_BUCKET}
        access-key-id: ${R2_ACCESS_KEY}
        secret-access-key: ${R2_SECRET_KEY}
# The script that restores and then continously replicates the data

echo "Restore db if exists"
litestream restore -if-replica-exists /pb_data/data.db
echo "Restored successfully"

echo "replicate!"
exec litestream replicate -exec "/pocketbase serve --http 0.0.0.0:8090"

Now we need a backend to host on the server. I have been very productive with PocketBase. It is a go framework with several great features. Like user authentication, an admin panel, an extendable API and a JS SDK for connecting it to the frontend. Best part it uses SQLite as the database, so we can use Litestream for our replication 🎉!

We also need a frontend. I’ll admit I’m not very good at the frontend stuff, I built one with React! It was quite enjoyable. For hosting a React app there are several free options. Like Vercel, Netlify and Render. But I chose Cloudflare Pages. I don’t see much difference between the mentioned alternatives. Since I’m already using Cloudflare’s other services (DNS, R2) the choice was easy. (And I’m lazy).

The last thing I have in my application is the scraping part. Loading 100s of images concurrently and moving them to an object storage is quite memory intensive. At least a 256MB instance can’t handle it! I offloaded the scraping part to Google Cloud Run. It scales to zero, and will only run when it gets a scraping request. It stores images in a bucket and returns the scraped data to the PocketBase backend. It of course also has a free tier that I use! 🤓

And here is a diagram of the architecture. Generated with Diagrams as Code.

Architecture

That’s it! Hope you enjoyed the post.

Check out github.com/aleda145/pocketbase-lab for a lab for setting up this architecture

Disclaimer: I paid $10/year for the bostadsbussen.se domain.