How manually importing CSV files informed our new import feature

Building a CSV import feature the lean way

GoSquared CSV import

When we first set out to build our CSV importer, we decided only to allow customers to upload their user CSVs from our front end to an AWS S3 bucket.

However, rather than automatically importing the data, we decided to manually undertake the importing process ourselves.

Ship early. Ship often.

Why did we release a clearly-unfinished feature?

Firstly, we wanted to make sure that the data customers imported into GoSquared was accurate and, secondly, it gave us an opportunity to understand better the variables of the data we would be receiving. What were the file sizes like? What type of character encoding should we expect? What types of delimiters were being used (comma, semicolon, or pipe?) How long would the imports take? And what factors could make an import fail?

Once we were able to identify the common patterns, we were able to build our back-end logic to handle the data and the edge cases that our research validated.

Deciding what decisions our customers would need to make

In the front-end we wanted to give our users the opportunity to customise the CSV before they imported it.

Most CSVs will likely be an export of data from another application such as the user’s previous sales CRM. Its naming convention may not be what our user want to have appear in their GoSquared account (e.g. “_first_name” rather than “First Name”) — or some of the data may not be relevant at all.

Defining the server-side architecture

For the architecture, we decided to build a lightweight Ruby on Rails service using its API configuration.

This would receive a POST request internally from our front end servers with the ID of the user importing, the project they’re importing to, the URL of the CSV file, and an array of headers that our they want to keep or ignore. From parsing the headers in the front-end, we could also validate the delimiter type by establishing the most common character and pass that on as a parameter.

Using ActiveJob

Additionally we noticed that the majority of the CSVs we had received had a lot of user data to import. We knew the files could take a while to import, and it wasn’t critical for the import to be processed right away. Once our API received the request it would use the Rails’s built in ActiveJob framework to declare the job, which we configured to use with Sidekiq as the queuing back-end.

Sidekiq for background processing

Sidekiq is a gem that allows easy implementation of background processing for Ruby applications. It works by spawning multiple worker threads from a single master worker process and is a great way to offload long running tasks to the background.

Heroku to the rescue!

This overall setup allowed us to easily host our service on Heroku. As some of you will know, we use Amazon Web Services for almost everything at GoSquared. But this was a project where we wanted to move quickly, and experiment with a few different technologies, and Heroku seemed the perfect fit.

As Sidekiq interfaces with a Redis instance, after adding a RedisToGo addon to manage the Redis instances, we were able to get our service deployed in no time, all at a minimal cost.

How our CSV parsing logic was informed by manually importing user data

We used a lot of our manual import learnings to build the logic for parsing the CSV and mapping the data into a suitable structure that we could post to our Tracking API.

One of the consequences of having clients based all around the world is that you tend to get words with diacritic characters (e.g. Pierré) within the import. In a lot of cases that will mean that the CSV file is ISO8859-1 encoded and this needs to be converted UTF-8, otherwise it could raise an ArgumentError upon reading.

You are able to handle this in Ruby by forcing the UTF-8 encoding when you read the file:

@csv = CSV.read(path_to_csv, encoding: "ISO8859-1:utf-8")

Also by doing the validation in the front end, we could just pass in the separator we identified from our params.

Once we had the CSV in a readable format, we could then start to split out our data.

We needed each row to be in key value pairs to be able to post to our tracking end point:

{"id"=>"19263", "email"=>"me@russellvaughan.com", "name"=>"Russell Vaughan"}

We found it really useful to split out our headers as a separate object for a few reasons. Firstly, as the headers object would always be the same length as the rows below it, we could easily map the key value pairs based on the index.

Also, by having the headers as an object, we could then use the headers object sent in as params in its place, featuring the changed header names. Any columns we wanted to skip, we could do so by just using a ‘false’ value in its position in the headers when we sent to our back-end service.

@csv.each_with_index do |row, index|
   properties = {}
   row.length.times do |_number|
    @headers.each_with_index do |h, i|
      next if h == false
      properties[headers[i].strip] = row[i].strip unless row[i].nil?
    end
  end
end

We were than able to take that object of key-value pairs and post to the GoSquared Tracking API.

Using ActiveJob’s callbacks

One of the nice things about ActiveJob is its use of callbacks. This meant that after the job completed we could make use of the “after perform” method to post-track an event called ‘csv import completed’ against that user in our own GoSquared People when the job had completed.

Dogfooding (or: drinking our own champagne)

Within our own GoSquared People, we were able to build a Smart Group, filtered by all users that had that event track against them; then, once that was saved, we could use our Drip integration to automate an email to these users, letting them know the job had been completed.

Although it can be tempting to jump right into building a solution that will automate your process, by investing time in truly understanding the process of manually importing CSVs, we were able validate our assumptions and ultimately build a better feature, that would deliver more value to our users.

Want to try out the new import feature?

GoSquared is free to use for 14 days – try importing your users and see what you can learn about them with our 14 day free trial.

Further reading

Shipping early and shipping often

What does a sales engineer do?

Running T2 instances on EC2

Never miss a post