Scraping, refining and plotting data

Scrape with Import.io, refine with OpenRefine and plot with plot.ly

I’m currently active in IT. At a certain moment I was curious as to what people of my age are earning so I could compare.  The easiest way is to go and check on standard jobsites or google it. Still not all data is released there, only some figures. And most of all there is no fun in doing this. Luckily we have messageboards where users can anonymously post their salaries and bunch of other interesting data. And luckily we have some easy tools to collect, refine and plot this data thus enabling us to do our own research.

In the picture below you find an example of a post on the messageboard as you can see it’s divided in topics: personal, labour agreement, terms of employment and working conditions. It’s interesting to know what one person is making but it would be more interesting to know what the average is, what people earn versus age is, to plot it into a histogram to know the distribution of the salaries. The people posting on this forum are mostly young(er) and they’re more active in engineering jobs, so interesting for myself to compare but not representative for the entire population. Below an example of the template people could fill in(in dutch).

Template post

Template post

The techniques and tools that are described below require minimal technical knowledge and almost no programming skills. This was my first time using any of the mentioned tools and they were easy to learn.

Scrape the data with import.io

To collect the data I’ve used import.io. Import.io enables you to turn any website into a table of data or an API with no coding required. Import.io can be freely downloaded at their website. In the example used here the following steps were taken:

1. Create an extractor and start with training the rows, in this case every post is a row. 20 rows per page, this could differ on what your posts/page setting is.

trainrows

2. Next up is training the columns, this is the time consuming part. Add a column and mark where the data is to be found. Most of the time you’ll have to train one column a few times with different rows, so it really learns correctly where to find the particular data.

addcolumn

Click add column

addagecolumn

Name your column and click done

 

traincolumn1

Mark the data to be extracted

traincolumn2

Repeat for other columns

3. You do this for two example pages, train rows and train columns.  After training two it should be enough.train2pages

4. Create a Dataset and select the extractor you created

SelectDataset

select dataset

5. Profit
In the screenshot below we now have our dataset for the first two pages of the thread. On the left you can add more pages from which you want to extract data. There’s is a way to automatically paginate through more pages but unfortunately and also understandibily this is limited to 10 pages. Adding them manually is what you can do.

Table with the data

Table with the data

Clean the data with OpenRefine

Even though they are using templates these posts on the messageboard are still like free text fields. There are lots of different ways of saying how much you make, if you have a company car or not and so on. So before anything useful can be done with the data we need to “refine” it. For this I used OpenRefine, formerly known as GoogleRefine. OpenRefine offers a nice toolset to clean your data. Their website: http://openrefine.org/

What I did in this example is cleaning the gross and net wages column. Openrefine offers basic transform actions. It also let’s you do custom transforms written yourself with clojure, jython or GREL(Google Refine Expression Language).

functionsopenrefine

Basic transforms

customtexttranform

Custom transformations

OpenRefine has some other great features, for example logging all the actions you did on a dataset and then replaying them on a similar dataset. You can also extend your data with more data to derive more from it. Check http://openrefine.org/ to learn how.

Plot your data with Plotly

As said in the introduction, now that we have collected the data and cleaned some of it we can get some statistics. Plot.ly will enable you to plot your data in a lot of insightfull ways. For this example I got some basics like an histogram showing the distribution of the salaries, salary versus age and if it correlates. As you can see here you can embed your plots into your blog or any website you want. It’s interactive and you can click through if you want to have a look at the data or the code itself. No knowledge of programming is required to make plots like these.

Distribution of gross salaries:

Gross salary versus age:

Thank you for reading and hopefully you’ll use some of it.