Lessons learned while researching data to find an answer.

Yesterday I published a data story on this blog. That was a first of its kind for me. Of course I’ve used graphs before in posts, but that was always reusing other people’s work. This time I did the data work myself. Here is an unstructured list of the things that I learned while doing it.

  • You start with downloading one dataset, but you’ll always need more data. My starting point was to find data on total houses in the country. The institute CBS has plenty of data available on their Statline website. I quickly found a data set with exactly what I needed: ‘Voorraad woningen; standen en mutaties vanaf 1921’. But of course, when you’re trying to find an answer to the question why housing is so expensive, you’ll need to compare it to population size. Therefore you need to download other data sets as well. For instance population growth;
  • Statline doesn’t always give you all the data available. In my exploration I first downloaded a dataset with numbers on population size starting in 1950. I used this mostly for compiling the graphs, only to find out later that there is another data set available that provides population data starting in 1900. My lesson here is to always dig for more when it comes to using CBS’s data;
  • Exploring data becomes messy rather quickly. I downloaded several data sets and used PowerBI to create a dimension table for ‘year’ and added this column to all tables, so that I could use all data across the tables. This phase is needed to discover what’s happening, but it gets more difficult to keep track of which columns you used from which table with each data set you add;
  • PowerBI is a very handy tool for exploring and combining data sets;
  • After the exploration phase, when I discovered the story the data was telling me, I created a new data set only containing the data that I needed. This way I couldn’t pick the wrong column when making the visuals;
  • To create relationships between the tables I used a ‘Year’ dimension table but only used it as a whole number column. I should have created a proper date dimension table to make it even easier to create relationships between the tables (as my teacher already told me to do with every new data model);
  • PowerBI Desktop is not the best tool for creating output outside the Microsoft PowerBI sphere. PowerBI is mainly meant for building ‘live’ dashboards used inside companies via PowerBI service, the online platform accompanying PowerBI. You can publish a report to service so that others inside your company can look at it. However, I want to publish the visuals on my blog. The only thing I can use from PowerBI Desktop is a PDF export. Luckily I know how to use Photoshop and was able to transform each PDF page in a PNG rather quickly, but that means extra steps between producing and publishing. Rather annoying when you have many graphs;
  • It’s easier to create new columns using a simple calculation in a spreadsheet than to use PowerBI’s DAX formulas to get the same result. In PowerBI I only succeeded doing calculations on columns within the same table, not across tables;
  • You need reflection time on what you’re doing with the data. I started exploring the data more than two weeks ago and only after I showed someone my unpublished post I discovered a flaw in my thinking. In one of my graphs I plotted three lines, two of which were a cumulation of population and houses and the third line was a yearly count of migrant surplus. I was comparing apples and pears to make a point. I corrected this and created a new graph comparing births, deaths and migrants, all accumulative since 1950.
  • I want to learn how I can create interactive SVG-plots on my website so readers can see the actual data behind the graphs.
Door |2021-11-02T11:55:33+02:002 november 2021|dataanalyses, datascience, flow|0 Reacties

Another certificate in the pocket

The past two weeks I spent most of my time studying for the MS DA-100 exam, also known as ‘Analyzing Data with Microsoft Power BI’. This morning I took the exam and passed with a very decent score of 893/1000 (although I have to admit I was a bit annoyed not breaking the 900 barrier). After the training and passing the exam I am now skilled enough to start my own data analyzing projects. I’m looking for ideas where to apply my new skills.

Door |2021-05-20T12:08:33+02:0020 mei 2021|datascience, flow|0 Reacties

Transforming and visualising data using Power BI

The past two weeks I was introduced to the ins and outs of Power BI. Four full training days I’ve been practising doing transformations on columns, making calculated measures and dragging columns and measures into visualisations. For those who are not into data analyses, Power BI is a piece of software developed by Microsoft to handle data sets. When spreadsheets are no longer sufficient to handle your data, you can step up the game by using Power BI.

Before this training I practised with SQL and Python to create scatter plots and calculate summations, and I have to admit that after using Power BI I finally understand what kind of actions I was doing to data sets when using Python. Power BI is a visual tool, so you click on the transformations you need to do to prepare your data and the results are immediately visible. And you can easily undo a step with one click.

I wouldn’t say Power BI is data analysis for dummies, because you still need to know conceptually understand what you’re doing to the data, but I totally see why many people prefer using Power BI over messing about with Python. It is visual, quicker and can create interactive reports and dashboards. The reporting part is (for now) least interesting to me, as I don’t work in a big company with lots of (sales) data that needs to flow through the organisation. However, I do feel more confident after the past weeks that I’m capable to get meaningful information from data sets. And that was the whole point of investing in this course.

Door |2021-05-04T14:26:14+02:004 mei 2021|datascience, flow|0 Reacties
Ga naar de bovenkant