Loading subsets of data with the Qlik SAMPLE and FIRST prefixes

Qlik subset load - Bitmetric Friday Qlik Test Prep question

Every Friday at Bitmetric we’re posting a new Qlik certification practice question to our LinkedIn company page. Last Friday we asked the following Qlik Data Architect certification practice question about Qlik subset creation in load scripts with the FIRST and SAMPLE statements:

While many people knew about loading data subsets using First, it was clear that the Sample prefix is a lot less known.

The correct answer is D: Random 10% of total rows are loaded from Customer.QVD

Arguably one of the most encountered problems during development is long loading times of (very) large datasets. In order to speed up development it is obviously nice to be able to create a subset in the load of the dataset. There are various ways of doing this, but looking at the comments and answers to the question, First is one which came to mind by many.

First is a prefix which limits the load to a specified number of rows. For example First 1000 will limit the load to the first 1000 rows from the table. Making the load and thus the development time quicker.

So what is the catch?

However, in some cases it could be that loading the first rows doesn’t cut it. For example:

The first rows of the dataset are filled with data of questionable quality (empty rows or test data)
Data is sorted on alphabet and only name records starting with A are loaded
Dates. The data could be sorted on entrydate and you only load the first dates from the dataset

Especially that last example makes sense if we are looking into front-end results. How do we compare year to date measures or point in time reporting if we only load a few dates? Take the following example dashboard from a car sales company. In this example we have used First 1000 as a prefix.

car sales dashboard first 1000 load subset

We can clearly see that there is data in the dashboard, but that most of the charts and KPI objects are not really telling us much. In this way we cannot check whether the measures are performing as expected.

Now using an equivalent sized Sample load, in this case using Sample (0.1) as prefix we get the following result:

car sales dashboard sample 0.1 load subset

This dashboards speaks more to the imagination, giving us a better look at all the various data. Since we now have data randomly picked from the total dataset we can immediately see that year to date and in time measures are performing as expected.

Using the SAMPLE prefix in Qlik

The sample prefix has the following syntax:

Sample p ( loadstatement | selectstatement )

The p argument is to indicate the probability that a record is read. It is optional to put this amount in parentheses. So looking at the example from the question we can tell that a random 10% is added to the loaded dataset.

Things to keep in mind?

Sample will look through the entire dataset. First will limit it to a fixed row number and stop. This means that sample could potentially still be slower.
Every time you reload again the sample set will change. Keep that in mind.
Each row is evaluated separately. So with Sample (0.1) each row will have 10% chance to end up in the resulting table. This means that if you load a dataset consisting of 1000 rows, it will on average load 100 rows into the table, but for each iteration this could deviate a little bit, since it will go through the total dataset and evaluate each row on 10%.
Sample must be a value between 0 and 1.
Combining First and Sample is possible.
Wait what?

Combining the FIRST and SAMPLE prefixes

Yes it is possible to combine both prefixes. Should you and what does it do? Let’s have a look at the results first to understand what is happening. The following load script is used:

FIRST 1000
SAMPLE (0.1)
LOAD
*
FROM [LIB:\\Customers.qvd]
(qvd);

After this load we also have loaded Sample (0.2), Sample (0.3) and so on, untill Sample (1). If we then check the resulting data models record and row numbers we see the following results:

subset results sample 0.1 first 1000 load

As we can see the load is limited to 1000 rows, but still random sampled. There is no difference in order of these prefixes. So it doesn’t matter if we use first and then sample or sample and then first. However the logical conclusion of First 1000 Sample (0.1) would be that Qlik would load the first 1000 rows and then samples 10% of these, giving us probably 100 rows as result. But this is not the case.

When combining first and sample the order doesn’t matter. Qlik will always evaluate sample first and then limit this load to the first 1000 samples. As we can see in the table above this is as expected. Sample(1) will result in a 100% chance for each row to return. This gradually moves to a bigger sample with how less the chance is a record is added. So sample (0.1) means that each row has a 10% chance of being added to the data model. First 1000 tells the load script to stop at row 1000, which could be a higher record number because of the chance of addition.

So should you use this? All depends on the size of the data sets and what you wish to evaluate. But if you want to do a over time analysis on a subset of data, the usage of sample should be sufficient.

That’s it for this week. See you next Friday?

More from the Bitmetric team

Take your Qlik skills to the next level!

Since 2013, the Masters Summit for Qlik is the premier advanced training for Qlik. Join us in Vienna and take your Qlik skills to the next level.

Join the team!

Enjoying these challenges? We have many more, and we’ll even pay you to solve them 😉 Check out our job opening.

Friday Qlik Test Prep Solution

How can we help?

Barry has over 20 years experience as a Data & Analytics architect, developer, trainer and author. He will gladly help you with any questions you may have.

Call us

Mail us

Qlik subset load with SAMPLE and FIRST

Share this message

The correct answer is D: Random 10% of total rows are loaded from Customer.QVD

So what is the catch?

Using the SAMPLE prefix in Qlik

Things to keep in mind?