Aggregated results without duplicates on multiple columns each filtered differently

Question

0.00/5 (No votes)

See more:

I have a BigQuery table that looks like this and that can't be modified:

|Country|Customer|Number of connections|Number of purchases|Country Metric 1|Country Metric 2
|Brazil|A|10|1|3|1000
|Brazil|B|90|5|3|1000
|Brazil|C|80|2|3|1000
|Namibia|B|20|1|5|2000
|Namibia|C|150|2|5|2000

About this table, please note that:
- Each combination of Country-Customer is unique.
- The country metrics, as their names suggest, only depend on the country.
- For some countries, some metrics are not available (`NULL` in the table).
- For some combinations Country-Customer, the number of connections/purchases are not available

I would like to obtain, in the same query, the following information:
- The mean of Country Metric 1 only taking into account combinations of Country-Customer having at least a number of purchases higher or equal than 2. In the example table, there are 3 combinations: Brazil-B, Brazil-C and Namibia-C. The mean should take into account Brazil only once so the result is `(3 + 5) / 2 = 4`.
- The mean of Country Metric 2 only taking into account combinations of Country-Customer having at least a number of connections higher than 100. There is only one combination which meets this criterion in the example table: Namibia-C. Thus, the expected result is 2000.

Those are just examples but there can be more metrics and other aggregations (sum, min, max, count...) but they should be very similar.

What I have tried:

SQL

SELECT AVG(IF(purchases > 2, country_metric_1, NULL)), -- => 6.5
AVG(IF(connections > 100, country_metric_2, NULL)) -- => 2000
FROM table

Issue: if the same country appears in multiple combinations, the same metric is taken into account multiple times.

SQL

SELECT AVG(IF(purchases > 2, country_metric_1_p, NULL)), -- => random
AVG(IF(connections > 100, country_metric_2_p, NULL)) -- => random
FROM (SELECT purchases, 
connections,
IF(ROW_NUMBER() OVER (PARTITION BY country) = 1, country_metric_1, NULL) country_metric_1_p
IF(ROW_NUMBER() OVER (PARTITION BY country) = 1, country_metric_2, NULL) country_metric_2_p
FROM table)

Issue: for each country, only one combination is taken into account giving lower and random results...

Posted 23-Apr-21 13:07pm

Lyrics Fever

Add a Solution

Comments

CHill60 26-Apr-21 11:06am

What if the data is

|Country|Customer|Number of connections|Number of purchases|Country Metric 1|Country Metric 2
|Brazil|A|10|1|3|1000
|Brazil|B|90|5|3|1000
|Brazil|C|80|2|4|1000
|Namibia|B|20|1|5|2000
|Namibia|C|150|2|5|2000

In other words, you state that data for Brazil should not be "duplicated" so in my example would you still use

(3 + 5) / 2 = 4

or

(4 + 5) / 2 = 4.5

or

(3 + 4 + 5) / 2 = 6

What about the null metrics - should they be included or not (e.g. include in the count of rows, or ignore the row entirely)
"there can be more metrics and other aggregations" - with the same criteria or with different criteria? This "single query" is going to be very confusing. Why must it be the same query? That is a very artificial criterion

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)