Clip Man

daniele


Daniel Einspanjer's journal

Data warehousing, ETL, BI, and general hackery


Previous Entry Share Next Entry
Counting unique visitors in SQL
Clip Man
daniele
A lot of web metrics solutions out there like NetTracker or Omniture allow you to perform analysis on the number of unique visitors over time. This is a pretty important metric to a lot of companies, and I recently needed to perform such an analysis, but it was on data stored in a SQL database rather than in one of these proprietary solution's data-stores.

Doing any sort of distinct counting on a large volume of data in SQL can be very costly, both in terms of storage of the raw data (since you can't aggregate it), and in query performance since there are relatively few optimizations that can be performed on the table or the query.


Fortunately for me, our data warehouse is stored in Vertica, and while the queries weren't blindingly fast, I was able to get the analysis done in a very reasonable time frame.

I was dealing with a week worth of traffic (about 80m requests per day), and one of the biggest challenges I had was how to determine what constituted a "unique visitor" (in this case, it is actually more of a unique ping or requestor since there isn't really a person involved).

I didn't have a cookie that I could use, so that left me with the less desirable course of using a combination of IP address and User Agent string. The problem with this is that the solution will under count one class of requests, and over count a different class. Here are the details:

1. If a request comes from a host that receives its public IP address via DHCP (e.g. a cable modem or DSL) and that service provider has their DHCP configured to force a change of IP addresses when the host renews its lease, then when the IP address changes, the requestor will be considered "new". e.g. HostX makes a request on Monday with IP 1.2.3.4 and a request Tuesday with IP 1.2.3.4. Then, on Wednesday, their IP address changes. Later Wednesday, HostX makes a request with its new IP 2.3.4.5. When we perform analysis on this week, we will see one distinct requestor on Monday and Tuesday, but a new requestor on Wednesday. In the worst case, if a new host, HostY is assigned IP 1.2.3.4 which HostX used to have and HostY also makes a request using the same OS version, we will mistakenly believe HostY on Wednesday through Saturday is the same distinct requestor as HostX from Monday and Tuesday.

2. If several hosts are on the same LAN network (e.g. an office), then the public IP address will likely be the same for each of those hosts. I use the partial user agent string to help mitigate this problem. I am pulling the OS and locale details out of the user agent string and using that in addition to the IP address to determine uniqueness. Unfortunately, there are a lot of machines running Windows XP with en-US, so this is only partially helpful. Any host with the same IP + OSversion + locale will be treated as a single distinct requestor in this analysis.

3. When I worked on this IP+UA strategy, I originally tested using the full UA (user agent) string which includes the browser version number. This might make sense for many other websites, but unfortunately, what I saw in the test cases that I used was that we would "forget" the distinctness whenver the browser is upgraded, or in some cases, even when certain plugins or extensions are installed (ones that modify the UA) [I'm glaring at you, MegaUpload!].


So, with this strategy in place, I ran the numbers and while I could see an unfortunate amount of under counting (i.e. multiple requests being counted as the same distinct requestor when they likely should have been separate), it was as good as I was going to get.

The last thing I needed to do was to write a SQL statement that added up the number of distinct requestors grouped by the number of days in the week that requests were made. Here is the SQL I wrote to do that. This was just my first stab at it, I got my answer, and it didn't take more than a few minutes, so I left it at that. I'd still be interested in hearing if anyone else has a better way. :)


SELECT (d15 + d16 + d17 + d18 + d19 + d20 + d21) AS RequestsPerWeek
, COUNT(*) AS NumDistinctRequestors
FROM (
SELECT
MAX(CASE WHEN d.date = '2009-03-15' THEN 1 ELSE 0 END) AS d15
, MAX(CASE WHEN d.date = '2009-03-16' THEN 1 ELSE 0 END) AS d16
, MAX(CASE WHEN d.date = '2009-03-17' THEN 1 ELSE 0 END) AS d17
, MAX(CASE WHEN d.date = '2009-03-18' THEN 1 ELSE 0 END) AS d18
, MAX(CASE WHEN d.date = '2009-03-19' THEN 1 ELSE 0 END) AS d19
, MAX(CASE WHEN d.date = '2009-03-20' THEN 1 ELSE 0 END) AS d20
, MAX(CASE WHEN d.date = '2009-03-21' THEN 1 ELSE 0 END) AS d21
FROM distinct_requests a1
JOIN dates d ON a1.utc_date_id = d.date_id
GROUP BY a1.ip_ua_id
) x
GROUP BY (d15 + d16 + d17 + d18 + d19 + d20 + d21)
ORDER BY (d15 + d16 + d17 + d18 + d19 + d20 + d21)



Tags: , , ,

  • 1
Hi!

well, without discussing the underlying logic, you can rewrite the subquery like tis:

SELECT COUNT(DISTINCT
CASE WHEN d.date BETWEEN '2009-03-15' AND '2009-03-21'
THEN d.date
ELSE NULL
END
) AS num_day_in_week
FROM distinct_requests a1
JOIN dates d
ON a1.utc_date_id = d.date_id
GROUP BY a1.ip_ua_id

Now what puzzles me is why you are looking at one particular week to determine the number of days per week that requests are being made while you do the count of all distinct requestors without regard to any particular period.

Can you clarify that?

kind regards,

Roland Bouman

The results of this query, and several other queries similar to it would allow me to answer questions such as the following:

1. How many people "visited" every day of the week?
2. 6 days? 5 days? ... 1 day?
3. How many people visited on the weekend and a weekday?
4. How many people visited only on the weekend?
5. Only on a weekday?
6. First part of week?
7. Last part of week?

Hi!

"How many people "visited" every day of the week?"

ok - but then, why does the subquery one look at 7 particular dates? I mean, the fact that some unique visitor happened to do requests on date X, which happened to be a Sunday, does not tell you anything about all the other sundays, right?

Or perhaps i am missing something?

kind regards,

Roland

I should clarify one bit.
The reason the inner query returns the individual days as Boolean type values is so I can have a WHERE clause in the outer query to answer the questions I mentioned above.

Re: SQL

(Anonymous)
Hi!

"The reason the inner query returns the individual days as Boolean type values is so I can have a WHERE clause in the outer query to answer the questions I mentioned above."

Ok. I'm just curious, have you considered making those flags in your date dimension? That way, all you need is the join, and you can apply all the logic on the date dimension fields

kind regards,

Roland

  • 1
?

Log in

No account? Create an account