
Tracking Your
Visitors
by Bill Winett
So you've created
the ultimate Web site, and now you're sitting back watching your hit counter
go wild. You may ask yourself, ÒI wonder how many pageviews my help page
is getting?" or, "I wonder how many people are visiting my site?"
Unfortunately, when
most people start building a Web site, they don't consider they someday
might want to track its traffic. It takes enough time just to design the
site and create the content. Outlining what information they want to track
is just more work that already overworked staffs tend to let slide.
But when it comes
down to it, we all quickly become bean counters on the Web. Once a site
is up and running, we want to know how many people are looking at our
pages and how many pages each of those people is looking at. That's usually
when a lot of Web developers discover that had they spent more time thinking
about setting up their site, they'd be able to track how it's being used
much more easily.
If you're in this
situation right now, you've come to the right place. And if you haven't
made your site public yet, you're lucky you still have time to
think about reporting before your design is set in stone. Don't miss out
on this chance!
What Information
Is Available?
Before you can decide what type of analysis you want to do, you need to
know what information is available. Unfortunately, there's not much tracking
data you can collect, and what you can get is unreliable. But don't despair
you can still gain useful knowledge from what does exist.
Your Web servers can
record information about every request they get. The information available
to you for each request includes:
- Date and time of
the hit (we'll look more closely at what hits are later on)
- Name of the host
- Request
- Visitor's login
name (if the user is authenticated)
- Web server's response
code (see http://www.FreeSoft.org/CIE/RFC/2068/43.htm for definitions,
or go to the source.
- Referer (see Toxic's
article "Who's Linking to You?")
- Visitor's user
agent (see http://www.FreeSoft.org/CIE/RFC/2068/205.htm, or go to the
source)
- Visitor's IP address
VisitorÕs host (if the visitor's IP address can be translated)
- Bytes transferred
- Path of the file
served
- Cookies sent by
the visitor (see Marc's article, "That's the Way the Cookie Crumbles"
for an overview of cookies)
- Cookies sent by
the Web server
Inaccurate, But
Not Useless
As I mentioned before, the information you have available is inaccurate
but not completely unreliable. Although this data is inexact, you can
still use it to gain a better understanding of how people use your site.
To start things off,
let's take the 10,000-foot view of everything available and then drop
slowly toward the details. So, first letÕs talk about hits and pageviews.
(If you didn't know already there is a difference. A hit is any
request for a file your server receives. That includes images, sound files,
and anything else that may appear on a page. A pageview is a little more
accurate because it counts a page as a whole not all its parts.)
As you probably already
know, it's quite easy to find out how many hits you're getting with a
simple hit counter, but for more precise analysis, you're going to have
to store the information about the hits you get. An easy way to do this
is simply to save the information in your Web server log files and periodically
load database tables with that data or to write the information directly
to database tables.
(For those database-savvy
readers, if you periodically load database tables using a 3GL and ODBC-
or RDBMS-dependent APIs, you can use data-loading tools from the RDBMS
vendor - such as Sybase's BCP - or you can use a third-party, data-loading
product. Here is a partial list of products.)
If you load your data
directly into a database, you will either need a Web server with the capability
already implemented (such as Microsoft's IIS), or you will need the source
code for the server. Another option is to use a third-party API, like
Apache's DBILogger.
Once you do that,
you can gather information about how many failed hits you're getting
just count the number of hits with a status code in the 400s. And if you're
curious, you can drill down farther by grouping by each status code separately.
Pageviews
On the whole, though, counting hits isn't as informative as counting pageviews.
And the results aren't comparable to those of other sites (see the Internet
Advertising Bureau's industry-standard metrics).
To count pageviews,
you need to devise some method of differentiating hits that are pageviews
from those that are not. Here are some of the factors we take into account
when doing this at Wired Digital:
- Name of the file
served
- Type of the file
served (HTML, GIF, WAV, and so on)
- Web server's response
code (for instance, we never count failed requests - those with a status
code in the 400s)
- Visitor's host
(we don't count pageviews generated by Wired employees)
Once you've determined
which hits are pageviews and which are not, you can count the number of
pageviews your site gets. But you'll probably want to drill down in your
data eventually to determine how many pageviews each of your pages gets
individually. Furthermore, if you split your site into channels or sections
- we separate our content into HotBot, HotWired, Wired News, and Suck
- you may want to determine how many pageviews each area gets.This
is where standards for site design can help.
Here at Wired Digital,
we've put into place a standard stating that the file path determines
where hits to a given file will be reported. For example, a pageview to
http://www.hotwired.lycos.com/webmonkey/98/13/index0a.html is counted
as a pageview for Webmonkey, whereas a pageview to http://www.hotwired.lyocs.com/synapse/98/12/index3a.html
is counted as a pageview for Synapse (because Jon Katz is a Synapse columnist).
If this standard is
in place at all levels of your site, you can summarize and drill down
through your pageviews at will. Of course, there are some problems with
this method. You may want to count a pageview in one section part of the
time and in another section at other times. There are ways (that I won't
go into now), however, to get around these problems. We've found over
the years that this method works best - at least for us.
Looking
Deeper Into Pageviews
Once you've cut your teeth on some programs designed to retrieve the types
of information I've just explained, you should be able to use your knowledge
to code programs to give you the following:
- Pageviews by time
bucket You can look at how pageviews change every five minutes
for a day. This will tell you when people are accessing your site. If
you also split group pageviews by your visitors' root domains, you can
determine whether people visit your site before work hours, during work,
or after work.
- Pageviews by logged-in
visitors vs. pageviews by visitors who haven't logged in What
percentage of your pageviews come from logged-in visitors? This information
can help you determine whether allowing people to log in is worthwhile.
You can also get some indicat ion of how your site might perform if
you required visitors to log in.
- Pageviews by referrer
When your visitors come to one of your pages via a link or banner,
where do they come from? This information can help you determine your
visitors' interests (you'll know what other sites they visit). And if
you advertise, this information can help you decide where to put your
advertising dollars. It can also help you decide more intelligently
which sites you want to partner with - if you're considering such an
endeavor.
- Pageviews by visitor
hardware platform, operating system, browser, and/or browser version
What percentage of your pageviews come from visitors using Macs?
Using PCs? From visitors using Netscape? Internet Explorer? It will
take a bit of work to cull this information out of the user agent string,
but it can be done. Oh, and since browsers are continually being created
and updated, and therefore the number of possible values in the user
agent string continues to grow larger, youÕll have to keep up to date
on whatever method you use to parse this information.
- Pageviews by visitors'
host How many of your pageviews come from visitors using AOL?
Earthlink?
\Note that you may
want to mix and match these various dimensions. For example, how do your
referrals change over time? Does the relative percentage of Netscape users
vs. Internet Explorer users change over the course of the day? Does one
area of your site seem to interest Unix users more than other areas?
How
To Count Unique Visitors
Now letÕs talk about visitor information. Look at the bulleted paragraphs
above and replace the word "pageviews" with the word "visitors."
Interesting, huh? Unfortunately, counting visitors is more difficult than
counting pageviews.
First off, let's get
one thing out in the open: There is absolutely no way to count visitors
reliably. Until Big Brother ties people to their computers and those computers
scan their retinas or fingerprints to supply you with this information,
you'll never be sure who's visiting your site.
Basically, there are
three types of information you can utilize to track visitors: their IP
addresses, their member names (if your site uses membership), and their
cookies.
The most readily available
piece of information is the visitor's IP address. To count visitors, you
simply count the number of unique IP addresses in your logs. Unfortunately,
easiest isn't always best. This method is the most inaccurate one available
to you. Most people connecting to the Net get a different IP address every
time they connect.
That's because ISPs
and organizations like AOL assign addresses dynamically in order to use
the limited block of IP addresses given to them more efficiently. When
an AOL customer connects, AOL assigns them an IP address. And when they
disconnect, AOL makes that IP address available to another customer.
For example, Sue connects
via AOL at 8 a.m. and is given the IP address 152.163.199.42, visits your
site, and disconnects. At 10 a.m., Bob connects via AOL and is assigned
the same IP address. He visits your site and then disconnects. Later,
as you're tallying the unique IP addresses in your logs, you'll unknowingly
count Sue and Bob as one visitor.
This method becomes
increasingly inaccurate if you're examining data over longer time periods.
We only use this information in our calculations at Wired Digital as a
last resort, and then only when we're looking at a single day's worth
of data.
If you allow people
to log in to your site through membership, you have another piece of information
available to you. If you require people to log in, visitor tracking becomes
much easier. And if you require people to enter their passwords each time
they log in, youÕre in tracking heaven. As we all know, though, there's
a downside to making people log in namely that a lot of people
don't like the process and won't come to your site if you require it.
If you do force people to log in, however, you can count the number of
unique member names and easily determine how many people visit your site.
If you don't force
people to log in, but do give them the option to do so, you can count
the number of unique member names; then, for those hits without member
names attached, you can count the number of unique IP addresses instead.
Lastly, you can add
cookies to your arsenal. Define a cookie that will have a unique value
for every visitor. Let's call it a machine ID (IÕll explain this later).
If a person visits you without providing you with a machine ID (either
because she hasn't visited your site before or because sheÕs set her browser
not to accept cookies), calculate a new value and send a cookie along
with the page she requested.
So now you can count
the number of unique machine IDs in your log. But there are still a couple
of issues that we need to discuss. First, as I've already mentioned, many
people turn off their cookies, so you can't rely on cookies alone to count
your visitors. At Wired Digital, we use a combination of cookies, member
names, and IP addresses to count visitors, with the caveat that, as I
said earlier, we don't use IP addresses when counting more than a single
day's traffic.
Second, the cookie
specification allows browsers to delete old cookies. And even if this
option wasn't specified, a user's hard disk can always fill up. Either
way, the cookies you send to a visitor may be removed at some point. So
it's possible that a person who visits your site at 8 a.m will no longer
have your cookie when they return at 9 a.m.
Third, when your Web
server sends a cookie to a visitor, it's stored on the visitor's machine
so if a person visits your site from home in the morning using
her desktop machine and visits again from work using another PC, you'll
log two different cookies. Which is why I've called the cookie a "machine
ID": it's tied to the machine, not the visitor.
Which brings us to
issue number four: Multiple people may use the same machine, in which
case you'll see only one cookie for all of them.
Fifth, various proxy
servers may handle cookies differently. It's possible that a given proxy
server won't deliver cookies to the user's machine. Or it might not deliver
the correct cookie to the user's machine (it might even deliver some other
cookie from its cache). Or it might not send the user's cookie back to
your Web server. Unfortunately, proxy servers are still young. There is
no formal and complete standard for how they're supposed to work, and
there's no certification service to ensure that they'll do what they're
supposed to do.
So with all these
issues to consider, here's what we do at Wired Digital:
- If we want to count
visitors for one day, we count member names.
- For hits that don't
have member names, we count cookies.
- For hits that have
neither member names or cookies, we count IP addresses.
And if we want to
count visitors over multiple days, we only use cookies. We do some statistical
analysis in an attempt to determine how much of an undercount results
- but in the end, all these calculations are only estimates.
There's one more issue
we need to discuss. Do you want to track the information you have over
multiple days? Or is one day's worth enough? If one day's data will suffice,
you can get away with simple programs that process your log files. If
you prefer to process multiple days' information, however, you'll want
to store it all in a database.
Bill Winett is
Wired DigitalÕs former director of internal systems and is now the CTO
of Computer Strategy Coordinators.
|