Home>Analytics>Data Science 101: Using Clusters to Understand Web Traffic
  • data science 101

Data Science 101: Using Clusters to Understand Web Traffic

4 minute read

There are a lot of factors that influence the amount of traffic a site receives, as well as how engaged its users are. In this post we’ll take a look at one of the many techniques we use to help understand a site’s traffic: clustering.

What is clustering?

At a high level, clustering is a machine learning technique that puts similar things into the same bucket. This can be done in a supervised or unsupervised fashion. Supervised clustering is like sorting coins based on denomination; you already know exactly what your clusters are. In practice, you’re often dealing with dirty or damaged coins, so it’s not immediately obvious what the denomination is, and hence why you need some machine learning. Unsupervised clustering is a form of clustering where items are lumped together automatically based on how similar they are. Typically you have to specify how many clusters you want your algorithm to spit out at the end, and there’s always a possibility that these clusters won’t be particularly obvious (For example, your algorithm might say, “Hey, I found a bunch of coins covered in green mud!”).

We use unsupervised clustering to help us figure out what the topic of a site is, and how that topic influences its traffic.

Understanding Internet Usage Patterns with Site Categories

For any site, we get a raw understanding of their traffic from our data panel, but we need to scale that up to the entire Internet-using population. To understand broader Internet usage patterns, it helps to know what kind of site we’re talking about. We figure out what sites are about by categorizing their topics. For instance, our data might tell us that sites about data science get 10 times as much traffic as sites about beanie babies (I wish). This means that if I see the same number of panelists visiting a data science themed site as a beanie baby themed site, I can confidently say there are many more data science fans in the wild.

While categorizing sites might sound easy (and it is, for a single site) things get difficult when we want to do this for every site on the web. There aren’t enough interns anywhere to do this before the sun burns out.

There are a couple of other things that make categorizing sites by topics tricky. The first is that there are a ton of possible topics. Every word in the dictionary could be a topic, but slicing sites so finely makes it harder to glean useful information (i.e., ‘sports’ is a more useful topic than ‘world series of underwater handstands, 1917’). One way we combat this is clustering, which is the fancy machine learning term for lumping like things together. As an example, the cluster of sports topics would include things like baseball, football, soccer, and underwater handstands (maybe).

That example hints at the second difficulty with categorizing site topics: a site can have multiple topics and they may not be related in a meaningful way. Contrast espn.com with sportsauthority.com. They’re both about sports, but one is a news aggregator (among other things) and the other is a store. We deal with this issue by letting a site belong to multiple clusters. This is like saying sportsauthority.com looks like a store from one side, and like a sports site from another side.

data-science-optical illusion-rabbit-and-duck

optical illusion of old woman and young girl







Identifying Clusters

Now let’s circle back to how we actually identify these topic clusters. We’re not necessarily interested in how you’d group sites based on only browsing their content. Instead, we’re interested in sites that have similar traffic patterns, which also gives us information about what sites are about.

Let’s take a random site that we know nothing about, foobar.com, for example. From my panel I might notice that people who visit foobar.com are much more likely to visit foo.com and bar.com than those who never go to foobar.com. This tells me two things: 1) foobar.com, foo.com and bar.com are probably about something similar, and 2) these sites probably receive comparable amounts and kinds of traffic. That second piece of information is really important. If I knew how much traffic foobar.com actually receives, I could leverage that information to give you an accurate estimate of how much traffic foo.com and bar.com receive. A similar statement can be made about links between sites (this is how Google got started years ago).

For our purposes, we generate a lot of clusters from our data sources and then let another layer of machine learning figure out which ones are actually useful. This means that these clusters are just one subset of the features (another fancy machine learning term for a variable or attribute or, typically, a column in a spreadsheet) we use to estimate various traffic metrics. How we use these features and how we let an algorithm pick which ones to use are a topic for another time.

In sum, topic clusters are extremely beneficial in helping us understand broad, Internet-wide usage patterns. They help us determine whether or not a site is the kind that people go to every day to keep up with the latest news or the kind of site they check out once a month. Despite being based solely on people’s browsing behavior, these clusters have distinct subjects like “sports” or “tech news”. We’ll dive into how these clusters and other features get incorporated into our models in future blog posts.

Until then, read more about what it’s like to be a data scientist in our post, Understanding Data Science and Why It’s So Important


  1. Mr frederick December 28, 2015 at 7:06 am - Reply


    คุณต้องการสินเชื่อเร่งด่วนในการแก้ปัญหาความต้องการทางการเงินของคุณ? เราเสนอ

    เงินให้กู้ยืมตั้งแต่ 5,000.00 เพื่อ 250,000,000.00 แม็กซ์เรามีความน่าเชื่อถือ



    รับประกันกู้ยืมจากต่างประเทศในช่วงระยะเวลานี้ นอกจากนี้เรายังออก




    frederickbills99@gmail.com ด้วยข้อมูลต่อไปนี้ชื่อของคุณ: ประเทศ: เมือง:

    ที่อยู่: จำนวนเงินที่จำเป็น: ระยะเวลา: อายุ: เพศ: อาชีพ: โทรศัพท์ No:


  2. ming lou December 11, 2015 at 8:00 pm - Reply


    value only produced in the process of the fundamental mass — energy
    transformation process. And this process only can happen in extreme conditions
    that are in the central black holes of the galaxies. This extreme condition is
    produced by the intrinsic natural tendency of mass state matter —
    concentration (gravity). This extreme conditions (extreme high temperature and
    pressure, extreme strong electromagnetic field) is caused by extreme
    concentration of large amount of mass. This process is self-causing,
    self-inducing, self-adjusting, self-maintained. The highly ordered (highly
    concentrated) energy produced by this process is being jet out far away from
    the gravitational centre. It is this highly concentrated and far away from the
    gravitational centred energy that possesses concentrated real value. The process
    of this continuously converting from mass state matter into energy state matter
    and then the energy state matter dissipating
    and converting to mass state matter and destined to realising into effective
    information of this highly ordered energy is the process of real value
    producing and consuming. In this process of mass — energy transformation, mass
    state matter transferred into highly ordered energy; entropy decreasing. It is
    this entropy decreasing process and only this process that produced and producing
    real value. This real value creating fundamental mass-energy transformation
    process is automatically controlled and adjusted by feedback mechanisms of the
    universe that makes the amount of mass and energy dynamically balanced in the
    universe. Therefore, there will be no heat death or gravitational death. Though
    this value creating fundamental mass-energy transformation process is
    continuously proceeding in the universe, dual to space isolation effects, for
    any local space (e.g. the Solar system) and in a certain time span (e.g. 5 to
    10 billion years)the available real value is limited. Though value is created continuously in the
    universe, the amount and available time of value on the Earth is set. Even the
    Sun is just a consumer of value. It is a third stage distributor of value. It
    decides the time and amount of value available to Earth. The Earth is the
    fourth stage distributer of value. It provides life with eco system. The
    central black hole of the Galaxy is the primary producer and distributor of
    value. It converts mass into energy and redistribute them into space against
    the force of gravity. The big stars are the secondary distributor of value.
    They produce elements (star dust) and disperse them (with energy rich gas mass)
    into space that later will form solar systems. Value can be measured and
    represented by energy but they are not the same thing. Value is what makes the
    world work. Value is decreased entropy. That can only happen in the central
    black holes of galaxies. But the process of real value consumption is happening
    all the time and everywhere in all normal conditions. The process of energy
    renewal is the process of value creation; the process of energy dissipation is
    the process of value consumption. The consumed value is the input for the
    production of effective information. Effective information is the final output.
    Effective information is the ultimate product and destination of the movement
    of the universe.

    process of value creation is the process of entropy decreasing. It is this
    process that maintained all matter movements including the creation and
    maintaining of life. The energy state of a local space expresses its degree of
    usability. The two beams that being jetted out of central black hole is the
    most concentrated energy. Its task is to travel fast and far away from
    gravitational centre. The further away from gravitational centre the more value
    it will possess. After it reached its maximum value, its second task is decay
    into energy rich gas to form big stars and produce elements — star dust, and
    cast them away with milder energy rich gas — the mixture of hydrogen and
    helium to form long lasting, steady energy emitting yellow stars and element
    rich planets. And the value at this stage can be used to create and maintain
    life. This is the real value we human depend on to survive and evolve into high
    intelligent being to outlive the Earth and the Sun to reach perpetual existence
    and development. The real value available to us is set. Only decrease by
    natural process and human use. Natural process use real value sparely and long lasting
    (on the Earth it may be 1 to 2 billion years according to scientists) but human
    use are not. For instance, we can destroy the Earth thermodynamic supporting
    system in just 200,000 to 400,000 years by simply using geothermal energy. Simply
    assume that we can take shortcut to evolve into super beings and possess the
    ability to survive independently without the Earth before we exhausted the real
    value on the Earth is ungrounded and against the risk management principles.
    Actually it is against the thermodynamics. If we follow that opinion and put
    into practice, we are doomed. The best way is to plan the use of real value to
    maximise human survival and development time so that we can maximise our chance
    to develop into a super being to reach perpetual existence. That is: minimising
    human use of real value to make Earth life span as close to its natural life
    span as possible — 2 billion years. And in the meantime maximise our
    production of effective information. That means transferring our purpose of all
    activities from produce and consume material products to minimise the use of
    resources and maximise the production of effective information. Use more real
    value does not necessarily increase the production of effective information. We
    use hundreds of times of resources now than the 18 and 19 centuries but
    weighing the quality and quantity of produced effective information, we did not
    outperform the ancestors. Information evolution rate (including biological
    information and thinking information) is fundamentally controlled by
    thermodynamic processes and it has its own pace. The life span of a local
    thermodynamic supporting system for instance, Earth thermodynamic supporting
    system is also fundamentally controlled by thermodynamic processes. Its natural
    life span should be able to allow life to evolve to the level to become the
    effective information (the output information) of this system if this life did
    the right thing. That is: they should be on the same dimension of time — 1 to
    2 billion years. Only after this life evolved to this level they can live
    independently from the Earth and use the real value in a bigger range in space
    and achieve perpetual existence and development. If we exhausted the real value
    on the Earth before we evolve to the level that enables us to survive
    independently from the Earth, our fate is doomed.

    it is just a speculation and cannot proof it that Earth natural life span should
    match human evolution span to reach the level to live independently from the Earth,
    under such circumstances:

    rational choice should be applying the risk management principles and choose
    the save mode( that is choosing the reversible
    process): planned use of resources
    according to the Earth natural life span.

    human reach the level to survive without the Earth earlier that is better. The
    extra resources can be left over to part of the people to continue evolve on
    the Earth (as a safety measure).

    human reach the level to survive without the Earth just before the Earth eco
    system finished, that is our luck and also is our rational choice saved us.

    human endeavoured to our best but still cannot reach the level to survive
    without the Earth before the Earth life span reach its end. That is the nature
    not permit. That is bad luck not human err. We got nothing to regret.

    is another choice, which is what we are doing now: on one hand, propagate the
    unrealistically optimistic idea that we can possess the ability to make use of
    nearly unlimited resources in decades or centuries. Teleport human in 500
    years. Take short cut to travel in space with worm holes, bent space-time,
    etc… On the other hand, under the direction of natural principles of free
    competition and natural selection, driven by natural tendencies and locked in
    gaming relations, everyone is competing others to get access to more resources
    to make more money and get more power. There is a frenzy of increase
    population, expand market, booming production and consumption. They call it
    stimulating economy, promote development. And it is said that is the way to be
    and the only way to solve all problems. It is said they are working hard to
    create value! The resources stay there will be no value and only after they
    used it, it turned into value and turned into money. For instance, the
    geothermal energy underground has no value if you do not use it, it will be
    wasted naturally (geothermal energy is the marker of Earth life span and
    health. It has the highest natural value on Earth. Nature uses it sparely and
    long lasting to maintain the Earth life. Short of it, the Earth will be
    unhealthy and short lived by millions of years. Unhealthy Earth will make all life
    on it sick or die out; short lived Earth means we will not have enough time to
    evolve into higher beings to escape the doomed Earth when the time comes.). They
    turn this super high natural value into rubbish use value — heating energy
    for house and electricity. Geothermal energy is unreplenishable on Earth but
    they catalogue it into renewable energy. Is this a creation of value or
    destroying value? You judge, because that is your future. {I just watched a TV
    documentary named “How to save this world”. In it they (the main stream)
    suggest the whole world follow the example of Costa Rica which will totally
    switch to geothermal energy in 6 years. That reminded me of a famous question
    in a movie: “let’s see can insanity be cured”. Tell the truth, I am not too
    optimistic about it}. The necessary result of this choice is human early
    extinction. Maybe a couple of hundred thousand years. Since this choice is an
    irreversible processes, even when people see the necessary prospect of this
    practice, and want to turn to planned use of resources, the remaining resources
    may not be enough to carry them to the evolution level to reach far future any
    more. Have to mention, turn around now, still not too late.

    society is still a highly manmade structure under the direction of natural
    principle: free competition and natural selection. And under the direction of
    this principle, the naturally stable strategies are all natural tendency driven
    natural processes. So it is intrinsically impossible to carry out rational
    reforms under the direction of natural principle. It is this natural principle
    directed social environment that locked people in a game relations and unable to
    escape from its grasp. People can only adapt to this social environment and
    unable to make rational choice.

    to allow rational choice been made to reach future, the social system has to be
    changed into a total manmade system in which the fundamental directing
    principle is rational thinking. Nature cannot help in this system change
    process this time. So human have to depend on their own efforts, use correct
    theory directing their actions and force the change to happen (because this
    change is a negative entropy driven processes).

    to reach future

    reach future, first thing to do is to make the system change from natural
    system into manmade system. Only in manmade system people can make rational
    choice and action (and in the jungle, only behave like an animal you can
    survive). To achieve this, a core alliance force has to be organised under
    correct principle and theory, through political, social, military manoeuvers,
    unite thoughts, unite action, unite leaderships, unite goals; precede the
    change from natural system into manmade system. United leadership but still
    keep the borders for each country.

    united under correct theory, leadership united under correct goals (and the
    goals cannot be too many and have to order them by importance and time sequence).

    final goal is human perpetual existence. All other goals serve this goal.

    primary goal: establish the organisation and mechanisms of global united
    thoughts, leadership and action.

    The stage goals for
    the realisation of present primary goal:

    directing theory (through research, debate and proof).

    and practice theory.

    the organisation and mechanisms of global government.

    leadership united:

    reduce population quantity (by one child policy), lift population quality (not
    by kill old, sick and disabled. Human value is not weighed by animal standards
    and human quality is not lifted by animal methods. Natural processes only do
    destruction work on manmade structures. And you cannot eliminate genetic
    disease by kill or sterilise the person. Everyone carries hundreds or thousands
    of expressed or hidden genetic defects. That is evolution process and evolution
    can solve the problem. Lifting quality has to use scientific and humane

    the meantime reduce resources consumption, reduce material product production
    and pollution. Try to reach zero pollution as soon as possible.

    efforts on effective information production. Establish the principle that the purpose
    of human activities is to develop effective information. All material product
    production and resources consumption are the input for this purpose. The
    efficiency of human activities is expressed by how much effective information
    produced by using how much resource.

    social development goals should be based on present social development stage
    and environment conditions; it should not be the same as previous social
    development goals. Present social development goals are: reducing population,
    reduce material product production, reduce pollution; but not increasing population,
    booming market, increasing production of material products, stimulating
    consumption, chasing luxury life style.

    from the present trend of population growth and resources consumption, this
    busting phase of this periodical cycle is coming close. And in view of the
    booming time had been so long and the growth has been so fast, while the
    resources have been used up so much, the bust should be unprecedentedly
    violent. Maybe the last, that blows open the way to a new world order —
    rational manmade system which will lead us to the perpetual existence and
    development. Any long term investment (maybe longer than 20 years) may not get
    returns. It will get blown away by violent bust. The time (and conditions) to
    dream American dreams has irreversibly passed. The Americans realised their
    American dreams at the time of capitalism ascending phase — the booming phase
    of this periodical cycle, but now this world is at the edge of the descending
    phase — the busting phase of this periodical cycle. Obviously it is not the
    best timing for dreaming American dreams at the moment.

    frenzy feast on further generations’ survival resources by the present
    generation is at its peak heat but that also mean its bust will coming soon. Face
    forward to build the future but not get your hands burned by losing money or
    get your hands dirty by slaughtering the unborn further generations by
    exhausting their survival resources and ruining their environment. We (and
    them) still have 2 billion years to go on this Earth. Life is a chain. Cut from
    anywhere, previous rings on this chain becoming meaningless (whatever dream you
    dreamed or whatever life style you lived).

    and always, keep our goal on the minimum and fundamental — survival, so that
    we can succeed in reaching the far future. Otherwise, we will die out as a race
    in chasing mirage.

  3. Avukat Baran Doğan November 26, 2015 at 3:17 pm - Reply

    Thanks, it is very helpfull

  4. SmartSoftware Inc November 25, 2015 at 8:26 am - Reply

    Thanks for your post.
    I learn so much.

Leave A Comment

Data Science 101: Using Clusters to Understand Web Traffic