Building a Multiplay Network for Dummies

				  or

	       Why this shit isn't as easy as it looks

[ This is part two of a two part document I'm in the middle of writing
  explaining how the MPUK iSeries event networks run.  ]

There's a lot that goes on 'under the hood' at a Multiplay network
event, and I get the impression that a lot of this is lost on the
customers, many of whom have some network skills, and can't understand
(having invited a few friends round for a LAN pary at home) why this
stuff seems to be so complicated.

Read on. . .

[ ... snip "Nik's basic guide to IP Networking" which goes here ... ]

After reading the above, you've probably got some ideas about how to
build a network suitable for an iSeries event.  They probably go
something like this.

    OK, we need a network for 1,000+ hosts.  Lets pick a network in
    the private address range (10.10 say), and just use that.  With a
    subnet mask of 255.255.0.0 everybody will be able to see everybody
    else, and everything will be peachy.

Sadly, it's not that simple.  Here are some of the reasons why.

 1.  ARP uses broadcasts to find the IP addresses of other machines.
     If you could talk directly to 1,000 other machines your host
     would send a lot of broadcast traffic.  As would every other host
     on the network.  The broadcasts

       a) Reduce the amount of bandwidth available for other traffic.

       b) Consume additional CPU time on each host on the network as
          it has to look at the broadcast traffic to decide whether or
          not to respond to it.

 2.  DHCP uses broadcasts to find the DHCP servers.  See (1).

 3.  Windows file sharing is an incredibly chatty protocol.  It uses
     broadcasts a lot, and is quite anal about announcing its
     existence every now and then.  Although we try and make sure that
     everybody has this turned off, we can't rely on it.  See (1).

 4.  Many games have a "LAN" mode and an "Internet" mode when it comes
     to network play.  Some badly written games (including the ones
     you want to play) decide whether or not they're on a LAN by
     looking at the network mask.  If it's /24 they go in to LAN mode,
     if it's anything else they go in to Internet mode.  In Internet
     mode they may

       a) Decrease the amount of data they send to and from the
          server, leading to increased ping times.

       b) Want to contact sites like won.net to verify your CD key,
          and refuse to work if they can't verify your key.

     Not all games do this, just some.  And we know it's an appalling
     way for them to behave.  But there's nothing we can do about it.

 5.  Even if they don't do (4), the game's built in 'browse for online
     servers' option may have been written to broadcast to try and
     find available servers.  Again, not all games do, but some do.  See (1).

 6.  If we have one big network, a problem (either on one of the MPUK
     hosts, or on one of the customer hosts) has the potential to
     bring down the entire network.  

     For example, if a customer machine starts erroneously generating
     broadcast packets, or accidentally starts up a DHCP server, or
     begins advertising different routes -- all these things can
     seriously impact the ability of the network to work properly.

     By partitioning the network in to smaller networks we are able to
     contain problems like these.  Note that it's not enough to hope
     that customer machines will be configured properly.  We have to
     assume that they won't be, and prepare for the worst.

So, now you know why the network has to be divided in to smaller
networks, how do we do it?

At i10, we had 11 different networks, all talking to one another

  192.168.1.0/24       "Administration" network for the servers
                       (dns: private.event.multiplay.co.uk)

  10.10.0.0/24	       Customer facing network for the servers
                       (dns: public.event.multiplay.co.uk)

  10.10.1.0/24	       The Internet gateway was on this network
  10.10.10.0/24	       One half of the Concourse
  10.10.11.0/24	       The other half of the Concourse
  10.10.20.0/24	       Member's Dining Room
  10.10.30.0/24	       One half of the Long Room
  10.10.31.0/24	       The other half of the Long Room
  10.10.40.0/24	       One half of the boxes on floors 3, 4, and 5
  10.10.41.0/24	       The other half of the boxes
  10.10.50.0/24	       Staff machines
                       (dns: punter.event.multiplay.co.uk)

By using a /24 netmask we only have room for 254 hosts on each
network.  Some of the rooms (Concourse, Long Room, and the boxes) can
hold more than this many machines, which is why they're split over
multiple networks.

In the middle of the network are two Extreme 48 port switches, linked
together with a fibre optic connection.  This gives us 96 100Mb ports
in total.  These switches are "layer 3", which means that they have IP
addresses.  You can think of them as being smarter than regular
switches, but not quite as smart as routers.  Each switch appears as
.1 or .2 on each of the 10.10.x.x networks listed above, and uses
VLANs (don't ask) to "route" traffic between all the networks.

Each MPUK server (on 10.10.0.0/24) is plugged straight in to one or
other of these switches, giving them a guaranteed 100Mb/s to
everything else on the network.  Each of these servers also has a
second NIC.  We give these NICs 192.168.1.0/24 addresses, and plug
them in to a 24 port Planet switch.  This 192.168.1.0/24 network is
used to administer the game servers -- administrators (such as the FTP
site managers) can only connect in to the servers to do their work
from the 192.168.1.0/24 network, which is physically located in the
Staff Room.

Into the remaining ports on the Extremes we run uplinks to a large
number of 24 port Planet switches.  We call these "table switches",
because these are the switches that physically sit out in the
Concourse, Long Room, etc, next to the tables that customers put their
equipment on.

We used to use one uplink per Planet.  Which meant that each Planet
had a 100MB/s connection to an Extreme.  Starting with i10 we now run
two uplinks from each Planet to the Extremes, meaning that each table
switch has a dedicated 200MB/s link to the Extremes (and thence to the
rest of the network).

Finally, you, the customer, plug your equipment in to a port on the
table switch.

The net effect of this is that you get 100Mb/s (more or less) to the
other 21 people who are on the same switch as you (24 ports on each
table switch, minus the port you're using, minus the two ports that
are used for the uplinks to the Extremes, equals 21).  Your table, as
a whole, gets 200Mb/s connection to the Extremes, and on to the game
servers, FTP servers, and so forth.

A word about the FTP server.  As I've already said, FTP is a very
efficient protocol for sucking up all available bandwidth.  If we let
everyone download from the FTP server at full speed it would only take
two people on a table switch downloading to saturate the 200Mb/s
uplink to the Extremes, and affect everyone else on that table's
bandwidth.  So at i10 we did two things to the FTP server.

  1.  We implemented rate limiting.  Uploads and downloads to the FTP
      server were capped at 10Mb/s per connection.  So if you
      downloaded something from the FTP site and wondered why it
      seemed to be downloading slowly, that's why.

  2.  We only allowed one connection to the FTP server per IP
      address.  Some FTP clients attempt to make multiple connections,
      and download different parts of the same file in order to get
      around the bandwidth restrictions described in (1).  This neatly
      puts a stop to that, and ensures that a couple of people can't
      ruin the experience for everyone else on their table switch.

The game servers and other core servers (DNS, FTP, DHCP, etc) sit on
the 10.10.0.0/24 network.

On each customer network (10.10.10.0/24 and up), addresses were handed
out in the range .32 through to .254.  Why didn't we start at .1?

One of the things we're considering is putting multiple IP addresses
on the servers.  So one of the CS servers might be 10.10.0.12.  We
might also want it to appear as 10.10.10.12, 10.10.11.12, 10.10.20.12,
and so on.  We're still debating the wisdom of doing this, and we
didn't do it at i10, but we wanted to have the flexibility -- so we
made sure that the .3 through to .31 addresses on each subnet were
available if we needed it.

So with all this in place, what can still go wrong, how does it affect
you, and how do we work around it?  In no particular order, here are
some of the things we saw at i10:

  1.  People not configuring their hosts to use DHCP, and just picking
      an IP address out of thin air.  Depending on the IP address they
      pick they might interfere with someone else on their network.
      Things get especially bad if they pick a .1 or .2 address,
      because then everything else thinks they're one of the Extremes.

      Not a lot we can do about this except track down the customer
      and get them to reconfigure their machine.  We can (centrally)
      track things down to the table switch that their connected to.
      Then we need to go out there and talk to the 22 people on that
      switch and find out which one has ignored our instructions.
      This typically takes 10-15 minutes, by which point everyone on
      their switch (or sometimes, everyone on their network) has been
      severely inconvenienced.

      We've thought about requiring customers to give us their MAC
      address first, so we know exactly which host is located where.

      Can you say "administrative nightmare"?  Not to mention the fact
      that a good portion of the customers wouldn't know what their
      MAC address is anyway.

  2.  People unplugging the uplinks from the table switches.

      Yes, this does happen.  Unbelievable isn't it?  What's worse is
      when they unplug one of the uplinks.  Each table switch has two
      rows of 12 ports each.  We plug the uplinks in to one port on
      the top row and one port on the bottom row.  If someone unplugs
      an uplink then all the ports on that row will lose connectivity
      to the Extremes.  So you get half the people on the table having
      no problems, and the other half not being able to see anything
      except the other people on the bottom row of the switch.

      Once people call for a yellow shirt to take a look at the
      problem this normally only takes a few minutes to diagnose.
      But it depends on how quickly people call for a yellow shirt.
      Faced with this scenario many people will try and ping their
      neighbours machine, which is often on the same port row as they
      are.  Naturally, this works.  By the time people have done these
      diagnostics and scratched their heads a few times many minutes
      have passed.  These things could be avoided if customers

        a)  Didn't unplug cables they don't understand

        b)  Called for a yellow shirt as soon as they have a problem.  

  3.  Flaky master browsers

      The network we've designed means that we absolutely have to have
      a master browser working so that things like GameSpy, and the
      in-game browsers for newer games work properly.

      The master browser software we used at i10 turned out to be
      buggy and crash-prone.  There's very little we could do about
      that at the event, except make sure that it got restarted each
      time it crashed.  And each time it was restarted it takes a few
      minutes for it to 'learn' about all the game servers that are
      running.

      This is incredibly frustrating, and we appreciate that.  We're
      working on replacements for i11.

  4.  Misconfigured DHCP allocations

      Originally, when we configured the servers for i10, we didn't
      think we'd need the 10.10.31.0/24 network.  Then the racecourse
      let us have some extra rooms at the last minute, which
      overflowed the 10.10.30.0/24 network.  So we had to reconfigure
      the DHCP servers to hand out addresses on the new network, and
      reconfigure the Extremes to recognise it.  This took about 10
      minutes once we were aware of the problem.

  5.  Hardware failures

      Part way through Saturday one of Extremes decided to lock up.
      Hard.

      At this point, roughly half the event loses connectivity to
      everything except the people on their own table.  This is a bad
      thing.  There's also nothing we can do about it except get the
      switch up and running again as quickly as possible.  This takes
      about 10 minutes.  Having a couple of redundant Extreme's
      hanging around just in case isn't feasible. . .

      Similarly, a couple of the game servers suffered from dodgy RAM
      during the event, leading to spontaneous crashes.  Swapping out
      the RAM fixed the problem, but we're always faced with a dilemma
      -- do we shut down a server during the day for an unknown amount
      of time to try and troubleshoot it, or do we let it run through
      until the evening when the load is lower and we can take it
      offline without having as serious an impact.

      We decided to let it run during the day, then take it down at
      the next opportunity to repair it.  From then on it ran like a
      charm.

      The core servers (DNS and DHCP) are run on two separate hosts.
      If one of them goes down the other picks up the load
      automatically.

  6.  People trying to crack the network

      We saw a number of attempts by people trying to deliberately
      crack the network.  Some of these were as simple as changing IP
      addresses to match one of the central servers, some were more
      complex (like changing your MAC address to be one of the
      Extremes).  Our logging systems show this pretty much as it
      happens.  In some cases we just blackhole the culprit until they
      come to us to complain (at which point they get a stern talking
      to).  In other cases we boot them out of the event for violating
      the AUP.

      Either way, these attempts will cause disruption for the people
      on the same table switch, and may cause problems for people on
      the same network, until they're resolved -- which normally takes
      5-15 minutes depending on the problem.

  7.  Illicit file sharing

      As I've already said, we deliberately rate limit the FTP site
      and cap the number of connections that it accepts to ensure that
      people downloading files do not interfere with your gaming.
      However, anyone carrying out illicit file sharing is very
      unlikely to do any of this.  If a couple of people on your
      table switch are sharing files with people on other table
      switches then it's entirely possible that the uplinks will get
      swamped with file sharing traffic, seriously impacting your
      gaming.

      At the moment we take a reactive stance to this -- our logs show
      any excessive traffic between ports on the switches, and we
      blackhole the offenders and/or boot them out of the event for
      violating the AUP.  But this still takes us some time to track
      down.

      For future events we're considering enforcing Quality of Service
      (QoS) rules on the Extremes to make this less of a problem.

  8.  Southern Electric warning us about a power surge coming in less
      than five minutes time

      Ha ha.  Just kidding.

A lot of you probably work in a big corporate environment and are
thinking "We have hundreds of Windows machines, and we never have
these problems.  Clearly, you guys suck."

All well and good.  But remember that we have absolutely no control
over how a customer configures their machine, which software they run
on it, the version of Windows they're using, the quality of their
network card, and innumerable other things.  Each customer is 'root'
on their own machine.