Archive for the 'Tips and Tricks' Category

Object Masks and Filters in C Sharp

Object Masks, Filters, and Other V3 Black Magic

Everyone has heard the age old saying for any given job you need to have the right tool. Just as most of us have tried to use the flat rounded edge of a butter knife a time or two when what we desperately needed was a screw driver. Does that mean you weren’t able to open that little compartment on whatever gizmo and replace the batteries? Probably not. In most cases it is possible to use a butter knife when a screw driver is the tool of choice; it’s just more painful and a lot less effective.

The same can said of the SoftLayer API (SLAPI). It’s a toolset. A very flexible set of tools allowing a developer to manage every aspect of dedicated hosting from accounting and billing to physical status of remote hardware. And yet there are so many tools in the V3 API toolbox, a number of them only subtlety different from their binary brethren ( at least on the surface), it’s tempting just to reach your hand into the bag, find the first thing that resembles a screw driver, and begin turning.

I know. I’m speaking from my own experiences. As a developer who largely works on SoftLayer’s back end systems, somewhere between the bottom of TCP/IP stack and the top edge of the kernel, recently getting to do production user portal code was a new experience for me. Sure I wrote some demos, dabbled a little here and there, but when I started doing my first “real” V3/SLAPI intensive project I realized my prior attempts had entirely missed the true power and elegance of SLAPI. The magic if you will. A little something called ORM.

Those of you who spend your days toiling in the world or relational databases are probably fairly familiar with the term ORM. But for someone like me who usually comes no closer to a database than reading an I/O address from the Windows registry, I was only vaguely aware of what the acronym even stood for. I turned to Webopedia. There I discovered the following. “Short for object role modeling, ORM is a conceptual database design methodology that allows the user to express information as an object and explore how it relates to other information objects”.

So there we have it. Database. Objects. Relations. I learn hands on—so none of that amounts to a hill of beans without some real code I can see and type and run for myself. So rather than regurgitate the SLDN documentation, I will just share a simple yet real life example. Then, in the second part of the article, we can expand that example to show some of the more powerful and less documented features of the V3 SLAPI.

The code that follows is written in Microsoft C Sharp using Visual Studio 2008 Professional Edition. I am not going to step through the basics of connecting a WSDL and generating a SOAP wrapper in this article. If you need help with that, there is an SLDN blog I did a while back which covers those steps entitled, “Dot Net? You Bet!”. It is still available under the implementations section of the SLDN website. True to my MO, I am not a big GUI guy so the code I am presenting runs as a Windows console application.

For the sake of making the example clear, I am going to simplify my task. In the example in both this article, and the next in the series, we will be playing the role of a developer who needs to count how many of his or her servers are running the Microsoft Windows operating system, as opposed to one of the many Linux variants SoftLayer also offers its customers. For our first code sample, we require two WSDLs: the SoftLayer_Account service as well as the SoftLayer_Hardware_Server service. The console program below will get us connected to the SoftLayer application servers, as well as provide us some timing metrics.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SLDN_Magic
{
   class Program
   {
       static void Main(string[] args)
       {
           //global timing vars
           DateTime stopwatch;
           TimeSpan elapsed;

           //replace with your username and api key
           string user_name = "Replace With Your User Name";
           string api_key = "Replace With Your API Key";

           Console.Write("Establishing connection to SLDN service...");

           //time it
           stopwatch = DateTime.Now;
           //declare the services
           SLDN_ACCT.SoftLayer_AccountService acct = new
               SLDN_ACCT.SoftLayer_AccountService();
           SLDN_SVR.SoftLayer_Hardware_ServerService svr = new
               SLDN_SVR.SoftLayer_Hardware_ServerService(); 

           //create an authentification object for each
           SLDN_ACCT.authenticate credentials_a = new
               SLDN_ACCT.authenticate();
           SLDN_SVR.authenticate credentials_b = new
               SLDN_SVR.authenticate();

           //assign credentials
           credentials_a.username = credentials_b.username = user_name;
           credentials_a.apiKey = credentials_b.apiKey = api_key;

           //authenticate
           acct.authenticateValue = credentials_a;
           svr.authenticateValue = credentials_b;

           elapsed = DateTime.Now.Subtract(stopwatch);
           Console.WriteLine("done (" +  elapsed.TotalSeconds.ToString() + " seconds)");

           Console.WriteLine("\nPress  to exit.");
           Console.ReadLine();
       }
   }
}

At this point, we can go ahead and run our code. It doesn’t really do anything all that useful. But never the less you should get an output similar to this.

While writing this article I connected to the SoftLayer API servers numerous times from my home. My connection times were pretty consistent. It took somewhere in the neighborhood of 20 seconds to get everything set up. That seems like a lot. But keep in mind that you only incur the overhead of connecting your services one time. Plus as we get a little further along in this article I will show you how we can use ORM to get rid of one of the references entirely. For now though, let’s move on.

For someone of my background and mindset, what seemed the most straight-forward and correct way to find out which servers were running MS Windows was to access the public method “isWindowsServer()“. This is a method off the server class. That’s why we needed to import the SoftLayer_Hardware_Server service. But we don’t want to check the OS on a single server. We want to recurse through all the servers for an account. Which is why we brought in the SoftLayer_Account service and its attractive public offering “getAllHardware()“.

Keeping this plan in mind, let’s go ahead and implement it in our console application.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SLDN_Magic
{
   class Program
   {
       static void Main(string[] args)
       {
           //global timing vars
           DateTime stopwatch;
           TimeSpan elapsed;

           //global container
           SLDN_ACCT.SoftLayer_Hardware[] hw;

           //replace with your username and api key
           string user_name = "Replace With Your User Name";
           string api_key = "Replace With Your API Key";

           Console.Write("Establishing connection to SLDN service...");

           //time it
           stopwatch = DateTime.Now;

           //declare the services
           SLDN_ACCT.SoftLayer_AccountService acct = new
               SLDN_ACCT.SoftLayer_AccountService();
           SLDN_SVR.SoftLayer_Hardware_ServerService svr = new
               SLDN_SVR.SoftLayer_Hardware_ServerService(); 

           //create an authentification object for each
           SLDN_ACCT.authenticate credentials_a = new
               SLDN_ACCT.authenticate();
           SLDN_SVR.authenticate credentials_b = new
               SLDN_SVR.authenticate();

           //assign credentials
           credentials_a.username = credentials_b.username = user_name;
           credentials_a.apiKey = credentials_b.apiKey = api_key;

           //authenticate
           acct.authenticateValue = credentials_a;
           svr.authenticateValue = credentials_b;

           elapsed = DateTime.Now.Subtract(stopwatch);
           Console.WriteLine("done (" + elapsed.TotalSeconds.ToString() + " seconds)");

           //butter knife method

           Console.Write("Retrieving hardware using method 1...");

           //get time stamp
           stopwatch = DateTime.Now;

           hw = null;
           try
           {
               hw = acct.getHardware();
           }
           catch (Exception e)
           {
               Console.WriteLine("Exception encountered [" + e.Message + "]");
               hw = null;
           }

           int cnt = 0;

           foreach (SLDN_ACCT.SoftLayer_Hardware server in hw)
           {
               try
               {
                   SLDN_SVR.SoftLayer_Hardware_ServerInitParameters box = new
                       SLDN_SVR.SoftLayer_Hardware_ServerInitParameters();
                   box.id = (int)server.id;
                   svr.SoftLayer_Hardware_ServerInitParametersValue = box;
                   if (svr.isWindowsServer())
                   {
                       cnt++;
                   }
               }
               catch (NullReferenceException)
               {
                   //ignore...this server has not had the
                   //OS loaded on it yet!
               }
           }

           elapsed = DateTime.Now.Subtract(stopwatch);

           Console.WriteLine("done (" + elapsed.TotalSeconds.ToString() + " seconds)");
           Console.WriteLine("counted " + cnt.ToString() + " MS Windows licenses");
           Console.WriteLine("\nPress  to exit.");
           Console.ReadLine();
       }
   }
}

That’s it. Pretty straight forward stuff possibly with the exception of the way SLAPI allows you to reinitialize the server (or any object) on the fly by use of:”SoftLayer_Hardware_ServerInitParameters”. If this confuses you, again I’ll refer you to the blog “Dot Net? You Bet!”. At this point, I think we are ready for another test run.

Once again we find ourselves in the twenty second range both for connecting the services and counting the hardware. What you can’t tell from looking at this output is that the account I was using for testing had about 100 servers on it. So basically we are talking 1/5 of a second per server. It’s certainly doable with a handful of servers, but this would obviously never work if you were trying to present this information real time to users if you managed 500 or 5,000 or 50,000 servers. So you are probably asking yourself the same thing I did. What gives? If SoftLayer wishes its customers success on an enterprise level, why create an API that comes to its knees when you start trying to manage more than a few hundred servers?

Luckily, the architects of SLAPI were a lot more web / database savvy than me. The above implementation, while correct syntactically, is a gross misuse of the SLAPI. It’s the butter knife, when what we really need is the screwdriver. What we need are object masks. But exactly what are object masks and how do they relate to ORM?

The best way I have found to understand ORM and object masks, is to think of the SLAPI data objects, as a self supporting entities. Each object provides its own methods and exposes some properties specific to that object. Yet thanks to ORM, most objects can actually get to properties in related objects, through a process called tapping. You simply tap each object down the chain until you find the property or properties you are interested in, prior to retrieving an instance or instances of the object. Then the set of objects returned will expose any relevant properties in the same manner you tapped them.

For example in our case the SLDN architecture relates a server to an operating system in the following manner.

The diagram shows us that essentially, as long as we can get a hardware object, we can tap all the way down to the software description — which as the SLDN documentation states has a “name” property. There by we eliminate two of our most time consuming tasks from our original application. First off we no longer need to instantiate the SoftLayer_Hardware_Server service, since we can get to server from hardware and hardware can be retrieved via the account class. Secondly, if when we return the hardware it already contains the name of the operating system, we no longer have a need to call the “isWindows()” method. Take a look.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SLDN_Magic
{
   class Program
   {
       static void Main(string[] args)
       {
           //global timing vars
           DateTime stopwatch;
           TimeSpan elapsed;

           //global container
           SLDN_ACCT.SoftLayer_Hardware[] hw;

           //replace with your username and api key
           string user_name = "Your User Name Here";
           string api_key = "Your Api Key Here";

           Console.Write("Establishing connection to SLDN service...");

           //time it
           stopwatch = DateTime.Now;

           //declare the services
           SLDN_ACCT.SoftLayer_AccountService acct = new
               SLDN_ACCT.SoftLayer_AccountService();

           //create an authentification object
           SLDN_ACCT.authenticate credentials_a = new
               SLDN_ACCT.authenticate();

           //assign credentials
           credentials_a.username = user_name;
           credentials_a.apiKey = api_key;

           //authenticate
           acct.authenticateValue = credentials_a;

           elapsed = DateTime.Now.Subtract(stopwatch);
           Console.WriteLine("done (" + elapsed.TotalSeconds.ToString() + " seconds)");

           //method 2
           Console.Write("Retrieving hardware using method 2...");

           //get time stamp
           stopwatch = DateTime.Now;

           hw = null;
           //attempt to pull a hardware list for this user
           try
           {
               acct.SoftLayer_AccountObjectMaskValue = new
                   SLDN_ACCT.SoftLayer_AccountObjectMask();
               acct.SoftLayer_AccountObjectMaskValue.mask = new
                   SLDN_ACCT.SoftLayer_Account();
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware = new
                   SLDN_ACCT.SoftLayer_Hardware_Server[1];
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware[0] = new
                   SLDN_ACCT.SoftLayer_Hardware_Server();
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware[0].operatingSystem = new
                   SLDN_ACCT.SoftLayer_Software_Component();
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware[0].operatingSystem.softwareLicense = new
                   SLDN_ACCT.SoftLayer_Software_License();
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware[0].operatingSystem.softwareLicense.softwareDescription = new
                   SLDN_ACCT.SoftLayer_Software_Description();
               hw = acct.getHardware();
           }
           catch (Exception e)
           {
               Console.WriteLine("Exception encountered [" + e.Message + "]");
               hw = null;
           }

           cnt = 0;

           foreach (SLDN_ACCT.SoftLayer_Hardware server in hw)
           {
               try
               {
                   if (server.operatingSystem.softwareLicense.
                                   softwareDescription.name.ToLower().
                                   Contains("windows"))
                   {
                       cnt++;
                   }
               }
               catch (NullReferenceException)
               {
                   //ignore...this server has not
                   //had the OS loaded on it yet!!!
               }
           }

           elapsed = DateTime.Now.Subtract(stopwatch);

           Console.WriteLine("done ("+ elapsed.TotalSeconds.ToString()+" seconds)");
           Console.WriteLine("counted " + cnt.ToString() + " MS Windows licenses");

           Console.WriteLine("\nPress  to exit.");
           Console.ReadLine();
       }
   }
}

You should notice right away all the references to SoftLayer_AccountObjectMaskValue prior to calling the “getHardware()” method. This is the object mask. Essentially we must create a new instance of each entity we want to include down the chain. Then when the target object is retrieved, in our case the hardware, all related objects which we have made room for will get created for that specific instance of hardware, assuming of course a record exists. You must instantiate each object down the chain. If you skip any link your result set will not conatain the property or method you were trying to get to. Some less structured langagues, like PHP, do not have this requirement. But with V3 and dot NET there is no getting around it. You’re probably thinking this version of the code looks far more cluttered and is not as straight-forward to read. You’re right. But I contend this is the electric screwdriver in the SLDN toolbox. See for yourself.

As you can see the connection overhead dropped in half, which is to be expected since we are only authenticating to half the number of services. But take a look at the second number, the number of seconds it takes to count the servers with Windows installed. It dropped from 20 seconds, to under 2 seconds. That’s a 10 times speed gain. And wait there’s more– because we are only making one call to the SLDN application servers to retrieve those records what you see is what you get. Meaning you should not expect that time to increase noticeably whether you have a hundred servers or a hundred thousand servers! That my friend is the magic of V3. The power of ORM.

Following this article I will include the entire code base, in a combined application that lets you run the tests sequentially so you can see the amazing difference object masks make for yourself. In the second part to this article, we’ll continue with the sample so if you download it keep it handy. In part two I’ll discuss the next best thing to object masks—object filters. With object filters we’ll be able to streamline this code even more. Until then…happy SLDNing!

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SLDN_Magic
{
   class Program
   {
       static void Main(string[] args)
       {
           //global timing vars
           DateTime stopwatch;
           TimeSpan elapsed;

           //global container
           SLDN_ACCT.SoftLayer_Hardware[] hw;

           //replace with your username and api key
           string user_name = "";
           string api_key = "";

           Console.Write("Establishing connection to SLDN service...");

           //time it
           stopwatch = DateTime.Now;

           //declare the services
           SLDN_ACCT.SoftLayer_AccountService acct = new SLDN_ACCT.SoftLayer_AccountService();
           SLDN_SVR.SoftLayer_Hardware_ServerService svr = new SLDN_SVR.SoftLayer_Hardware_ServerService(); 

           //create an authentification object for each
           SLDN_ACCT.authenticate credentials_a = new SLDN_ACCT.authenticate();
           SLDN_SVR.authenticate credentials_b = new SLDN_SVR.authenticate();

           //assign credentials
           credentials_a.username = credentials_b.username = user_name;
           credentials_a.apiKey = credentials_b.apiKey = api_key;

           //authenticate
           acct.authenticateValue = credentials_a;
           svr.authenticateValue = credentials_b;

           elapsed = DateTime.Now.Subtract(stopwatch);
           Console.WriteLine("done (" + elapsed.TotalSeconds.ToString() + " seconds)");

           //method 1

           Console.Write("Retrieving hardware using method 1...");

           //get time stamp
           stopwatch = DateTime.Now;

           hw = null;
           try
           {
               hw = acct.getHardware();
           }
           catch (Exception e)
           {
               Console.WriteLine("Exception encountered [" + e.Message + "]");
               hw = null;
           }

           int cnt = 0;

           foreach (SLDN_ACCT.SoftLayer_Hardware server in hw)
           {
               try
               {
                   SLDN_SVR.SoftLayer_Hardware_ServerInitParameters box = new SLDN_SVR.SoftLayer_Hardware_ServerInitParameters();
                   box.id = (int)server.id;
                   svr.SoftLayer_Hardware_ServerInitParametersValue = box;
                   if (svr.isWindowsServer())
                   {
                       cnt++;
                   }
               }
               catch (NullReferenceException)
               {
                   //ignore...this server has not had the OS loaded on it yet
               }
           }

           elapsed = DateTime.Now.Subtract(stopwatch);

           Console.WriteLine("done (" + elapsed.TotalSeconds.ToString() + " seconds)");
           Console.WriteLine("counted " + cnt.ToString() + " MS Windows licenses");

           //method 2
           Console.Write("Retrieving hardware using method 2...");

           //get time stamp
           stopwatch = DateTime.Now;

           hw = null;
           //attempt to pull a hardware list for this user
           try
           {
               acct.SoftLayer_AccountObjectMaskValue = new SLDN_ACCT.SoftLayer_AccountObjectMask();
               acct.SoftLayer_AccountObjectMaskValue.mask = new SLDN_ACCT.SoftLayer_Account();
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware = new SLDN_ACCT.SoftLayer_Hardware_Server[1];
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware[0] = new SLDN_ACCT.SoftLayer_Hardware_Server();
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware[0].operatingSystem = new SLDN_ACCT.SoftLayer_Software_Component();
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware[0].operatingSystem.softwareLicense = new SLDN_ACCT.SoftLayer_Software_License();
               acct.SoftLayer_AccountObjectMaskValue.mask.hardware[0].operatingSystem.softwareLicense.softwareDescription = new SLDN_ACCT.SoftLayer_Software_Description();
               hw = acct.getHardware();
           }
           catch (Exception e)
           {
               Console.WriteLine("Exception encountered [" + e.Message + "]");
               hw = null;
           }

           cnt = 0;

           foreach (SLDN_ACCT.SoftLayer_Hardware server in hw)
           {
               try
               {
                   if (server.operatingSystem.softwareLicense.softwareDescription.name.ToLower().Contains("windows"))
                   {
                       cnt++;
                   }
               }
               catch (NullReferenceException)
               {
                   //ignore...this server has not had the OS loaded on it yet
               }
           }

           elapsed = DateTime.Now.Subtract(stopwatch);

           Console.WriteLine("done ("+elapsed.TotalSeconds.ToString()+" seconds)");
           Console.WriteLine("counted " + cnt.ToString() + " MS Windows licenses");

           Console.WriteLine("\nPress  to exit.");
           Console.ReadLine();
       }
   }
}
No comments

Building the Data Warehouse

Here at SoftLayer, we have a lot of things that we need to keep track of. It’s not just payments, servers, rack slots, network ports, processors, hard drives, RAM sticks, and operating systems, it’s also bandwidth, monitoring, network intrusions, firewall logs, VPN access logs, API access, user history, and a whole host more. Last year, I was tapped to completely overhaul the existing bandwidth system. The old system was starting to show its age, and with our phenomenal growth it just hasn’t been able to keep up.

SoftLayer has 20,000+ servers. Each of those servers is on 2 networks, the public network open to the Internet, and the private network that only our customers can use. Each of those networks exists on a 3 to 4-level network hierarchy. This gives us more than 50,000 switch ports, but we’ll use 50,000 to make the math easier. Each switch port has bandwidth in and bandwidth out, as well as packets in and packets out. That gives us 200,000 data points to poll. Bandwidth is polled every 5 minutes, giving us 57,600,000 data points per day, or 1,728,000,000 per month. Given that bandwidth data points are all 64 bit numbers, and we also have to track the server ID (32 bit), the network (32 bit), and the datetime (32 bit), that makes a month’s worth of raw data (excluding any overhead from storage engines) 34.56GB. Now, the data is to be stored redundantly, so double that. Also, we have to track bandwidth for racks, virtual racks, and private racks, so add another 50% onto the data. That gives us around 90GB per month of data.

This doesn’t seem like a lot of data at first, but we need to generate custom bandwidth graphs for use on the web. Since it’s on the web, loading times above 2 seconds are unacceptable. Also, these are not enormous files with small keys (we’ll spend more time on that later) so 45GB of bandwidth data is a whole lot different than 45GB of movie files or MP3s.

To accomplish this, we decided that we needed a data warehouse. After numerous false starts and blind alleys, we decided to make our own system from scratch using MySQL. We considered commercial products, and pre-built open source solutions, but they just didn’t seem to fit our needs properly. The data warehouse project commenced with phase 1.

Phase 1: MySQL Cluster with Read Shards, Large Tables
Our first implementation was relatively simple. We planned to have a MySQL cluster for writing all the data, with the data split into 100 tables. The ID of the hardware mod 100 would determine the table name that we would write to. Then we’d have between 5 and 20 read databases, each replicating a different table, for load reasons. Then all we have to do is index the data properly, and add code to our applications to pull data from the correct read database node, and we’ll be fine.

Bad news: MySQL cluster isn’t designed for data with huge numbers of large keys. MySQL cluster stores all indexes and keys in memory on the cluster controller multiple times. As mentioned before, our data had 12 bytes of key, and 8 bytes of data. This means that we could only get about 2 million rows into the data warehouse before MySQL would lock up and quit working with the ever helpful error message “Unknown error: 157.” Even more disturbing, deleting items from MySQL cluster didn’t free up memory, as the indexes had to be rebuilt before that would happen. We upgraded everything to MySQL 5.1.6, but per the MySQL manual, “Beginning with MySQL 5.1.6, it is possible to store the non-indexed columns of NDB tables on disk, rather than in RAM as with previous versions of MySQL Cluster.” Unfortunately, it’s the indexes that are causing us problems, so the upgrade didn’t help.

Phase 2: MySQL Cluster with Read Shards, Small Tables
Since MySQL cluster couldn’t handle a small number of really big tables, how about a really big number of small tables? We tried making hundreds of tables per day, and removing the indexes on tables older than a certain time, but that quickly ran up against operating system limitations on the number of files in a folder. When we switched operating systems to alleviate this problem, we ran into MySQL’s limitation of roughly 21,000 tables per database. Nothing could get past that, so we had to move away from the cluster entirely.

Phase 3: Large MySQL box with Read Shards, Large Tables
We then moved on to one single MySQL box with enormous hard drives and multiple read shards. This looked promising at first, but the box simply couldn’t handle the amount of inserts and updates, and the slave servers were locking up too often. We thought that if we partitioned the tables, MySQL could handle the inserts better. This, naturally, broke replication in a different way. This was mainly because we had too many tables now that each partition was getting its own file, so MySQL would be constantly opening all these table files to perform the updates, and would quickly run out of memory and begin to swap. We just had to decrease the number of tables per server. We decided to abandon the centralized master server idea, and built out 5 pairs of master/slave servers.

Phase 4: 5 Pairs of Master/Slave Servers, Large Grouped Tables
This plan really seemed like it was going to work. We actually had a working data warehouse for almost 2 weeks without any errors. We were close to breaking out the champagne, but there was one feature we still hadn’t implemented.

The time came to add the tracking of bandwidth for virtual dedicated racks as well. To accomplish this, we changed the MySQL INSERT statement to INSERT... ON DUPLICATE KEY UPDATE. For those that don’t immediately recognize the terminology, the ON DUPLICATE KEY UPDATE syntax is designed so that if a particular database key already exists in the database, simply change the INSERT statement into an UPDATE statement. The UPDATE portion is defined at the end of the clause. Since the key to our data is the large “object ID, date time, data type” combination, adding the duplicate syntax allowed us to issue multiple INSERT statements for the same data, and have it continually update with new values. This was especially handy for virtual racks with thousands of servers.

MySQL threw us another curve ball when we tried this. There was a known issue with the ON DUPLICATE UPDATE syntax and replication using binary logs. Namely, the UPDATE would remain an INSERT in the log somehow, so we’d get duplicate key errors thrown on the slave, when the master was still working fine. Each time this happened, it would require stopping both servers, re-synching the tables, clearing the logs, restarting replication, and restarting the data transfer processes. This was unacceptable, so we had to move once more.

Phase 5: 10 Individual Machines
Since we already had 10 identical database machines, we decided to make them all independent and ignorant of each other. We stayed with the item group theory, leaving multiple bandwidth items per table. With 25 different items per table, we were only going to have 1,000,000 rows per table per month, which still maintains our speed. However, the index size problem bit us again. These tables were simply too large to have 3/4 of their columns indexed.

Finally we thought we had the answer. We would split up each of these tables by month. That way, each table would max out at 1,000,000 rows, so the indexes wouldn’t be unmanageable. Remember that we’ve already split the connection points from one, to five, to ten. We’ve also broken the tables up from one large table, to one per object, and now one per group of objects. In order to cut down on the complexity of the application layer code, we decided to use merge tables to keep the table name consistent. That way we can simply select from “the table” rather than “the table from March.” When we did this, we almost immediately began getting mysterious deadlocks on the servers. The INSERT statements would conflict with any SELECT statements using the same table, and the two threads would just hang indefinitely. Oddly, the table was never actually locked, it seemed that there was some sort of “offsetting lock” happening, where both threads were deadlocked in a race condition. The table could still be used, but the application we had inserting the data would be hung, so it was causing unacceptable delays in data inserts.

Phase 6: 10 Individual Machines, Small Individual Tables
Finally we decided that enough was enough, we were going to roll our own scaled storage solution. Clustering didn’t work, throwing hardware at the problem unfortunately didn’t work, and even the fancy new features for merged tables and partitioned tables didn’t work. We simply decided that we were going to keep the old “grouped tables” model, as well as creating a new table for every month without merge tables. This way, we keep the number of tables relatively low, and the number of rows per table low. Plus, as a bonus, by controlling the table names ourselves we could ensure that MySQL wouldn’t open too many files. All inserts went into this month’s table, and all reads would come out of whatever specific month they needed. A cron job was set up to periodically issue FLUSH TABLES on all 10 data warehouse nodes, and we had our completed product! It’s been months now since we had a major database failure, we can generate any bandwidth graphs you want in less than a second, and other developers are starting to create their own table groups.

Each object’s data resides completely on two of our data warehouse nodes, so the data is redundantly stored in the event of a node going down. The application code has been written with extensive factory patterns and ORM so that all a developer has to do is create a new group type, a new data tracking object, and add data to it. The code automatically selects the applicable nodes to write the data to, creates the tables if they don’t exist, and writes the data to the tables. Similarly, a “get data” command will randomly select one location for the data, and retrieve the proper data, using only the tables that are necessary.

But Wait, There’s More!
All 6 phases above were happening simultaneously with other systems. The raw data itself needed to be stored, buffered, transferred, and translated into a format suitable for the data warehouse. Since almost a quarter of our servers are in other parts of the country, we had to have redundant data storage and transfer solutions to make sure the raw data got to where it needed to be. For this data, the rules were different.

First of all, there are two layers of raw data. Each city has a local bandwidth polling database. We use rtgpoll to poll the bandwidth data. Since rtgpoll is designed to write to a different table for each data type, we kept the data like that. For ease of data management, we created a script that would keep a rotating two-day cycle of tables, one for today and one for yesterday, with a merge table that encompassed them both. We could get away with the merge table on this layer because there are far fewer tables, far fewer rows, and different indexes. Since the interface isn’t important at this layer, we could index only the date time and get the performance we wanted by making our transfer scripts to the global buffer date-based.

The global buffer server is the same box we attempted to use in phase 3 and 4, above. It has the exact same table structure as the datacenter buffers, with the rotating tables living under a merge table. This data is replicated out to a slave server, to prevent read/write contention. At this layer, we have no ON DUPLICATE KEY statements being executed, and no partitions, plus our merge tables are much smaller, so everything works out. These tables act as a permanent raw data archive in the event of a system failure or a bandwidth dispute with a customer.

The scripts that pull data out of the datacenter buffers also inserts data into a queue for each of the data warehouse nodes. We store a lookup table in memcache that will translate a raw interface ID into the data warehouse nodes that interface’s data needs to be inserted into. That raw row is then inserted into the queues for the nodes it belongs to.

Finally, a set of scripts runs on the data warehouse nodes, constantly pulling any new data out of their queues on the global buffer slave. The data is translated from raw interface data to match up to our customer accounts, then inserted into the local data warehouse database, ready to be selected out to make graphs, reports, or billing.

All Powered by Tracking Object
The entire system is interfaced from our application code using the tracking object system. The tracking objects are a series of PHP classes that link a particular object in our existing production database to that object’s various data points in the data warehouse. Using ORM and factory patterns, we were able to abstract tracking objects to the point where any object in our database could have an associated “trackingObject” member variable. Servers, Virtual Dedicated Racks, Cloud Computing Instances, and other systems can simply call the getBandwidthData() method on their tracking object, and the tracking object system will automatically select the correct database, select the correct table, and pull the correct fields, formatting them as a collection of generic “bandwidth data” objects. Other metrics, like CPU and memory usage for servers and Cloud Computing Instances, can just as easily be retrieved.

Similarly, most of our back-end systems use the tracking objects to add data to the data warehouse. The developers don’t touch the warehouse directly, they simply load whatever object they have new data for, and pass an array of raw data objects to our internal addData() function, which automatically determines database node, write table, and data structure. The tracking object system is completely transparent to the other developers, and it means new tracking objects or data warehouse nodes can be created seamlessly without changing existing code.

By centralizing the reading and writing into these classes, the data warehouse can be extended infinitely. A new type of data can be added as easily as adding a new data warehouse data type class to the file system, as well as a row to the database. As long as that class has the data structure properly defined, new tracking objects can be created for that data type, and data can begin being recorded immediately. Creating a new tracking object will automatically choose two or more database nodes to store the data on. Creating new nodes takes the current tracking object count into consideration, so the nodes stay balanced.

So far the system has 33,115,715,147 rows in 683,460 tables spread over 10 databases. We have hundreds of customers who view their bandwidth graphs every day, and a handful that systematically pull the graphs every hour. Load tests suggest that performance doesn’t degrade until we hit 500 simultaneous graph requests, and even then we still come in under 2 seconds per graph. With the scaling potential and the single point of access for developers, we should be able to use this system indefinitely.

No comments

Using CURL to access CloudLayer Storage

CloudLayer Storage is billed as providing “anytime, anywhere access to your data”. This isn’t just referring to human interfaces, but also includes automated interfaces.

One easy way to automate access to CloudLayer Storage is through curl. Curl is available as a command-line tool in most every operating system and is typically used for transferring files. In this post I’ll show some examples on how to use curl to add, get, delete, or otherwise manipulate files in CloudLayer Storage. Note that this isn’t using the SoftLayer API, but instead interfaces directly with CloudLayer Storage.

Upload a file named “DSC1012.jpg” to an account owned by username “user@example.com” with a password of “PaSsWoRd”:

# curl –F filename=@DSC1012.jpg –u user@example.com:PaSsWoRd \

https://storage.cloudlayer.com/v1/files/

The command will return some XML tags. The items of interest are “FileID” and “lockID”. These values are important for future operations on the file.


<fileID>102C9C28-65C3-11DE-1234-2BE68BA216C2</fileID>
<lockID>6CDCEEB2-6B38-11DE-A510-123F439A2728</lockID>
<lockDuration>120</lockDuration>

The lock is to protect a file form reading or being manipulated during the upload process. The lock will expire in “lockDuration” seconds or the user can disable the lock manually.

Here is how to disable the lock using the lockID and the fileID generated from the upload operation:

# curl –d \
'action=unlock&lockid=6CDCEEB2-6B38-11DE-A510-123F439A2728' \
–u user@example.com:PaSsWoRd \

https://storage.cloudlayer.com/v1/files/102C9C28-65C3-11DE-1234-2BE68BA216C2/lock

If you ever lose track of the FileID, you can use this command to retrieve a listing of the files and containers (directories) in an account along with the FileIDs which are listed as an “oid” XML tag.

# curl –u user@example.com:PaSsWoRd \

https://storage.cloudlayer.com/v1/files/list

To get the list of files in a container, just append the container oid to the URL.

# curl –u user@example.com:PaSsWoRd \

https://storage.cloudlayer.com/v1/files/list?oid=37D0F2AC-08FC-11DE-1234-3FA3A91CD1B4

To retrieve the file from CloudLayer Storage, use the FileID to retrieve it.

# curl -u user@example.com:PaSsWoRd \
https://storage.cloudlayer.com/v1/files/37D0F2AC-08FC-11DE-1234-3FA3A91CD1B4/ -o outputfilename

Alternatively, you could use “wget” to retrieve the file

# wget –http-user=user@example.com -–http-password=PaSsWoRd \
https://storage.cloudlayer.com/v1/files/37D0F2AC-08FC-11DE-1234-3FA3A91CD1B4/ -O outputfilename

To delete a file just add the POST form variable “action” with the value “delete”.

# curl –d 'action=delete' –u user@example.com:PaSsWoRd \

https://storage.cloudlayer.com/v1/files/37D0F2AC-08FC-11DE-1234-3FA3A91CD1B4/

Each of the commands listed above return data in XML format. If you would prefer json format, add a query parameter “output=json” to the query string.

# curl –u user@example.com:PaSsWoRd \

https://storage.cloudlayer.com/v1/files/list?output=json

In order to create a public URL for a file, just send a POST variable of “action=create” to the “token” endpoint.

# curl -d 'action=create' -u user@example.com:PaSsWoRd \

https://storage.cloudlayer.com/v1/files/37D0F2AC-08FC-11DE-1234-3FA3A91CD1B4/token/

The long string “37D0F2…” is the oid (a.k.a FileID) of the file that you can get from the XML returned when the file was uploaded, or retrived using the file listing example above.

In the XML (or JSON) data that is returned, there will be a “token”.

<token>B2891F7B054EF2DF764801E1CFF0079057291234</token>

That token can be combined with the oid to create a URL that anyone can use to retrieve the file.

The URL looks like this:

https://storage.cloudlayer.com/v1/public/{oid}/{token}

In our example it would be:

https://storage.cloudlayer.com/v1/public/37D0F2AC-08FC-11DE-1234-3FA3A91CD1B4/B2891F7B054EF2DF764801E1CFF0079057291234

If you are accessing CloudLayer Storage from inside a SoftLayer datacenter, you can access the storage over the SoftLayer private network (no bandwidth fees!). Just use “scs.service.softlayer.com” instead of “storage.cloudlayer.com”.

You can use the information above in conjunction with the curl libraries in PHP, C++, or one of many other programming languages with curl bindings.

1 comment

PHP Memory Management in Foreach

Many developers, even experienced ones, are confused by the way PHP handles arrays in foreach loops. In the standard foreach loop, PHP makes a copy of the array that is used in the loop. The copy is discarded immediately after the loop finishes. This is transparent in the operation of a simple foreach loop. For example:

$set = array("apple""banana""coconut");
foreach ( 
$set AS $item ) {
    echo 
"{$item}\n";
}

This outputs:

apple
banana
coconut

Even though the copy is created, the developer doesn’t notice, because the original array isn’t referenced within the loop or after the loop finishes. However, when you attempt to modify the items in a loop, you find that they are unmodified when you finish:

$set = array("apple""banana""coconut");
foreach ( 
$set AS $item ) {
    
$item strrev ($item);
}
print_r($set);

This outputs:

Array
(
    [0] => apple
    [1] => banana
    [2] => coconut
)

There are no changes from the original, even though you clearly assigned a value to $item. This is because you are operating on $item as it appears in the copy of $set being worked on. You can override this by grabbing $item by reference, like so:

$set = array("apple""banana""coconut");
foreach ( 
$set AS &$item ) {
    
$item strrev($item);
}
print_r($set);

This outputs:

Array
(
    [0] => elppa
    [1] => ananab
    [2] => tunococ
)

As you can see, when $item is operated on by-reference, the changes made to $item are made to the members of the original $set. Using $item by reference also prevents PHP from creating the array copy. To test this, first we’ll show a quick script demonstrating the copy:

$set = array("apple""banana""coconut");
foreach ( 
$set AS $item ) {
    
$set[] = ucfirst($item);
}
print_r($set);

This outputs:

Array
(
    [0] => apple
    [1] => banana
    [2] => coconut
    [3] => Apple
    [4] => Banana
    [5] => Coconut
)

In this example, PHP copied $set and used it to loop over, but when $set was used inside the loop, PHP added the variables to the original array, not the copied array. Basically, PHP is only using the copied array for the execution of the loop and the assignment of $item. Because of this, the loop above only executes 3 times, and each time it appends another value to the end of the original $set, leaving the original $set with 6 elements, but never entering an infinite loop.

However, what if we had used $item by reference, as I mentioned before? A single character added to the above test:

$set = array("apple""banana""coconut");
foreach ( 
$set AS &$item ) {
    
$set[] = ucfirst($item);
}
print_r($set);

Results in an infinite loop. Note this actually is an infinite loop, you’ll have to either kill the script yourself or wait for your OS to run out of memory. I added the following line to my script so PHP would run out of memory very quickly, I suggest you do the same if you’re going to be running these infinite loop tests:

ini_set("memory_limit","1M");

So in this previous example with the infinite loop, we see the reason why PHP was written to create a copy of the array to loop over. When a copy is created and used only by the structure of the loop construct itself, the array stays static throughout the execution of the loop, so you’ll never run into issues.

But wait, there’s more. PHP fails to create a copy of the array if a reference is used at all. We know that referencing $item will cause the infinite loop scenario above, but if $set is referenced anywhere else in the script, even the non-referencing foreach format will break:

$set = array("apple""banana""coconut");
$a = &$set;
foreach ( 
$set AS $item ) {
    
$set[] = ucfirst($item);
}

Results in an infinite loop, even though $item isn’t by reference. Using $a instead of $set gives identical results.

This is not to say that $item is implicitly used by reference if $set is referenced. See this example:

$set = array("apple""banana""coconut");
$a = &$set;
foreach ( 
$a AS $item ) {
    
$item ucfirst($item);
}
print_r($set);

This outputs:

Array
(
    [0] => apple
    [1] => banana
    [2] => coconut
)

$set is unchanged from the original values, because even though $set is referenced by $a, and $set has not been copied, $item is still given only lexical scope in relation to the loop, and will not pass modifications back to $set. You will still have to assign it by reference to make changes to the original array:

$set = array("apple""banana""coconut");
$a = &$set;
foreach ( 
$a AS &$item ) {
    
$item strrev($item);
}
print_r($set);

This outputs:

Array
(
    [0] => elppa
    [1] => ananab
    [2] => tunococ
)

All of these examples also work in associative arrays using the foreach ( $set AS $key => $item ) syntax. $key can never be used by-reference it always comes from the array the loop construct is using, and cannot be modified. So the tricks used to modify array items in-position won’t work for modifying the keys. You can create new keys in the array, however, and unset the existing ones, like so:

$set = array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
foreach ( 
$set AS $key => $item ) {
    
$set[ucfirst($key)] = $item;
    unset(
$set[$key]);
}
print_r($set);

This outputs:

Array
(
    [Apple] => red
    [Banana] => yellow
    [Coconut] => brown
)

However, as you may have already noticed, this array was copied before the loop began. If you were using the array in a situation where it couldn’t be copied, you will run into errors:

$set = array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
$a = &$set;
foreach ( 
$set AS $key => $item ) {
    
$set[ucfirst($key)] = $item;
    unset(
$set[$key]);
}
print_r($set);

This outputs:

Array
(
)

Because the array was referenced and not copied, you get vastly unpredictable results when attempting to alter the physical structure of the array, especially using unset(). Without the unset() call in this example, you operate on the original array and loop through the original array, so you get the same infinite-loop generating code as before, but since we’re specifying the key for $set it doesn’t continue forever:

$set = array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
$a = &$set;
foreach ( 
$set AS $key => $item ) {
    
$set[ucfirst($key)] = $item;
}
print_r($set);

This outputs:

Array
(
    [apple] => red
    [banana] => yellow
    [coconut] => brown
    [Apple] => red
    [Banana] => yellow
    [Coconut] => brown
)

You can prove that it’s still possible to enter an infinite loop by adding a $set[] inside your loop:

$set = array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
$a = &$set;
foreach ( 
$set AS $key => $item ) {
    
$set[ucfirst($key)] = $item;
    
$set[] = $item;
}
print_r($set);

This results in an infinite loop.

One interesting thing you can do with the $key => $item syntax when the array is copied is modify the original array structure without fear of causing loop issues:

<?php
$set 
= array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
foreach ( 
$set AS $key => $item ) {
    
$set[] = ucfirst($item);
    unset(
$set[$key]);
}
print_r($set);

This outputs:

Array
(
    [0] => Red
    [1] => Yellow
    [2] => Brown
)

As you can see from this example, the array was copied for use in the loop construct. References to $set within the loop still refer to the outer version of $set, so the unset() call and the $set[] addition work on the original, leaving us with a nicely upper-cased version of the original, without keys.

This knowledge is useful for developers who are trying to plug memory holes in PHP applications. If you foreach through an array of objects that can be 50MB in size, you create an entire copy of the structure in memory for no reason other than to power the loop. If your loop doesn’t modify the structure of the array or add to it at all, it would be vastly more efficient to add the “cheat” of $a = &$array; right before your array to prevent PHP from making a copy.

This knowledge is also hopefully useful for programmers who cannot figure out why arrays are behaving like they are. Basically, if you don’t use references, the loop executes once for each member in the original array, regardless of what you do to the original.

NOTE: These tests were performed on PHP version 5.2.5. 5.2.0 and earlier perform differently. Run these tests yourself under controlled circumstances before relying on PHP to behave in any particular way.

No comments

PHP Type Conversions for Comparison

There has been some discussion recently among our dev team regarding PHP type conversion. I’ll give some of the problems we’ve run into and then try to shed some light on the inner workings of PHP when it does comparisons.

The first example may seem familiar to most seasoned developers, but when chained together it brings up an interesting point about PHP: The == operator isn’t transitive.

echo (null == 0 ? "YES" : "NO") . "\n"; //YES
echo ("null" == 0 ? "YES" : "NO") . "\n"; //YES
echo ("null" == null ? "YES" : "NO") . "\n"; //NO

As you can see, null == 0 == “null”, but null != “null”

You may be familiar with the following kind of error. The erroneous code is usually similar to:

if ( $a = "Hello" && $b != "World" )

Seeded with $b = “World”, the function assigned FALSE to $a. This is because there was a single = instead of == in $a = “Hello”, so PHP was interpreting the whole thing as an assignment operator. Since $b was not equal to “World” $b != “World” was returning TRUE, and TRUE was && with “Hello”, so “Hello” was converted to FALSE, then FALSE && TRUE was assigned to $a.

PHP has a certain order of precedence for data types. It is defined loosely in the manual’s comparison operators page, but I will try to spell it out more explicitly here. There are 8 basic types of data in PHP. In order of operator precedence, they are:

  • Boolean
  • Object
  • Array
  • Floating Point Number
  • Integer
  • String
  • Resource
  • NULL

That is to say, if you compare any two data types on the list, the variable with the data type lower on the list will be converted to the upper variable’s data type, and then the comparison is applied. However, when applying the first example to this hard and fast rule, we find it lacking. In reality, there are certain comparisons that are so far off PHP converts BOTH data types to a third data type. The first example actually works out like:

  • null == 0. both were converting to FALSE, so the comparison was succeeding
  • “null” == 0. “null” was converting to 0, so the comparison was succeeding
  • “null” == null. “null” was converting to TRUE, NULL was converting to false.

It’s much more easily represented as a table:

  Boolean Object Array Floating Point Number Integer String Resource NULL
Boolean   Boolean
Objects always resolve to true
Boolean
Empty arrays are false, all others are true
Boolean
0 resolves to false, all others are true
Boolean
0 resolves to false, all others are true
Boolean
"" resoves to false, all others are true
Boolean
Resources always resolve to true
Boolean
NULL is always false
Object Boolean
Objects always resolve to true
  No conversion made
Objects are always greater-than
No conversion made
Objects are always greater-than
No conversion made
Objects are always greater-than
No conversion made
Objects are always greater-than
No conversion made
Objects are always greater-than
Boolean
Objects always resolve to true
Array Boolean
Empty arrays are false, all others are true
No conversion made
Objects are always greater-than
  No conversion made
Arrays are always greater-than
No conversion made
Arrays are always greater-than
No conversion made
Arrays are always greater-than
No conversion made
Arrays are always greater-than
Boolean
Empty arrays are false, all others are true
Floating Point Boolean
0 resolves to false, all others are true
No conversion made
Objects are always greater-than
No conversion made
Arrays are always greater-than
  Floating Point Floating Point Floating Point Boolean
0 resolves to false, all others are true
Integer Boolean
0 resolves to false, all others are true
No conversion made
Objects are always greater-than
No conversion made
Arrays are always greater-than
Floating Point   Floating Point Integer Boolean
0 resolves to false, all others are true
String Boolean
0 resolves to false, all others are true
No conversion made
Objects are always greater-than
No conversion made
Arrays are always greater-than
Floating Point Floating Point   Floating Point String
NULL is converted to ""
Resource Boolean
Resources always resolve to true
No conversion made
Objects are always greater-than
No conversion made
Arrays are always greater-than
Floating Point Integer Floating Point   Boolean
Resources always resolve to true
NULL Boolean
NULL resolves to false
Boolean
Objects always resolve to true
Never == null
Boolean
Empty arrays are false, all others are true
Boolean
0 resolves to false, all others are true
Boolean
0 resolves to false, all others are true
String
NULL is converted to ""
Boolean
Resources always resolve to true
Never == null
 

In the table where you see the phrase “No Conversion Made” that means that those two data types will never == each other. However, in most of those situations data types are given specific return values for quantitative comparisons, such as greater-than and less-than. Note the specific case of NULL, where almost every instance of comparing to NULL results in both types being converted to Boolean.

Armed with this information, we are now capable of determining the outcome of almost any comparison in PHP. We know, for instance, that array() is greater than “Hello”, but “Hello” is less than 2. We know that stdClass() is greater than array(), but both of them are equal to TRUE. There are plenty of places where PHP contradicts normal logic, because of the sometimes convoluted process involved in comparing different data types.

The fact that PHP sometimes internally converts two operands to a third, unrelated data type can be quite confusing. I hope, however, that the chart in this article will help you work out exactly what it’s doing.

Of course, as one of our lead developers is quick to point out, this whole discussion would be moot if everyone used ===.

No comments

URL Obfuscation

On August 26, our CTO Nathan Day wrote a post on the InnerLayer blog about nameservers. His straightforward explanation of nameservers and their operations got me thinking about how NOT straightforward the whole operation is.

The way Nathan explained it, you type in “theinnerlayer.softlayer.com” and it is translated to an IP address, which is then contacted, and the page is returned to you. However, if you know the IP address already, you can use that instead of the URL, and skip the nameserver entirely. For instance, http://66.228.119.19 will take you directly to the InnerLayer blog, bypassing the name server. But that’s not all! Not only will the dotted-decimal representation of the IP work in a url, but the dword representation will as well! Try http://1122268947. That will also get you to the InnerLayer.

Now that we’ve gotten the domain out of the way, what about the bits before and after? Before the domain, between the protocol (http) and the domain itself, there is an optional authentication part. You can specify a username to log into secured sites right in the URL. http://user:pass@site.com is the standard format for such logins. However, if the website you’re going to doesn’t require authentication, most browsers simply ignore it. FireFox 3 will prompt you when you click on these obfuscated links to ask you if “site.com” is really the site you wish to visit, where IE7 simply won’t work at all if there’s an unexpected authentication string. This is a fairly new feature, and it’s a good way to protect users against this sort of attack. Now that you know about the methods of obfuscating domain names in URLs, you can probably see how http://www.bankofamerica.com%20login@1122268947 actually redirects to the InnerLayer. This is a common tactic used by spammers and phishers to obfuscate their URLs. You can put anything you want into the authentication portion of the URL to obfuscate it, as long as it’s not a reserved URL character like colon, “at” sign, or forward slash. For our case, let’s use “4NDIw:U4ODYwMCAxMjE5″ as our fake authentication data, just to be confusing.

Now that we’ve added stuff to the beginning of the URL, what about the filename at the end? Nathan’s post could easily be accessed using http://4NDIw:U4ODYwMCAxMjE5@1122268947/2008/do-you-know-where-your-nameserver-is/. However, there’s still all that easy-to-read nonsense at the end. That will never do. Have you ever seen a URL with a space in it? The space is encoded as %20. That’s the hexadecimal representation of the ASCII code 32, a space. The percent sign indicates that the following 2 digits are to be interpreted as a hex code for a real character. This is how you keep URLs from breaking on spaces, you turn the spaces into non-breaking characters. However, did you know it works for ALL characters and not just spaces? We can change every character but the forward slashes in any url to their hex equivalents. Nathan’s article link then becomes: http://4NDIw:U4ODYwMCAxMjE5@1122268947/%32%30%30%38/%64%6f%2d%79%6f
%75%2d%6b%6e%6f%77%2d%77%68%65%72%65%2d%79%6f%75%72%2d%6e%61
%6d%65%73%65%72%76%65%72%2d%69%73
.

But wait, there’s more! Let’s go back to the domain name, shall we? Most browsers will handle overflow in the dword representation of the domain just fine. What that means is that we can continually add 4294967296 (2^32) to the domain portion of our obfuscated URL and still continue to get the results we want. Our URL is now: http://4NDIw:U4ODYwMCAxMjE5
@5417236243/%32%30%30%38/%64%6f%2d%79%6f%75%2d%6b%6e%6f%77%2d%77
%68%65%72%65%2d%79%6f%75%72%2d%6e%61%6d%65%73%65%72%76%65%72
%2d%69%73
.

As a final trick, you don’t have to obfuscate every letter. A fixed pattern of %xx%xx%xx over and over again will get boring. Mix it up. I only converted 70% of my URL to hex, resulting in this gem: http://4NDIw:U4ODYwMCAxMjE5@5417236243//%3200%38/%64%6f%2d%79%6f%75
%2d%6b%6e%6f%77-%77%68%65%72e%2d%79%6f%75%72-%6e%61%6d%65%73e
%72v%65%72%2di%73
. As you can see, this is quite a bit more confusing than the original URL, which was http://theinnerlayer.softlayer.com/2008/do-you-know-where-your-nameserver-is/.

This information can be useful to any systems administrator who is dealing with an elusive, abusive user. Being able to translate a crazy URL to the actual human-readable equivalent can greatly assist both the SoftLayer abuse department as well as any other group attempting to track down spammers, scammers, or just plain old sneaky users.

As a final note: Please don’t use this knowledge for evil. As mentioned before, the new versions of both FireFox and Internet Explorer are no longer fooled by the fake authentication string trick, and the rest of the obfuscation should really only be used to fool web spiders. Personally, I used this method in combination with javascript to obfuscate links and email addresses so that I wouldn’t get spammed.

The following PHP code was used to generate the links in this article.

[Editor's note: We at SoftLayer use our powers for good and so should you. Thankfully half of these kinds of links won't open in the latest versions of Outlook and Safari. -K]


<?php

//the URL we're attempting to obfuscate
$url "http://theinnerlayer.softlayer.com/2008/do-you-know-where-your-nameserver-is/";

$urlData parse_url($url);

$path $urlData['path'];

$startingIP gethostbyname($urlData['host']);

//get the long representation:
$long ip2long($startingIP);

//add 4294967296 to the long for further obfuscation:
$long += 4294967296;

//add random authentication characters to the beginning of the string:
$auth substr(base64_encode(microtime()), rand(5,10), rand(515)) . ":" substr(base64_encode(microtime()), rand(5,10), rand(515));

//obfuscate the rest of the URL
$len strlen($path);
$obfuscatedLocation "";

for ( $p 0$p $len$p++ ) {
    
//check for slashes
    //also, 3 in 10 characters make it through plain for further confusion

    if ( $path[$p] == '/' || rand(010) > ) {
        
$obfuscatedLocation .= $path[$p];
        continue;
    }  

    //made it here, obfuscate this character:
    
$obfuscatedLocation .= '%' dechex(ord($path[$p]));
}

echo "http://$auth@$long$obfuscatedLocation";

?>


No comments

The New Face of Search Engine Optimization

Most SL customers host websites on our services, and all websites benefit from high search engine rankings. The “old method” of search engine optimization doesn’t really work anymore. Back in the days before Google, the best way to get to the top of the search engine rankings was to follow four easy steps:

  1. Diversify your IP space.
  2. Add keywords to the <meta> tag on your site.
  3. Make sure those keywords also appear in the body of your document.
  4. Take 2 & 3 and fill them with references to Pokémon, pop music, and porn.

However, only #3 is a valid tactic in this new, Google-driven world. Let’s analyze them one by one.

Diversifying your IP space. Old search engines gave more credence to sites located in “geographically diverse” areas, where “geographically diverse” was determined by class C addresses. Now, however, with the advent of huge centralized data centers, search engine algorithms recognize that a site with 15 servers in the same datacenter may be just as effective as 15 separate cities. Of course, it’s still a good idea to buy servers in, say, Dallas, Seattle, and Washington DC.

Meta tags. Google and other major search engines don’t really look at meta tags anymore for keywords. They still will use the meta tag for language, encoding, and summary data. However, the processing power of search engines has been increasing exponentially in the last few years, which means they’re capable of analyzing the actual content of the page rather than relying on meta tags. If you still have meta tags, you can keep them, but they’re only really useful for language and summary information.

Document body keywords. This is an area where it still matters. As previously mentioned, search engines now are capable of searching the entire page. In the past, it was only a few search engines that indexed actual page content, and even then it may have been a simple count of how often your meta keywords match page contents. Now, however, Google stores local copies of every page they index (to a certain extent) and uses the entire page contents for search and cached viewing.

Dummy data. When search engines were younger, they could be fooled very easily by simply including the top 1,000 popular search terms in your meta tags and as invisible text inside your document body. I never understood it personally, but the thinking was that if you had enough references to Britney Spears on your page, you would hijack enough people that one of them would forget what he was originally looking for and buy your product instead. Though I guess that’s how spam works now, isn’t it?

So what can you do right now to improve your search engine placement? There are a few easy things to do, broken into the following categories:

  1. Page Titles. Your pages should each have a unique, meaningful title. Putting the name of your site on every page doesn’t do anyone any good. Not only will it give your search results more visibility, but it will help people find it again if they bookmark it.
  2. Page Content. You want your page content to be meaningful and arranged around a central semantic theme. Don’t put up one huge page featuring thousands of unrelated pieces of information. Keep it concise, unique, and focused. You have an unlimited number of individual pages, make use of that fact.
  3. Dynamic Content. The more often your site changes, the higher your Google rank will be. You could take the cheap way out and simply put a box on your site that has random content, but the best way is to actually do updates as often as possible. This ensures not only visibility on the search engines, but makes your site more useful to the people that eventually make it to your pages, which is your main goal anyway.
  4. Accessibility. This is a key area that many sites overlook. You need to make heavy use of the title and alt attributes for things like links and images. Not only is it required by the Americans With Disabilities Act, but it helps blind users navigate your site. You know what acts like a blind user? Search engine crawlers. When you do a Google images search, the images that pop up most likely have alt attributes specified. The same goes for link titles, if you put a brief link description in your titles, not only do you get pretty mouseovers on the links, but they add one more point to your eventual page rank.
  5. Linking. Google builds its page rank based on links to and from the page in question, as well as the page contents itself. With this in mind, it’s useful to link out to sources on whatever topic you’re attempting to talk about. The higher the page rank of the target, the higher the benefit to you. Also, it’s a good idea to attempt to be useful enough for other people to link to you, either in message boards or as a source of their own. All links eventually increase your page rank. Also, as a small note, make your URLs “search engine friendly” by attempting to include keywords in there as well. Many message boards will include the post title in the URL for just this purpose. Also, for some reason, Google refuses to index any URL with “?id=” in it, so be careful about that.
  6. Site Map. Search engines love site maps. Users don’t care for them as much, but a concise HTML or XML site map with links to every page on your site divided into sections with a short description increases links, increases accessibility, and gives the search engines more meta data on the important topics in each page.

So all you have to do to improve your search engine rank is to have dynamic, frequently changing content about a single, concise topic on an easily accessible page that is frequently linked to by other pages. Wikipedia is a perfect example of search engine optimization in action. Each page is titled with the topic it discusses; every image has a title attribute and links out to a full description of the article; each link has a title attribute; many outside sources are mentioned; plenty of sites link to each article as well as the root domain; and the index page changes every single day with completely new and original content.

No comments