Author Archive

A Conversation on API Abstraction

As a developer, I have a good relationship with the PHP community. Many of my personal friends are involved in large PHP projects all over the world. One friend in particular is the lead developer of The Easy API. It’s an API wrapper that does the “hard parts” for you. Some companies release “APIs” that are confusing hodgepodges of unrelated functionality. Many times the API in question is simply a web form that developers are expected to POST to and parse poorly-formatted output.

The Easy API was designed specifically to take these poorly written APIs and wrap them up in a good PHP API Interface, with real functions and objects so that you can utilize a remote, web-driven API just like you would a native set of objects, or a database wrapper class.

I was discussing this with the lead developer of the project, and we had the following conversation:

Chad: did you check out The Easy API yet? http://theeasyapi.com

Daniel: Actually, I never had time to check it out. While I’m doing that, check out our API: http://sldn.softlayer.com/wiki/index.php/Main_Page

Daniel: Do you think you could make an EasyAPI wrapper for that?

Chad: What does your company do?

Daniel: unmanaged hosting

Daniel: our API can do EVERYTHING, except for physically remove servers and components

Chad: Wow, what a huge API. Very big indeed

Daniel: you have no idea

Chad: yeah, just looking at the docs… i’m getting a very quick idea

Daniel: you can purchase servers, purchase services, format, re-install, upgrade, downgrade, enable ports, change routes, add firewall rules, load balance, monitor, fetch billing, update tickets, even purchase whole new servers and cloud instances

Daniel: all from the API

Daniel: our customer portal sits on top of the api, anything you can do in the portal you can do in the api

Chad: yeah, it looks like it would take a lot of learning and figuring out how to do things, but it’s immensely powerful

Chad: Your API is a perfect example of how I’m trying to figure out the best way to handle well documented large API’s

Daniel: We have a github project (http://github.com/softlayer) with an API client that translates to PHP objects already [Note: since this conversation, the github page has been updated with a PERL client -ed.]

Chad: In that case, you barely even need the EasyAPI, your API already functions the way they all should.

That was a great thing to hear, as a SoftLayer Developer. The reason we wrote the API the way we did was because we were all tired of companies calling something an “API” when it was really a URL that would spit out a CSV file, or a ridiculously strict XML engine that would complain about a single space out of place. In fact, I once worked with an API that would throw “not valid XML” errors on perfectly valid XML. The most ironic part was that the “not valid XML” error itself was not valid XML.

As developers who spend much of our time integrating third party products, APIs, and services, we know how hard it is to work with a poorly documented, poorly implemented interface. That’s why part of our standard release procedure is having our API Evangelist review our method names, variable names, class names, and all related documentation to make sure they’re not only easy to read, but follow the pattern that the rest of the API follows. That’s why you always see “hardware” keys on our objects: we’re simply not allowed to call something “servers,” the code cannot be released until the API-exposed functionality is ready for public consumption.

We’ve all worked very hard on the API, because the API is what drives the portal, and the portal is what drives our customers. We’re happy to see everyone using the portal, but what really excites us is when customers use the API directly to form their own custom tools. The portal is a wonderful, powerful tool, but we understand that not every customer is happy with using the same thing. That’s why we exposed the API to our customers: so you could ALL write your own custom API-enabled objects. If you do, please share them, if not with the community at large, than at least with us directly. We’d love to see how customers are using the API, and if you share with us your most difficult API tasks, we’ll work to make them better. Even though the current SoftLayer API makes API-wrapper authors say “wow,” we want to make it even better.

No comments

Building the Data Warehouse

Here at SoftLayer, we have a lot of things that we need to keep track of. It’s not just payments, servers, rack slots, network ports, processors, hard drives, RAM sticks, and operating systems, it’s also bandwidth, monitoring, network intrusions, firewall logs, VPN access logs, API access, user history, and a whole host more. Last year, I was tapped to completely overhaul the existing bandwidth system. The old system was starting to show its age, and with our phenomenal growth it just hasn’t been able to keep up.

SoftLayer has 20,000+ servers. Each of those servers is on 2 networks, the public network open to the Internet, and the private network that only our customers can use. Each of those networks exists on a 3 to 4-level network hierarchy. This gives us more than 50,000 switch ports, but we’ll use 50,000 to make the math easier. Each switch port has bandwidth in and bandwidth out, as well as packets in and packets out. That gives us 200,000 data points to poll. Bandwidth is polled every 5 minutes, giving us 57,600,000 data points per day, or 1,728,000,000 per month. Given that bandwidth data points are all 64 bit numbers, and we also have to track the server ID (32 bit), the network (32 bit), and the datetime (32 bit), that makes a month’s worth of raw data (excluding any overhead from storage engines) 34.56GB. Now, the data is to be stored redundantly, so double that. Also, we have to track bandwidth for racks, virtual racks, and private racks, so add another 50% onto the data. That gives us around 90GB per month of data.

This doesn’t seem like a lot of data at first, but we need to generate custom bandwidth graphs for use on the web. Since it’s on the web, loading times above 2 seconds are unacceptable. Also, these are not enormous files with small keys (we’ll spend more time on that later) so 45GB of bandwidth data is a whole lot different than 45GB of movie files or MP3s.

To accomplish this, we decided that we needed a data warehouse. After numerous false starts and blind alleys, we decided to make our own system from scratch using MySQL. We considered commercial products, and pre-built open source solutions, but they just didn’t seem to fit our needs properly. The data warehouse project commenced with phase 1.

Phase 1: MySQL Cluster with Read Shards, Large Tables
Our first implementation was relatively simple. We planned to have a MySQL cluster for writing all the data, with the data split into 100 tables. The ID of the hardware mod 100 would determine the table name that we would write to. Then we’d have between 5 and 20 read databases, each replicating a different table, for load reasons. Then all we have to do is index the data properly, and add code to our applications to pull data from the correct read database node, and we’ll be fine.

Bad news: MySQL cluster isn’t designed for data with huge numbers of large keys. MySQL cluster stores all indexes and keys in memory on the cluster controller multiple times. As mentioned before, our data had 12 bytes of key, and 8 bytes of data. This means that we could only get about 2 million rows into the data warehouse before MySQL would lock up and quit working with the ever helpful error message “Unknown error: 157.” Even more disturbing, deleting items from MySQL cluster didn’t free up memory, as the indexes had to be rebuilt before that would happen. We upgraded everything to MySQL 5.1.6, but per the MySQL manual, “Beginning with MySQL 5.1.6, it is possible to store the non-indexed columns of NDB tables on disk, rather than in RAM as with previous versions of MySQL Cluster.” Unfortunately, it’s the indexes that are causing us problems, so the upgrade didn’t help.

Phase 2: MySQL Cluster with Read Shards, Small Tables
Since MySQL cluster couldn’t handle a small number of really big tables, how about a really big number of small tables? We tried making hundreds of tables per day, and removing the indexes on tables older than a certain time, but that quickly ran up against operating system limitations on the number of files in a folder. When we switched operating systems to alleviate this problem, we ran into MySQL’s limitation of roughly 21,000 tables per database. Nothing could get past that, so we had to move away from the cluster entirely.

Phase 3: Large MySQL box with Read Shards, Large Tables
We then moved on to one single MySQL box with enormous hard drives and multiple read shards. This looked promising at first, but the box simply couldn’t handle the amount of inserts and updates, and the slave servers were locking up too often. We thought that if we partitioned the tables, MySQL could handle the inserts better. This, naturally, broke replication in a different way. This was mainly because we had too many tables now that each partition was getting its own file, so MySQL would be constantly opening all these table files to perform the updates, and would quickly run out of memory and begin to swap. We just had to decrease the number of tables per server. We decided to abandon the centralized master server idea, and built out 5 pairs of master/slave servers.

Phase 4: 5 Pairs of Master/Slave Servers, Large Grouped Tables
This plan really seemed like it was going to work. We actually had a working data warehouse for almost 2 weeks without any errors. We were close to breaking out the champagne, but there was one feature we still hadn’t implemented.

The time came to add the tracking of bandwidth for virtual dedicated racks as well. To accomplish this, we changed the MySQL INSERT statement to INSERT... ON DUPLICATE KEY UPDATE. For those that don’t immediately recognize the terminology, the ON DUPLICATE KEY UPDATE syntax is designed so that if a particular database key already exists in the database, simply change the INSERT statement into an UPDATE statement. The UPDATE portion is defined at the end of the clause. Since the key to our data is the large “object ID, date time, data type” combination, adding the duplicate syntax allowed us to issue multiple INSERT statements for the same data, and have it continually update with new values. This was especially handy for virtual racks with thousands of servers.

MySQL threw us another curve ball when we tried this. There was a known issue with the ON DUPLICATE UPDATE syntax and replication using binary logs. Namely, the UPDATE would remain an INSERT in the log somehow, so we’d get duplicate key errors thrown on the slave, when the master was still working fine. Each time this happened, it would require stopping both servers, re-synching the tables, clearing the logs, restarting replication, and restarting the data transfer processes. This was unacceptable, so we had to move once more.

Phase 5: 10 Individual Machines
Since we already had 10 identical database machines, we decided to make them all independent and ignorant of each other. We stayed with the item group theory, leaving multiple bandwidth items per table. With 25 different items per table, we were only going to have 1,000,000 rows per table per month, which still maintains our speed. However, the index size problem bit us again. These tables were simply too large to have 3/4 of their columns indexed.

Finally we thought we had the answer. We would split up each of these tables by month. That way, each table would max out at 1,000,000 rows, so the indexes wouldn’t be unmanageable. Remember that we’ve already split the connection points from one, to five, to ten. We’ve also broken the tables up from one large table, to one per object, and now one per group of objects. In order to cut down on the complexity of the application layer code, we decided to use merge tables to keep the table name consistent. That way we can simply select from “the table” rather than “the table from March.” When we did this, we almost immediately began getting mysterious deadlocks on the servers. The INSERT statements would conflict with any SELECT statements using the same table, and the two threads would just hang indefinitely. Oddly, the table was never actually locked, it seemed that there was some sort of “offsetting lock” happening, where both threads were deadlocked in a race condition. The table could still be used, but the application we had inserting the data would be hung, so it was causing unacceptable delays in data inserts.

Phase 6: 10 Individual Machines, Small Individual Tables
Finally we decided that enough was enough, we were going to roll our own scaled storage solution. Clustering didn’t work, throwing hardware at the problem unfortunately didn’t work, and even the fancy new features for merged tables and partitioned tables didn’t work. We simply decided that we were going to keep the old “grouped tables” model, as well as creating a new table for every month without merge tables. This way, we keep the number of tables relatively low, and the number of rows per table low. Plus, as a bonus, by controlling the table names ourselves we could ensure that MySQL wouldn’t open too many files. All inserts went into this month’s table, and all reads would come out of whatever specific month they needed. A cron job was set up to periodically issue FLUSH TABLES on all 10 data warehouse nodes, and we had our completed product! It’s been months now since we had a major database failure, we can generate any bandwidth graphs you want in less than a second, and other developers are starting to create their own table groups.

Each object’s data resides completely on two of our data warehouse nodes, so the data is redundantly stored in the event of a node going down. The application code has been written with extensive factory patterns and ORM so that all a developer has to do is create a new group type, a new data tracking object, and add data to it. The code automatically selects the applicable nodes to write the data to, creates the tables if they don’t exist, and writes the data to the tables. Similarly, a “get data” command will randomly select one location for the data, and retrieve the proper data, using only the tables that are necessary.

But Wait, There’s More!
All 6 phases above were happening simultaneously with other systems. The raw data itself needed to be stored, buffered, transferred, and translated into a format suitable for the data warehouse. Since almost a quarter of our servers are in other parts of the country, we had to have redundant data storage and transfer solutions to make sure the raw data got to where it needed to be. For this data, the rules were different.

First of all, there are two layers of raw data. Each city has a local bandwidth polling database. We use rtgpoll to poll the bandwidth data. Since rtgpoll is designed to write to a different table for each data type, we kept the data like that. For ease of data management, we created a script that would keep a rotating two-day cycle of tables, one for today and one for yesterday, with a merge table that encompassed them both. We could get away with the merge table on this layer because there are far fewer tables, far fewer rows, and different indexes. Since the interface isn’t important at this layer, we could index only the date time and get the performance we wanted by making our transfer scripts to the global buffer date-based.

The global buffer server is the same box we attempted to use in phase 3 and 4, above. It has the exact same table structure as the datacenter buffers, with the rotating tables living under a merge table. This data is replicated out to a slave server, to prevent read/write contention. At this layer, we have no ON DUPLICATE KEY statements being executed, and no partitions, plus our merge tables are much smaller, so everything works out. These tables act as a permanent raw data archive in the event of a system failure or a bandwidth dispute with a customer.

The scripts that pull data out of the datacenter buffers also inserts data into a queue for each of the data warehouse nodes. We store a lookup table in memcache that will translate a raw interface ID into the data warehouse nodes that interface’s data needs to be inserted into. That raw row is then inserted into the queues for the nodes it belongs to.

Finally, a set of scripts runs on the data warehouse nodes, constantly pulling any new data out of their queues on the global buffer slave. The data is translated from raw interface data to match up to our customer accounts, then inserted into the local data warehouse database, ready to be selected out to make graphs, reports, or billing.

All Powered by Tracking Object
The entire system is interfaced from our application code using the tracking object system. The tracking objects are a series of PHP classes that link a particular object in our existing production database to that object’s various data points in the data warehouse. Using ORM and factory patterns, we were able to abstract tracking objects to the point where any object in our database could have an associated “trackingObject” member variable. Servers, Virtual Dedicated Racks, Cloud Computing Instances, and other systems can simply call the getBandwidthData() method on their tracking object, and the tracking object system will automatically select the correct database, select the correct table, and pull the correct fields, formatting them as a collection of generic “bandwidth data” objects. Other metrics, like CPU and memory usage for servers and Cloud Computing Instances, can just as easily be retrieved.

Similarly, most of our back-end systems use the tracking objects to add data to the data warehouse. The developers don’t touch the warehouse directly, they simply load whatever object they have new data for, and pass an array of raw data objects to our internal addData() function, which automatically determines database node, write table, and data structure. The tracking object system is completely transparent to the other developers, and it means new tracking objects or data warehouse nodes can be created seamlessly without changing existing code.

By centralizing the reading and writing into these classes, the data warehouse can be extended infinitely. A new type of data can be added as easily as adding a new data warehouse data type class to the file system, as well as a row to the database. As long as that class has the data structure properly defined, new tracking objects can be created for that data type, and data can begin being recorded immediately. Creating a new tracking object will automatically choose two or more database nodes to store the data on. Creating new nodes takes the current tracking object count into consideration, so the nodes stay balanced.

So far the system has 33,115,715,147 rows in 683,460 tables spread over 10 databases. We have hundreds of customers who view their bandwidth graphs every day, and a handful that systematically pull the graphs every hour. Load tests suggest that performance doesn’t degrade until we hit 500 simultaneous graph requests, and even then we still come in under 2 seconds per graph. With the scaling potential and the single point of access for developers, we should be able to use this system indefinitely.

No comments

PHP Memory Management in Foreach

Many developers, even experienced ones, are confused by the way PHP handles arrays in foreach loops. In the standard foreach loop, PHP makes a copy of the array that is used in the loop. The copy is discarded immediately after the loop finishes. This is transparent in the operation of a simple foreach loop. For example:

$set = array("apple""banana""coconut");
foreach ( 
$set AS $item ) {
    echo 
"{$item}\n";
}

This outputs:

apple
banana
coconut

Even though the copy is created, the developer doesn’t notice, because the original array isn’t referenced within the loop or after the loop finishes. However, when you attempt to modify the items in a loop, you find that they are unmodified when you finish:

$set = array("apple""banana""coconut");
foreach ( 
$set AS $item ) {
    
$item strrev ($item);
}
print_r($set);

This outputs:

Array
(
    [0] => apple
    [1] => banana
    [2] => coconut
)

There are no changes from the original, even though you clearly assigned a value to $item. This is because you are operating on $item as it appears in the copy of $set being worked on. You can override this by grabbing $item by reference, like so:

$set = array("apple""banana""coconut");
foreach ( 
$set AS &$item ) {
    
$item strrev($item);
}
print_r($set);

This outputs:

Array
(
    [0] => elppa
    [1] => ananab
    [2] => tunococ
)

As you can see, when $item is operated on by-reference, the changes made to $item are made to the members of the original $set. Using $item by reference also prevents PHP from creating the array copy. To test this, first we’ll show a quick script demonstrating the copy:

$set = array("apple""banana""coconut");
foreach ( 
$set AS $item ) {
    
$set[] = ucfirst($item);
}
print_r($set);

This outputs:

Array
(
    [0] => apple
    [1] => banana
    [2] => coconut
    [3] => Apple
    [4] => Banana
    [5] => Coconut
)

In this example, PHP copied $set and used it to loop over, but when $set was used inside the loop, PHP added the variables to the original array, not the copied array. Basically, PHP is only using the copied array for the execution of the loop and the assignment of $item. Because of this, the loop above only executes 3 times, and each time it appends another value to the end of the original $set, leaving the original $set with 6 elements, but never entering an infinite loop.

However, what if we had used $item by reference, as I mentioned before? A single character added to the above test:

$set = array("apple""banana""coconut");
foreach ( 
$set AS &$item ) {
    
$set[] = ucfirst($item);
}
print_r($set);

Results in an infinite loop. Note this actually is an infinite loop, you’ll have to either kill the script yourself or wait for your OS to run out of memory. I added the following line to my script so PHP would run out of memory very quickly, I suggest you do the same if you’re going to be running these infinite loop tests:

ini_set("memory_limit","1M");

So in this previous example with the infinite loop, we see the reason why PHP was written to create a copy of the array to loop over. When a copy is created and used only by the structure of the loop construct itself, the array stays static throughout the execution of the loop, so you’ll never run into issues.

But wait, there’s more. PHP fails to create a copy of the array if a reference is used at all. We know that referencing $item will cause the infinite loop scenario above, but if $set is referenced anywhere else in the script, even the non-referencing foreach format will break:

$set = array("apple""banana""coconut");
$a = &$set;
foreach ( 
$set AS $item ) {
    
$set[] = ucfirst($item);
}

Results in an infinite loop, even though $item isn’t by reference. Using $a instead of $set gives identical results.

This is not to say that $item is implicitly used by reference if $set is referenced. See this example:

$set = array("apple""banana""coconut");
$a = &$set;
foreach ( 
$a AS $item ) {
    
$item ucfirst($item);
}
print_r($set);

This outputs:

Array
(
    [0] => apple
    [1] => banana
    [2] => coconut
)

$set is unchanged from the original values, because even though $set is referenced by $a, and $set has not been copied, $item is still given only lexical scope in relation to the loop, and will not pass modifications back to $set. You will still have to assign it by reference to make changes to the original array:

$set = array("apple""banana""coconut");
$a = &$set;
foreach ( 
$a AS &$item ) {
    
$item strrev($item);
}
print_r($set);

This outputs:

Array
(
    [0] => elppa
    [1] => ananab
    [2] => tunococ
)

All of these examples also work in associative arrays using the foreach ( $set AS $key => $item ) syntax. $key can never be used by-reference it always comes from the array the loop construct is using, and cannot be modified. So the tricks used to modify array items in-position won’t work for modifying the keys. You can create new keys in the array, however, and unset the existing ones, like so:

$set = array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
foreach ( 
$set AS $key => $item ) {
    
$set[ucfirst($key)] = $item;
    unset(
$set[$key]);
}
print_r($set);

This outputs:

Array
(
    [Apple] => red
    [Banana] => yellow
    [Coconut] => brown
)

However, as you may have already noticed, this array was copied before the loop began. If you were using the array in a situation where it couldn’t be copied, you will run into errors:

$set = array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
$a = &$set;
foreach ( 
$set AS $key => $item ) {
    
$set[ucfirst($key)] = $item;
    unset(
$set[$key]);
}
print_r($set);

This outputs:

Array
(
)

Because the array was referenced and not copied, you get vastly unpredictable results when attempting to alter the physical structure of the array, especially using unset(). Without the unset() call in this example, you operate on the original array and loop through the original array, so you get the same infinite-loop generating code as before, but since we’re specifying the key for $set it doesn’t continue forever:

$set = array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
$a = &$set;
foreach ( 
$set AS $key => $item ) {
    
$set[ucfirst($key)] = $item;
}
print_r($set);

This outputs:

Array
(
    [apple] => red
    [banana] => yellow
    [coconut] => brown
    [Apple] => red
    [Banana] => yellow
    [Coconut] => brown
)

You can prove that it’s still possible to enter an infinite loop by adding a $set[] inside your loop:

$set = array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
$a = &$set;
foreach ( 
$set AS $key => $item ) {
    
$set[ucfirst($key)] = $item;
    
$set[] = $item;
}
print_r($set);

This results in an infinite loop.

One interesting thing you can do with the $key => $item syntax when the array is copied is modify the original array structure without fear of causing loop issues:

<?php
$set 
= array("apple"=>"red","banana"=>"yellow","coconut"=>"brown");
foreach ( 
$set AS $key => $item ) {
    
$set[] = ucfirst($item);
    unset(
$set[$key]);
}
print_r($set);

This outputs:

Array
(
    [0] => Red
    [1] => Yellow
    [2] => Brown
)

As you can see from this example, the array was copied for use in the loop construct. References to $set within the loop still refer to the outer version of $set, so the unset() call and the $set[] addition work on the original, leaving us with a nicely upper-cased version of the original, without keys.

This knowledge is useful for developers who are trying to plug memory holes in PHP applications. If you foreach through an array of objects that can be 50MB in size, you create an entire copy of the structure in memory for no reason other than to power the loop. If your loop doesn’t modify the structure of the array or add to it at all, it would be vastly more efficient to add the “cheat” of $a = &$array; right before your array to prevent PHP from making a copy.

This knowledge is also hopefully useful for programmers who cannot figure out why arrays are behaving like they are. Basically, if you don’t use references, the loop executes once for each member in the original array, regardless of what you do to the original.

NOTE: These tests were performed on PHP version 5.2.5. 5.2.0 and earlier perform differently. Run these tests yourself under controlled circumstances before relying on PHP to behave in any particular way.

No comments

PHP Type Conversions for Comparison

There has been some discussion recently among our dev team regarding PHP type conversion. I’ll give some of the problems we’ve run into and then try to shed some light on the inner workings of PHP when it does comparisons.

The first example may seem familiar to most seasoned developers, but when chained together it brings up an interesting point about PHP: The == operator isn’t transitive.

echo (null == 0 ? "YES" : "NO") . "\n"; //YES
echo ("null" == 0 ? "YES" : "NO") . "\n"; //YES
echo ("null" == null ? "YES" : "NO") . "\n"; //NO

As you can see, null == 0 == “null”, but null != “null”

You may be familiar with the following kind of error. The erroneous code is usually similar to:

if ( $a = "Hello" && $b != "World" )

Seeded with $b = “World”, the function assigned FALSE to $a. This is because there was a single = instead of == in $a = “Hello”, so PHP was interpreting the whole thing as an assignment operator. Since $b was not equal to “World” $b != “World” was returning TRUE, and TRUE was && with “Hello”, so “Hello” was converted to FALSE, then FALSE && TRUE was assigned to $a.

PHP has a certain order of precedence for data types. It is defined loosely in the manual’s comparison operators page, but I will try to spell it out more explicitly here. There are 8 basic types of data in PHP. In order of operator precedence, they are:

  • Boolean
  • Object
  • Array
  • Floating Point Number
  • Integer
  • String
  • Resource
  • NULL

That is to say, if you compare any two data types on the list, the variable with the data type lower on the list will be converted to the upper variable’s data type, and then the comparison is applied. However, when applying the first example to this hard and fast rule, we find it lacking. In reality, there are certain comparisons that are so far off PHP converts BOTH data types to a third data type. The first example actually works out like:

  • null == 0. both were converting to FALSE, so the comparison was succeeding
  • “null” == 0. “null” was converting to 0, so the comparison was succeeding
  • “null” == null. “null” was converting to TRUE, NULL was converting to false.

It’s much more easily represented as a table:

  Boolean Object Array Floating Point Number Integer String Resource NULL
Boolean   Boolean
Objects always resolve to true
Boolean
Empty arrays are false, all others are true
Boolean
0 resolves to false, all others are true
Boolean
0 resolves to false, all others are true
Boolean
"" resoves to false, all others are true
Boolean
Resources always resolve to true
Boolean
NULL is always false
Object Boolean
Objects always resolve to true
  No conversion made
Objects are always greater-than
No conversion made
Objects are always greater-than
No conversion made
Objects are always greater-than
No conversion made
Objects are always greater-than
No conversion made
Objects are always greater-than
Boolean
Objects always resolve to true
Array Boolean
Empty arrays are false, all others are true
No conversion made
Objects are always greater-than
  No conversion made
Arrays are always greater-than
No conversion made
Arrays are always greater-than
No conversion made
Arrays are always greater-than
No conversion made
Arrays are always greater-than
Boolean
Empty arrays are false, all others are true
Floating Point Boolean
0 resolves to false, all others are true
No conversion made
Objects are always greater-than
No conversion made
Arrays are always greater-than
  Floating Point Floating Point Floating Point Boolean
0 resolves to false, all others are true
Integer Boolean
0 resolves to false, all others are true
No conversion made
Objects are always greater-than
No conversion made
Arrays are always greater-than
Floating Point   Floating Point Integer Boolean
0 resolves to false, all others are true
String Boolean
0 resolves to false, all others are true
No conversion made
Objects are always greater-than
No conversion made
Arrays are always greater-than
Floating Point Floating Point   Floating Point String
NULL is converted to ""
Resource Boolean
Resources always resolve to true
No conversion made
Objects are always greater-than
No conversion made
Arrays are always greater-than
Floating Point Integer Floating Point   Boolean
Resources always resolve to true
NULL Boolean
NULL resolves to false
Boolean
Objects always resolve to true
Never == null
Boolean
Empty arrays are false, all others are true
Boolean
0 resolves to false, all others are true
Boolean
0 resolves to false, all others are true
String
NULL is converted to ""
Boolean
Resources always resolve to true
Never == null
 

In the table where you see the phrase “No Conversion Made” that means that those two data types will never == each other. However, in most of those situations data types are given specific return values for quantitative comparisons, such as greater-than and less-than. Note the specific case of NULL, where almost every instance of comparing to NULL results in both types being converted to Boolean.

Armed with this information, we are now capable of determining the outcome of almost any comparison in PHP. We know, for instance, that array() is greater than “Hello”, but “Hello” is less than 2. We know that stdClass() is greater than array(), but both of them are equal to TRUE. There are plenty of places where PHP contradicts normal logic, because of the sometimes convoluted process involved in comparing different data types.

The fact that PHP sometimes internally converts two operands to a third, unrelated data type can be quite confusing. I hope, however, that the chart in this article will help you work out exactly what it’s doing.

Of course, as one of our lead developers is quick to point out, this whole discussion would be moot if everyone used ===.

No comments

URL Obfuscation

On August 26, our CTO Nathan Day wrote a post on the InnerLayer blog about nameservers. His straightforward explanation of nameservers and their operations got me thinking about how NOT straightforward the whole operation is.

The way Nathan explained it, you type in “theinnerlayer.softlayer.com” and it is translated to an IP address, which is then contacted, and the page is returned to you. However, if you know the IP address already, you can use that instead of the URL, and skip the nameserver entirely. For instance, http://66.228.119.19 will take you directly to the InnerLayer blog, bypassing the name server. But that’s not all! Not only will the dotted-decimal representation of the IP work in a url, but the dword representation will as well! Try http://1122268947. That will also get you to the InnerLayer.

Now that we’ve gotten the domain out of the way, what about the bits before and after? Before the domain, between the protocol (http) and the domain itself, there is an optional authentication part. You can specify a username to log into secured sites right in the URL. http://user:pass@site.com is the standard format for such logins. However, if the website you’re going to doesn’t require authentication, most browsers simply ignore it. FireFox 3 will prompt you when you click on these obfuscated links to ask you if “site.com” is really the site you wish to visit, where IE7 simply won’t work at all if there’s an unexpected authentication string. This is a fairly new feature, and it’s a good way to protect users against this sort of attack. Now that you know about the methods of obfuscating domain names in URLs, you can probably see how http://www.bankofamerica.com%20login@1122268947 actually redirects to the InnerLayer. This is a common tactic used by spammers and phishers to obfuscate their URLs. You can put anything you want into the authentication portion of the URL to obfuscate it, as long as it’s not a reserved URL character like colon, “at” sign, or forward slash. For our case, let’s use “4NDIw:U4ODYwMCAxMjE5″ as our fake authentication data, just to be confusing.

Now that we’ve added stuff to the beginning of the URL, what about the filename at the end? Nathan’s post could easily be accessed using http://4NDIw:U4ODYwMCAxMjE5@1122268947/2008/do-you-know-where-your-nameserver-is/. However, there’s still all that easy-to-read nonsense at the end. That will never do. Have you ever seen a URL with a space in it? The space is encoded as %20. That’s the hexadecimal representation of the ASCII code 32, a space. The percent sign indicates that the following 2 digits are to be interpreted as a hex code for a real character. This is how you keep URLs from breaking on spaces, you turn the spaces into non-breaking characters. However, did you know it works for ALL characters and not just spaces? We can change every character but the forward slashes in any url to their hex equivalents. Nathan’s article link then becomes: http://4NDIw:U4ODYwMCAxMjE5@1122268947/%32%30%30%38/%64%6f%2d%79%6f
%75%2d%6b%6e%6f%77%2d%77%68%65%72%65%2d%79%6f%75%72%2d%6e%61
%6d%65%73%65%72%76%65%72%2d%69%73
.

But wait, there’s more! Let’s go back to the domain name, shall we? Most browsers will handle overflow in the dword representation of the domain just fine. What that means is that we can continually add 4294967296 (2^32) to the domain portion of our obfuscated URL and still continue to get the results we want. Our URL is now: http://4NDIw:U4ODYwMCAxMjE5
@5417236243/%32%30%30%38/%64%6f%2d%79%6f%75%2d%6b%6e%6f%77%2d%77
%68%65%72%65%2d%79%6f%75%72%2d%6e%61%6d%65%73%65%72%76%65%72
%2d%69%73
.

As a final trick, you don’t have to obfuscate every letter. A fixed pattern of %xx%xx%xx over and over again will get boring. Mix it up. I only converted 70% of my URL to hex, resulting in this gem: http://4NDIw:U4ODYwMCAxMjE5@5417236243//%3200%38/%64%6f%2d%79%6f%75
%2d%6b%6e%6f%77-%77%68%65%72e%2d%79%6f%75%72-%6e%61%6d%65%73e
%72v%65%72%2di%73
. As you can see, this is quite a bit more confusing than the original URL, which was http://theinnerlayer.softlayer.com/2008/do-you-know-where-your-nameserver-is/.

This information can be useful to any systems administrator who is dealing with an elusive, abusive user. Being able to translate a crazy URL to the actual human-readable equivalent can greatly assist both the SoftLayer abuse department as well as any other group attempting to track down spammers, scammers, or just plain old sneaky users.

As a final note: Please don’t use this knowledge for evil. As mentioned before, the new versions of both FireFox and Internet Explorer are no longer fooled by the fake authentication string trick, and the rest of the obfuscation should really only be used to fool web spiders. Personally, I used this method in combination with javascript to obfuscate links and email addresses so that I wouldn’t get spammed.

The following PHP code was used to generate the links in this article.

[Editor's note: We at SoftLayer use our powers for good and so should you. Thankfully half of these kinds of links won't open in the latest versions of Outlook and Safari. -K]


<?php

//the URL we're attempting to obfuscate
$url "http://theinnerlayer.softlayer.com/2008/do-you-know-where-your-nameserver-is/";

$urlData parse_url($url);

$path $urlData['path'];

$startingIP gethostbyname($urlData['host']);

//get the long representation:
$long ip2long($startingIP);

//add 4294967296 to the long for further obfuscation:
$long += 4294967296;

//add random authentication characters to the beginning of the string:
$auth substr(base64_encode(microtime()), rand(5,10), rand(515)) . ":" substr(base64_encode(microtime()), rand(5,10), rand(515));

//obfuscate the rest of the URL
$len strlen($path);
$obfuscatedLocation "";

for ( $p 0$p $len$p++ ) {
    
//check for slashes
    //also, 3 in 10 characters make it through plain for further confusion

    if ( $path[$p] == '/' || rand(010) > ) {
        
$obfuscatedLocation .= $path[$p];
        continue;
    }  

    //made it here, obfuscate this character:
    
$obfuscatedLocation .= '%' dechex(ord($path[$p]));
}

echo "http://$auth@$long$obfuscatedLocation";

?>


No comments

The New Face of Search Engine Optimization

Most SL customers host websites on our services, and all websites benefit from high search engine rankings. The “old method” of search engine optimization doesn’t really work anymore. Back in the days before Google, the best way to get to the top of the search engine rankings was to follow four easy steps:

  1. Diversify your IP space.
  2. Add keywords to the <meta> tag on your site.
  3. Make sure those keywords also appear in the body of your document.
  4. Take 2 & 3 and fill them with references to Pokémon, pop music, and porn.

However, only #3 is a valid tactic in this new, Google-driven world. Let’s analyze them one by one.

Diversifying your IP space. Old search engines gave more credence to sites located in “geographically diverse” areas, where “geographically diverse” was determined by class C addresses. Now, however, with the advent of huge centralized data centers, search engine algorithms recognize that a site with 15 servers in the same datacenter may be just as effective as 15 separate cities. Of course, it’s still a good idea to buy servers in, say, Dallas, Seattle, and Washington DC.

Meta tags. Google and other major search engines don’t really look at meta tags anymore for keywords. They still will use the meta tag for language, encoding, and summary data. However, the processing power of search engines has been increasing exponentially in the last few years, which means they’re capable of analyzing the actual content of the page rather than relying on meta tags. If you still have meta tags, you can keep them, but they’re only really useful for language and summary information.

Document body keywords. This is an area where it still matters. As previously mentioned, search engines now are capable of searching the entire page. In the past, it was only a few search engines that indexed actual page content, and even then it may have been a simple count of how often your meta keywords match page contents. Now, however, Google stores local copies of every page they index (to a certain extent) and uses the entire page contents for search and cached viewing.

Dummy data. When search engines were younger, they could be fooled very easily by simply including the top 1,000 popular search terms in your meta tags and as invisible text inside your document body. I never understood it personally, but the thinking was that if you had enough references to Britney Spears on your page, you would hijack enough people that one of them would forget what he was originally looking for and buy your product instead. Though I guess that’s how spam works now, isn’t it?

So what can you do right now to improve your search engine placement? There are a few easy things to do, broken into the following categories:

  1. Page Titles. Your pages should each have a unique, meaningful title. Putting the name of your site on every page doesn’t do anyone any good. Not only will it give your search results more visibility, but it will help people find it again if they bookmark it.
  2. Page Content. You want your page content to be meaningful and arranged around a central semantic theme. Don’t put up one huge page featuring thousands of unrelated pieces of information. Keep it concise, unique, and focused. You have an unlimited number of individual pages, make use of that fact.
  3. Dynamic Content. The more often your site changes, the higher your Google rank will be. You could take the cheap way out and simply put a box on your site that has random content, but the best way is to actually do updates as often as possible. This ensures not only visibility on the search engines, but makes your site more useful to the people that eventually make it to your pages, which is your main goal anyway.
  4. Accessibility. This is a key area that many sites overlook. You need to make heavy use of the title and alt attributes for things like links and images. Not only is it required by the Americans With Disabilities Act, but it helps blind users navigate your site. You know what acts like a blind user? Search engine crawlers. When you do a Google images search, the images that pop up most likely have alt attributes specified. The same goes for link titles, if you put a brief link description in your titles, not only do you get pretty mouseovers on the links, but they add one more point to your eventual page rank.
  5. Linking. Google builds its page rank based on links to and from the page in question, as well as the page contents itself. With this in mind, it’s useful to link out to sources on whatever topic you’re attempting to talk about. The higher the page rank of the target, the higher the benefit to you. Also, it’s a good idea to attempt to be useful enough for other people to link to you, either in message boards or as a source of their own. All links eventually increase your page rank. Also, as a small note, make your URLs “search engine friendly” by attempting to include keywords in there as well. Many message boards will include the post title in the URL for just this purpose. Also, for some reason, Google refuses to index any URL with “?id=” in it, so be careful about that.
  6. Site Map. Search engines love site maps. Users don’t care for them as much, but a concise HTML or XML site map with links to every page on your site divided into sections with a short description increases links, increases accessibility, and gives the search engines more meta data on the important topics in each page.

So all you have to do to improve your search engine rank is to have dynamic, frequently changing content about a single, concise topic on an easily accessible page that is frequently linked to by other pages. Wikipedia is a perfect example of search engine optimization in action. Each page is titled with the topic it discusses; every image has a title attribute and links out to a full description of the article; each link has a title attribute; many outside sources are mentioned; plenty of sites link to each article as well as the root domain; and the index page changes every single day with completely new and original content.

No comments