Applying FilterIterator to Directory Iteration
I'm currently doing research and prototyping for autoloading alternatives in
Zend Framework 2.0. One approach
I'm looking at involves creating explicit class/file maps; these tend to be
much faster than using the include_path, but do require some
additional setup.
My algorithm for generating the maps was absurdly simple:
- Scan the filesystem for PHP files
- If the file does not contain an interface, class, or abstract class, skip it.
- If it does, get its declared namespace and classname
The question was what implementation approach to use.
I'm well aware of RecursiveDirectoryIterator, and planned to
use that. However, I also had heard of FilterIterator, and
wondered if I could tie that in somehow. In the end, I could, but the
solution was non-obvious.
What I Thought I'd Be Able To Do
FilterIterator is an abstract class. When extending it, you must
define an accept() method.
class FooFilter extends FilterIterator
{
public function accept()
{
}
}
In that method, you typically will inspect whatever is returned by
$this->current(), and then return a boolean true
or false, depending on whether you want to keep it or not.
class FooFilter extends FilterIterator
{
public function accept()
{
$item = $this->current();
if ($someCriteriaIsMet) {
return true;
}
return false;
}
}
I'll go into the mechanics of my criteria later; what's important now is
knowing that a FilterIterator allows you to limit the results
returned by your iterator.
I originally thought I'd be able to simply pass a
DirectoryIterator or
RecursiveDirectoryIterator to my filtering instance. This
worked in the former case, as it's only one level deep. However, for the
latter, it would only return the first directory level for all classes that
matched -- i.e., if I ran it over "Zend/Controller", I'd get a match for
each class under "Zend/Controller/Action/Helper/", but it would return
simply "Zend/Controller/Action" as the match. This certainly wasn't useful.
I then discovered RecursiveFilterIterator, which looked like it
would solve the recursion problem. However, I found one of two results
occurred: either I'd receive an entire subtree if at least one item matched,
or it would skip an entire subtree if the first item found failed the
criteria. There was no middle ground.
The Solution
The solution was incredibly simple and elegant, once I stumbled upon it:
pass my RecursiveIteratorIterator instance to the
FilterIterator.
$rdi = new RecursiveDirectoryIterator($somePath);
$rii = new RecursiveIteratorIterator($rdi);
$filtered = new FooFilter($rii);
Really. It was that simple -- but, as noted, non-obvious. It also required a
slight change within my filter -- instead of using current(),
I'd need to first pull the "inner" iterator instance:
$this->getInnerIterator()->current(). I show an example of that
below when I go over the filter implementation.
As for my criteria, I had several options. I could require_once
the file, and use the Reflection API to inspect the class to determine if it
was an interface, abstract class, or class, as well as to determine the
namespace. However, I couldn't be 100% sure the file would contain a class,
so this seemed like overkill. That, and horribly non-performant, due to
using reflection.
The next option was to simply slurp in the file contents into a variable, and use regular expressions. I love regular expressions, but in this case, it felt like I could possibly end up with some false positives. Also, since some of these files could be quite large, I was worried again about performance implications -- I don't want to have to wait forever to generate these maps.
The solution I went with was to use the tokenizer to inspect the file. Tokenizing is incredibly fast, and it's also incredibly simple to analyze the tokens.
I decided to store the detected namespace and classnames as public
properties of the SplFileInfo objects returned; this makes it
simple to iterate over the entire collection and utilize that
information. Also, because I have SplFileInfo objects, I
already have the paths I need.
My implementation looks like this:
/** @namespace */
namespace Zend\File;
// import SPL classes/interfaces into local scope
use DirectoryIterator,
FilterIterator,
RecursiveIterator,
RecursiveDirectoryIterator,
RecursiveIteratorIterator;
/**
* Locate files containing PHP classes, interfaces, or abstracts
*
* @package Zend_File
* @license New BSD {@link http://framework.zend.com/license/new-bsd}
*/
class ClassFileLocater extends FilterIterator
{
/**
* Create an instance of the locater iterator
*
* Expects either a directory, or a DirectoryIterator (or its recursive variant)
* instance.
*
* @param string|DirectoryIterator $dirOrIterator
* @return void
*/
public function __construct($dirOrIterator = '.')
{
if (is_string($dirOrIterator)) {
if (!is_dir($dirOrIterator)) {
throw new InvalidArgumentException('Expected a valid directory name');
}
$dirOrIterator = new RecursiveDirectoryIterator($dirOrIterator);
}
if (!$dirOrIterator instanceof DirectoryIterator) {
throw new InvalidArgumentException('Expected a DirectoryIterator');
}
if ($dirOrIterator instanceof RecursiveIterator) {
$iterator = new RecursiveIteratorIterator($dirOrIterator);
} else {
$iterator = $dirOrIterator;
}
parent::__construct($iterator);
$this->rewind();
}
/**
* Filter for files containing PHP classes, interfaces, or abstracts
*
* @return bool
*/
public function accept()
{
$file = $this->getInnerIterator()->current();
// If we somehow have something other than an SplFileInfo object, just
// return false
if (!$file instanceof \SplFileInfo) {
return false;
}
// If we have a directory, it's not a file, so return false
if (!$file->isFile()) {
return false;
}
// If not a PHP file, skip
if ($file->getBasename('.php') == $file->getBasename()) {
return false;
}
$contents = file_get_contents($file->getRealPath());
$tokens = token_get_all($contents);
$count = count($tokens);
$i = 0;
while ($i < $count) {
$token = $tokens[$i];
if (!is_array($token)) {
// single character token found; skip
$i++;
continue;
}
list($id, $content, $line) = $token;
switch ($id) {
case T_NAMESPACE:
// Namespace found; grab it for later
$namespace = '';
$done = false;
do {
++$i;
$token = $tokens[$i];
if (is_string($token)) {
if (';' === $token) {
$done = true;
}
continue;
}
list($type, $content, $line) = $token;
switch ($type) {
case T_STRING:
case T_NS_SEPARATOR:
$namespace .= $content;
break;
}
} while (!$done && $i < $count);
// Set the namespace of this file in the object
$file->namespace = $namespace;
break;
case T_ABSTRACT:
case T_CLASS:
case T_INTERFACE:
// Abstract class, class, or interface found
// Get the classname
$class = '';
do {
++$i;
$token = $tokens[$i];
if (is_string($token)) {
continue;
}
list($type, $content, $line) = $token;
switch ($type) {
case T_STRING:
$class = $content;
break;
}
} while (empty($class) && $i < $count);
// If a classname was found, set it in the object, and
// return boolean true (found)
if (!empty($class)) {
$file->classname = $class;
return true;
}
break;
default:
break;
}
++$i;
}
// No class-type tokens found; return false
return false;
}
}
Note: the Exceptions thrown in this class are defined in the same namespace; I'll leave how they're implemented to your imagination.
Iterating Faster
The next trick I discovered was in the form of
iterator_apply(). Normally when I use iterators, I use
foreach, because, well, that's what you do. But in looking
through the various iterators for this exercise, I stumbled across this gem.
Basically, you pass the iterator, a callback, and argument(s) you want
passed to the callback. Like FilterIterator, you don't get the
actual item returned by the iterator, so in most use cases, you pass the
iterator itself:
iterator_apply($it, $callback, array($it));
You can then grab the current value and/or key from the iterator itself:
public function process(Iterator $it)
{
$value = $it->current();
$key = $it->key();
// ...
}
While you can use any valid PHP callback, I found the most interesting solution was to use a closure, as it allows you to define everything up front:
iterator_apply($it, function() use ($it) {
$value = $it->current();
$key = $it->key();
// ...
});
If you pass in a local value via a use statement, you can do
some aggregation:
$map = new \stdClass;
iterator_apply($it, function() use ($it, $map) {
$file = $it->current();
$namespace = !empty($file->namespace) ? $file->namespace . '\\' : '';
$classname = $namespace . $file->classname;
$map->{$classname} = $file->getPathname();
});
Not only is this a nice, concise technique, it's also tremendously fast -- I
was finding it was 200% - 300% faster than using a traditional
foreach loop. Clearly it cannot be used in all situations, but
if you can use it, you probably should.
So, start playing with FilterIterator and
iterator_apply() if you haven't already -- the two offer
tremendous possibilities and capabilities for your applications.