<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/">
  <channel>
    <title>Tag: spl :: phly, boy, phly</title>
    <description>Tag: spl :: phly, boy, phly</description>
    <pubDate>Fri, 21 Jan 2011 22:07:25 +0000</pubDate>
    <generator>Zend_Feed_Writer 2.1.4dev (http://framework.zend.com)</generator>
    <link>http://mwop.net/blog/tag/spl.html</link>
    <atom:link rel="self" type="application/rss+xml" href="http://mwop.net/blog/tag/spl-rss.xml"/>
    <item>
      <title>Taming SplPriorityQueue</title>
      <pubDate>Fri, 21 Jan 2011 22:07:25 +0000</pubDate>
      <link>http://mwop.net/blog/253-Taming-SplPriorityQueue.html</link>
      <guid>http://mwop.net/blog/253-Taming-SplPriorityQueue.html</guid>
      <author>me@mwop.net (Matthew Weier O'Phinney)</author>
      <dc:creator>Matthew Weier O'Phinney</dc:creator>
      <content:encoded><![CDATA[<p>
<a href="http://php.net/SplPriorityQueue">SplPriorityQueue</a> is a fantastic new feature of
PHP 5.3. However, in trying to utilize it in a few projects recently, I've run
into some behavior that's (a) non-intuitive, and (b) in some cases at least,
undesired. In this post, I'll present my solutions.
</p><h2 id="toc_1.1">Of Heaps and Queues</h2>

<p>
<em>Queues</em> in programming are any data structure that, when iterated, return
values in a "first-in-first-out" (FIFO) order. For "last-in-first-out" (LIFO)
iteration, you define a <em>stack</em>.
</p>

<p>
A <em>heap</em> is a data structure where, given a specific node, all nodes beneath it
are of a value less than it.  (Technically, this would be considered a
"max-heap," as you can also have a variant where all child nodes are of a value
greater; this is called a "min-heap.")
</p>

<p>
A <em>priority queue</em> is a specialized version of a max-heap. Typically, data is
registered with a specific priority -- so the max-heap is looking at only the
priority value, not the data itself. This allows inserting data into the queue
in any order desired, while ensuring that they are iterated in the order
specified by the priorities provided.
</p>

<p>
PHP offers SPL data structures corresponding to each:
</p>

<ul>
<li>
<a href="http://php.net/SplQueue">SplQueue</a>, corresponding to a <em>queue</em>.
</li>
<li>
<a href="http://php.net/SplStack">SplStack</a>, corresponding to a <em>stack</em>.
</li>
<li>
<a href="http://php.net/SplHeap">SplHeap</a>, corresponding to a <em>heap</em>.
</li>
<li>
<a href="http://php.net/SplMaxHeap">SplMaxHeap</a>, correspondaxg to a <em>max-heap</em>.
</li>
<li>
<a href="http://php.net/SplMinHeap">SplMinHeap</a>, corresponding to a <em>min-heap</em>.
</li>
<li>
<a href="http://php.net/SplPriorityQueue">SplPriorityQueue</a>, corresponding to a
   <em>priority queue</em>.
</li>
</ul>

<h2 id="toc_1.2">Problems</h2>

<p>
The first problem I ran into was really a lapse of reasoning on my part, and is
namely this:
</p>
<blockquote>
<em>Iterating over a heap removes the values from the heap.</em>
</blockquote>
    
<p>
Basically, in order to satisfy the <em>heap</em> contract, which is that the root node
is always the maximum value (or minimum, in the case of a min-heap), any
previous nodes must be removed.
</p>

<p>
The problem with this, obviously, is that if you want to iterate over a heap of
any sort multiple times, well, you can't with the same instance.
</p>

<p>
The next problem I ran into was with SplPriorityQueue specifically: when items
of equal priority are enqueued, the iteration order of these items is...
unexpected. While the <a href="http://php.net/splpriorityqueue.compare">documentation</a>
notes that "multiple elements with the same priority will get dequeued in no
particular order," the fact is that it <em>is</em> predictable, and unintuitive. For
example, given the following:
</p>

<div class="example"><pre><code lang="php">
$queue-&gt;insert('foo', 1000);
$queue-&gt;insert('bar', 1000);
$queue-&gt;insert('baz', 1000);
$queue-&gt;insert('bat', 1000);

foreach ($queue as $data) echo $data, \&quot; \&quot;;
</code></pre></div>

<p>
I'd expect a result of "foo bar baz bat", assuming FIFO order (which is expected
in a <em>queue</em>) for equal priorities; "foo baz bat bar", assuming ordering by
data (which might be expected in a max-heap). In fact, neither is true: the
first item will be emitted first, and then the remaining items in reverse order
of when enqueued: "foo bat baz bar".
</p>

<p>
While this may be somewhat predictable, I find I don't want to assume such
order, nor try and write code around it.
</p>

<h2 id="toc_1.3">Solutions</h2>

<h3 id="toc_1.3.1">Allowing multiple iterations</h3>

<p>
Allowing multiple iterations of a queue is as easy as cloning it prior to
iteration:
</p>

<div class="example"><pre><code lang="php">
foreach (clone $queue as $datum) echo $datum, \&quot; \&quot;;
</code></pre></div>

<p>
The problem is automating this -- there are cases where I don't want users to
really have to understand the internal implementation.
</p>

<p>
My solution to this was to use the idea of inner and outer iterators. In this
particular case, I created a "PriorityQueue" class that composes an
SplPriorityQueue instance, and which also implements <code>IteratorAggregate</code>. This
allows the following:
</p>

<div class="example"><pre><code lang="php">
namespace Foo;

class PriorityQueue implements Countable, IteratorAggregate
{
    protected $innerQueue;
    
    public function __construct()
    {
        // I'll explain the lack of global namespacing later...
        $this-&gt;innerQueue = new SplPriorityQueue;
    }

    public function count()
    {
        return count($this-&gt;innerQueue);
    }

    public function insert($datum, $priority)
    {
        $this-&gt;innerQueue-&gt;insert($datum, $priority);
    }
    
    public function getIterator()
    {
        return clone $this-&gt;innerQueue;
    }
}
</code></pre></div>

<p>
This approach means that as I consume PriorityQueue, I can be assured that I
can count and iterate over it... again and again.
</p>

<p>
I mention in the code comments that I'm not importing SplPriorityQueue into the
namespace. The reason is that I want to also solve the problem of predictable
queue order.
</p>

<h3 id="toc_1.3.2">Enforcing predictable queue order</h3>

<p>
The solution to the queue order problem with equal priorities is actually quite simple. While I found it on <a href="http://php.net/splpriorityqueue.compare">the SplPriorityQueue::compare manual page</a>, <a href="http://twitter.com/elazar">Matthew Turland</a> also <a href="http://www.slideshare.net/tobias382/new-spl-features-in-php-53">discusses it in a presentation on SPL</a>, and it hinges on one, simple fact: <em>priorities do not need to be integers</em>.
</p>

<p>
What does this mean? It means that the following are not equivalent, and will
lead to a more expected sort order:
</p>

<div class="example"><pre><code lang="php">
$queue-&gt;insert('foo', array(1000, 1000));
$queue-&gt;insert('bar', array(1000, 100));
$queue-&gt;insert('baz', array(1000, 10));
$queue-&gt;insert('bat', array(1000, 1));

foreach ($queue as $data) echo $data, \&quot; \&quot;;
</code></pre></div>

<p>
This results in "foo bar baz bat"!
</p>

<p>
The trick, then, is automating the solution. I achieved this in a custom
SplPriorityQueue extension:
</p>

<div class="example"><pre><code lang="php">
namespace Foo;

class SplPriorityQueue extends \SplPriorityQueue
{
    protected $queueOrder = PHP_INT_MAX;

    public function insert($datum, $priority)
    {
        if (is_int($priority)) {
            $priority = array($priority, $this-&gt;queueOrder--);
        }
        parent::insert($datum, $priority);
    }
}
</code></pre></div>

<p>
As each datum is added to the queue, if the priority is an integer, it wraps it
in an array, using <code>$queueOrder</code> as a second value to the array, and
decrementing <code>$queueOrder</code> on completion. The new priority is then used to
insert the value.
</p>

<p>
Using this extension ensures that order in the priority queue is now
predictable.
</p>

<h2 id="toc_1.4">Conclusions</h2>

<p>
SplPriorityQueue is indeed powerful, and saves me a ton of time programming --
and also likely CPU processes and memory when using larger data sets. While it
may not always meet my use cases, the fact is that, particularly with
namespacing available, I can easily override the class to meet my needs.
</p>]]></content:encoded>
      <slash:comments>0</slash:comments>
    </item>
    <item>
      <title>Applying FilterIterator to Directory Iteration</title>
      <pubDate>Fri, 20 Aug 2010 19:45:21 +0000</pubDate>
      <link>http://mwop.net/blog/244-Applying-FilterIterator-to-Directory-Iteration.html</link>
      <guid>http://mwop.net/blog/244-Applying-FilterIterator-to-Directory-Iteration.html</guid>
      <author>me@mwop.net (Matthew Weier O'Phinney)</author>
      <dc:creator>Matthew Weier O'Phinney</dc:creator>
      <content:encoded><![CDATA[<p>
    I'm currently doing research and prototyping for autoloading alternatives in
    <a href="http://framework.zend.com/">Zend Framework</a> 2.0. One approach
    I'm looking at involves creating explicit class/file maps; these tend to be
    much faster than using the <code>include_path</code>, but do require some
    additional setup.
</p>

<p>
    My algorithm for generating the maps was absurdly simple:
</p>

<ul>
    <li>Scan the filesystem for PHP files</li>
    <li>If the file does not contain an interface, class, or abstract class,
    skip it.</li>
    <li>If it does, get its declared namespace and classname</li>
</ul>

<p>
    The question was what implementation approach to use.
</p>

<p>
    I'm well aware of <code>RecursiveDirectoryIterator</code>, and planned to
    use that. However, I also had heard of <code>FilterIterator</code>, and
    wondered if I could tie that in somehow. In the end, I could, but the
    solution was non-obvious.
</p><h2>What I Thought I'd Be Able To Do</h2>

<p>
    <code>FilterIterator</code> is an abstract class. When extending it, you must
    define an <code>accept()</code> method. 
</p>

<div class="example"><pre><code lang="php">
class FooFilter extends FilterIterator
{
    public function accept()
    {
    }
}
</code></pre></div>
    
<p>
    In that method, you typically will inspect whatever is returned by
    <code>$this->current()</code>, and then return a boolean <code>true</code>
    or <code>false</code>, depending on whether you want to keep it or not. 
</p>

<div class="example"><pre><code lang="php">
class FooFilter extends FilterIterator
{
    public function accept()
    {
        $item = $this-&gt;current();

        if ($someCriteriaIsMet) {
            return true;
        }

        return false;
    }
}
</code></pre></div>
    
<p>
    I'll go into the mechanics of my criteria later; what's important now is
    knowing that a <code>FilterIterator</code> allows you to limit the results
    returned by your iterator.
</p>

<p>
    I originally thought I'd be able to simply pass a
    <code>DirectoryIterator</code> or
    <code>RecursiveDirectoryIterator</code> to my filtering instance. This
    worked in the former case, as it's only one level deep. However, for the
    latter, it would only return the first directory level for all classes that
    matched -- i.e., if I ran it over "Zend/Controller", I'd get a match for
    each class under "Zend/Controller/Action/Helper/", but it would return
    simply "Zend/Controller/Action" as the match. This certainly wasn't useful.
</p>

<p>
    I then discovered <code>RecursiveFilterIterator</code>, which looked like it
    would solve the recursion problem. However, I found one of two results
    occurred: either I'd receive an entire subtree if at least one item matched,
    or it would skip an entire subtree if the first item found failed the
    criteria. There was no middle ground.
</p>

<h2>The Solution</h2>

<p>
    The solution was incredibly simple and elegant, once I stumbled upon it:
    pass my <code>RecursiveIteratorIterator</code> instance to the
    <code>FilterIterator</code>. 
</p>

<div class="example"><pre><code lang="php">
$rdi      = new RecursiveDirectoryIterator($somePath);
$rii      = new RecursiveIteratorIterator($rdi);
$filtered = new FooFilter($rii);
</code></pre></div>
    
<p>
    Really. It was that simple -- but, as noted, non-obvious. It also required a
    slight change within my filter -- instead of using <code>current()</code>,
    I'd need to first pull the "inner" iterator instance:
    <code>$this->getInnerIterator()->current()</code>. I show an example of that
    below when I go over the filter implementation.
</p>

<p>
    As for my criteria, I had several options. I could <code>require_once</code>
    the file, and use the Reflection API to inspect the class to determine if it
    was an interface, abstract class, or class, as well as to determine the
    namespace. However, I couldn't be 100% sure the file would contain a class,
    so this seemed like overkill. That, and horribly non-performant, due to
    using reflection.
</p>

<p>
    The next option was to simply slurp in the file contents into a variable,
    and use regular expressions. I love regular expressions, but in this case,
    it felt like I could possibly end up with some false positives. Also, since
    some of these files could be quite large, I was worried again about
    performance implications -- I don't want to have to wait forever to generate
    these maps.
</p>

<p>
    The solution I went with was to use the <a
        href="http://php.net/tokenizer">tokenizer</a> to inspect the file.
    Tokenizing is incredibly fast, and it's also incredibly simple to analyze
    the tokens.
</p>

<p>
    I decided to store the detected namespace and classnames as public
    properties of the <code>SplFileInfo</code> objects returned; this makes it
    simple to iterate over the entire collection and utilize that
    information. Also, because I have <code>SplFileInfo</code> objects, I
    already have the paths I need.
</p>

<p>
    My implementation looks like this:
</p>

<div class="example"><pre><code lang="php">
/** @namespace */
namespace Zend\File;

// import SPL classes/interfaces into local scope
use DirectoryIterator,
    FilterIterator,
    RecursiveIterator,
    RecursiveDirectoryIterator,
    RecursiveIteratorIterator;

/**
 * Locate files containing PHP classes, interfaces, or abstracts
 * 
 * @package    Zend_File
 * @license    New BSD {@link http://framework.zend.com/license/new-bsd}
 */
class ClassFileLocater extends FilterIterator
{
    /**
     * Create an instance of the locater iterator
     * 
     * Expects either a directory, or a DirectoryIterator (or its recursive variant) 
     * instance.
     * 
     * @param  string|DirectoryIterator $dirOrIterator 
     * @return void
     */
    public function __construct($dirOrIterator = '.')
    {
        if (is_string($dirOrIterator)) {
            if (!is_dir($dirOrIterator)) {
                throw new InvalidArgumentException('Expected a valid directory name');
            }

            $dirOrIterator = new RecursiveDirectoryIterator($dirOrIterator);
        }
        if (!$dirOrIterator instanceof DirectoryIterator) {
            throw new InvalidArgumentException('Expected a DirectoryIterator');
        }

        if ($dirOrIterator instanceof RecursiveIterator) {
            $iterator = new RecursiveIteratorIterator($dirOrIterator);
        } else {
            $iterator = $dirOrIterator;
        }

        parent::__construct($iterator);
        $this-&gt;rewind();
    }

    /**
     * Filter for files containing PHP classes, interfaces, or abstracts
     * 
     * @return bool
     */
    public function accept()
    {
        $file = $this-&gt;getInnerIterator()-&gt;current();

        // If we somehow have something other than an SplFileInfo object, just 
        // return false
        if (!$file instanceof \SplFileInfo) {
            return false;
        }

        // If we have a directory, it's not a file, so return false
        if (!$file-&gt;isFile()) {
            return false;
        }

        // If not a PHP file, skip
        if ($file-&gt;getBasename('.php') == $file-&gt;getBasename()) {
            return false;
        }

        $contents = file_get_contents($file-&gt;getRealPath());
        $tokens   = token_get_all($contents);
        $count    = count($tokens);
        $i        = 0;
        while ($i &lt; $count) {
            $token = $tokens[$i];

            if (!is_array($token)) {
                // single character token found; skip
                $i++;
                continue;
            }

            list($id, $content, $line) = $token;

            switch ($id) {
                case T_NAMESPACE:
                    // Namespace found; grab it for later
                    $namespace = '';
                    $done      = false;
                    do {
                        ++$i;
                        $token = $tokens[$i];
                        if (is_string($token)) {
                            if (';' === $token) {
                                $done = true;
                            }
                            continue;
                        }
                        list($type, $content, $line) = $token;
                        switch ($type) {
                            case T_STRING:
                            case T_NS_SEPARATOR:
                                $namespace .= $content;
                                break;
                        }
                    } while (!$done &amp;&amp; $i &lt; $count);

                    // Set the namespace of this file in the object
                    $file-&gt;namespace = $namespace;
                    break;
                case T_ABSTRACT:
                case T_CLASS:
                case T_INTERFACE:
                    // Abstract class, class, or interface found

                    // Get the classname
                    $class = '';
                    do {
                        ++$i;
                        $token = $tokens[$i];
                        if (is_string($token)) {
                            continue;
                        }
                        list($type, $content, $line) = $token;
                        switch ($type) {
                            case T_STRING:
                                $class = $content;
                                break;
                        }
                    } while (empty($class) &amp;&amp; $i &lt; $count);

                    // If a classname was found, set it in the object, and 
                    // return boolean true (found)
                    if (!empty($class)) {
                        $file-&gt;classname = $class;
                        return true;
                    }
                    break;
                default:
                    break;
            }
            ++$i;
        }

        // No class-type tokens found; return false
        return false;
    }
}
</code></pre></div>

<p>
    <i>Note: the Exceptions thrown in this class are defined in the same
    namespace; I'll leave how they're implemented to your imagination.</i>
</p>

<h2>Iterating Faster</h2>

<p>
    The next trick I discovered was in the form of
    <code>iterator_apply()</code>. Normally when I use iterators, I use
    <code>foreach</code>, because, well, that's what you do. But in looking
    through the various iterators for this exercise, I stumbled across this gem.
</p>

<p>
    Basically, you pass the iterator, a callback, and argument(s) you want
    passed to the callback. Like <code>FilterIterator</code>, you don't get the
    actual item returned by the iterator, so in most use cases, you pass the
    iterator itself:
</p>

<div class="example"><pre><code lang="php">
iterator_apply($it, $callback, array($it));
</code></pre></div>

<p>
    You can then grab the current value and/or key from the iterator itself:
</p>

<div class="example"><pre><code lang="php">
public function process(Iterator $it)
{
    $value = $it-&gt;current();
    $key   = $it-&gt;key();
    // ...
}
</code></pre></div>

<p>
    While you can use any valid PHP callback, I found the most interesting
    solution was to use a closure, as it allows you to define everything up
    front:
</p>

<div class="example"><pre><code lang="php">
iterator_apply($it, function() use ($it) {
    $value = $it-&gt;current();
    $key   = $it-&gt;key();
    // ...
});
</code></pre></div>

<p>
    If you pass in a local value via a <code>use</code> statement, you can do
    some aggregation:
</p>

<div class="example"><pre><code lang="php">
$map = new \stdClass;
iterator_apply($it, function() use ($it, $map) {
    $file = $it-&gt;current();
    $namespace = !empty($file-&gt;namespace) ? $file-&gt;namespace . '\\' : '';
    $classname = $namespace . $file-&gt;classname;
    $map-&gt;{$classname} = $file-&gt;getPathname();
});
</code></pre></div>

<p>
    Not only is this a nice, concise technique, it's also tremendously fast -- I
    was finding it was 200% - 300% faster than using a traditional
    <code>foreach</code> loop. Clearly it cannot be used in all situations, but
    if you <em>can</em> use it, you probably should.
</p>

<p>
    So, start playing with <code>FilterIterator</code> and
    <code>iterator_apply()</code> if you haven't already -- the two offer
    tremendous possibilities and capabilities for your applications.
</p>]]></content:encoded>
      <slash:comments>0</slash:comments>
    </item>
  </channel>
</rss>
