Posts Tagged ‘code’

WordPress: Query Posts

Friday, February 6th, 2009

When you are interacting with WordPress’s Query Posts function the codex page gives great examples.

However, the one glaring omission is documentation on how to handle variables with spaces. For example you might want to query for posts with the tag “Converse”, this is easily achievable with:


# the query terms are not case sensitive, so converse==ConVERsE==Converse

query_posts('showposts=1&tag=converse');

But what if you wanted to query for a tag that contained spaces in it, like “Onitsuka Tiger”? Unfortunately none of the examples on the WordPress page currently provide an answer and giving the query_posts function spaces doesn’t result in anything. And while you might assume urlencode or rawurlencode might be the answer – they’re not.

The answer is that the spaces must be replaced with hyphens:


$manufacturer_tag=str_replace(' ', '-', $manufacturer_name);

query_posts('showposts=1&tag=' . $manufacturer_tag);

Apart from that omission the WordPress codex is still undeniably one of the greatest examples of an OSource API.

Web parsing

Sunday, July 27th, 2008

Five web sources need to be parsed and data entries (say search results) need to be extracted. What is the best approach?

One could use regular expressions to work on the data. However, I am more familiar with XPath selectors (similar to CSS selectors) due to my experience with jQuery hence I’ll be talking about an approach without using regular expressions.

Some information on how the two (XPath selectors/CSS selectors) are interrelated is mentioned in this post by John Resig (creator of jQuery.)

Here are the steps required to extract the data from the web sources:

  • Start by accessing the website itself, so you’ll connect to the page via some HTTP library present in the language (all good languages have them anyway.)
  • Once you’ve got the raw HTML as a string you need to ‘massage’ it into XML. Depending on the language there are different approaches, I have found that BeautifulSoup is good for Python and that JTidy might be good for Java.
  • The above libraries will transform your HTML string into a well-formed XML tree structure. Upon analysis of this webpage you will manually identify where result entries repeat and exist. For example, you may find that your XML tree has a snippet like the following:
<tr>
  <td colspan="3">
    <a href="..." class="medium-text" target="_self">
      Experiments on Design Pattern Discovery
    </a>
    <div class="authors">
      Jing Dong, Yajing Zhao
    </div>
  </td>
</tr>
  • In the above example we would create an XPath selector as follows:
//tr/td[contains(@colspan,'3')]
  • Which would return a list of the contents of the elements that matched the selector:
<a href="..." class="medium-text" target="_self">
  Experiments on Design Pattern Discovery
</a>
<div class="authors">
  Jing Dong, Yajing Zhao
</div>
  • Once you have that list you can start pulling the little details out of the result entry. To do this you may write custom string parsing functions, perhaps you will use some to pull the authors out of the result entry and separate them from the title of the result entry.
  • Alternatively, another approach would be to apply Natural Language Processing to the entries. NLP attempts to pick up the different kinds of words and text existing within a larger set of text. However, NLP is beyond the scope of this discussion. For Python I believe the NLTK is appropriate.