DrupalDork.com was shut down on September 18, 2013. This is just a snapshot of the site as it appeared then.

Posts from April 2011

Search Indexing: Published vs. Unpublished

This week, I've learning a thing or two about Drupal search indexing. On the search settings page (admin/settings/search) for a client's site, the percentage of the site that had been indexed remained really low, even after running cron a few times. The search functionality still seemed to be working though, so I knew something weird was going on.

What I found was that the search_dataset table had well over a million records in it, so indexing was definitely happening. After checking the code used to calculate the percentage shown in the admin, I found that it only checks published nodes when determining how much content has been indexed - but, the node module chooses from all nodes when choosig a batch to index during a cron run. Since this site had about a thousand published nodes and over 100,000 unpublished nodes (the reason for that is a different story altogether), thousands of the unpublished nodes had been indexed, but not many of the published nodes had.

Now, the reported percentage in the admin doesn't really matter that much. I would like for it to give the client accurate information, but it was more important that a user's search would return all available results. If there are published nodes that have not been indexed, then search results will not be accurate.

My quick solution was to make sure that published nodes would be indexed during every cron run, instead of just indexing all willy-nilly. I implemented hook_update_index() by replicating node_update_index(), with an extra WHERE condition to restrict it to published nodes:

function mymodule_update_index() {
  $limit = (int)variable_get('search_cron_limit', 100);

  // Store the maximum possible comments per thread (used for ranking by reply count)
  variable_set('node_cron_comments_scale', 1.0 / max(1, db_result(db_query('SELECT MAX(comment_count) FROM {node_comment_statistics}'))));
  variable_set('node_cron_views_scale', 1.0 / max(1, db_result(db_query('SELECT MAX(totalcount) FROM {node_counter}'))));

  $sql = "SELECT n.nid FROM {node} n 
    LEFT JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid 
    WHERE (d.sid IS NULL OR d.reindex <> 0) AND n.status=1 
    ORDER BY d.reindex ASC, n.nid ASC";
  $result = db_query_range($sql, 0, $limit);

  while ($node = db_fetch_object($result)) {
    _node_index_node($node);
  }
}   

With the indexing batch limit set to 100, the site will index up to 200 nodes per cron run: the node module does a hundred nodes without concern for status, and my module takes care of 100 that are published.

Note: the site in question is running Drupal 6.20. It looks like this was fixed in D7: the reported percentage accounts for all nodes.

Display Suite in Drupal 7

Display Suite for Drupal 7 included a lot of updates. The big ones are the inclusion of the previously-separate Node Displays and Views Displays modules, and support for more region layouts (the D6 version had a single pre-defined layout).

The feature that's making my life easy today, though, is a little thing: positioning the node post date separately from the node author. This was an option in D6 too, and I knew the module could still do it, but could not for the life of me figure out how. On the Manage Dislpay tab for my content type, my only options were the node body and taxonomy fields.

So, for your reference and mine: Display Suite fields only become available once you choose a layout under "Layout for [node type] in default" on the Manage Display tab.

I have a feeling I'll forget this step in the future, but hopefully I'll remember to look here for the solution!

Tags: