DrupalDork.com was shut down on September 18, 2013. This is just a snapshot of the site as it appeared then.

Search Indexing: Published vs. Unpublished

This week, I've learning a thing or two about Drupal search indexing. On the search settings page (admin/settings/search) for a client's site, the percentage of the site that had been indexed remained really low, even after running cron a few times. The search functionality still seemed to be working though, so I knew something weird was going on.

What I found was that the search_dataset table had well over a million records in it, so indexing was definitely happening. After checking the code used to calculate the percentage shown in the admin, I found that it only checks published nodes when determining how much content has been indexed - but, the node module chooses from all nodes when choosig a batch to index during a cron run. Since this site had about a thousand published nodes and over 100,000 unpublished nodes (the reason for that is a different story altogether), thousands of the unpublished nodes had been indexed, but not many of the published nodes had.

Now, the reported percentage in the admin doesn't really matter that much. I would like for it to give the client accurate information, but it was more important that a user's search would return all available results. If there are published nodes that have not been indexed, then search results will not be accurate.

My quick solution was to make sure that published nodes would be indexed during every cron run, instead of just indexing all willy-nilly. I implemented hook_update_index() by replicating node_update_index(), with an extra WHERE condition to restrict it to published nodes:

function mymodule_update_index() {
  $limit = (int)variable_get('search_cron_limit', 100);

  // Store the maximum possible comments per thread (used for ranking by reply count)
  variable_set('node_cron_comments_scale', 1.0 / max(1, db_result(db_query('SELECT MAX(comment_count) FROM {node_comment_statistics}'))));
  variable_set('node_cron_views_scale', 1.0 / max(1, db_result(db_query('SELECT MAX(totalcount) FROM {node_counter}'))));

  $sql = "SELECT n.nid FROM {node} n 
    LEFT JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid 
    WHERE (d.sid IS NULL OR d.reindex <> 0) AND n.status=1 
    ORDER BY d.reindex ASC, n.nid ASC";
  $result = db_query_range($sql, 0, $limit);

  while ($node = db_fetch_object($result)) {
    _node_index_node($node);
  }
}   

With the indexing batch limit set to 100, the site will index up to 200 nodes per cron run: the node module does a hundred nodes without concern for status, and my module takes care of 100 that are published.

Note: the site in question is running Drupal 6.20. It looks like this was fixed in D7: the reported percentage accounts for all nodes.