阿小信大人的头像
Life is short (You need Python) Bruce Eckel

Django集成haystack使用whoosh进行全文检索2014-09-02 06:32

文档:http://django-haystack.readthedocs.org/en/latest/tutorial.html

用来测试的Blog Model:

from django.db import models

from django.contrib import admin

class Blog(models.Model):  
    Title=models.CharField(u'Title',max_length=200,blank=True)  
    Content=models.TextField(u'Content',blank=True)  
    def __unicode__(self):  
        return self.Title  
    class Meta:  
        verbose_name=u"Blog"

admin.site.register(Blog)

settings.py的配置:

1.Add Haystack To INSTALLED_APPS

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.sites',

    # Added.
    'haystack',

    # Then your usual apps...
    'blog',
]

2.setting PATH to the place on your filesystem where the Whoosh index should be located

import os
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
        'PATH': os.path.join(os.path.dirname(__file__), 'whoosh_index'),
    },
}

3.设置模板目录:

EMPLATE_DIRS = (
    os.path.join(os.path.dirname(__file__), 'templates')
)

在models.py所在目录创建search_indexes.py:

from models import Blog  
from haystack import indexes

class BlogIndex(indexes.SearchIndex, indexes.Indexable):  
    text = indexes.CharField(document=True, use_template=True)      
    def get_model(self):  
        return Blog  
    def index_queryset(self, using=None):  
        """Used when the entire index for model is updated."""    
        return self.get_model().objects.all()

4.在模板目录下新建search目录后:

4.1新建文件search/indexes/myapp/blog_text.txt

<h2>{{ object.Title }}</h2>  
<p>{{ object.Content}}</p>

4.2创建search/search.html

<h2>Search</h2>

<form method="get" action=".">
    <table>
        {{ form.as_table }}
        <tr>
            <td>&nbsp;</td>
            <td>
                <input type="submit" value="Search">
            </td>
        </tr>
    </table>

    {% if query %}
        <h3>Results</h3>

        {% for result in page.object_list %}
            <p>
                <a href="{{ result.object.get_absolute_url }}">{{ result.object.Title }}</a>
            </p>
        {% empty %}
            <p>No results found.</p>
        {% endfor %}

        {% if page.has_previous or page.has_next %}
            <div>
                {% if page.has_previous %}<a href="?q={{ query }}&amp;page={{ page.previous_page_number }}">{% endif %}&laquo; Previous{% if page.has_previous %}</a>{% endif %}
                |
                {% if page.has_next %}<a href="?q={{ query }}&amp;page={{ page.next_page_number }}">{% endif %}Next &raquo;{% if page.has_next %}</a>{% endif %}
            </div>
        {% endif %}
    {% else %}
        {# Show some example queries to run, maybe query syntax, something else? #}
    {% endif %}
</form>

5.设置url:

(r'^search/', include('haystack.urls')),

6.运行命令:./manage.py rebuild_index 完成索引的创建

访问http://127.0.0.1:8000/search/ 可以搜索了~但是默认不支持中文

添加中文支持

1 新建中文分词脚本chinesetokenizer.py复制到/usr/local/lib/python2.7/dist-packages/haystack/backends

#-*- coding:utf-8 -*-
import jieba
from whoosh.analysis import Tokenizer,Token 
from whoosh.compat import text_type

class ChineseTokenizer(Tokenizer):  
    def __call__(self, value, positions=False, chars=False,  
                 keeporiginal=False, removestops=True,  
                 start_pos=0, start_char=0, mode='', **kwargs):  
        assert isinstance(value, text_type), "%r is not unicode" % value  
        t = Token(positions, chars, removestops=removestops, mode=mode,  
            **kwargs)  
        seglist=jieba.cut_for_search(value)                       #使用结巴分词库进行分词  
        for w in seglist:  
            t.original = t.text = w  
            t.boost = 1.0  
            if positions:  
                t.pos=start_pos+value.find(w)  
            if chars:  
                t.startchar=start_char+value.find(w)  
                t.endchar=start_char+value.find(w)+len(w)  
            yield t                                               #通过生成器返回每个分词的结果token

def ChineseAnalyzer():  
    return ChineseTokenizer()

2 然后将/usr/local/lib/python2.7/dist-packages/haystack/backends里面的whoosh_backend.py复制为whoosh_cn_backend.py,并修改为中文分词

...
from chinesetokenizer import ChineseAnalyzer 
...
#build_schema函数处修改最后一个else处的代码
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(), field_boost=field_class.boost)  
#改为
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(), field_boost=field_class.boost)

3 修改settings.py中的HAYSTACK_CONNECTIONS为

HAYSTACK_CONNECTIONS = {  
    'default': {  
        'ENGINE': 'haystack.backends.whoosh_cn_backend.WhooshEngine',  
        'PATH': os.path.join(PROJECT_DIR, 'schema'),  
    },  
}

4 一定要重建index才能正常识别中文python manage.py rebuild_index

如果您觉得从我的分享中得到了帮助,并且希望我的博客持续发展下去,请点击支付宝捐赠,谢谢!

若非特别声明,文章均为阿小信的个人笔记,转载请注明出处。文章如有侵权内容,请联系我,我会及时删除。

#Python#   #django
分享到:
阅读[6892] 评论[2]

上一篇:urldecode

你可能也感兴趣的文章推荐

本文最近访客

发表评论

#1 网友49.*.*.93[南京]40807 :
没有rebuild_index这个manager参数呢,运行不了
2016-10-16 17:31 回复