Toggle navigation
首页
技术
骑行
羽毛球
资讯
联络我
登录
Elasticsearch索引office document
2018-06-26
Elasticsearch
> Elasticsearch可以通过 logstash 接入很多类型的数据,但是对于 office 档案,需要额外做一些事情才能处理。 # 方案选择 要处理office档案,可以通过如下途径: 1. [Ingest Attachment Plugin](https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html) 2. [FsCrawler](https://github.com/dadoonet/fscrawler) 3. 自己写code调用sdk解析 3最灵活,但是工作量也最大;2最简单,但是限制比较多,仅支持文件系统;1为官方插件,结合了灵活性和便利性,是较为折中的方案,本文介绍这种方式。 # 安装插件 可以使用如下命令直接安装: sudo bin/elasticsearch-plugin install ingest-attachment docker image方式(Dockerfile): ```Dockerfile ARG ELK_VERSION=6.2.2 FROM docker.elastic.co/elasticsearch/elasticsearch-oss:$ELK_VERSION ARG ELK_VERSION RUN ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v$ELK_VERSION/elasticsearch-analysis-ik-$ELK_VERSION.zip && \ ./bin/elasticsearch-plugin install ingest-attachment ``` # 配置插件及ELK mapping Ingest Attachment Plugin通过pipeline来解析二进制档案,下面配置了3个processors,分别用来:解析多个二进制档案为字符串;将解析的多个字符串组合到Content栏位;移除预原base64编码的二进制数据和解析出的临时字符串: ```json PUT _ingest/pipeline/attachment { "description" : "Extract attachment information from arrays", "processors" : [ { "foreach": { "field": "Files", "processor": { "attachment": { "target_field": "_ingest._value.file", "field": "_ingest._value.data", "indexed_chars": 20971520 } } } }, { "script": { "lang": "painless", "source": """ for (item in ctx.Files) { ctx.Content = ctx.Content + item.file.content } """ } }, { "foreach": { "field": "Files", "processor": { "remove": { "field": ["_ingest._value.data", "_ingest._value.file.content"] } } } } ] } ``` 下面配置了一个索引模板,特别要注意Content栏位的配置:"term_vector": "with_positions_offsets", 对于长度超过10000的栏位来说,如果要使用query的highlight功能,6.x版本不对栏位添加term_vector则会有一个warning,7.x版本则会直接报错,参考:[Offsets Strategyedit](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#offsets-strategy) ```json PUT /_template/template_1 { "index_patterns" : ["*"], "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "tokenizer": "ik_max_word", "char_filter": [ "html_strip" ], "filter": [ "lowercase", "asciifolding" ] } } } }, "mappings": { "doc": { "properties": { "Content": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "term_vector": "with_positions_offsets", "analyzer": "my_custom_analyzer" }, "Title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "analyzer": "my_custom_analyzer" }, "Description": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "analyzer": "my_custom_analyzer" } } } } } ``` # 编写程式将档案base64编码并送往ELK 下面是一个c#的例子: ```c# Chilkat.FileAccess fac = new Chilkat.FileAccess(); string strBase64 = fac.ReadBinaryToEncoded(file, "base64"); fac.FileClose(); itemObj.Files = new dynamic[] { new { data = strBase64 } }; ``` 最后发往ELK的数据格式可能是这样: ```json { "Files": [ { "data": "……" }, { "data": "……" } ] } ``` # 查看索引结果 可以通过如下指令在kibana中查看结果: get index/_search?q=* # 参考: * [Ingest Attachment Plugin](https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html) * [Term Vector](https://www.elastic.co/guide/en/elasticsearch/reference/current/term-vector.html) * [Painless](https://www.elastic.co/guide/en/elasticsearch/painless/6.3/painless-lang-spec.html) * [Ingesting Documents (pdf, word, txt, etc) Into ElasticSearch](https://blog.ambar.cloud/ingesting-documents-pdf-word-txt-etc-into-elasticsearch/) * [Ingest Node: (re)Indexing and Enriching Docs within ElasticSearch](https://www.youtube.com/watch?v=PEHnBa19Gxs)
×
本文为博主原创,如需转载,请注明出处:
http://www.supperxin.com
返回博客列表