Integrate BACKBEAT SDK and resolve KACHING license validation

Major integrations and fixes:
- Added BACKBEAT SDK integration for P2P operation timing
- Implemented beat-aware status tracking for distributed operations
- Added Docker secrets support for secure license management
- Resolved KACHING license validation via HTTPS/TLS
- Updated docker-compose configuration for clean stack deployment
- Disabled rollback policies to prevent deployment failures
- Added license credential storage (CHORUS-DEV-MULTI-001)

Technical improvements:
- BACKBEAT P2P operation tracking with phase management
- Enhanced configuration system with file-based secrets
- Improved error handling for license validation
- Clean separation of KACHING and CHORUS deployment stacks

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
anthonyrawlins
2025-09-06 07:56:26 +10:00
parent 543ab216f9
commit 9bdcbe0447
4730 changed files with 1480093 additions and 1916 deletions

20
vendor/github.com/blevesearch/bleve/v2/.gitignore generated vendored Normal file
View File

@@ -0,0 +1,20 @@
#*
*.sublime-*
*~
.#*
.project
.settings
**/.idea/
**/*.iml
.DS_Store
query_string.y.go.tmp
/analysis/token_filters/cld2/cld2-read-only
/analysis/token_filters/cld2/libcld2_full.a
/cmd/bleve/bleve
vendor/**
!vendor/manifest
/y.output
/search/query/y.output
*.test
tags
go.sum

25
vendor/github.com/blevesearch/bleve/v2/.travis.yml generated vendored Normal file
View File

@@ -0,0 +1,25 @@
sudo: false
language: go
go:
- "1.21.x"
- "1.22.x"
- "1.23.x"
script:
- go get golang.org/x/tools/cmd/cover
- go get github.com/mattn/goveralls
- go get github.com/kisielk/errcheck
- go get -u github.com/FiloSottile/gvt
- gvt restore
- go test -race -v $(go list ./... | grep -v vendor/)
- go vet $(go list ./... | grep -v vendor/)
- go test ./test -v -indexType scorch
- errcheck -ignorepkg fmt $(go list ./... | grep -v vendor/);
- scripts/project-code-coverage.sh
- scripts/build_children.sh
notifications:
email:
- fts-team@couchbase.com

16
vendor/github.com/blevesearch/bleve/v2/CONTRIBUTING.md generated vendored Normal file
View File

@@ -0,0 +1,16 @@
# Contributing to Bleve
We look forward to your contributions, but ask that you first review these guidelines.
## Sign the CLA
As Bleve is a Couchbase project we require contributors accept the [Couchbase Contributor License Agreement](http://review.couchbase.org/static/individual_agreement.html). To sign this agreement log into the Couchbase [code review tool](http://review.couchbase.org/). The Bleve project does not use this code review tool but it is still used to track acceptance of the contributor license agreements.
## Submitting a Pull Request
All types of contributions are welcome, but please keep the following in mind:
- If you're planning a large change, you should really discuss it in a github issue or on the google group first. This helps avoid duplicate effort and spending time on something that may not be merged.
- Existing tests should continue to pass, new tests for the contribution are nice to have.
- All code should have gone through `go fmt`
- All code should pass `go vet`

202
vendor/github.com/blevesearch/bleve/v2/LICENSE generated vendored Normal file
View File

@@ -0,0 +1,202 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

122
vendor/github.com/blevesearch/bleve/v2/README.md generated vendored Normal file
View File

@@ -0,0 +1,122 @@
# ![bleve](docs/bleve.png) bleve
[![Tests](https://github.com/blevesearch/bleve/actions/workflows/tests.yml/badge.svg?branch=master&event=push)](https://github.com/blevesearch/bleve/actions/workflows/tests.yml?query=event%3Apush+branch%3Amaster)
[![Coverage Status](https://coveralls.io/repos/github/blevesearch/bleve/badge.svg?branch=master)](https://coveralls.io/github/blevesearch/bleve?branch=master)
[![Go Reference](https://pkg.go.dev/badge/github.com/blevesearch/bleve/v2.svg)](https://pkg.go.dev/github.com/blevesearch/bleve/v2)
[![Join the chat](https://badges.gitter.im/join_chat.svg)](https://app.gitter.im/#/room/#blevesearch_bleve:gitter.im)
[![codebeat](https://codebeat.co/badges/38a7cbc9-9cf5-41c0-a315-0746178230f4)](https://codebeat.co/projects/github-com-blevesearch-bleve)
[![Go Report Card](https://goreportcard.com/badge/github.com/blevesearch/bleve/v2)](https://goreportcard.com/report/github.com/blevesearch/bleve/v2)
[![Sourcegraph](https://sourcegraph.com/github.com/blevesearch/bleve/-/badge.svg)](https://sourcegraph.com/github.com/blevesearch/bleve?badge)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
A modern indexing + search library in GO
## Features
* Index any GO data structure or JSON
* Intelligent defaults backed up by powerful configuration ([scorch](https://github.com/blevesearch/bleve/blob/master/index/scorch/README.md))
* Supported field types:
* `text`, `number`, `datetime`, `boolean`, `geopoint`, `geoshape`, `IP`, `vector`
* Supported query types:
* `term`, `phrase`, `match`, `match_phrase`, `prefix`, `regexp`, `wildcard`, `fuzzy`
* term range, numeric range, date range, boolean field
* compound queries: `conjuncts`, `disjuncts`, boolean (`must`/`should`/`must_not`)
* [query string syntax](http://www.blevesearch.com/docs/Query-String-Query/)
* [geo spatial search](https://github.com/blevesearch/bleve/blob/master/geo/README.md)
* approximate k-nearest neighbors via [vector search](https://github.com/blevesearch/bleve/blob/master/docs/vectors.md)
* [synonym search](https://github.com/blevesearch/bleve/blob/master/docs/synonyms.md)
* [tf-idf](https://github.com/blevesearch/bleve/blob/master/docs/scoring.md#tf-idf) / [bm25](https://github.com/blevesearch/bleve/blob/master/docs/scoring.md#bm25) scoring models
* Hybrid search: exact + semantic
* Query time boosting
* Search result match highlighting with document fragments
* Aggregations/faceting support:
* terms facet
* numeric range facet
* date range facet
## Indexing
```go
message := struct {
Id string
From string
Body string
}{
Id: "example",
From: "xyz@couchbase.com",
Body: "bleve indexing is easy",
}
mapping := bleve.NewIndexMapping()
index, err := bleve.New("example.bleve", mapping)
if err != nil {
panic(err)
}
index.Index(message.Id, message)
```
## Querying
```go
index, _ := bleve.Open("example.bleve")
query := bleve.NewQueryStringQuery("bleve")
searchRequest := bleve.NewSearchRequest(query)
searchResult, _ := index.Search(searchRequest)
```
## Command Line Interface
To install the CLI for the latest release of bleve, run:
```bash
go install github.com/blevesearch/bleve/v2/cmd/bleve@latest
```
```text
$ bleve --help
Bleve is a command-line tool to interact with a bleve index.
Usage:
bleve [command]
Available Commands:
bulk bulk loads from newline delimited JSON files
check checks the contents of the index
count counts the number documents in the index
create creates a new index
dictionary prints the term dictionary for the specified field in the index
dump dumps the contents of the index
fields lists the fields in this index
help Help about any command
index adds the files to the index
mapping prints the mapping used for this index
query queries the index
registry registry lists the bleve components compiled into this executable
scorch command-line tool to interact with a scorch index
Flags:
-h, --help help for bleve
Use "bleve [command] --help" for more information about a command.
```
## Text Analysis
Bleve includes general-purpose analyzers (customizable) as well as pre-built text analyzers for the following languages:
Arabic (ar), Bulgarian (bg), Catalan (ca), Chinese-Japanese-Korean (cjk), Kurdish (ckb), Danish (da), German (de), Greek (el), English (en), Spanish - Castilian (es), Basque (eu), Persian (fa), Finnish (fi), French (fr), Gaelic (ga), Spanish - Galician (gl), Hindi (hi), Croatian (hr), Hungarian (hu), Armenian (hy), Indonesian (id, in), Italian (it), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Swedish (sv), Turkish (tr)
## Text Analysis Wizard
[bleveanalysis.couchbase.com](https://bleveanalysis.couchbase.com)
## Discussion/Issues
Discuss usage/development of bleve and/or report issues here:
* [Github issues](https://github.com/blevesearch/bleve/issues)
* [Google group](https://groups.google.com/forum/#!forum/bleve)
## License
Apache License Version 2.0

16
vendor/github.com/blevesearch/bleve/v2/SECURITY.md generated vendored Normal file
View File

@@ -0,0 +1,16 @@
# Security Policy
## Supported Versions
We support the latest release (for example, bleve v2.5.x).
## Reporting a Vulnerability
All security issues for this project should be reported via email to [security@couchbase.com](mailto:security@couchbase.com) and [fts-team@couchbase.com](mailto:fts-team@couchbase.com).
This mail will be delivered to the owners of this project.
- To ensure your report is NOT marked as spam, please include the word "security/vulnerability" along with the project name (blevesearch/bleve) in the subject of the email.
- Please be as descriptive as possible while explaining the issue, and a testcase highlighting the issue is always welcome.
Your email will be acknowledged at the soonest possible.

View File

@@ -0,0 +1,41 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package keyword
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/single"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "keyword"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
keywordTokenizer, err := cache.TokenizerNamed(single.Name)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: keywordTokenizer,
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,55 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package standard
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/lang/en"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "standard"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
tokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
stopEnFilter, err := cache.TokenFilterNamed(en.StopName)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
TokenFilters: []analysis.TokenFilter{
toLowerFilter,
stopEnFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(Name, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,67 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package flexible
import (
"fmt"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "flexiblego"
type DateTimeParser struct {
layouts []string
}
func New(layouts []string) *DateTimeParser {
return &DateTimeParser{
layouts: layouts,
}
}
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
for _, layout := range p.layouts {
rv, err := time.Parse(layout, input)
if err == nil {
return rv, layout, nil
}
}
return time.Time{}, "", analysis.ErrInvalidDateTime
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
layouts, ok := config["layouts"].([]interface{})
if !ok {
return nil, fmt.Errorf("must specify layouts")
}
var layoutStrs []string
for _, layout := range layouts {
layoutStr, ok := layout.(string)
if ok {
layoutStrs = append(layoutStrs, layoutStr)
}
}
return New(layoutStrs), nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,50 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package optional
import (
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/datetime/flexible"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "dateTimeOptional"
const rfc3339NoTimezone = "2006-01-02T15:04:05"
const rfc3339NoTimezoneNoT = "2006-01-02 15:04:05"
const rfc3339Offset = "2006-01-02 15:04:05 -0700"
const rfc3339NoTime = "2006-01-02"
var layouts = []string{
time.RFC3339Nano,
time.RFC3339,
rfc3339NoTimezone,
rfc3339NoTimezoneNoT,
rfc3339Offset,
rfc3339NoTime,
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return flexible.New(layouts), nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,55 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package microseconds
import (
"math"
"strconv"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unix_micro"
type DateTimeParser struct {
}
var minBound int64 = math.MinInt64 / 1000
var maxBound int64 = math.MaxInt64 / 1000
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
// unix timestamp is milliseconds since UNIX epoch
timestamp, err := strconv.ParseInt(input, 10, 64)
if err != nil {
return time.Time{}, "", analysis.ErrInvalidTimestampString
}
if timestamp < minBound || timestamp > maxBound {
return time.Time{}, "", analysis.ErrInvalidTimestampRange
}
return time.UnixMicro(timestamp), Name, nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return &DateTimeParser{}, nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,55 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package milliseconds
import (
"math"
"strconv"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unix_milli"
type DateTimeParser struct {
}
var minBound int64 = math.MinInt64 / 1000000
var maxBound int64 = math.MaxInt64 / 1000000
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
// unix timestamp is milliseconds since UNIX epoch
timestamp, err := strconv.ParseInt(input, 10, 64)
if err != nil {
return time.Time{}, "", analysis.ErrInvalidTimestampString
}
if timestamp < minBound || timestamp > maxBound {
return time.Time{}, "", analysis.ErrInvalidTimestampRange
}
return time.UnixMilli(timestamp), Name, nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return &DateTimeParser{}, nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,55 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package nanoseconds
import (
"math"
"strconv"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unix_nano"
type DateTimeParser struct {
}
var minBound int64 = math.MinInt64
var maxBound int64 = math.MaxInt64
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
// unix timestamp is milliseconds since UNIX epoch
timestamp, err := strconv.ParseInt(input, 10, 64)
if err != nil {
return time.Time{}, "", analysis.ErrInvalidTimestampString
}
if timestamp < minBound || timestamp > maxBound {
return time.Time{}, "", analysis.ErrInvalidTimestampRange
}
return time.Unix(0, timestamp), Name, nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return &DateTimeParser{}, nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,55 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package seconds
import (
"math"
"strconv"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unix_sec"
type DateTimeParser struct {
}
var minBound int64 = math.MinInt64 / 1000000000
var maxBound int64 = math.MaxInt64 / 1000000000
func (p *DateTimeParser) ParseDateTime(input string) (time.Time, string, error) {
// unix timestamp is seconds since UNIX epoch
timestamp, err := strconv.ParseInt(input, 10, 64)
if err != nil {
return time.Time{}, "", analysis.ErrInvalidTimestampString
}
if timestamp < minBound || timestamp > maxBound {
return time.Time{}, "", analysis.ErrInvalidTimestampRange
}
return time.Unix(timestamp, 0), Name, nil
}
func DateTimeParserConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.DateTimeParser, error) {
return &DateTimeParser{}, nil
}
func init() {
err := registry.RegisterDateTimeParser(Name, DateTimeParserConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,70 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package analysis
import (
index "github.com/blevesearch/bleve_index_api"
)
func TokenFrequency(tokens TokenStream, arrayPositions []uint64, options index.FieldIndexingOptions) index.TokenFrequencies {
rv := make(map[string]*index.TokenFreq, len(tokens))
if options.IncludeTermVectors() {
tls := make([]index.TokenLocation, len(tokens))
tlNext := 0
for _, token := range tokens {
tls[tlNext] = index.TokenLocation{
ArrayPositions: arrayPositions,
Start: token.Start,
End: token.End,
Position: token.Position,
}
curr, ok := rv[string(token.Term)]
if ok {
curr.Locations = append(curr.Locations, &tls[tlNext])
} else {
curr = &index.TokenFreq{
Term: token.Term,
Locations: []*index.TokenLocation{&tls[tlNext]},
}
rv[string(token.Term)] = curr
}
if !options.SkipFreqNorm() {
curr.SetFrequency(curr.Frequency() + 1)
}
tlNext++
}
} else {
for _, token := range tokens {
curr, exists := rv[string(token.Term)]
if !exists {
curr = &index.TokenFreq{
Term: token.Term,
}
rv[string(token.Term)] = curr
}
if !options.SkipFreqNorm() {
curr.SetFrequency(curr.Frequency() + 1)
}
}
}
return rv
}

View File

@@ -0,0 +1,73 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package en implements an analyzer with reasonable defaults for processing
// English text.
//
// It strips possessive suffixes ('s), transforms tokens to lower case,
// removes stopwords from a built-in list, and applies porter stemming.
//
// The built-in stopwords list is defined in EnglishStopWords.
package en
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/token/porter"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
)
const AnalyzerName = "en"
func AnalyzerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Analyzer, error) {
tokenizer, err := cache.TokenizerNamed(unicode.Name)
if err != nil {
return nil, err
}
possEnFilter, err := cache.TokenFilterNamed(PossessiveName)
if err != nil {
return nil, err
}
toLowerFilter, err := cache.TokenFilterNamed(lowercase.Name)
if err != nil {
return nil, err
}
stopEnFilter, err := cache.TokenFilterNamed(StopName)
if err != nil {
return nil, err
}
stemmerEnFilter, err := cache.TokenFilterNamed(porter.Name)
if err != nil {
return nil, err
}
rv := analysis.DefaultAnalyzer{
Tokenizer: tokenizer,
TokenFilters: []analysis.TokenFilter{
possEnFilter,
toLowerFilter,
stopEnFilter,
stemmerEnFilter,
},
}
return &rv, nil
}
func init() {
err := registry.RegisterAnalyzer(AnalyzerName, AnalyzerConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,177 @@
/*
This code was ported from the Open Search Project
https://github.com/opensearch-project/OpenSearch/blob/main/modules/analysis-common/src/main/java/org/opensearch/analysis/common/EnglishPluralStemFilter.java
The algorithm itself was created by Mark Harwood
https://github.com/markharwood
*/
/*
* SPDX-License-Identifier: Apache-2.0
*
* The OpenSearch Contributors require contributions made to
* this file be licensed under the Apache-2.0 license or a
* compatible open source license.
*/
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package en
import (
"strings"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const PluralStemmerName = "stemmer_en_plural"
type EnglishPluralStemmerFilter struct {
}
func NewEnglishPluralStemmerFilter() *EnglishPluralStemmerFilter {
return &EnglishPluralStemmerFilter{}
}
func (s *EnglishPluralStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
token.Term = []byte(stem(string(token.Term)))
}
return input
}
func EnglishPluralStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewEnglishPluralStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(PluralStemmerName, EnglishPluralStemmerFilterConstructor)
if err != nil {
panic(err)
}
}
// ----------------------------------------------------------------------------
// Words ending in oes that retain the e when stemmed
var oesExceptions = []string{"shoes", "canoes", "oboes"}
// Words ending in ches that retain the e when stemmed
var chesExceptions = []string{
"cliches",
"avalanches",
"mustaches",
"moustaches",
"quiches",
"headaches",
"heartaches",
"porsches",
"tranches",
"caches",
}
func stem(word string) string {
runes := []rune(strings.ToLower(word))
if len(runes) < 3 || runes[len(runes)-1] != 's' {
return string(runes)
}
switch runes[len(runes)-2] {
case 'u':
fallthrough
case 's':
return string(runes)
case 'e':
// Modified ies->y logic from original s-stemmer - only work on strings > 4
// so spies -> spy still but pies->pie.
// The original code also special-cased aies and eies for no good reason as far as I can tell.
// ( no words of consequence - eg http://www.thefreedictionary.com/words-that-end-in-aies )
if len(runes) > 4 && runes[len(runes)-3] == 'i' {
runes[len(runes)-3] = 'y'
return string(runes[0 : len(runes)-2])
}
// Suffix rules to remove any dangling "e"
if len(runes) > 3 {
// xes (but >1 prefix so we can stem "boxes->box" but keep "axes->axe")
if len(runes) > 4 && runes[len(runes)-3] == 'x' {
return string(runes[0 : len(runes)-2])
}
// oes
if len(runes) > 3 && runes[len(runes)-3] == 'o' {
if isException(runes, oesExceptions) {
// Only remove the S
return string(runes[0 : len(runes)-1])
}
// Remove the es
return string(runes[0 : len(runes)-2])
}
if len(runes) > 4 {
// shes/sses
if runes[len(runes)-4] == 's' && (runes[len(runes)-3] == 'h' || runes[len(runes)-3] == 's') {
return string(runes[0 : len(runes)-2])
}
// ches
if len(runes) > 4 {
if runes[len(runes)-4] == 'c' && runes[len(runes)-3] == 'h' {
if isException(runes, chesExceptions) {
// Only remove the S
return string(runes[0 : len(runes)-1])
}
// Remove the es
return string(runes[0 : len(runes)-2])
}
}
}
}
fallthrough
default:
return string(runes[0 : len(runes)-1])
}
}
func isException(word []rune, exceptions []string) bool {
for _, exception := range exceptions {
exceptionRunes := []rune(exception)
exceptionPos := len(exceptionRunes) - 1
wordPos := len(word) - 1
matched := true
for exceptionPos >= 0 && wordPos >= 0 {
if exceptionRunes[exceptionPos] != word[wordPos] {
matched = false
break
}
exceptionPos--
wordPos--
}
if matched {
return true
}
}
return false
}

View File

@@ -0,0 +1,70 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"unicode/utf8"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
// PossessiveName is the name PossessiveFilter is registered as
// in the bleve registry.
const PossessiveName = "possessive_en"
const rightSingleQuotationMark = ''
const apostrophe = '\''
const fullWidthApostrophe = ''
const apostropheChars = rightSingleQuotationMark + apostrophe + fullWidthApostrophe
// PossessiveFilter implements a TokenFilter which
// strips the English possessive suffix ('s) from tokens.
// It handle a variety of apostrophe types, is case-insensitive
// and doesn't distinguish between possessive and contraction.
// (ie "She's So Rad" becomes "She So Rad")
type PossessiveFilter struct {
}
func NewPossessiveFilter() *PossessiveFilter {
return &PossessiveFilter{}
}
func (s *PossessiveFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
lastRune, lastRuneSize := utf8.DecodeLastRune(token.Term)
if lastRune == 's' || lastRune == 'S' {
nextLastRune, nextLastRuneSize := utf8.DecodeLastRune(token.Term[:len(token.Term)-lastRuneSize])
if nextLastRune == rightSingleQuotationMark ||
nextLastRune == apostrophe ||
nextLastRune == fullWidthApostrophe {
token.Term = token.Term[:len(token.Term)-lastRuneSize-nextLastRuneSize]
}
}
}
return input
}
func PossessiveFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewPossessiveFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(PossessiveName, PossessiveFilterConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,52 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/snowballstem"
"github.com/blevesearch/snowballstem/english"
)
const SnowballStemmerName = "stemmer_en_snowball"
type EnglishStemmerFilter struct {
}
func NewEnglishStemmerFilter() *EnglishStemmerFilter {
return &EnglishStemmerFilter{}
}
func (s *EnglishStemmerFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
env := snowballstem.NewEnv(string(token.Term))
english.Stem(env)
token.Term = []byte(env.Current())
}
return input
}
func EnglishStemmerFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewEnglishStemmerFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(SnowballStemmerName, EnglishStemmerFilterConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,36 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package en
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/analysis/token/stop"
"github.com/blevesearch/bleve/v2/registry"
)
func StopTokenFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
tokenMap, err := cache.TokenMapNamed(StopName)
if err != nil {
return nil, err
}
return stop.NewStopTokensFilter(tokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(StopName, StopTokenFilterConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,347 @@
package en
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const StopName = "stop_en"
// EnglishStopWords is the built-in list of stopwords used by the "stop_en" TokenFilter.
//
// this content was obtained from:
// lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
// ` was changed to ' to allow for literal string
var EnglishStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
| An English stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.
| Many of the forms below are quite rare (e.g. "yourselves") but included for
| completeness.
| PRONOUNS FORMS
| 1st person sing
i | subject, always in upper case of course
me | object
my | possessive adjective
| the possessive pronoun 'mine' is best suppressed, because of the
| sense of coal-mine etc.
myself | reflexive
| 1st person plural
we | subject
| us | object
| care is required here because US = United States. It is usually
| safe to remove it if it is in lower case.
our | possessive adjective
ours | possessive pronoun
ourselves | reflexive
| second person (archaic 'thou' forms not included)
you | subject and object
your | possessive adjective
yours | possessive pronoun
yourself | reflexive (singular)
yourselves | reflexive (plural)
| third person singular
he | subject
him | object
his | possessive adjective and pronoun
himself | reflexive
she | subject
her | object and possessive adjective
hers | possessive pronoun
herself | reflexive
it | subject and object
its | possessive adjective
itself | reflexive
| third person plural
they | subject
them | object
their | possessive adjective
theirs | possessive pronoun
themselves | reflexive
| other forms (demonstratives, interrogatives)
what
which
who
whom
this
that
these
those
| VERB FORMS (using F.R. Palmer's nomenclature)
| BE
am | 1st person, present
is | -s form (3rd person, present)
are | present
was | 1st person, past
were | past
be | infinitive
been | past participle
being | -ing form
| HAVE
have | simple
has | -s form
had | past
having | -ing form
| DO
do | simple
does | -s form
did | past
doing | -ing form
| The forms below are, I believe, best omitted, because of the significant
| homonym forms:
| He made a WILL
| old tin CAN
| merry month of MAY
| a smell of MUST
| fight the good fight with all thy MIGHT
| would, could, should, ought might however be included
| | AUXILIARIES
| | WILL
|will
would
| | SHALL
|shall
should
| | CAN
|can
could
| | MAY
|may
|might
| | MUST
|must
| | OUGHT
ought
| COMPOUND FORMS, increasingly encountered nowadays in 'formal' writing
| pronoun + verb
i'm
you're
he's
she's
it's
we're
they're
i've
you've
we've
they've
i'd
you'd
he'd
she'd
we'd
they'd
i'll
you'll
he'll
she'll
we'll
they'll
| verb + negation
isn't
aren't
wasn't
weren't
hasn't
haven't
hadn't
doesn't
don't
didn't
| auxiliary + negation
won't
wouldn't
shan't
shouldn't
can't
cannot
couldn't
mustn't
| miscellaneous forms
let's
that's
who's
what's
here's
there's
when's
where's
why's
how's
| rarer forms
| daren't needn't
| doubtful forms
| oughtn't mightn't
| ARTICLES
a
an
the
| THE REST (Overlap among prepositions, conjunctions, adverbs etc is so
| high, that classification is pointless.)
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
| Just for the record, the following words are among the commonest in English
| one
| every
| least
| less
| many
| now
| ever
| never
| say
| says
| said
| also
| get
| go
| goes
| just
| made
| make
| put
| see
| seen
| whether
| like
| well
| back
| even
| still
| way
| take
| since
| another
| however
| two
| three
| four
| five
| first
| second
| new
| old
| high
| long
`)
func TokenMapConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenMap, error) {
rv := analysis.NewTokenMap()
err := rv.LoadBytes(EnglishStopWords)
return rv, err
}
func init() {
err := registry.RegisterTokenMap(StopName, TokenMapConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,7 @@
# full line comment
marty
steve # trailing comment
| different format of comment
dustin
siri | different style trailing comment
multiple words with different whitespace

View File

@@ -0,0 +1,108 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package lowercase implements a TokenFilter which converts
// tokens to lower case according to unicode rules.
package lowercase
import (
"bytes"
"unicode"
"unicode/utf8"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
// Name is the name used to register LowerCaseFilter in the bleve registry
const Name = "to_lower"
type LowerCaseFilter struct {
}
func NewLowerCaseFilter() *LowerCaseFilter {
return &LowerCaseFilter{}
}
func (f *LowerCaseFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
token.Term = toLowerDeferredCopy(token.Term)
}
return input
}
func LowerCaseFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewLowerCaseFilter(), nil
}
func init() {
err := registry.RegisterTokenFilter(Name, LowerCaseFilterConstructor)
if err != nil {
panic(err)
}
}
// toLowerDeferredCopy will function exactly like
// bytes.ToLower() only it will reuse (overwrite)
// the original byte array when possible
// NOTE: because its possible that the lower-case
// form of a rune has a different utf-8 encoded
// length, in these cases a new byte array is allocated
func toLowerDeferredCopy(s []byte) []byte {
j := 0
for i := 0; i < len(s); {
wid := 1
r := rune(s[i])
if r >= utf8.RuneSelf {
r, wid = utf8.DecodeRune(s[i:])
}
l := unicode.ToLower(r)
// If the rune is already lowercased, just move to the
// next rune.
if l == r {
i += wid
j += wid
continue
}
// Handles the Unicode edge-case where the last
// rune in a word on the greek Σ needs to be converted
// differently.
if l == 'σ' && i+2 == len(s) {
l = 'ς'
}
lwid := utf8.RuneLen(l)
if lwid > wid {
// utf-8 encoded replacement is wider
// for now, punt and defer
// to bytes.ToLower() for the remainder
// only known to happen with chars
// Rune Ⱥ(570) width 2 - Lower ⱥ(11365) width 3
// Rune Ⱦ(574) width 2 - Lower ⱦ(11366) width 3
rest := bytes.ToLower(s[i:])
rv := make([]byte, j+len(rest))
copy(rv[:j], s[:j])
copy(rv[j:], rest)
return rv
} else {
utf8.EncodeRune(s[j:], l)
}
i += wid
j += lwid
}
return s[:j]
}

View File

@@ -0,0 +1,56 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package porter
import (
"bytes"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/go-porterstemmer"
)
const Name = "stemmer_porter"
type PorterStemmer struct {
}
func NewPorterStemmer() *PorterStemmer {
return &PorterStemmer{}
}
func (s *PorterStemmer) Filter(input analysis.TokenStream) analysis.TokenStream {
for _, token := range input {
// if it is not a protected keyword, stem it
if !token.KeyWord {
termRunes := bytes.Runes(token.Term)
stemmedRunes := porterstemmer.StemWithoutLowerCasing(termRunes)
token.Term = analysis.BuildTermFromRunes(stemmedRunes)
}
}
return input
}
func PorterStemmerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
return NewPorterStemmer(), nil
}
func init() {
err := registry.RegisterTokenFilter(Name, PorterStemmerConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,73 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package stop implements a TokenFilter removing tokens found in
// a TokenMap.
//
// It constructor takes the following arguments:
//
// "stop_token_map" (string): the name of the token map identifying tokens to
// remove.
package stop
import (
"fmt"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "stop_tokens"
type StopTokensFilter struct {
stopTokens analysis.TokenMap
}
func NewStopTokensFilter(stopTokens analysis.TokenMap) *StopTokensFilter {
return &StopTokensFilter{
stopTokens: stopTokens,
}
}
func (f *StopTokensFilter) Filter(input analysis.TokenStream) analysis.TokenStream {
j := 0
for _, token := range input {
_, isStopToken := f.stopTokens[string(token.Term)]
if !isStopToken {
input[j] = token
j++
}
}
return input[:j]
}
func StopTokensFilterConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.TokenFilter, error) {
stopTokenMapName, ok := config["stop_token_map"].(string)
if !ok {
return nil, fmt.Errorf("must specify stop_token_map")
}
stopTokenMap, err := cache.TokenMapNamed(stopTokenMapName)
if err != nil {
return nil, fmt.Errorf("error building stop words filter: %v", err)
}
return NewStopTokensFilter(stopTokenMap), nil
}
func init() {
err := registry.RegisterTokenFilter(Name, StopTokensFilterConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,52 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package single
import (
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "single"
type SingleTokenTokenizer struct {
}
func NewSingleTokenTokenizer() *SingleTokenTokenizer {
return &SingleTokenTokenizer{}
}
func (t *SingleTokenTokenizer) Tokenize(input []byte) analysis.TokenStream {
return analysis.TokenStream{
&analysis.Token{
Term: input,
Position: 1,
Start: 0,
End: len(input),
Type: analysis.AlphaNumeric,
},
}
}
func SingleTokenTokenizerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Tokenizer, error) {
return NewSingleTokenTokenizer(), nil
}
func init() {
err := registry.RegisterTokenizer(Name, SingleTokenTokenizerConstructor)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,134 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package unicode
import (
"github.com/blevesearch/segment"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/registry"
)
const Name = "unicode"
type UnicodeTokenizer struct {
}
func NewUnicodeTokenizer() *UnicodeTokenizer {
return &UnicodeTokenizer{}
}
func (rt *UnicodeTokenizer) Tokenize(input []byte) analysis.TokenStream {
rvx := make([]analysis.TokenStream, 0, 10) // When rv gets full, append to rvx.
rv := make(analysis.TokenStream, 0, 1)
ta := []analysis.Token(nil)
taNext := 0
segmenter := segment.NewWordSegmenterDirect(input)
start := 0
pos := 1
guessRemaining := func(end int) int {
avgSegmentLen := end / (len(rv) + 1)
if avgSegmentLen < 1 {
avgSegmentLen = 1
}
remainingLen := len(input) - end
return remainingLen / avgSegmentLen
}
for segmenter.Segment() {
segmentBytes := segmenter.Bytes()
end := start + len(segmentBytes)
if segmenter.Type() != segment.None {
if taNext >= len(ta) {
remainingSegments := guessRemaining(end)
if remainingSegments > 1000 {
remainingSegments = 1000
}
if remainingSegments < 1 {
remainingSegments = 1
}
ta = make([]analysis.Token, remainingSegments)
taNext = 0
}
token := &ta[taNext]
taNext++
token.Term = segmentBytes
token.Start = start
token.End = end
token.Position = pos
token.Type = convertType(segmenter.Type())
if len(rv) >= cap(rv) { // When rv is full, save it into rvx.
rvx = append(rvx, rv)
rvCap := cap(rv) * 2
if rvCap > 256 {
rvCap = 256
}
rv = make(analysis.TokenStream, 0, rvCap) // Next rv cap is bigger.
}
rv = append(rv, token)
pos++
}
start = end
}
if len(rvx) > 0 {
n := len(rv)
for _, r := range rvx {
n += len(r)
}
rall := make(analysis.TokenStream, 0, n)
for _, r := range rvx {
rall = append(rall, r...)
}
return append(rall, rv...)
}
return rv
}
func UnicodeTokenizerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Tokenizer, error) {
return NewUnicodeTokenizer(), nil
}
func init() {
err := registry.RegisterTokenizer(Name, UnicodeTokenizerConstructor)
if err != nil {
panic(err)
}
}
func convertType(segmentWordType int) analysis.TokenType {
switch segmentWordType {
case segment.Ideo:
return analysis.Ideographic
case segment.Kana:
return analysis.Ideographic
case segment.Number:
return analysis.Numeric
}
return analysis.AlphaNumeric
}

View File

@@ -0,0 +1,76 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package analysis
import (
"bufio"
"bytes"
"io"
"os"
"strings"
)
type TokenMap map[string]bool
func NewTokenMap() TokenMap {
return make(TokenMap, 0)
}
// LoadFile reads in a list of tokens from a text file,
// one per line.
// Comments are supported using `#` or `|`
func (t TokenMap) LoadFile(filename string) error {
data, err := os.ReadFile(filename)
if err != nil {
return err
}
return t.LoadBytes(data)
}
// LoadBytes reads in a list of tokens from memory,
// one per line.
// Comments are supported using `#` or `|`
func (t TokenMap) LoadBytes(data []byte) error {
bytesReader := bytes.NewReader(data)
bufioReader := bufio.NewReader(bytesReader)
line, err := bufioReader.ReadString('\n')
for err == nil {
t.LoadLine(line)
line, err = bufioReader.ReadString('\n')
}
// if the err was EOF we still need to process the last value
if err == io.EOF {
t.LoadLine(line)
return nil
}
return err
}
func (t TokenMap) LoadLine(line string) {
// find the start of a comment, if any
startComment := strings.IndexAny(line, "#|")
if startComment >= 0 {
line = line[:startComment]
}
tokens := strings.Fields(line)
for _, token := range tokens {
t.AddToken(token)
}
}
func (t TokenMap) AddToken(token string) {
t[token] = true
}

120
vendor/github.com/blevesearch/bleve/v2/analysis/type.go generated vendored Normal file
View File

@@ -0,0 +1,120 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package analysis
import (
"fmt"
"time"
)
type CharFilter interface {
Filter([]byte) []byte
}
type TokenType int
const (
AlphaNumeric TokenType = iota
Ideographic
Numeric
DateTime
Shingle
Single
Double
Boolean
IP
)
// Token represents one occurrence of a term at a particular location in a
// field.
type Token struct {
// Start specifies the byte offset of the beginning of the term in the
// field.
Start int `json:"start"`
// End specifies the byte offset of the end of the term in the field.
End int `json:"end"`
Term []byte `json:"term"`
// Position specifies the 1-based index of the token in the sequence of
// occurrences of its term in the field.
Position int `json:"position"`
Type TokenType `json:"type"`
KeyWord bool `json:"keyword"`
}
func (t *Token) String() string {
return fmt.Sprintf("Start: %d End: %d Position: %d Token: %s Type: %d", t.Start, t.End, t.Position, string(t.Term), t.Type)
}
type TokenStream []*Token
// A Tokenizer splits an input string into tokens, the usual behaviour being to
// map words to tokens.
type Tokenizer interface {
Tokenize([]byte) TokenStream
}
// A TokenFilter adds, transforms or removes tokens from a token stream.
type TokenFilter interface {
Filter(TokenStream) TokenStream
}
type Analyzer interface {
Analyze([]byte) TokenStream
}
type DefaultAnalyzer struct {
CharFilters []CharFilter
Tokenizer Tokenizer
TokenFilters []TokenFilter
}
func (a *DefaultAnalyzer) Analyze(input []byte) TokenStream {
if a.CharFilters != nil {
for _, cf := range a.CharFilters {
input = cf.Filter(input)
}
}
tokens := a.Tokenizer.Tokenize(input)
if a.TokenFilters != nil {
for _, tf := range a.TokenFilters {
tokens = tf.Filter(tokens)
}
}
return tokens
}
var ErrInvalidDateTime = fmt.Errorf("unable to parse datetime with any of the layouts")
var ErrInvalidTimestampString = fmt.Errorf("unable to parse timestamp string")
var ErrInvalidTimestampRange = fmt.Errorf("timestamp out of range")
type DateTimeParser interface {
ParseDateTime(string) (time.Time, string, error)
}
const SynonymSourceType = "synonym"
type SynonymSourceVisitor func(name string, item SynonymSource) error
type SynonymSource interface {
Analyzer() string
Collection() string
}
type ByteArrayConverter interface {
Convert([]byte) (interface{}, error)
}

View File

@@ -0,0 +1,92 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package analysis
import (
"bytes"
"unicode/utf8"
)
func DeleteRune(in []rune, pos int) []rune {
if pos >= len(in) {
return in
}
copy(in[pos:], in[pos+1:])
return in[:len(in)-1]
}
func InsertRune(in []rune, pos int, r rune) []rune {
// create a new slice 1 rune larger
rv := make([]rune, len(in)+1)
// copy the characters before the insert pos
copy(rv[0:pos], in[0:pos])
// set the inserted rune
rv[pos] = r
// copy the characters after the insert pos
copy(rv[pos+1:], in[pos:])
return rv
}
// BuildTermFromRunesOptimistic will build a term from the provided runes
// AND optimistically attempt to encode into the provided buffer
// if at any point it appears the buffer is too small, a new buffer is
// allocated and that is used instead
// this should be used in cases where frequently the new term is the same
// length or shorter than the original term (in number of bytes)
func BuildTermFromRunesOptimistic(buf []byte, runes []rune) []byte {
rv := buf
used := 0
for _, r := range runes {
nextLen := utf8.RuneLen(r)
if used+nextLen > len(rv) {
// alloc new buf
buf = make([]byte, len(runes)*utf8.UTFMax)
// copy work we've already done
copy(buf, rv[:used])
rv = buf
}
written := utf8.EncodeRune(rv[used:], r)
used += written
}
return rv[:used]
}
func BuildTermFromRunes(runes []rune) []byte {
return BuildTermFromRunesOptimistic(make([]byte, len(runes)*utf8.UTFMax), runes)
}
func TruncateRunes(input []byte, num int) []byte {
runes := bytes.Runes(input)
runes = runes[:len(runes)-num]
out := BuildTermFromRunes(runes)
return out
}
func RunesEndsWith(input []rune, suffix string) bool {
inputLen := len(input)
suffixRunes := []rune(suffix)
suffixLen := len(suffixRunes)
if suffixLen > inputLen {
return false
}
for i := suffixLen - 1; i >= 0; i-- {
if input[inputLen-(suffixLen-i)] != suffixRunes[i] {
return false
}
}
return true
}

94
vendor/github.com/blevesearch/bleve/v2/builder.go generated vendored Normal file
View File

@@ -0,0 +1,94 @@
// Copyright (c) 2019 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package bleve
import (
"fmt"
"github.com/blevesearch/bleve/v2/document"
"github.com/blevesearch/bleve/v2/index/scorch"
"github.com/blevesearch/bleve/v2/mapping"
"github.com/blevesearch/bleve/v2/util"
index "github.com/blevesearch/bleve_index_api"
)
type builderImpl struct {
b index.IndexBuilder
m mapping.IndexMapping
}
func (b *builderImpl) Index(id string, data interface{}) error {
if id == "" {
return ErrorEmptyID
}
doc := document.NewDocument(id)
err := b.m.MapDocument(doc, data)
if err != nil {
return err
}
err = b.b.Index(doc)
return err
}
func (b *builderImpl) Close() error {
return b.b.Close()
}
func newBuilder(path string, mapping mapping.IndexMapping, config map[string]interface{}) (Builder, error) {
if path == "" {
return nil, fmt.Errorf("builder requires path")
}
err := mapping.Validate()
if err != nil {
return nil, err
}
if config == nil {
config = map[string]interface{}{}
}
// the builder does not have an API to interact with internal storage
// however we can pass k/v pairs through the config
mappingBytes, err := util.MarshalJSON(mapping)
if err != nil {
return nil, err
}
config["internal"] = map[string][]byte{
string(mappingInternalKey): mappingBytes,
}
// do not use real config, as these are options for the builder,
// not the resulting index
meta := newIndexMeta(scorch.Name, scorch.Name, map[string]interface{}{})
err = meta.Save(path)
if err != nil {
return nil, err
}
config["path"] = indexStorePath(path)
b, err := scorch.NewBuilder(config)
if err != nil {
return nil, err
}
rv := &builderImpl{
b: b,
m: mapping,
}
return rv, nil
}

95
vendor/github.com/blevesearch/bleve/v2/config.go generated vendored Normal file
View File

@@ -0,0 +1,95 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package bleve
import (
"expvar"
"io"
"log"
"time"
"github.com/blevesearch/bleve/v2/index/scorch"
"github.com/blevesearch/bleve/v2/index/upsidedown/store/gtreap"
"github.com/blevesearch/bleve/v2/registry"
"github.com/blevesearch/bleve/v2/search/highlight/highlighter/html"
index "github.com/blevesearch/bleve_index_api"
)
var bleveExpVar = expvar.NewMap("bleve")
type configuration struct {
Cache *registry.Cache
DefaultHighlighter string
DefaultKVStore string
DefaultMemKVStore string
DefaultIndexType string
SlowSearchLogThreshold time.Duration
analysisQueue *index.AnalysisQueue
}
func (c *configuration) SetAnalysisQueueSize(n int) {
if c.analysisQueue != nil {
c.analysisQueue.Close()
}
c.analysisQueue = index.NewAnalysisQueue(n)
}
func (c *configuration) Shutdown() {
c.SetAnalysisQueueSize(0)
}
func newConfiguration() *configuration {
return &configuration{
Cache: registry.NewCache(),
analysisQueue: index.NewAnalysisQueue(4),
}
}
// Config contains library level configuration
var Config *configuration
func init() {
bootStart := time.Now()
// build the default configuration
Config = newConfiguration()
// set the default highlighter
Config.DefaultHighlighter = html.Name
// default kv store
Config.DefaultKVStore = ""
// default mem only kv store
Config.DefaultMemKVStore = gtreap.Name
// default index
Config.DefaultIndexType = scorch.Name
bootDuration := time.Since(bootStart)
bleveExpVar.Add("bootDuration", int64(bootDuration))
indexStats = NewIndexStats()
bleveExpVar.Set("indexes", indexStats)
initDisk()
}
var logger = log.New(io.Discard, "bleve", log.LstdFlags)
// SetLog sets the logger used for logging
// by default log messages are sent to io.Discard
func SetLog(l *log.Logger) {
logger = l
}

24
vendor/github.com/blevesearch/bleve/v2/config_app.go generated vendored Normal file
View File

@@ -0,0 +1,24 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build appengine || appenginevm
// +build appengine appenginevm
package bleve
// in the appengine environment we cannot support disk based indexes
// so we do no extra configuration in this method
func initDisk() {
}

26
vendor/github.com/blevesearch/bleve/v2/config_disk.go generated vendored Normal file
View File

@@ -0,0 +1,26 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build !appengine && !appenginevm
// +build !appengine,!appenginevm
package bleve
import "github.com/blevesearch/bleve/v2/index/upsidedown/store/boltdb"
// in normal environments we configure boltdb as the default storage
func initDisk() {
// default kv store
Config.DefaultKVStore = boltdb.Name
}

37
vendor/github.com/blevesearch/bleve/v2/doc.go generated vendored Normal file
View File

@@ -0,0 +1,37 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
/*
Package bleve is a library for indexing and searching text.
Example Opening New Index, Indexing Data
message := struct{
Id: "example"
From: "xyz@couchbase.com",
Body: "bleve indexing is easy",
}
mapping := bleve.NewIndexMapping()
index, _ := bleve.New("example.bleve", mapping)
index.Index(message.Id, message)
Example Opening Existing Index, Searching Data
index, _ := bleve.Open("example.bleve")
query := bleve.NewQueryStringQuery("bleve")
searchRequest := bleve.NewSearchRequest(query)
searchResult, _ := index.Search(searchRequest)
*/
package bleve

View File

@@ -0,0 +1,159 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"fmt"
"reflect"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeDocument int
func init() {
var d Document
reflectStaticSizeDocument = int(reflect.TypeOf(d).Size())
}
type Document struct {
id string `json:"id"`
Fields []Field `json:"fields"`
CompositeFields []*CompositeField
StoredFieldsSize uint64
indexed bool
}
func (d *Document) StoredFieldsBytes() uint64 {
return d.StoredFieldsSize
}
func NewDocument(id string) *Document {
return &Document{
id: id,
Fields: make([]Field, 0),
CompositeFields: make([]*CompositeField, 0),
}
}
func NewSynonymDocument(id string) *Document {
return &Document{
id: id,
Fields: make([]Field, 0),
}
}
func (d *Document) Size() int {
sizeInBytes := reflectStaticSizeDocument + size.SizeOfPtr +
len(d.id)
for _, entry := range d.Fields {
sizeInBytes += entry.Size()
}
for _, entry := range d.CompositeFields {
sizeInBytes += entry.Size()
}
return sizeInBytes
}
func (d *Document) AddField(f Field) *Document {
switch f := f.(type) {
case *CompositeField:
d.CompositeFields = append(d.CompositeFields, f)
default:
d.Fields = append(d.Fields, f)
}
return d
}
func (d *Document) GoString() string {
fields := ""
for i, field := range d.Fields {
if i != 0 {
fields += ", "
}
fields += fmt.Sprintf("%#v", field)
}
compositeFields := ""
for i, field := range d.CompositeFields {
if i != 0 {
compositeFields += ", "
}
compositeFields += fmt.Sprintf("%#v", field)
}
return fmt.Sprintf("&document.Document{ID:%s, Fields: %s, CompositeFields: %s}", d.ID(), fields, compositeFields)
}
func (d *Document) NumPlainTextBytes() uint64 {
rv := uint64(0)
for _, field := range d.Fields {
rv += field.NumPlainTextBytes()
}
for _, compositeField := range d.CompositeFields {
for _, field := range d.Fields {
if compositeField.includesField(field.Name()) {
rv += field.NumPlainTextBytes()
}
}
}
return rv
}
func (d *Document) ID() string {
return d.id
}
func (d *Document) SetID(id string) {
d.id = id
}
func (d *Document) AddIDField() {
d.AddField(NewTextFieldCustom("_id", nil, []byte(d.ID()), index.IndexField|index.StoreField, nil))
}
func (d *Document) VisitFields(visitor index.FieldVisitor) {
for _, f := range d.Fields {
visitor(f)
}
}
func (d *Document) VisitComposite(visitor index.CompositeFieldVisitor) {
for _, f := range d.CompositeFields {
visitor(f)
}
}
func (d *Document) HasComposite() bool {
return len(d.CompositeFields) > 0
}
func (d *Document) VisitSynonymFields(visitor index.SynonymFieldVisitor) {
for _, f := range d.Fields {
if sf, ok := f.(index.SynonymField); ok {
visitor(sf)
}
}
}
func (d *Document) SetIndexed() {
d.indexed = true
}
func (d *Document) Indexed() bool {
return d.indexed
}

View File

@@ -0,0 +1,45 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
index "github.com/blevesearch/bleve_index_api"
)
type Field interface {
// Name returns the path of the field from the root DocumentMapping.
// A root field path is "field", a subdocument field is "parent.field".
Name() string
// ArrayPositions returns the intermediate document and field indices
// required to resolve the field value in the document. For example, if the
// field path is "doc1.doc2.field" where doc1 and doc2 are slices or
// arrays, ArrayPositions returns 2 indices used to resolve "doc2" value in
// "doc1", then "field" in "doc2".
ArrayPositions() []uint64
Options() index.FieldIndexingOptions
Analyze()
Value() []byte
// NumPlainTextBytes should return the number of plain text bytes
// that this field represents - this is a common metric for tracking
// the rate of indexing
NumPlainTextBytes() uint64
Size() int
EncodedFieldType() byte
AnalyzedLength() int
AnalyzedTokenFrequencies() index.TokenFrequencies
}

View File

@@ -0,0 +1,142 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"fmt"
"reflect"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeBooleanField int
func init() {
var f BooleanField
reflectStaticSizeBooleanField = int(reflect.TypeOf(f).Size())
}
const DefaultBooleanIndexingOptions = index.StoreField | index.IndexField | index.DocValues
type BooleanField struct {
name string
arrayPositions []uint64
options index.FieldIndexingOptions
value []byte
numPlainTextBytes uint64
length int
frequencies index.TokenFrequencies
}
func (b *BooleanField) Size() int {
var freqSize int
if b.frequencies != nil {
freqSize = b.frequencies.Size()
}
return reflectStaticSizeBooleanField + size.SizeOfPtr +
len(b.name) +
len(b.arrayPositions)*size.SizeOfUint64 +
len(b.value) +
freqSize
}
func (b *BooleanField) Name() string {
return b.name
}
func (b *BooleanField) ArrayPositions() []uint64 {
return b.arrayPositions
}
func (b *BooleanField) Options() index.FieldIndexingOptions {
return b.options
}
func (b *BooleanField) Analyze() {
tokens := make(analysis.TokenStream, 0)
tokens = append(tokens, &analysis.Token{
Start: 0,
End: len(b.value),
Term: b.value,
Position: 1,
Type: analysis.Boolean,
})
b.length = len(tokens)
b.frequencies = analysis.TokenFrequency(tokens, b.arrayPositions, b.options)
}
func (b *BooleanField) Value() []byte {
return b.value
}
func (b *BooleanField) Boolean() (bool, error) {
if len(b.value) == 1 {
return b.value[0] == 'T', nil
}
return false, fmt.Errorf("boolean field has %d bytes", len(b.value))
}
func (b *BooleanField) GoString() string {
return fmt.Sprintf("&document.BooleanField{Name:%s, Options: %s, Value: %s}", b.name, b.options, b.value)
}
func (b *BooleanField) NumPlainTextBytes() uint64 {
return b.numPlainTextBytes
}
func (b *BooleanField) EncodedFieldType() byte {
return 'b'
}
func (b *BooleanField) AnalyzedLength() int {
return b.length
}
func (b *BooleanField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return b.frequencies
}
func NewBooleanFieldFromBytes(name string, arrayPositions []uint64, value []byte) *BooleanField {
return &BooleanField{
name: name,
arrayPositions: arrayPositions,
value: value,
options: DefaultBooleanIndexingOptions,
numPlainTextBytes: uint64(len(value)),
}
}
func NewBooleanField(name string, arrayPositions []uint64, b bool) *BooleanField {
return NewBooleanFieldWithIndexingOptions(name, arrayPositions, b, DefaultBooleanIndexingOptions)
}
func NewBooleanFieldWithIndexingOptions(name string, arrayPositions []uint64, b bool, options index.FieldIndexingOptions) *BooleanField {
numPlainTextBytes := 5
v := []byte("F")
if b {
numPlainTextBytes = 4
v = []byte("T")
}
return &BooleanField{
name: name,
arrayPositions: arrayPositions,
value: v,
options: options,
numPlainTextBytes: uint64(numPlainTextBytes),
}
}

View File

@@ -0,0 +1,138 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"reflect"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeCompositeField int
func init() {
var cf CompositeField
reflectStaticSizeCompositeField = int(reflect.TypeOf(cf).Size())
}
const DefaultCompositeIndexingOptions = index.IndexField
type CompositeField struct {
name string
includedFields map[string]bool
excludedFields map[string]bool
defaultInclude bool
options index.FieldIndexingOptions
totalLength int
compositeFrequencies index.TokenFrequencies
}
func NewCompositeField(name string, defaultInclude bool, include []string, exclude []string) *CompositeField {
return NewCompositeFieldWithIndexingOptions(name, defaultInclude, include, exclude, DefaultCompositeIndexingOptions)
}
func NewCompositeFieldWithIndexingOptions(name string, defaultInclude bool, include []string, exclude []string, options index.FieldIndexingOptions) *CompositeField {
rv := &CompositeField{
name: name,
options: options,
defaultInclude: defaultInclude,
includedFields: make(map[string]bool, len(include)),
excludedFields: make(map[string]bool, len(exclude)),
compositeFrequencies: make(index.TokenFrequencies),
}
for _, i := range include {
rv.includedFields[i] = true
}
for _, e := range exclude {
rv.excludedFields[e] = true
}
return rv
}
func (c *CompositeField) Size() int {
sizeInBytes := reflectStaticSizeCompositeField + size.SizeOfPtr +
len(c.name)
for k := range c.includedFields {
sizeInBytes += size.SizeOfString + len(k) + size.SizeOfBool
}
for k := range c.excludedFields {
sizeInBytes += size.SizeOfString + len(k) + size.SizeOfBool
}
if c.compositeFrequencies != nil {
sizeInBytes += c.compositeFrequencies.Size()
}
return sizeInBytes
}
func (c *CompositeField) Name() string {
return c.name
}
func (c *CompositeField) ArrayPositions() []uint64 {
return []uint64{}
}
func (c *CompositeField) Options() index.FieldIndexingOptions {
return c.options
}
func (c *CompositeField) Analyze() {
}
func (c *CompositeField) Value() []byte {
return []byte{}
}
func (c *CompositeField) NumPlainTextBytes() uint64 {
return 0
}
func (c *CompositeField) includesField(field string) bool {
shouldInclude := c.defaultInclude
_, fieldShouldBeIncluded := c.includedFields[field]
if fieldShouldBeIncluded {
shouldInclude = true
}
_, fieldShouldBeExcluded := c.excludedFields[field]
if fieldShouldBeExcluded {
shouldInclude = false
}
return shouldInclude
}
func (c *CompositeField) Compose(field string, length int, freq index.TokenFrequencies) {
if c.includesField(field) {
c.totalLength += length
c.compositeFrequencies.MergeAll(field, freq)
}
}
func (c *CompositeField) EncodedFieldType() byte {
return 'c'
}
func (c *CompositeField) AnalyzedLength() int {
return c.totalLength
}
func (c *CompositeField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return c.compositeFrequencies
}

View File

@@ -0,0 +1,202 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"bytes"
"fmt"
"math"
"reflect"
"time"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/numeric"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var dateTimeValueSeperator = []byte{'\xff'}
var reflectStaticSizeDateTimeField int
func init() {
var f DateTimeField
reflectStaticSizeDateTimeField = int(reflect.TypeOf(f).Size())
}
const DefaultDateTimeIndexingOptions = index.StoreField | index.IndexField | index.DocValues
const DefaultDateTimePrecisionStep uint = 4
var MinTimeRepresentable = time.Unix(0, math.MinInt64)
var MaxTimeRepresentable = time.Unix(0, math.MaxInt64)
type DateTimeField struct {
name string
arrayPositions []uint64
options index.FieldIndexingOptions
value numeric.PrefixCoded
numPlainTextBytes uint64
length int
frequencies index.TokenFrequencies
}
func (n *DateTimeField) Size() int {
var freqSize int
if n.frequencies != nil {
freqSize = n.frequencies.Size()
}
return reflectStaticSizeDateTimeField + size.SizeOfPtr +
len(n.name) +
len(n.arrayPositions)*size.SizeOfUint64 +
len(n.value) +
freqSize
}
func (n *DateTimeField) Name() string {
return n.name
}
func (n *DateTimeField) ArrayPositions() []uint64 {
return n.arrayPositions
}
func (n *DateTimeField) Options() index.FieldIndexingOptions {
return n.options
}
func (n *DateTimeField) EncodedFieldType() byte {
return 'd'
}
func (n *DateTimeField) AnalyzedLength() int {
return n.length
}
func (n *DateTimeField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return n.frequencies
}
// split the value into the prefix coded date and the layout
// using the dateTimeValueSeperator as the split point
func (n *DateTimeField) splitValue() (numeric.PrefixCoded, string) {
parts := bytes.SplitN(n.value, dateTimeValueSeperator, 2)
if len(parts) == 1 {
return numeric.PrefixCoded(parts[0]), ""
}
return numeric.PrefixCoded(parts[0]), string(parts[1])
}
func (n *DateTimeField) Analyze() {
valueWithoutLayout, _ := n.splitValue()
tokens := make(analysis.TokenStream, 0)
tokens = append(tokens, &analysis.Token{
Start: 0,
End: len(valueWithoutLayout),
Term: valueWithoutLayout,
Position: 1,
Type: analysis.DateTime,
})
original, err := valueWithoutLayout.Int64()
if err == nil {
shift := DefaultDateTimePrecisionStep
for shift < 64 {
shiftEncoded, err := numeric.NewPrefixCodedInt64(original, shift)
if err != nil {
break
}
token := analysis.Token{
Start: 0,
End: len(shiftEncoded),
Term: shiftEncoded,
Position: 1,
Type: analysis.DateTime,
}
tokens = append(tokens, &token)
shift += DefaultDateTimePrecisionStep
}
}
n.length = len(tokens)
n.frequencies = analysis.TokenFrequency(tokens, n.arrayPositions, n.options)
}
func (n *DateTimeField) Value() []byte {
return n.value
}
func (n *DateTimeField) DateTime() (time.Time, string, error) {
date, layout := n.splitValue()
i64, err := date.Int64()
if err != nil {
return time.Time{}, "", err
}
return time.Unix(0, i64).UTC(), layout, nil
}
func (n *DateTimeField) GoString() string {
return fmt.Sprintf("&document.DateField{Name:%s, Options: %s, Value: %s}", n.name, n.options, n.value)
}
func (n *DateTimeField) NumPlainTextBytes() uint64 {
return n.numPlainTextBytes
}
func NewDateTimeFieldFromBytes(name string, arrayPositions []uint64, value []byte) *DateTimeField {
return &DateTimeField{
name: name,
arrayPositions: arrayPositions,
value: value,
options: DefaultDateTimeIndexingOptions,
numPlainTextBytes: uint64(len(value)),
}
}
func NewDateTimeField(name string, arrayPositions []uint64, dt time.Time, layout string) (*DateTimeField, error) {
return NewDateTimeFieldWithIndexingOptions(name, arrayPositions, dt, layout, DefaultDateTimeIndexingOptions)
}
func NewDateTimeFieldWithIndexingOptions(name string, arrayPositions []uint64, dt time.Time, layout string, options index.FieldIndexingOptions) (*DateTimeField, error) {
if canRepresent(dt) {
dtInt64 := dt.UnixNano()
prefixCoded := numeric.MustNewPrefixCodedInt64(dtInt64, 0)
// The prefixCoded value is combined with the layout.
// This is necessary because the storage layer stores a fields value as a byte slice
// without storing extra information like layout. So by making value = prefixCoded + layout,
// both pieces of information are stored in the byte slice.
// During a query, the layout is extracted from the byte slice stored to correctly
// format the prefixCoded value.
valueWithLayout := append(prefixCoded, dateTimeValueSeperator...)
valueWithLayout = append(valueWithLayout, []byte(layout)...)
return &DateTimeField{
name: name,
arrayPositions: arrayPositions,
value: valueWithLayout,
options: options,
// not correct, just a place holder until we revisit how fields are
// represented and can fix this better
numPlainTextBytes: uint64(8),
}, nil
}
return nil, fmt.Errorf("cannot represent %s in this type", dt)
}
func canRepresent(dt time.Time) bool {
if dt.Before(MinTimeRepresentable) || dt.After(MaxTimeRepresentable) {
return false
}
return true
}

View File

@@ -0,0 +1,199 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"fmt"
"reflect"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/geo"
"github.com/blevesearch/bleve/v2/numeric"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeGeoPointField int
func init() {
var f GeoPointField
reflectStaticSizeGeoPointField = int(reflect.TypeOf(f).Size())
}
var GeoPrecisionStep uint = 9
type GeoPointField struct {
name string
arrayPositions []uint64
options index.FieldIndexingOptions
value numeric.PrefixCoded
numPlainTextBytes uint64
length int
frequencies index.TokenFrequencies
spatialplugin index.SpatialAnalyzerPlugin
}
func (n *GeoPointField) Size() int {
var freqSize int
if n.frequencies != nil {
freqSize = n.frequencies.Size()
}
return reflectStaticSizeGeoPointField + size.SizeOfPtr +
len(n.name) +
len(n.arrayPositions)*size.SizeOfUint64 +
len(n.value) +
freqSize
}
func (n *GeoPointField) Name() string {
return n.name
}
func (n *GeoPointField) ArrayPositions() []uint64 {
return n.arrayPositions
}
func (n *GeoPointField) Options() index.FieldIndexingOptions {
return n.options
}
func (n *GeoPointField) EncodedFieldType() byte {
return 'g'
}
func (n *GeoPointField) AnalyzedLength() int {
return n.length
}
func (n *GeoPointField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return n.frequencies
}
func (n *GeoPointField) Analyze() {
tokens := make(analysis.TokenStream, 0, 8)
tokens = append(tokens, &analysis.Token{
Start: 0,
End: len(n.value),
Term: n.value,
Position: 1,
Type: analysis.Numeric,
})
if n.spatialplugin != nil {
lat, _ := n.Lat()
lon, _ := n.Lon()
p := &geo.Point{Lat: lat, Lon: lon}
terms := n.spatialplugin.GetIndexTokens(p)
for _, term := range terms {
token := analysis.Token{
Start: 0,
End: len(term),
Term: []byte(term),
Position: 1,
Type: analysis.AlphaNumeric,
}
tokens = append(tokens, &token)
}
} else {
original, err := n.value.Int64()
if err == nil {
shift := GeoPrecisionStep
for shift < 64 {
shiftEncoded, err := numeric.NewPrefixCodedInt64(original, shift)
if err != nil {
break
}
token := analysis.Token{
Start: 0,
End: len(shiftEncoded),
Term: shiftEncoded,
Position: 1,
Type: analysis.Numeric,
}
tokens = append(tokens, &token)
shift += GeoPrecisionStep
}
}
}
n.length = len(tokens)
n.frequencies = analysis.TokenFrequency(tokens, n.arrayPositions, n.options)
}
func (n *GeoPointField) Value() []byte {
return n.value
}
func (n *GeoPointField) Lon() (float64, error) {
i64, err := n.value.Int64()
if err != nil {
return 0.0, err
}
return geo.MortonUnhashLon(uint64(i64)), nil
}
func (n *GeoPointField) Lat() (float64, error) {
i64, err := n.value.Int64()
if err != nil {
return 0.0, err
}
return geo.MortonUnhashLat(uint64(i64)), nil
}
func (n *GeoPointField) GoString() string {
return fmt.Sprintf("&document.GeoPointField{Name:%s, Options: %s, Value: %s}", n.name, n.options, n.value)
}
func (n *GeoPointField) NumPlainTextBytes() uint64 {
return n.numPlainTextBytes
}
func NewGeoPointFieldFromBytes(name string, arrayPositions []uint64, value []byte) *GeoPointField {
return &GeoPointField{
name: name,
arrayPositions: arrayPositions,
value: value,
options: DefaultNumericIndexingOptions,
numPlainTextBytes: uint64(len(value)),
}
}
func NewGeoPointField(name string, arrayPositions []uint64, lon, lat float64) *GeoPointField {
return NewGeoPointFieldWithIndexingOptions(name, arrayPositions, lon, lat, DefaultNumericIndexingOptions)
}
func NewGeoPointFieldWithIndexingOptions(name string, arrayPositions []uint64, lon, lat float64, options index.FieldIndexingOptions) *GeoPointField {
mhash := geo.MortonHash(lon, lat)
prefixCoded := numeric.MustNewPrefixCodedInt64(int64(mhash), 0)
return &GeoPointField{
name: name,
arrayPositions: arrayPositions,
value: prefixCoded,
options: options,
// not correct, just a place holder until we revisit how fields are
// represented and can fix this better
numPlainTextBytes: uint64(8),
}
}
// SetSpatialAnalyzerPlugin implements the
// index.TokenisableSpatialField interface.
func (n *GeoPointField) SetSpatialAnalyzerPlugin(
plugin index.SpatialAnalyzerPlugin) {
n.spatialplugin = plugin
}

View File

@@ -0,0 +1,265 @@
// Copyright (c) 2022 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"fmt"
"reflect"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/geo"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
"github.com/blevesearch/geo/geojson"
)
var reflectStaticSizeGeoShapeField int
func init() {
var f GeoShapeField
reflectStaticSizeGeoShapeField = int(reflect.TypeOf(f).Size())
}
const DefaultGeoShapeIndexingOptions = index.IndexField | index.DocValues
type GeoShapeField struct {
name string
shape index.GeoJSON
arrayPositions []uint64
options index.FieldIndexingOptions
numPlainTextBytes uint64
length int
encodedValue []byte
value []byte
frequencies index.TokenFrequencies
}
func (n *GeoShapeField) Size() int {
var freqSize int
if n.frequencies != nil {
freqSize = n.frequencies.Size()
}
return reflectStaticSizeGeoShapeField + size.SizeOfPtr +
len(n.name) +
len(n.arrayPositions)*size.SizeOfUint64 +
len(n.encodedValue) +
len(n.value) +
freqSize
}
func (n *GeoShapeField) Name() string {
return n.name
}
func (n *GeoShapeField) ArrayPositions() []uint64 {
return n.arrayPositions
}
func (n *GeoShapeField) Options() index.FieldIndexingOptions {
return n.options
}
func (n *GeoShapeField) EncodedFieldType() byte {
return 's'
}
func (n *GeoShapeField) AnalyzedLength() int {
return n.length
}
func (n *GeoShapeField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return n.frequencies
}
func (n *GeoShapeField) Analyze() {
// compute the bytes representation for the coordinates
tokens := make(analysis.TokenStream, 0)
rti := geo.GetSpatialAnalyzerPlugin("s2")
terms := rti.GetIndexTokens(n.shape)
for _, term := range terms {
token := analysis.Token{
Start: 0,
End: len(term),
Term: []byte(term),
Position: 1,
Type: analysis.AlphaNumeric,
}
tokens = append(tokens, &token)
}
n.length = len(tokens)
n.frequencies = analysis.TokenFrequency(tokens, n.arrayPositions, n.options)
}
func (n *GeoShapeField) Value() []byte {
return n.value
}
func (n *GeoShapeField) GoString() string {
return fmt.Sprintf("&document.GeoShapeField{Name:%s, Options: %s, Value: %s}",
n.name, n.options, n.value)
}
func (n *GeoShapeField) NumPlainTextBytes() uint64 {
return n.numPlainTextBytes
}
func (n *GeoShapeField) EncodedShape() []byte {
return n.encodedValue
}
func NewGeoShapeField(name string, arrayPositions []uint64,
coordinates [][][][]float64, typ string) *GeoShapeField {
return NewGeoShapeFieldWithIndexingOptions(name, arrayPositions,
coordinates, typ, DefaultGeoShapeIndexingOptions)
}
func NewGeoShapeFieldFromBytes(name string, arrayPositions []uint64,
value []byte) *GeoShapeField {
return &GeoShapeField{
name: name,
arrayPositions: arrayPositions,
value: value,
options: DefaultGeoShapeIndexingOptions,
numPlainTextBytes: uint64(len(value)),
}
}
func NewGeoShapeFieldWithIndexingOptions(name string, arrayPositions []uint64,
coordinates [][][][]float64, typ string,
options index.FieldIndexingOptions) *GeoShapeField {
shape := &geojson.GeoShape{
Coordinates: coordinates,
Type: typ,
}
return NewGeoShapeFieldFromShapeWithIndexingOptions(name,
arrayPositions, shape, options)
}
func NewGeoShapeFieldFromShapeWithIndexingOptions(name string, arrayPositions []uint64,
geoShape *geojson.GeoShape, options index.FieldIndexingOptions) *GeoShapeField {
var shape index.GeoJSON
var encodedValue []byte
var err error
if geoShape.Type == geo.CircleType {
shape, encodedValue, err = geo.NewGeoCircleShape(geoShape.Center, geoShape.Radius)
} else {
shape, encodedValue, err = geo.NewGeoJsonShape(geoShape.Coordinates, geoShape.Type)
}
if err != nil {
return nil
}
// extra glue bytes to work around the term splitting logic from interfering
// the custom encoding of the geoshape coordinates inside the docvalues.
encodedValue = append(geo.GlueBytes, append(encodedValue, geo.GlueBytes...)...)
// get the byte value for the geoshape.
value, err := shape.Value()
if err != nil {
return nil
}
// docvalues are always enabled for geoshape fields, even if the
// indexing options are set to not include docvalues.
options = options | index.DocValues
return &GeoShapeField{
shape: shape,
name: name,
arrayPositions: arrayPositions,
options: options,
encodedValue: encodedValue,
value: value,
numPlainTextBytes: uint64(len(value)),
}
}
func NewGeometryCollectionFieldWithIndexingOptions(name string,
arrayPositions []uint64, coordinates [][][][][]float64, types []string,
options index.FieldIndexingOptions) *GeoShapeField {
if len(coordinates) != len(types) {
return nil
}
shapes := make([]*geojson.GeoShape, len(types))
for i := range coordinates {
shapes[i] = &geojson.GeoShape{
Coordinates: coordinates[i],
Type: types[i],
}
}
return NewGeometryCollectionFieldFromShapesWithIndexingOptions(name,
arrayPositions, shapes, options)
}
func NewGeometryCollectionFieldFromShapesWithIndexingOptions(name string,
arrayPositions []uint64, geoShapes []*geojson.GeoShape,
options index.FieldIndexingOptions) *GeoShapeField {
shape, encodedValue, err := geo.NewGeometryCollectionFromShapes(geoShapes)
if err != nil {
return nil
}
// extra glue bytes to work around the term splitting logic from interfering
// the custom encoding of the geoshape coordinates inside the docvalues.
encodedValue = append(geo.GlueBytes, append(encodedValue, geo.GlueBytes...)...)
// get the byte value for the geometryCollection.
value, err := shape.Value()
if err != nil {
return nil
}
// docvalues are always enabled for geoshape fields, even if the
// indexing options are set to not include docvalues.
options = options | index.DocValues
return &GeoShapeField{
shape: shape,
name: name,
arrayPositions: arrayPositions,
options: options,
encodedValue: encodedValue,
value: value,
numPlainTextBytes: uint64(len(value)),
}
}
func NewGeoCircleFieldWithIndexingOptions(name string, arrayPositions []uint64,
centerPoint []float64, radius string,
options index.FieldIndexingOptions) *GeoShapeField {
shape := &geojson.GeoShape{
Center: centerPoint,
Radius: radius,
Type: geo.CircleType,
}
return NewGeoShapeFieldFromShapeWithIndexingOptions(name,
arrayPositions, shape, options)
}
// GeoShape is an implementation of the index.GeoShapeField interface.
func (n *GeoShapeField) GeoShape() (index.GeoJSON, error) {
return geojson.ParseGeoJSONShape(n.value)
}

View File

@@ -0,0 +1,137 @@
// Copyright (c) 2021 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"fmt"
"net"
"reflect"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeIPField int
func init() {
var f IPField
reflectStaticSizeIPField = int(reflect.TypeOf(f).Size())
}
const DefaultIPIndexingOptions = index.StoreField | index.IndexField | index.DocValues
type IPField struct {
name string
arrayPositions []uint64
options index.FieldIndexingOptions
value net.IP
numPlainTextBytes uint64
length int
frequencies index.TokenFrequencies
}
func (b *IPField) Size() int {
var freqSize int
if b.frequencies != nil {
freqSize = b.frequencies.Size()
}
return reflectStaticSizeIPField + size.SizeOfPtr +
len(b.name) +
len(b.arrayPositions)*size.SizeOfUint64 +
len(b.value) +
freqSize
}
func (b *IPField) Name() string {
return b.name
}
func (b *IPField) ArrayPositions() []uint64 {
return b.arrayPositions
}
func (b *IPField) Options() index.FieldIndexingOptions {
return b.options
}
func (n *IPField) EncodedFieldType() byte {
return 'i'
}
func (n *IPField) AnalyzedLength() int {
return n.length
}
func (n *IPField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return n.frequencies
}
func (b *IPField) Analyze() {
tokens := analysis.TokenStream{
&analysis.Token{
Start: 0,
End: len(b.value),
Term: b.value,
Position: 1,
Type: analysis.IP,
},
}
b.length = 1
b.frequencies = analysis.TokenFrequency(tokens, b.arrayPositions, b.options)
}
func (b *IPField) Value() []byte {
return b.value
}
func (b *IPField) IP() (net.IP, error) {
return net.IP(b.value), nil
}
func (b *IPField) GoString() string {
return fmt.Sprintf("&document.IPField{Name:%s, Options: %s, Value: %s}", b.name, b.options, net.IP(b.value))
}
func (b *IPField) NumPlainTextBytes() uint64 {
return b.numPlainTextBytes
}
func NewIPFieldFromBytes(name string, arrayPositions []uint64, value []byte) *IPField {
return &IPField{
name: name,
arrayPositions: arrayPositions,
value: value,
options: DefaultIPIndexingOptions,
numPlainTextBytes: uint64(len(value)),
}
}
func NewIPField(name string, arrayPositions []uint64, v net.IP) *IPField {
return NewIPFieldWithIndexingOptions(name, arrayPositions, v, DefaultIPIndexingOptions)
}
func NewIPFieldWithIndexingOptions(name string, arrayPositions []uint64, b net.IP, options index.FieldIndexingOptions) *IPField {
v := b.To16()
return &IPField{
name: name,
arrayPositions: arrayPositions,
value: v,
options: options,
numPlainTextBytes: net.IPv6len,
}
}

View File

@@ -0,0 +1,165 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"fmt"
"reflect"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/numeric"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeNumericField int
func init() {
var f NumericField
reflectStaticSizeNumericField = int(reflect.TypeOf(f).Size())
}
const DefaultNumericIndexingOptions = index.StoreField | index.IndexField | index.DocValues
const DefaultPrecisionStep uint = 4
type NumericField struct {
name string
arrayPositions []uint64
options index.FieldIndexingOptions
value numeric.PrefixCoded
numPlainTextBytes uint64
length int
frequencies index.TokenFrequencies
}
func (n *NumericField) Size() int {
var freqSize int
if n.frequencies != nil {
freqSize = n.frequencies.Size()
}
return reflectStaticSizeNumericField + size.SizeOfPtr +
len(n.name) +
len(n.arrayPositions)*size.SizeOfUint64 +
len(n.value) +
freqSize
}
func (n *NumericField) Name() string {
return n.name
}
func (n *NumericField) ArrayPositions() []uint64 {
return n.arrayPositions
}
func (n *NumericField) Options() index.FieldIndexingOptions {
return n.options
}
func (n *NumericField) EncodedFieldType() byte {
return 'n'
}
func (n *NumericField) AnalyzedLength() int {
return n.length
}
func (n *NumericField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return n.frequencies
}
func (n *NumericField) Analyze() {
tokens := make(analysis.TokenStream, 0)
tokens = append(tokens, &analysis.Token{
Start: 0,
End: len(n.value),
Term: n.value,
Position: 1,
Type: analysis.Numeric,
})
original, err := n.value.Int64()
if err == nil {
shift := DefaultPrecisionStep
for shift < 64 {
shiftEncoded, err := numeric.NewPrefixCodedInt64(original, shift)
if err != nil {
break
}
token := analysis.Token{
Start: 0,
End: len(shiftEncoded),
Term: shiftEncoded,
Position: 1,
Type: analysis.Numeric,
}
tokens = append(tokens, &token)
shift += DefaultPrecisionStep
}
}
n.length = len(tokens)
n.frequencies = analysis.TokenFrequency(tokens, n.arrayPositions, n.options)
}
func (n *NumericField) Value() []byte {
return n.value
}
func (n *NumericField) Number() (float64, error) {
i64, err := n.value.Int64()
if err != nil {
return 0.0, err
}
return numeric.Int64ToFloat64(i64), nil
}
func (n *NumericField) GoString() string {
return fmt.Sprintf("&document.NumericField{Name:%s, Options: %s, Value: %s}", n.name, n.options, n.value)
}
func (n *NumericField) NumPlainTextBytes() uint64 {
return n.numPlainTextBytes
}
func NewNumericFieldFromBytes(name string, arrayPositions []uint64, value []byte) *NumericField {
return &NumericField{
name: name,
arrayPositions: arrayPositions,
value: value,
options: DefaultNumericIndexingOptions,
numPlainTextBytes: uint64(len(value)),
}
}
func NewNumericField(name string, arrayPositions []uint64, number float64) *NumericField {
return NewNumericFieldWithIndexingOptions(name, arrayPositions, number, DefaultNumericIndexingOptions)
}
func NewNumericFieldWithIndexingOptions(name string, arrayPositions []uint64, number float64, options index.FieldIndexingOptions) *NumericField {
numberInt64 := numeric.Float64ToInt64(number)
prefixCoded := numeric.MustNewPrefixCodedInt64(numberInt64, 0)
return &NumericField{
name: name,
arrayPositions: arrayPositions,
value: prefixCoded,
options: options,
// not correct, just a place holder until we revisit how fields are
// represented and can fix this better
numPlainTextBytes: uint64(8),
}
}

View File

@@ -0,0 +1,149 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"reflect"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeSynonymField int
func init() {
var f SynonymField
reflectStaticSizeSynonymField = int(reflect.TypeOf(f).Size())
}
const DefaultSynonymIndexingOptions = index.IndexField
type SynonymField struct {
name string
analyzer analysis.Analyzer
options index.FieldIndexingOptions
input []string
synonyms []string
numPlainTextBytes uint64
// populated during analysis
synonymMap map[string][]string
}
func (s *SynonymField) Size() int {
return reflectStaticSizeSynonymField + size.SizeOfPtr +
len(s.name)
}
func (s *SynonymField) Name() string {
return s.name
}
func (s *SynonymField) ArrayPositions() []uint64 {
return nil
}
func (s *SynonymField) Options() index.FieldIndexingOptions {
return s.options
}
func (s *SynonymField) NumPlainTextBytes() uint64 {
return s.numPlainTextBytes
}
func (s *SynonymField) AnalyzedLength() int {
return 0
}
func (s *SynonymField) EncodedFieldType() byte {
return 'y'
}
func (s *SynonymField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return nil
}
func (s *SynonymField) Analyze() {
var analyzedInput []string
if len(s.input) > 0 {
analyzedInput = make([]string, 0, len(s.input))
for _, term := range s.input {
analyzedTerm := analyzeSynonymTerm(term, s.analyzer)
if analyzedTerm != "" {
analyzedInput = append(analyzedInput, analyzedTerm)
}
}
}
analyzedSynonyms := make([]string, 0, len(s.synonyms))
for _, syn := range s.synonyms {
analyzedTerm := analyzeSynonymTerm(syn, s.analyzer)
if analyzedTerm != "" {
analyzedSynonyms = append(analyzedSynonyms, analyzedTerm)
}
}
s.synonymMap = processSynonymData(analyzedInput, analyzedSynonyms)
}
func (s *SynonymField) Value() []byte {
return nil
}
func (s *SynonymField) IterateSynonyms(visitor func(term string, synonyms []string)) {
for term, synonyms := range s.synonymMap {
visitor(term, synonyms)
}
}
func NewSynonymField(name string, analyzer analysis.Analyzer, input []string, synonyms []string) *SynonymField {
return &SynonymField{
name: name,
analyzer: analyzer,
options: DefaultSynonymIndexingOptions,
input: input,
synonyms: synonyms,
}
}
func processSynonymData(input []string, synonyms []string) map[string][]string {
var synonymMap map[string][]string
if len(input) > 0 {
// Map each term to the same list of synonyms.
synonymMap = make(map[string][]string, len(input))
for _, term := range input {
synonymMap[term] = synonyms
}
} else {
synonymMap = make(map[string][]string, len(synonyms))
// Precompute a map where each synonym points to all other synonyms.
for i, elem := range synonyms {
synonymMap[elem] = make([]string, 0, len(synonyms)-1)
for j, otherElem := range synonyms {
if i != j {
synonymMap[elem] = append(synonymMap[elem], otherElem)
}
}
}
}
return synonymMap
}
func analyzeSynonymTerm(term string, analyzer analysis.Analyzer) string {
tokenStream := analyzer.Analyze([]byte(term))
if len(tokenStream) == 1 {
return string(tokenStream[0].Term)
}
return ""
}

View File

@@ -0,0 +1,162 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package document
import (
"fmt"
"reflect"
"github.com/blevesearch/bleve/v2/analysis"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeTextField int
func init() {
var f TextField
reflectStaticSizeTextField = int(reflect.TypeOf(f).Size())
}
const DefaultTextIndexingOptions = index.IndexField | index.DocValues
type TextField struct {
name string
arrayPositions []uint64
options index.FieldIndexingOptions
analyzer analysis.Analyzer
value []byte
numPlainTextBytes uint64
length int
frequencies index.TokenFrequencies
}
func (t *TextField) Size() int {
var freqSize int
if t.frequencies != nil {
freqSize = t.frequencies.Size()
}
return reflectStaticSizeTextField + size.SizeOfPtr +
len(t.name) +
len(t.arrayPositions)*size.SizeOfUint64 +
len(t.value) +
freqSize
}
func (t *TextField) Name() string {
return t.name
}
func (t *TextField) ArrayPositions() []uint64 {
return t.arrayPositions
}
func (t *TextField) Options() index.FieldIndexingOptions {
return t.options
}
func (t *TextField) EncodedFieldType() byte {
return 't'
}
func (t *TextField) AnalyzedLength() int {
return t.length
}
func (t *TextField) AnalyzedTokenFrequencies() index.TokenFrequencies {
return t.frequencies
}
func (t *TextField) Analyze() {
var tokens analysis.TokenStream
if t.analyzer != nil {
bytesToAnalyze := t.Value()
if t.options.IsStored() {
// need to copy
bytesCopied := make([]byte, len(bytesToAnalyze))
copy(bytesCopied, bytesToAnalyze)
bytesToAnalyze = bytesCopied
}
tokens = t.analyzer.Analyze(bytesToAnalyze)
} else {
tokens = analysis.TokenStream{
&analysis.Token{
Start: 0,
End: len(t.value),
Term: t.value,
Position: 1,
Type: analysis.AlphaNumeric,
},
}
}
t.length = len(tokens) // number of tokens in this doc field
t.frequencies = analysis.TokenFrequency(tokens, t.arrayPositions, t.options)
}
func (t *TextField) Analyzer() analysis.Analyzer {
return t.analyzer
}
func (t *TextField) Value() []byte {
return t.value
}
func (t *TextField) Text() string {
return string(t.value)
}
func (t *TextField) GoString() string {
return fmt.Sprintf("&document.TextField{Name:%s, Options: %s, Analyzer: %v, Value: %s, ArrayPositions: %v}", t.name, t.options, t.analyzer, t.value, t.arrayPositions)
}
func (t *TextField) NumPlainTextBytes() uint64 {
return t.numPlainTextBytes
}
func NewTextField(name string, arrayPositions []uint64, value []byte) *TextField {
return NewTextFieldWithIndexingOptions(name, arrayPositions, value, DefaultTextIndexingOptions)
}
func NewTextFieldWithIndexingOptions(name string, arrayPositions []uint64, value []byte, options index.FieldIndexingOptions) *TextField {
return &TextField{
name: name,
arrayPositions: arrayPositions,
options: options,
value: value,
numPlainTextBytes: uint64(len(value)),
}
}
func NewTextFieldWithAnalyzer(name string, arrayPositions []uint64, value []byte, analyzer analysis.Analyzer) *TextField {
return &TextField{
name: name,
arrayPositions: arrayPositions,
options: DefaultTextIndexingOptions,
analyzer: analyzer,
value: value,
numPlainTextBytes: uint64(len(value)),
}
}
func NewTextFieldCustom(name string, arrayPositions []uint64, value []byte, options index.FieldIndexingOptions, analyzer analysis.Analyzer) *TextField {
return &TextField{
name: name,
arrayPositions: arrayPositions,
options: options,
analyzer: analyzer,
value: value,
numPlainTextBytes: uint64(len(value)),
}
}

View File

@@ -0,0 +1,146 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors
// +build vectors
package document
import (
"fmt"
"reflect"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeVectorField int
func init() {
var f VectorField
reflectStaticSizeVectorField = int(reflect.TypeOf(f).Size())
}
const DefaultVectorIndexingOptions = index.IndexField
type VectorField struct {
name string
dims int // Dimensionality of the vector
similarity string // Similarity metric to use for scoring
options index.FieldIndexingOptions
value []float32
numPlainTextBytes uint64
vectorIndexOptimizedFor string // Optimization applied to this index.
}
func (n *VectorField) Size() int {
return reflectStaticSizeVectorField + size.SizeOfPtr +
len(n.name) +
len(n.similarity) +
len(n.vectorIndexOptimizedFor) +
int(numBytesFloat32s(n.value))
}
func (n *VectorField) Name() string {
return n.name
}
func (n *VectorField) ArrayPositions() []uint64 {
return nil
}
func (n *VectorField) Options() index.FieldIndexingOptions {
return n.options
}
func (n *VectorField) NumPlainTextBytes() uint64 {
return n.numPlainTextBytes
}
func (n *VectorField) AnalyzedLength() int {
// vectors aren't analyzed
return 0
}
func (n *VectorField) EncodedFieldType() byte {
return 'v'
}
func (n *VectorField) AnalyzedTokenFrequencies() index.TokenFrequencies {
// vectors aren't analyzed
return nil
}
func (n *VectorField) Analyze() {
// vectors aren't analyzed
}
func (n *VectorField) Value() []byte {
return nil
}
func (n *VectorField) GoString() string {
return fmt.Sprintf("&document.VectorField{Name:%s, Options: %s, "+
"Value: %+v}", n.name, n.options, n.value)
}
// For the sake of not polluting the API, we are keeping arrayPositions as a
// parameter, but it is not used.
func NewVectorField(name string, arrayPositions []uint64,
vector []float32, dims int, similarity, vectorIndexOptimizedFor string) *VectorField {
return NewVectorFieldWithIndexingOptions(name, arrayPositions,
vector, dims, similarity, vectorIndexOptimizedFor,
DefaultVectorIndexingOptions)
}
// For the sake of not polluting the API, we are keeping arrayPositions as a
// parameter, but it is not used.
func NewVectorFieldWithIndexingOptions(name string, arrayPositions []uint64,
vector []float32, dims int, similarity, vectorIndexOptimizedFor string,
options index.FieldIndexingOptions) *VectorField {
return &VectorField{
name: name,
dims: dims,
similarity: similarity,
options: options,
value: vector,
numPlainTextBytes: numBytesFloat32s(vector),
vectorIndexOptimizedFor: vectorIndexOptimizedFor,
}
}
func numBytesFloat32s(value []float32) uint64 {
return uint64(len(value) * size.SizeOfFloat32)
}
// -----------------------------------------------------------------------------
// Following methods help in implementing the bleve_index_api's VectorField
// interface.
func (n *VectorField) Vector() []float32 {
return n.value
}
func (n *VectorField) Dims() int {
return n.dims
}
func (n *VectorField) Similarity() string {
return n.similarity
}
func (n *VectorField) IndexOptimizedFor() string {
return n.vectorIndexOptimizedFor
}

View File

@@ -0,0 +1,163 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors
// +build vectors
package document
import (
"encoding/base64"
"encoding/binary"
"fmt"
"math"
"reflect"
"github.com/blevesearch/bleve/v2/size"
"github.com/blevesearch/bleve/v2/util"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeVectorBase64Field int
func init() {
var f VectorBase64Field
reflectStaticSizeVectorBase64Field = int(reflect.TypeOf(f).Size())
}
type VectorBase64Field struct {
vectorField *VectorField
base64Encoding string
}
func (n *VectorBase64Field) Size() int {
var vecFieldSize int
if n.vectorField != nil {
vecFieldSize = n.vectorField.Size()
}
return reflectStaticSizeVectorBase64Field + size.SizeOfPtr +
len(n.base64Encoding) +
vecFieldSize
}
func (n *VectorBase64Field) Name() string {
return n.vectorField.Name()
}
func (n *VectorBase64Field) ArrayPositions() []uint64 {
return n.vectorField.ArrayPositions()
}
func (n *VectorBase64Field) Options() index.FieldIndexingOptions {
return n.vectorField.Options()
}
func (n *VectorBase64Field) NumPlainTextBytes() uint64 {
return n.vectorField.NumPlainTextBytes()
}
func (n *VectorBase64Field) AnalyzedLength() int {
return n.vectorField.AnalyzedLength()
}
func (n *VectorBase64Field) EncodedFieldType() byte {
return 'e'
}
func (n *VectorBase64Field) AnalyzedTokenFrequencies() index.TokenFrequencies {
return n.vectorField.AnalyzedTokenFrequencies()
}
func (n *VectorBase64Field) Analyze() {
}
func (n *VectorBase64Field) Value() []byte {
return n.vectorField.Value()
}
func (n *VectorBase64Field) GoString() string {
return fmt.Sprintf("&document.vectorFieldBase64Field{Name:%s, Options: %s, "+
"Value: %+v}", n.vectorField.Name(), n.vectorField.Options(), n.vectorField.Value())
}
// For the sake of not polluting the API, we are keeping arrayPositions as a
// parameter, but it is not used.
func NewVectorBase64Field(name string, arrayPositions []uint64, vectorBase64 string,
dims int, similarity, vectorIndexOptimizedFor string) (*VectorBase64Field, error) {
decodedVector, err := DecodeVector(vectorBase64)
if err != nil {
return nil, err
}
return &VectorBase64Field{
vectorField: NewVectorFieldWithIndexingOptions(name, arrayPositions,
decodedVector, dims, similarity,
vectorIndexOptimizedFor, DefaultVectorIndexingOptions),
base64Encoding: vectorBase64,
}, nil
}
// This function takes a base64 encoded string and decodes it into
// a vector.
func DecodeVector(encodedValue string) ([]float32, error) {
// We first decode the encoded string into a byte array.
decodedString, err := base64.StdEncoding.DecodeString(encodedValue)
if err != nil {
return nil, err
}
// The array is expected to be divisible by 4 because each float32
// should occupy 4 bytes
if len(decodedString)%size.SizeOfFloat32 != 0 {
return nil, fmt.Errorf("decoded byte array not divisible by %d", size.SizeOfFloat32)
}
dims := int(len(decodedString) / size.SizeOfFloat32)
if dims <= 0 {
return nil, fmt.Errorf("unable to decode encoded vector")
}
decodedVector := make([]float32, dims)
// We iterate through the array 4 bytes at a time and convert each of
// them to a float32 value by reading them in a little endian notation
for i := 0; i < dims; i++ {
bytes := decodedString[i*size.SizeOfFloat32 : (i+1)*size.SizeOfFloat32]
entry := math.Float32frombits(binary.LittleEndian.Uint32(bytes))
if !util.IsValidFloat32(float64(entry)) {
return nil, fmt.Errorf("invalid float32 value: %f", entry)
}
decodedVector[i] = entry
}
return decodedVector, nil
}
func (n *VectorBase64Field) Vector() []float32 {
return n.vectorField.Vector()
}
func (n *VectorBase64Field) Dims() int {
return n.vectorField.Dims()
}
func (n *VectorBase64Field) Similarity() string {
return n.vectorField.Similarity()
}
func (n *VectorBase64Field) IndexOptimizedFor() string {
return n.vectorField.IndexOptimizedFor()
}

54
vendor/github.com/blevesearch/bleve/v2/error.go generated vendored Normal file
View File

@@ -0,0 +1,54 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package bleve
// Constant Error values which can be compared to determine the type of error
const (
ErrorIndexPathExists Error = iota
ErrorIndexPathDoesNotExist
ErrorIndexMetaMissing
ErrorIndexMetaCorrupt
ErrorIndexClosed
ErrorAliasMulti
ErrorAliasEmpty
ErrorUnknownIndexType
ErrorEmptyID
ErrorIndexReadInconsistency
ErrorTwoPhaseSearchInconsistency
ErrorSynonymSearchNotSupported
)
// Error represents a more strongly typed bleve error for detecting
// and handling specific types of errors.
type Error int
func (e Error) Error() string {
return errorMessages[e]
}
var errorMessages = map[Error]string{
ErrorIndexPathExists: "cannot create new index, path already exists",
ErrorIndexPathDoesNotExist: "cannot open index, path does not exist",
ErrorIndexMetaMissing: "cannot open index, metadata missing",
ErrorIndexMetaCorrupt: "cannot open index, metadata corrupt",
ErrorIndexClosed: "index is closed",
ErrorAliasMulti: "cannot perform single index operation on multiple index alias",
ErrorAliasEmpty: "cannot perform operation on empty alias",
ErrorUnknownIndexType: "unknown index type",
ErrorEmptyID: "document ID cannot be empty",
ErrorIndexReadInconsistency: "index read inconsistency detected",
ErrorTwoPhaseSearchInconsistency: "2-phase search failed, likely due to an overlapping topology change",
ErrorSynonymSearchNotSupported: "synonym search not supported",
}

312
vendor/github.com/blevesearch/bleve/v2/geo/README.md generated vendored Normal file
View File

@@ -0,0 +1,312 @@
# Geo spatial search support in bleve
Latest bleve spatial capabilities are powered by spatial hierarchical tokens generated from s2geometry.
You can find more details about the [s2geometry basics here](http://s2geometry.io/), and explore the extended functionality of our forked golang port of [s2geometry lib here](https://github.com/blevesearch/geo).
Users can continue to index and query `geopoint` field type and the existing queries like,
- Point Distance
- Bounded Rectangle
- Bounded Polygon
as before.
## New Spatial Field Type - geoshape
We have introduced a field type (`geoshape`) for representing the new spatial types.
Using the new `geoshape` field type, users can unblock the spatial capabilities
for the [geojson](https://datatracker.ietf.org/doc/html/rfc7946) shapes like,
- Point
- LineString
- Polygon
- MultiPoint
- MultiLineString
- MultiPolygon
- GeometryCollection
In addition to these shapes, bleve will also support additional shapes like,
- Circle
- Envelope (Bounded box)
To specify GeoJSON data, use a nested field with:
- a field named type that specifies the GeoJSON object type and the type value will be case-insensitive.
- a field named coordinates that specifies the object's coordinates.
```text
"fieldName": {
"type": "GeoJSON Type",
"coordinates": <coordinates>
}
```
- If specifying latitude and longitude coordinates, list the longitude first and then latitude.
- Valid longitude values are between -180 and 180, both inclusive.
- Valid latitude values are between -90 and 90, both inclusive.
- Shapes would be internally represented as geodesics.
- The GeoJSON specification strongly suggests splitting geometries so that neither of their parts crosses the antimeridian.
Examples for the various geojson shapes representations are as below.
## Point
The following specifies a [Point](https://tools.ietf.org/html/rfc7946#section-3.1.2) field in a document:
```json
{
"type": "point",
"coordinates": [75.05687713623047, 22.53539059204079]
}
```
## Linestring
The following specifies a [Linestring](https://tools.ietf.org/html/rfc7946#section-3.1.4) field in a document:
```json
{
"type": "linestring",
"coordinates": [
[77.01416015625, 23.0797317624497],
[78.134765625, 20.385825381874263]
]
}
```
## Polygon
The following specifies a [Polygon](https://tools.ietf.org/html/rfc7946#section-3.1.6) field in a document:
```json
{
"type": "polygon",
"coordinates": [
[
[85.605, 57.207],
[86.396, 55.998],
[87.033, 56.716],
[85.605, 57.207]
]
]
}
```
The first and last coordinates must match in order to close the polygon.
And the exterior coordinates have to be in Counter Clockwise Order in a polygon. (CCW)
## MultiPoint
The following specifies a [Multipoint](https://tools.ietf.org/html/rfc7946#section-3.1.3) field in a document:
```json
{
"type": "multipoint",
"coordinates": [
[-115.8343505859375, 38.45789034424927],
[-115.81237792968749, 38.19502155795575],
[-120.80017089843749, 36.54053616262899],
[-120.67932128906249, 36.33725319397006]
]
}
```
## MultiLineString
The following specifies a [MultiLineString](https://tools.ietf.org/html/rfc7946#section-3.1.5) field in a document:
```json
{
"type": "multilinestring",
"coordinates": [
[
[-118.31726074, 35.250105158],
[-117.509765624, 35.3756141]
],
[
[-118.696289, 34.624167789],
[-118.317260742, 35.03899204]
],
[
[-117.9492187, 35.146862906],
[-117.6745605, 34.41144164]
]
]
}
```
## MultiPolygon
The following specifies a [MultiPolygon](https://tools.ietf.org/html/rfc7946#section-3.1.7) field in a document:
```json
{
"type": "multipolygon",
"coordinates": [
[
[
[-73.958, 40.8003],
[-73.9498, 40.7968],
[-73.9737, 40.7648],
[-73.9814, 40.7681],
[-73.958, 40.8003]
]
],
[
[
[-73.958, 40.8003],
[-73.9498, 40.7968],
[-73.9737, 40.7648],
[-73.958, 40.8003]
]
]
]
}
```
## GeometryCollection
The following specifies a [GeometryCollection](https://tools.ietf.org/html/rfc7946#section-3.1.8) field in a document:
```json
{
"type": "geometrycollection",
"geometries": [
{
"type": "multipoint",
"coordinates": [
[-73.958, 40.8003],
[-73.9498, 40.7968],
[-73.9737, 40.7648],
[-73.9814, 40.7681]
]
},
{
"type": "multilinestring",
"coordinates": [
[
[-73.96943, 40.78519],
[-73.96082, 40.78095]
],
[
[-73.96415, 40.79229],
[-73.95544, 40.78854]
],
[
[-73.97162, 40.78205],
[-73.96374, 40.77715]
],
[
[-73.9788, 40.77247],
[-73.97036, 40.76811]
]
]
},
{
"type": "polygon",
"coordinates": [
[
[0, 0],
[3, 6],
[6, 1],
[0, 0]
],
[
[2, 2],
[3, 3],
[4, 2],
[2, 2]
]
]
}
]
}
```
## Circle
If the user wishes to cover a circular region over the earth's surface, then they could use this shape.
A sample circular shape is as below.
```json
{
"type": "circle",
"coordinates": [75.05687713623047, 22.53539059204079],
"radius": "1000m"
}
```
Circle is specified over the center point coordinates along with the radius.
Example formats supported for radius are:
"5in" , "5inch" , "7yd" , "7yards", "9ft" , "9feet", "11km", "11kilometers", "3nm", "3nauticalmiles", "13mm" , "13millimeters", "15cm", "15centimeters", "17mi", "17miles", "19m" or "19meters".
If the unit cannot be determined, the entire string is parsed and the unit of meters is assumed.
## Envelope
Envelope type, which consists of coordinates for upper left and lower right points of the shape to represent a bounding rectangle in the format [[minLon, maxLat], [maxLon, minLat]].
```json
{
"type": "envelope",
"coordinates": [
[72.83, 18.979],
[78.508, 17.4555]
]
}
```
## GeoShape Query
Geoshape query support three types/filters of spatial querying capability across those heterogeneous types of documents indexed.
### Query Structure
```json
{
"query": {
"geometry": {
"shape": {
"type": "<shapeType>",
"coordinates": [
[[]]
]
},
"relation": "<<filterName>>"
}
}
}
```
*shapeType* => can be any of the aforementioned types like Point, LineString, Polygon, MultiPoint,
Geometrycollection, MultiLineString, MultiPolygon, Circle and Envelope.
*filterName* => can be any of the 3 types like *intersects*, *contains* and *within*.
### Relation
| FilterName | Description |
| :-----------:| :-----------------------------------------------------------------: |
| `intersects` | Return all documents whose shape field intersects the query geometry. |
| `contains` | Return all documents whose shape field contains the query geometry |
| `within` | Return all documents whose shape field is within the query geometry. |
------------------------------------------------------------------------------------------------------------------------
### Older Implementation
First, all of this geo code is a Go adaptation of the [Lucene 5.3.2 sandbox geo support](https://lucene.apache.org/core/5_3_2/sandbox/org/apache/lucene/util/package-summary.html).
## Notes
- All of the APIs will use float64 for lon/lat values.
- When describing a point in function arguments or return values, we always use the order lon, lat.
- High level APIs will use TopLeft and BottomRight to describe bounding boxes. This may not map cleanly to min/max lon/lat when crossing the dateline. The lower level APIs will use min/max lon/lat and require the higher-level code to split boxes accordingly.
- Points and MultiPoints may only contain Points and MultiPoints.
- LineStrings and MultiLineStrings may only contain Points and MultiPoints.
- Polygons or MultiPolygons intersecting Polygons and MultiPolygons may return arbitrary results when the overlap is only an edge or a vertex.
- Circles containing polygon will return a false positive result if all of the vertices of the polygon are within the circle, but the orientation of those points are clock-wise.
- The edges of an Envelope follows the latitude and logitude lines instead of the shortest path on a globe.
- Envelope intersecting queries with LineStrings, MultiLineStrings, Polygons and MultiPolygons implicitly converts the Envelope into a Polygon which changes the curvature of the edges causing inaccurate results for few edge cases.

210
vendor/github.com/blevesearch/bleve/v2/geo/geo.go generated vendored Normal file
View File

@@ -0,0 +1,210 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package geo
import (
"fmt"
"math"
"github.com/blevesearch/bleve/v2/numeric"
)
// GeoBits is the number of bits used for a single geo point
// Currently this is 32bits for lon and 32bits for lat
var GeoBits uint = 32
var minLon = -180.0
var minLat = -90.0
var maxLon = 180.0
var maxLat = 90.0
var minLonRad = minLon * degreesToRadian
var minLatRad = minLat * degreesToRadian
var maxLonRad = maxLon * degreesToRadian
var maxLatRad = maxLat * degreesToRadian
var geoTolerance = 1e-6
var lonScale = float64((uint64(0x1)<<GeoBits)-1) / 360.0
var latScale = float64((uint64(0x1)<<GeoBits)-1) / 180.0
var geoHashMaxLength = 12
// Point represents a geo point.
type Point struct {
Lon float64 `json:"lon"`
Lat float64 `json:"lat"`
}
// MortonHash computes the morton hash value for the provided geo point
// This point is ordered as lon, lat.
func MortonHash(lon, lat float64) uint64 {
return numeric.Interleave(scaleLon(lon), scaleLat(lat))
}
func scaleLon(lon float64) uint64 {
rv := uint64((lon - minLon) * lonScale)
return rv
}
func scaleLat(lat float64) uint64 {
rv := uint64((lat - minLat) * latScale)
return rv
}
// MortonUnhashLon extracts the longitude value from the provided morton hash.
func MortonUnhashLon(hash uint64) float64 {
return unscaleLon(numeric.Deinterleave(hash))
}
// MortonUnhashLat extracts the latitude value from the provided morton hash.
func MortonUnhashLat(hash uint64) float64 {
return unscaleLat(numeric.Deinterleave(hash >> 1))
}
func unscaleLon(lon uint64) float64 {
return (float64(lon) / lonScale) + minLon
}
func unscaleLat(lat uint64) float64 {
return (float64(lat) / latScale) + minLat
}
// compareGeo will compare two float values and see if they are the same
// taking into consideration a known geo tolerance.
func compareGeo(a, b float64) float64 {
compare := a - b
if math.Abs(compare) <= geoTolerance {
return 0
}
return compare
}
// RectIntersects checks whether rectangles a and b intersect
func RectIntersects(aMinX, aMinY, aMaxX, aMaxY, bMinX, bMinY, bMaxX, bMaxY float64) bool {
return !(aMaxX < bMinX || aMinX > bMaxX || aMaxY < bMinY || aMinY > bMaxY)
}
// RectWithin checks whether box a is within box b
func RectWithin(aMinX, aMinY, aMaxX, aMaxY, bMinX, bMinY, bMaxX, bMaxY float64) bool {
rv := !(aMinX < bMinX || aMinY < bMinY || aMaxX > bMaxX || aMaxY > bMaxY)
return rv
}
// BoundingBoxContains checks whether the lon/lat point is within the box
func BoundingBoxContains(lon, lat, minLon, minLat, maxLon, maxLat float64) bool {
return compareGeo(lon, minLon) >= 0 && compareGeo(lon, maxLon) <= 0 &&
compareGeo(lat, minLat) >= 0 && compareGeo(lat, maxLat) <= 0
}
const degreesToRadian = math.Pi / 180
const radiansToDegrees = 180 / math.Pi
// DegreesToRadians converts an angle in degrees to radians
func DegreesToRadians(d float64) float64 {
return d * degreesToRadian
}
// RadiansToDegrees converts an angle in radians to degress
func RadiansToDegrees(r float64) float64 {
return r * radiansToDegrees
}
var earthMeanRadiusMeters = 6371008.7714
func RectFromPointDistance(lon, lat, dist float64) (float64, float64, float64, float64, error) {
err := checkLongitude(lon)
if err != nil {
return 0, 0, 0, 0, err
}
err = checkLatitude(lat)
if err != nil {
return 0, 0, 0, 0, err
}
radLon := DegreesToRadians(lon)
radLat := DegreesToRadians(lat)
radDistance := (dist + 7e-2) / earthMeanRadiusMeters
minLatL := radLat - radDistance
maxLatL := radLat + radDistance
var minLonL, maxLonL float64
if minLatL > minLatRad && maxLatL < maxLatRad {
deltaLon := math.Asin(math.Sin(radDistance) / math.Cos(radLat))
minLonL = radLon - deltaLon
if minLonL < minLonRad {
minLonL += 2 * math.Pi
}
maxLonL = radLon + deltaLon
if maxLonL > maxLonRad {
maxLonL -= 2 * math.Pi
}
} else {
// pole is inside distance
minLatL = math.Max(minLatL, minLatRad)
maxLatL = math.Min(maxLatL, maxLatRad)
minLonL = minLonRad
maxLonL = maxLonRad
}
return RadiansToDegrees(minLonL),
RadiansToDegrees(maxLatL),
RadiansToDegrees(maxLonL),
RadiansToDegrees(minLatL),
nil
}
func checkLatitude(latitude float64) error {
if math.IsNaN(latitude) || latitude < minLat || latitude > maxLat {
return fmt.Errorf("invalid latitude %f; must be between %f and %f", latitude, minLat, maxLat)
}
return nil
}
func checkLongitude(longitude float64) error {
if math.IsNaN(longitude) || longitude < minLon || longitude > maxLon {
return fmt.Errorf("invalid longitude %f; must be between %f and %f", longitude, minLon, maxLon)
}
return nil
}
func BoundingRectangleForPolygon(polygon []Point) (
float64, float64, float64, float64, error) {
err := checkLongitude(polygon[0].Lon)
if err != nil {
return 0, 0, 0, 0, err
}
err = checkLatitude(polygon[0].Lat)
if err != nil {
return 0, 0, 0, 0, err
}
maxY, minY := polygon[0].Lat, polygon[0].Lat
maxX, minX := polygon[0].Lon, polygon[0].Lon
for i := 1; i < len(polygon); i++ {
err := checkLongitude(polygon[i].Lon)
if err != nil {
return 0, 0, 0, 0, err
}
err = checkLatitude(polygon[i].Lat)
if err != nil {
return 0, 0, 0, 0, err
}
maxY = math.Max(maxY, polygon[i].Lat)
minY = math.Min(minY, polygon[i].Lat)
maxX = math.Max(maxX, polygon[i].Lon)
minX = math.Min(minX, polygon[i].Lon)
}
return minX, maxY, maxX, minY, nil
}

98
vendor/github.com/blevesearch/bleve/v2/geo/geo_dist.go generated vendored Normal file
View File

@@ -0,0 +1,98 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package geo
import (
"fmt"
"math"
"strconv"
"strings"
)
type distanceUnit struct {
conv float64
suffixes []string
}
var inch = distanceUnit{0.0254, []string{"in", "inch"}}
var yard = distanceUnit{0.9144, []string{"yd", "yards"}}
var feet = distanceUnit{0.3048, []string{"ft", "feet"}}
var kilom = distanceUnit{1000, []string{"km", "kilometers"}}
var nauticalm = distanceUnit{1852.0, []string{"nm", "nauticalmiles"}}
var millim = distanceUnit{0.001, []string{"mm", "millimeters"}}
var centim = distanceUnit{0.01, []string{"cm", "centimeters"}}
var miles = distanceUnit{1609.344, []string{"mi", "miles"}}
var meters = distanceUnit{1, []string{"m", "meters"}}
var distanceUnits = []*distanceUnit{
&inch, &yard, &feet, &kilom, &nauticalm, &millim, &centim, &miles, &meters,
}
// ParseDistance attempts to parse a distance string and return distance in
// meters. Example formats supported:
// "5in" "5inch" "7yd" "7yards" "9ft" "9feet" "11km" "11kilometers"
// "3nm" "3nauticalmiles" "13mm" "13millimeters" "15cm" "15centimeters"
// "17mi" "17miles" "19m" "19meters"
// If the unit cannot be determined, the entire string is parsed and the
// unit of meters is assumed.
// If the number portion cannot be parsed, 0 and the parse error are returned.
func ParseDistance(d string) (float64, error) {
for _, unit := range distanceUnits {
for _, unitSuffix := range unit.suffixes {
if strings.HasSuffix(d, unitSuffix) {
parsedNum, err := strconv.ParseFloat(d[0:len(d)-len(unitSuffix)], 64)
if err != nil {
return 0, err
}
return parsedNum * unit.conv, nil
}
}
}
// no unit matched, try assuming meters?
parsedNum, err := strconv.ParseFloat(d, 64)
if err != nil {
return 0, err
}
return parsedNum, nil
}
// ParseDistanceUnit attempts to parse a distance unit and return the
// multiplier for converting this to meters. If the unit cannot be parsed
// then 0 and the error message is returned.
func ParseDistanceUnit(u string) (float64, error) {
for _, unit := range distanceUnits {
for _, unitSuffix := range unit.suffixes {
if u == unitSuffix {
return unit.conv, nil
}
}
}
return 0, fmt.Errorf("unknown distance unit: %s", u)
}
// Haversin computes the distance between two points.
// This implemenation uses the sloppy math implemenations which trade off
// accuracy for performance. The distance returned is in kilometers.
func Haversin(lon1, lat1, lon2, lat2 float64) float64 {
x1 := lat1 * degreesToRadian
x2 := lat2 * degreesToRadian
h1 := 1 - math.Cos(x1-x2)
h2 := 1 - math.Cos((lon1-lon2)*degreesToRadian)
h := (h1 + math.Cos(x1)*math.Cos(x2)*h2) / 2
avgLat := (x1 + x2) / 2
diameter := earthDiameter(avgLat)
return diameter * math.Asin(math.Min(1, math.Sqrt(h)))
}

View File

@@ -0,0 +1,463 @@
// Copyright (c) 2022 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package geo
import (
"encoding/json"
"sync"
"github.com/blevesearch/bleve/v2/util"
index "github.com/blevesearch/bleve_index_api"
"github.com/blevesearch/geo/geojson"
"github.com/blevesearch/geo/s2"
)
const (
PointType = "point"
MultiPointType = "multipoint"
LineStringType = "linestring"
MultiLineStringType = "multilinestring"
PolygonType = "polygon"
MultiPolygonType = "multipolygon"
GeometryCollectionType = "geometrycollection"
CircleType = "circle"
EnvelopeType = "envelope"
)
// spatialPluginsMap is spatial plugin cache.
var (
spatialPluginsMap = make(map[string]index.SpatialAnalyzerPlugin)
pluginsMapLock = sync.RWMutex{}
)
func init() {
registerS2RegionTermIndexer()
}
func registerS2RegionTermIndexer() {
spatialPlugin := S2SpatialAnalyzerPlugin{
s2Indexer: s2.NewRegionTermIndexerWithOptions(initS2IndexerOptions()),
s2Searcher: s2.NewRegionTermIndexerWithOptions(initS2SearcherOptions()),
s2GeoPointsRegionTermIndexer: s2.NewRegionTermIndexerWithOptions(initS2OptionsForGeoPoints()),
}
RegisterSpatialAnalyzerPlugin(&spatialPlugin)
}
// RegisterSpatialAnalyzerPlugin registers the given plugin implementation.
func RegisterSpatialAnalyzerPlugin(plugin index.SpatialAnalyzerPlugin) {
pluginsMapLock.Lock()
spatialPluginsMap[plugin.Type()] = plugin
pluginsMapLock.Unlock()
}
// GetSpatialAnalyzerPlugin retrieves the given implementation type.
func GetSpatialAnalyzerPlugin(typ string) index.SpatialAnalyzerPlugin {
pluginsMapLock.RLock()
rv := spatialPluginsMap[typ]
pluginsMapLock.RUnlock()
return rv
}
// initS2IndexerOptions returns the options for s2's region
// term indexer for the index time tokens of geojson shapes.
func initS2IndexerOptions() s2.Options {
options := s2.Options{}
// maxLevel control the maximum size of the
// S2Cells used to approximate regions.
options.SetMaxLevel(16)
// minLevel control the minimum size of the
// S2Cells used to approximate regions.
options.SetMinLevel(2)
// levelMod value greater than 1 increases the effective branching
// factor of the S2Cell hierarchy by skipping some levels.
options.SetLevelMod(1)
// maxCells controls the maximum number of cells
// when approximating each s2 region.
options.SetMaxCells(20)
return options
}
// initS2SearcherOptions returns the options for s2's region
// term indexer for the query time tokens of geojson shapes.
func initS2SearcherOptions() s2.Options {
options := s2.Options{}
// maxLevel control the maximum size of the
// S2Cells used to approximate regions.
options.SetMaxLevel(16)
// minLevel control the minimum size of the
// S2Cells used to approximate regions.
options.SetMinLevel(2)
// levelMod value greater than 1 increases the effective branching
// factor of the S2Cell hierarchy by skipping some levels.
options.SetLevelMod(1)
// maxCells controls the maximum number of cells
// when approximating each s2 region.
options.SetMaxCells(8)
return options
}
// initS2OptionsForGeoPoints returns the options for
// s2's region term indexer for the original geopoints.
func initS2OptionsForGeoPoints() s2.Options {
options := s2.Options{}
// maxLevel control the maximum size of the
// S2Cells used to approximate regions.
options.SetMaxLevel(16)
// minLevel control the minimum size of the
// S2Cells used to approximate regions.
options.SetMinLevel(4)
// levelMod value greater than 1 increases the effective branching
// factor of the S2Cell hierarchy by skipping some levels.
options.SetLevelMod(2)
// maxCells controls the maximum number of cells
// when approximating each s2 region.
options.SetMaxCells(8)
// explicit for geo points.
options.SetPointsOnly(true)
return options
}
// S2SpatialAnalyzerPlugin is an implementation of
// the index.SpatialAnalyzerPlugin interface.
type S2SpatialAnalyzerPlugin struct {
s2Indexer *s2.RegionTermIndexer
s2Searcher *s2.RegionTermIndexer
s2GeoPointsRegionTermIndexer *s2.RegionTermIndexer
}
func (s *S2SpatialAnalyzerPlugin) Type() string {
return "s2"
}
func (s *S2SpatialAnalyzerPlugin) GetIndexTokens(queryShape index.GeoJSON) []string {
var rv []string
shapes := []index.GeoJSON{queryShape}
if gc, ok := queryShape.(*geojson.GeometryCollection); ok {
shapes = gc.Shapes
}
for _, shape := range shapes {
if s2t, ok := shape.(s2Tokenizable); ok {
rv = append(rv, s2t.IndexTokens(s.s2Indexer)...)
} else if s2t, ok := shape.(s2TokenizableEx); ok {
rv = append(rv, s2t.IndexTokens(s)...)
}
}
return geojson.DeduplicateTerms(rv)
}
func (s *S2SpatialAnalyzerPlugin) GetQueryTokens(queryShape index.GeoJSON) []string {
var rv []string
shapes := []index.GeoJSON{queryShape}
if gc, ok := queryShape.(*geojson.GeometryCollection); ok {
shapes = gc.Shapes
}
for _, shape := range shapes {
if s2t, ok := shape.(s2Tokenizable); ok {
rv = append(rv, s2t.QueryTokens(s.s2Searcher)...)
} else if s2t, ok := shape.(s2TokenizableEx); ok {
rv = append(rv, s2t.QueryTokens(s)...)
}
}
return geojson.DeduplicateTerms(rv)
}
// ------------------------------------------------------------------------
// s2Tokenizable is an optional interface for shapes that support
// the generation of s2 based tokens that can be used for both
// indexing and querying.
type s2Tokenizable interface {
// IndexTokens returns the tokens for indexing.
IndexTokens(*s2.RegionTermIndexer) []string
// QueryTokens returns the tokens for searching.
QueryTokens(*s2.RegionTermIndexer) []string
}
// ------------------------------------------------------------------------
// s2TokenizableEx is an optional interface for shapes that support
// the generation of s2 based tokens that can be used for both
// indexing and querying. This is intended for the older geopoint
// indexing and querying.
type s2TokenizableEx interface {
// IndexTokens returns the tokens for indexing.
IndexTokens(*S2SpatialAnalyzerPlugin) []string
// QueryTokens returns the tokens for searching.
QueryTokens(*S2SpatialAnalyzerPlugin) []string
}
//----------------------------------------------------------------------------------
func (p *Point) Type() string {
return PointType
}
func (p *Point) Value() ([]byte, error) {
return util.MarshalJSON(p)
}
func (p *Point) Intersects(s index.GeoJSON) (bool, error) {
// placeholder implementation
return false, nil
}
func (p *Point) Contains(s index.GeoJSON) (bool, error) {
// placeholder implementation
return false, nil
}
func (p *Point) IndexTokens(s *S2SpatialAnalyzerPlugin) []string {
return s.s2GeoPointsRegionTermIndexer.GetIndexTermsForPoint(s2.PointFromLatLng(
s2.LatLngFromDegrees(p.Lat, p.Lon)), "")
}
func (p *Point) QueryTokens(s *S2SpatialAnalyzerPlugin) []string {
return nil
}
//----------------------------------------------------------------------------------
type boundedRectangle struct {
minLat float64
maxLat float64
minLon float64
maxLon float64
}
func NewBoundedRectangle(minLat, minLon, maxLat,
maxLon float64) *boundedRectangle {
return &boundedRectangle{minLat: minLat,
maxLat: maxLat, minLon: minLon, maxLon: maxLon}
}
func (br *boundedRectangle) Type() string {
// placeholder implementation
return "boundedRectangle"
}
func (br *boundedRectangle) Value() ([]byte, error) {
return util.MarshalJSON(br)
}
func (p *boundedRectangle) Intersects(s index.GeoJSON) (bool, error) {
// placeholder implementation
return false, nil
}
func (p *boundedRectangle) Contains(s index.GeoJSON) (bool, error) {
// placeholder implementation
return false, nil
}
func (br *boundedRectangle) IndexTokens(s *S2SpatialAnalyzerPlugin) []string {
return nil
}
func (br *boundedRectangle) QueryTokens(s *S2SpatialAnalyzerPlugin) []string {
rect := s2.RectFromDegrees(br.minLat, br.minLon, br.maxLat, br.maxLon)
// obtain the terms to be searched for the given bounding box.
terms := s.s2GeoPointsRegionTermIndexer.GetQueryTermsForRegion(rect, "")
return geojson.StripCoveringTerms(terms)
}
//----------------------------------------------------------------------------------
type boundedPolygon struct {
coordinates []Point
}
func NewBoundedPolygon(coordinates []Point) *boundedPolygon {
return &boundedPolygon{coordinates: coordinates}
}
func (bp *boundedPolygon) Type() string {
// placeholder implementation
return "boundedPolygon"
}
func (bp *boundedPolygon) Value() ([]byte, error) {
return util.MarshalJSON(bp)
}
func (p *boundedPolygon) Intersects(s index.GeoJSON) (bool, error) {
// placeholder implementation
return false, nil
}
func (p *boundedPolygon) Contains(s index.GeoJSON) (bool, error) {
// placeholder implementation
return false, nil
}
func (bp *boundedPolygon) IndexTokens(s *S2SpatialAnalyzerPlugin) []string {
return nil
}
func (bp *boundedPolygon) QueryTokens(s *S2SpatialAnalyzerPlugin) []string {
vertices := make([]s2.Point, len(bp.coordinates))
for i, point := range bp.coordinates {
vertices[i] = s2.PointFromLatLng(
s2.LatLngFromDegrees(point.Lat, point.Lon))
}
s2polygon := s2.PolygonFromOrientedLoops([]*s2.Loop{s2.LoopFromPoints(vertices)})
// obtain the terms to be searched for the given polygon.
terms := s.s2GeoPointsRegionTermIndexer.GetQueryTermsForRegion(
s2polygon.CapBound(), "")
return geojson.StripCoveringTerms(terms)
}
//----------------------------------------------------------------------------------
type pointDistance struct {
dist float64
centerLat float64
centerLon float64
}
func (p *pointDistance) Type() string {
// placeholder implementation
return "pointDistance"
}
func (p *pointDistance) Value() ([]byte, error) {
return util.MarshalJSON(p)
}
func NewPointDistance(centerLat, centerLon,
dist float64) *pointDistance {
return &pointDistance{centerLat: centerLat,
centerLon: centerLon, dist: dist}
}
func (p *pointDistance) Intersects(s index.GeoJSON) (bool, error) {
// placeholder implementation
return false, nil
}
func (p *pointDistance) Contains(s index.GeoJSON) (bool, error) {
// placeholder implementation
return false, nil
}
func (pd *pointDistance) IndexTokens(s *S2SpatialAnalyzerPlugin) []string {
return nil
}
func (pd *pointDistance) QueryTokens(s *S2SpatialAnalyzerPlugin) []string {
// obtain the covering query region from the given points.
queryRegion := s2.CapFromCenterAndRadius(pd.centerLat,
pd.centerLon, pd.dist)
// obtain the query terms for the query region.
terms := s.s2GeoPointsRegionTermIndexer.GetQueryTermsForRegion(queryRegion, "")
return geojson.StripCoveringTerms(terms)
}
// ------------------------------------------------------------------------
// NewGeometryCollection instantiate a geometrycollection
// and prefix the byte contents with certain glue bytes that
// can be used later while filering the doc values.
func NewGeometryCollection(coordinates [][][][][]float64,
typs []string) (index.GeoJSON, []byte, error) {
shapes := make([]*geojson.GeoShape, len(coordinates))
for i := range coordinates {
shapes[i] = &geojson.GeoShape{
Coordinates: coordinates[i],
Type: typs[i],
}
}
return geojson.NewGeometryCollection(shapes)
}
func NewGeometryCollectionFromShapes(shapes []*geojson.GeoShape) (
index.GeoJSON, []byte, error) {
return geojson.NewGeometryCollection(shapes)
}
// NewGeoCircleShape instantiate a circle shape and
// prefix the byte contents with certain glue bytes that
// can be used later while filering the doc values.
func NewGeoCircleShape(cp []float64,
radius string) (index.GeoJSON, []byte, error) {
return geojson.NewGeoCircleShape(cp, radius)
}
func NewGeoJsonShape(coordinates [][][][]float64, typ string) (
index.GeoJSON, []byte, error) {
return geojson.NewGeoJsonShape(coordinates, typ)
}
func NewGeoJsonPoint(points []float64) index.GeoJSON {
return geojson.NewGeoJsonPoint(points)
}
func NewGeoJsonMultiPoint(points [][]float64) index.GeoJSON {
return geojson.NewGeoJsonMultiPoint(points)
}
func NewGeoJsonLinestring(points [][]float64) index.GeoJSON {
return geojson.NewGeoJsonLinestring(points)
}
func NewGeoJsonMultilinestring(points [][][]float64) index.GeoJSON {
return geojson.NewGeoJsonMultilinestring(points)
}
func NewGeoJsonPolygon(points [][][]float64) index.GeoJSON {
return geojson.NewGeoJsonPolygon(points)
}
func NewGeoJsonMultiPolygon(points [][][][]float64) index.GeoJSON {
return geojson.NewGeoJsonMultiPolygon(points)
}
func NewGeoCircle(points []float64, radius string) index.GeoJSON {
return geojson.NewGeoCircle(points, radius)
}
func NewGeoEnvelope(points [][]float64) index.GeoJSON {
return geojson.NewGeoEnvelope(points)
}
func ParseGeoJSONShape(input json.RawMessage) (index.GeoJSON, error) {
return geojson.ParseGeoJSONShape(input)
}

111
vendor/github.com/blevesearch/bleve/v2/geo/geohash.go generated vendored Normal file
View File

@@ -0,0 +1,111 @@
// Copyright (c) 2019 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// This implementation is inspired from the geohash-js
// ref: https://github.com/davetroy/geohash-js
package geo
// encoding encapsulates an encoding defined by a given base32 alphabet.
type encoding struct {
enc string
dec [256]byte
}
// newEncoding constructs a new encoding defined by the given alphabet,
// which must be a 32-byte string.
func newEncoding(encoder string) *encoding {
e := new(encoding)
e.enc = encoder
for i := 0; i < len(e.dec); i++ {
e.dec[i] = 0xff
}
for i := 0; i < len(encoder); i++ {
e.dec[encoder[i]] = byte(i)
}
return e
}
// base32encoding with the Geohash alphabet.
var base32encoding = newEncoding("0123456789bcdefghjkmnpqrstuvwxyz")
var masks = []uint64{16, 8, 4, 2, 1}
// DecodeGeoHash decodes the string geohash faster with
// higher precision. This api is in experimental phase.
func DecodeGeoHash(geoHash string) (float64, float64) {
even := true
lat := []float64{-90.0, 90.0}
lon := []float64{-180.0, 180.0}
for i := 0; i < len(geoHash); i++ {
cd := uint64(base32encoding.dec[geoHash[i]])
for j := 0; j < 5; j++ {
if even {
if cd&masks[j] > 0 {
lon[0] = (lon[0] + lon[1]) / 2
} else {
lon[1] = (lon[0] + lon[1]) / 2
}
} else {
if cd&masks[j] > 0 {
lat[0] = (lat[0] + lat[1]) / 2
} else {
lat[1] = (lat[0] + lat[1]) / 2
}
}
even = !even
}
}
return (lat[0] + lat[1]) / 2, (lon[0] + lon[1]) / 2
}
func EncodeGeoHash(lat, lon float64) string {
even := true
lats := []float64{-90.0, 90.0}
lons := []float64{-180.0, 180.0}
precision := 12
var ch, bit uint64
var geoHash string
for len(geoHash) < precision {
if even {
mid := (lons[0] + lons[1]) / 2
if lon > mid {
ch |= masks[bit]
lons[0] = mid
} else {
lons[1] = mid
}
} else {
mid := (lats[0] + lats[1]) / 2
if lat > mid {
ch |= masks[bit]
lats[0] = mid
} else {
lats[1] = mid
}
}
even = !even
if bit < 4 {
bit++
} else {
geoHash += string(base32encoding.enc[ch])
ch = 0
bit = 0
}
}
return geoHash
}

465
vendor/github.com/blevesearch/bleve/v2/geo/parse.go generated vendored Normal file
View File

@@ -0,0 +1,465 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package geo
import (
"reflect"
"strconv"
"strings"
"github.com/blevesearch/bleve/v2/util"
"github.com/blevesearch/geo/geojson"
)
// ExtractGeoPoint takes an arbitrary interface{} and tries it's best to
// interpret it is as geo point. Supported formats:
// Container:
// slice length 2 (GeoJSON)
//
// first element lon, second element lat
//
// string (coordinates separated by comma, or a geohash)
//
// first element lat, second element lon
//
// map[string]interface{}
//
// exact keys lat and lon or lng
//
// struct
//
// w/exported fields case-insensitive match on lat and lon or lng
//
// struct
//
// satisfying Later and Loner or Lnger interfaces
//
// in all cases values must be some sort of numeric-like thing: int/uint/float
func ExtractGeoPoint(thing interface{}) (lon, lat float64, success bool) {
var foundLon, foundLat bool
thingVal := reflect.ValueOf(thing)
if !thingVal.IsValid() {
return lon, lat, false
}
thingTyp := thingVal.Type()
// is it a slice
if thingVal.Kind() == reflect.Slice {
// must be length 2
if thingVal.Len() == 2 {
first := thingVal.Index(0)
if first.CanInterface() {
firstVal := first.Interface()
lon, foundLon = util.ExtractNumericValFloat64(firstVal)
}
second := thingVal.Index(1)
if second.CanInterface() {
secondVal := second.Interface()
lat, foundLat = util.ExtractNumericValFloat64(secondVal)
}
}
}
// is it a string
if thingVal.Kind() == reflect.String {
geoStr := thingVal.Interface().(string)
if strings.Contains(geoStr, ",") {
// geo point with coordinates split by comma
points := strings.Split(geoStr, ",")
for i, point := range points {
// trim any leading or trailing white spaces
points[i] = strings.TrimSpace(point)
}
if len(points) == 2 {
var err error
lat, err = strconv.ParseFloat(points[0], 64)
if err == nil {
foundLat = true
}
lon, err = strconv.ParseFloat(points[1], 64)
if err == nil {
foundLon = true
}
}
} else {
// geohash
if len(geoStr) <= geoHashMaxLength {
lat, lon = DecodeGeoHash(geoStr)
foundLat = true
foundLon = true
}
}
}
// is it a map
if l, ok := thing.(map[string]interface{}); ok {
if lval, ok := l["lon"]; ok {
lon, foundLon = util.ExtractNumericValFloat64(lval)
} else if lval, ok := l["lng"]; ok {
lon, foundLon = util.ExtractNumericValFloat64(lval)
}
if lval, ok := l["lat"]; ok {
lat, foundLat = util.ExtractNumericValFloat64(lval)
}
}
// now try reflection on struct fields
if thingVal.Kind() == reflect.Struct {
for i := 0; i < thingVal.NumField(); i++ {
fieldName := thingTyp.Field(i).Name
if strings.HasPrefix(strings.ToLower(fieldName), "lon") {
if thingVal.Field(i).CanInterface() {
fieldVal := thingVal.Field(i).Interface()
lon, foundLon = util.ExtractNumericValFloat64(fieldVal)
}
}
if strings.HasPrefix(strings.ToLower(fieldName), "lng") {
if thingVal.Field(i).CanInterface() {
fieldVal := thingVal.Field(i).Interface()
lon, foundLon = util.ExtractNumericValFloat64(fieldVal)
}
}
if strings.HasPrefix(strings.ToLower(fieldName), "lat") {
if thingVal.Field(i).CanInterface() {
fieldVal := thingVal.Field(i).Interface()
lat, foundLat = util.ExtractNumericValFloat64(fieldVal)
}
}
}
}
// last hope, some interfaces
// lon
if l, ok := thing.(loner); ok {
lon = l.Lon()
foundLon = true
} else if l, ok := thing.(lnger); ok {
lon = l.Lng()
foundLon = true
}
// lat
if l, ok := thing.(later); ok {
lat = l.Lat()
foundLat = true
}
return lon, lat, foundLon && foundLat
}
// various support interfaces which can be used to find lat/lon
type loner interface {
Lon() float64
}
type later interface {
Lat() float64
}
type lnger interface {
Lng() float64
}
// GlueBytes primarily for quicker filtering of docvalues
// during the filtering phase.
var GlueBytes = []byte("##")
var GlueBytesOffset = len(GlueBytes)
func extractCoordinates(thing interface{}) []float64 {
thingVal := reflect.ValueOf(thing)
if !thingVal.IsValid() {
return nil
}
if thingVal.Kind() == reflect.Slice {
// must be length 2
if thingVal.Len() == 2 {
var foundLon, foundLat bool
var lon, lat float64
first := thingVal.Index(0)
if first.CanInterface() {
firstVal := first.Interface()
lon, foundLon = util.ExtractNumericValFloat64(firstVal)
}
second := thingVal.Index(1)
if second.CanInterface() {
secondVal := second.Interface()
lat, foundLat = util.ExtractNumericValFloat64(secondVal)
}
if !foundLon || !foundLat {
return nil
}
return []float64{lon, lat}
}
}
return nil
}
func extract2DCoordinates(thing interface{}) [][]float64 {
thingVal := reflect.ValueOf(thing)
if !thingVal.IsValid() {
return nil
}
rv := make([][]float64, 0, 8)
if thingVal.Kind() == reflect.Slice {
for j := 0; j < thingVal.Len(); j++ {
edges := thingVal.Index(j).Interface()
if es, ok := edges.([]interface{}); ok {
v := extractCoordinates(es)
if len(v) == 2 {
rv = append(rv, v)
}
}
}
return rv
}
return nil
}
func extract3DCoordinates(thing interface{}) (c [][][]float64) {
coords := reflect.ValueOf(thing)
if !coords.IsValid() {
return nil
}
if coords.Kind() == reflect.Slice {
for i := 0; i < coords.Len(); i++ {
vals := coords.Index(i)
edges := vals.Interface()
if es, ok := edges.([]interface{}); ok {
loop := extract2DCoordinates(es)
if len(loop) > 0 {
c = append(c, loop)
}
}
}
}
return c
}
func extract4DCoordinates(thing interface{}) (rv [][][][]float64) {
thingVal := reflect.ValueOf(thing)
if !thingVal.IsValid() {
return nil
}
if thingVal.Kind() == reflect.Slice {
for j := 0; j < thingVal.Len(); j++ {
c := extract3DCoordinates(thingVal.Index(j).Interface())
rv = append(rv, c)
}
}
return rv
}
func ParseGeoShapeField(thing interface{}) (interface{}, string, error) {
thingVal := reflect.ValueOf(thing)
if !thingVal.IsValid() {
return nil, "", nil
}
var shape string
var coordValue interface{}
if thingVal.Kind() == reflect.Map {
iter := thingVal.MapRange()
for iter.Next() {
if iter.Key().String() == "type" {
shape = iter.Value().Interface().(string)
continue
}
if iter.Key().String() == "coordinates" {
coordValue = iter.Value().Interface()
}
}
}
return coordValue, strings.ToLower(shape), nil
}
func extractGeoShape(thing interface{}) (*geojson.GeoShape, bool) {
coordValue, typ, err := ParseGeoShapeField(thing)
if err != nil {
return nil, false
}
if typ == CircleType {
return ExtractCircle(thing)
}
return ExtractGeoShapeCoordinates(coordValue, typ)
}
// ExtractGeometryCollection takes an interface{} and tries it's best to
// interpret all the member geojson shapes within it.
func ExtractGeometryCollection(thing interface{}) ([]*geojson.GeoShape, bool) {
thingVal := reflect.ValueOf(thing)
if !thingVal.IsValid() {
return nil, false
}
var rv []*geojson.GeoShape
var f bool
if thingVal.Kind() == reflect.Map {
iter := thingVal.MapRange()
for iter.Next() {
if iter.Key().String() == "type" {
continue
}
if iter.Key().String() == "geometries" {
collection := iter.Value().Interface()
items := reflect.ValueOf(collection)
for j := 0; j < items.Len(); j++ {
shape, found := extractGeoShape(items.Index(j).Interface())
if found {
f = found
rv = append(rv, shape)
}
}
}
}
}
return rv, f
}
// ExtractCircle takes an interface{} and tries it's best to
// interpret the center point coordinates and the radius for a
// given circle shape.
func ExtractCircle(thing interface{}) (*geojson.GeoShape, bool) {
thingVal := reflect.ValueOf(thing)
if !thingVal.IsValid() {
return nil, false
}
rv := &geojson.GeoShape{
Type: CircleType,
Center: make([]float64, 0, 2),
}
if thingVal.Kind() == reflect.Map {
iter := thingVal.MapRange()
for iter.Next() {
if iter.Key().String() == "radius" {
rv.Radius = iter.Value().Interface().(string)
continue
}
if iter.Key().String() == "coordinates" {
lng, lat, found := ExtractGeoPoint(iter.Value().Interface())
if !found {
return nil, false
}
rv.Center = append(rv.Center, lng, lat)
}
}
}
return rv, true
}
// ExtractGeoShapeCoordinates takes an interface{} and tries it's best to
// interpret the coordinates for any of the given geoshape typ like
// a point, multipoint, linestring, multilinestring, polygon, multipolygon,
func ExtractGeoShapeCoordinates(coordValue interface{},
typ string) (*geojson.GeoShape, bool) {
rv := &geojson.GeoShape{
Type: typ,
}
if typ == PointType {
point := extractCoordinates(coordValue)
// ignore the contents with invalid entry.
if len(point) < 2 {
return nil, false
}
rv.Coordinates = [][][][]float64{{{point}}}
return rv, true
}
if typ == MultiPointType || typ == LineStringType ||
typ == EnvelopeType {
coords := extract2DCoordinates(coordValue)
// ignore the contents with invalid entry.
if len(coords) == 0 {
return nil, false
}
if typ == EnvelopeType && len(coords) != 2 {
return nil, false
}
if typ == LineStringType && len(coords) < 2 {
return nil, false
}
rv.Coordinates = [][][][]float64{{coords}}
return rv, true
}
if typ == PolygonType || typ == MultiLineStringType {
coords := extract3DCoordinates(coordValue)
// ignore the contents with invalid entry.
if len(coords) == 0 {
return nil, false
}
if typ == PolygonType && len(coords[0]) < 3 ||
typ == MultiLineStringType && len(coords[0]) < 2 {
return nil, false
}
rv.Coordinates = [][][][]float64{coords}
return rv, true
}
if typ == MultiPolygonType {
coords := extract4DCoordinates(coordValue)
// ignore the contents with invalid entry.
if len(coords) == 0 || len(coords[0]) == 0 {
return nil, false
}
if len(coords[0][0]) < 3 {
return nil, false
}
rv.Coordinates = coords
return rv, true
}
return rv, false
}

59
vendor/github.com/blevesearch/bleve/v2/geo/sloppy.go generated vendored Normal file
View File

@@ -0,0 +1,59 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package geo
import (
"math"
)
var earthDiameterPerLatitude []float64
const (
radiusTabsSize = (1 << 10) + 1
radiusDelta = (math.Pi / 2) / (radiusTabsSize - 1)
radiusIndexer = 1 / radiusDelta
)
func init() {
// initializes the tables used for the sloppy math functions
// earth radius
a := 6378137.0
b := 6356752.31420
a2 := a * a
b2 := b * b
earthDiameterPerLatitude = make([]float64, radiusTabsSize)
earthDiameterPerLatitude[0] = 2.0 * a / 1000
earthDiameterPerLatitude[radiusTabsSize-1] = 2.0 * b / 1000
for i := 1; i < radiusTabsSize-1; i++ {
lat := math.Pi * float64(i) / (2*radiusTabsSize - 1)
one := math.Pow(a2*math.Cos(lat), 2)
two := math.Pow(b2*math.Sin(lat), 2)
three := math.Pow(float64(a)*math.Cos(lat), 2)
four := math.Pow(b*math.Sin(lat), 2)
radius := math.Sqrt((one + two) / (three + four))
earthDiameterPerLatitude[i] = 2 * radius / 1000
}
}
// earthDiameter returns an estimation of the earth's diameter at the specified
// latitude in kilometers
func earthDiameter(lat float64) float64 {
index := math.Mod(math.Abs(lat)*radiusIndexer+0.5, float64(len(earthDiameterPerLatitude)))
if math.IsNaN(index) {
return 0
}
return earthDiameterPerLatitude[int(index)]
}

388
vendor/github.com/blevesearch/bleve/v2/index.go generated vendored Normal file
View File

@@ -0,0 +1,388 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package bleve
import (
"context"
"fmt"
"github.com/blevesearch/bleve/v2/index/upsidedown"
"github.com/blevesearch/bleve/v2/document"
"github.com/blevesearch/bleve/v2/mapping"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
// A Batch groups together multiple Index and Delete
// operations you would like performed at the same
// time. The Batch structure is NOT thread-safe.
// You should only perform operations on a batch
// from a single thread at a time. Once batch
// execution has started, you may not modify it.
type Batch struct {
index Index
internal *index.Batch
lastDocSize uint64
totalSize uint64
}
// Index adds the specified index operation to the
// batch. NOTE: the bleve Index is not updated
// until the batch is executed.
func (b *Batch) Index(id string, data interface{}) error {
if id == "" {
return ErrorEmptyID
}
if eventIndex, ok := b.index.(index.EventIndex); ok {
eventIndex.FireIndexEvent()
}
doc := document.NewDocument(id)
err := b.index.Mapping().MapDocument(doc, data)
if err != nil {
return err
}
b.internal.Update(doc)
b.lastDocSize = uint64(doc.Size() +
len(id) + size.SizeOfString) // overhead from internal
b.totalSize += b.lastDocSize
return nil
}
func (b *Batch) IndexSynonym(id string, collection string, definition *SynonymDefinition) error {
if id == "" {
return ErrorEmptyID
}
if eventIndex, ok := b.index.(index.EventIndex); ok {
eventIndex.FireIndexEvent()
}
synMap, ok := b.index.Mapping().(mapping.SynonymMapping)
if !ok {
return ErrorSynonymSearchNotSupported
}
if err := definition.Validate(); err != nil {
return err
}
doc := document.NewSynonymDocument(id)
err := synMap.MapSynonymDocument(doc, collection, definition.Input, definition.Synonyms)
if err != nil {
return err
}
b.internal.Update(doc)
b.lastDocSize = uint64(doc.Size() +
len(id) + size.SizeOfString) // overhead from internal
b.totalSize += b.lastDocSize
return nil
}
func (b *Batch) LastDocSize() uint64 {
return b.lastDocSize
}
func (b *Batch) TotalDocsSize() uint64 {
return b.totalSize
}
// IndexAdvanced adds the specified index operation to the
// batch which skips the mapping. NOTE: the bleve Index is not updated
// until the batch is executed.
func (b *Batch) IndexAdvanced(doc *document.Document) (err error) {
if doc.ID() == "" {
return ErrorEmptyID
}
b.internal.Update(doc)
return nil
}
// Delete adds the specified delete operation to the
// batch. NOTE: the bleve Index is not updated until
// the batch is executed.
func (b *Batch) Delete(id string) {
if id != "" {
b.internal.Delete(id)
}
}
// SetInternal adds the specified set internal
// operation to the batch. NOTE: the bleve Index is
// not updated until the batch is executed.
func (b *Batch) SetInternal(key, val []byte) {
b.internal.SetInternal(key, val)
}
// DeleteInternal adds the specified delete internal
// operation to the batch. NOTE: the bleve Index is
// not updated until the batch is executed.
func (b *Batch) DeleteInternal(key []byte) {
b.internal.DeleteInternal(key)
}
// Size returns the total number of operations inside the batch
// including normal index operations and internal operations.
func (b *Batch) Size() int {
return len(b.internal.IndexOps) + len(b.internal.InternalOps)
}
// String prints a user friendly string representation of what
// is inside this batch.
func (b *Batch) String() string {
return b.internal.String()
}
// Reset returns a Batch to the empty state so that it can
// be re-used in the future.
func (b *Batch) Reset() {
b.internal.Reset()
b.lastDocSize = 0
b.totalSize = 0
}
func (b *Batch) Merge(o *Batch) {
if o != nil && o.internal != nil {
b.internal.Merge(o.internal)
if o.LastDocSize() > 0 {
b.lastDocSize = o.LastDocSize()
}
b.totalSize = uint64(b.internal.TotalDocSize())
}
}
func (b *Batch) SetPersistedCallback(f index.BatchCallback) {
b.internal.SetPersistedCallback(f)
}
func (b *Batch) PersistedCallback() index.BatchCallback {
return b.internal.PersistedCallback()
}
// An Index implements all the indexing and searching
// capabilities of bleve. An Index can be created
// using the New() and Open() methods.
//
// Index() takes an input value, deduces a DocumentMapping for its type,
// assigns string paths to its fields or values then applies field mappings on
// them.
//
// The DocumentMapping used to index a value is deduced by the following rules:
// 1. If value implements mapping.bleveClassifier interface, resolve the mapping
// from BleveType().
// 2. If value implements mapping.Classifier interface, resolve the mapping
// from Type().
// 3. If value has a string field or value at IndexMapping.TypeField.
//
// (defaulting to "_type"), use it to resolve the mapping. Fields addressing
// is described below.
// 4) If IndexMapping.DefaultType is registered, return it.
// 5) Return IndexMapping.DefaultMapping.
//
// Each field or nested field of the value is identified by a string path, then
// mapped to one or several FieldMappings which extract the result for analysis.
//
// Struct values fields are identified by their "json:" tag, or by their name.
// Nested fields are identified by prefixing with their parent identifier,
// separated by a dot.
//
// Map values entries are identified by their string key. Entries not indexed
// by strings are ignored. Entry values are identified recursively like struct
// fields.
//
// Slice and array values are identified by their field name. Their elements
// are processed sequentially with the same FieldMapping.
//
// String, float64 and time.Time values are identified by their field name.
// Other types are ignored.
//
// Each value identifier is decomposed in its parts and recursively address
// SubDocumentMappings in the tree starting at the root DocumentMapping. If a
// mapping is found, all its FieldMappings are applied to the value. If no
// mapping is found and the root DocumentMapping is dynamic, default mappings
// are used based on value type and IndexMapping default configurations.
//
// Finally, mapped values are analyzed, indexed or stored. See
// FieldMapping.Analyzer to know how an analyzer is resolved for a given field.
//
// Examples:
//
// type Date struct {
// Day string `json:"day"`
// Month string
// Year string
// }
//
// type Person struct {
// FirstName string `json:"first_name"`
// LastName string
// BirthDate Date `json:"birth_date"`
// }
//
// A Person value FirstName is mapped by the SubDocumentMapping at
// "first_name". Its LastName is mapped by the one at "LastName". The day of
// BirthDate is mapped to the SubDocumentMapping "day" of the root
// SubDocumentMapping "birth_date". It will appear as the "birth_date.day"
// field in the index. The month is mapped to "birth_date.Month".
type Index interface {
// Index analyzes, indexes or stores mapped data fields. Supplied
// identifier is bound to analyzed data and will be retrieved by search
// requests. See Index interface documentation for details about mapping
// rules.
Index(id string, data interface{}) error
Delete(id string) error
NewBatch() *Batch
Batch(b *Batch) error
// Document returns specified document or nil if the document is not
// indexed or stored.
Document(id string) (index.Document, error)
// DocCount returns the number of documents in the index.
DocCount() (uint64, error)
Search(req *SearchRequest) (*SearchResult, error)
SearchInContext(ctx context.Context, req *SearchRequest) (*SearchResult, error)
Fields() ([]string, error)
FieldDict(field string) (index.FieldDict, error)
FieldDictRange(field string, startTerm []byte, endTerm []byte) (index.FieldDict, error)
FieldDictPrefix(field string, termPrefix []byte) (index.FieldDict, error)
Close() error
Mapping() mapping.IndexMapping
Stats() *IndexStat
StatsMap() map[string]interface{}
GetInternal(key []byte) ([]byte, error)
SetInternal(key, val []byte) error
DeleteInternal(key []byte) error
// Name returns the name of the index (by default this is the path)
Name() string
// SetName lets you assign your own logical name to this index
SetName(string)
// Advanced returns the internal index implementation
Advanced() (index.Index, error)
}
// New index at the specified path, must not exist.
// The provided mapping will be used for all
// Index/Search operations.
func New(path string, mapping mapping.IndexMapping) (Index, error) {
return newIndexUsing(path, mapping, Config.DefaultIndexType, Config.DefaultKVStore, nil)
}
// NewMemOnly creates a memory-only index.
// The contents of the index is NOT persisted,
// and will be lost once closed.
// The provided mapping will be used for all
// Index/Search operations.
func NewMemOnly(mapping mapping.IndexMapping) (Index, error) {
return newIndexUsing("", mapping, upsidedown.Name, Config.DefaultMemKVStore, nil)
}
// NewUsing creates index at the specified path,
// which must not already exist.
// The provided mapping will be used for all
// Index/Search operations.
// The specified index type will be used.
// The specified kvstore implementation will be used
// and the provided kvconfig will be passed to its
// constructor. Note that currently the values of kvconfig must
// be able to be marshaled and unmarshaled using the encoding/json library (used
// when reading/writing the index metadata file).
func NewUsing(path string, mapping mapping.IndexMapping, indexType string, kvstore string, kvconfig map[string]interface{}) (Index, error) {
return newIndexUsing(path, mapping, indexType, kvstore, kvconfig)
}
// Open index at the specified path, must exist.
// The mapping used when it was created will be used for all Index/Search operations.
func Open(path string) (Index, error) {
return openIndexUsing(path, nil)
}
// OpenUsing opens index at the specified path, must exist.
// The mapping used when it was created will be used for all Index/Search operations.
// The provided runtimeConfig can override settings
// persisted when the kvstore was created.
func OpenUsing(path string, runtimeConfig map[string]interface{}) (Index, error) {
return openIndexUsing(path, runtimeConfig)
}
// Builder is a limited interface, used to build indexes in an offline mode.
// Items cannot be updated or deleted, and the caller MUST ensure a document is
// indexed only once.
type Builder interface {
Index(id string, data interface{}) error
Close() error
}
// NewBuilder creates a builder, which will build an index at the specified path,
// using the specified mapping and options.
func NewBuilder(path string, mapping mapping.IndexMapping, config map[string]interface{}) (Builder, error) {
return newBuilder(path, mapping, config)
}
// IndexCopyable is an index which supports an online copy operation
// of the index.
type IndexCopyable interface {
// CopyTo creates a fully functional copy of the index at the
// specified destination directory implementation.
CopyTo(d index.Directory) error
}
// FileSystemDirectory is the default implementation for the
// index.Directory interface.
type FileSystemDirectory string
// SynonymDefinition represents a synonym mapping in Bleve.
// Each instance associates one or more input terms with a list of synonyms,
// defining how terms are treated as equivalent in searches.
type SynonymDefinition struct {
// Input is an optional list of terms for unidirectional synonym mapping.
// When terms are specified in Input, they will map to the terms in Synonyms,
// making the relationship unidirectional (each Input maps to all Synonyms).
// If Input is omitted, the relationship is bidirectional among all Synonyms.
Input []string `json:"input,omitempty"`
// Synonyms is a list of terms that are considered equivalent.
// If Input is specified, each term in Input will map to each term in Synonyms.
// If Input is not specified, the Synonyms list will be treated bidirectionally,
// meaning each term in Synonyms is treated as synonymous with all others.
Synonyms []string `json:"synonyms"`
}
func (sd *SynonymDefinition) Validate() error {
if len(sd.Synonyms) == 0 {
return fmt.Errorf("synonym definition must have at least one synonym")
}
return nil
}
// SynonymIndex supports indexing synonym definitions alongside regular documents.
// Synonyms, grouped by collection name, define term relationships for query expansion in searches.
type SynonymIndex interface {
Index
// IndexSynonym indexes a synonym definition, with the specified id and belonging to the specified collection.
IndexSynonym(id string, collection string, definition *SynonymDefinition) error
}

View File

@@ -0,0 +1,369 @@
# scorch
## Definitions
Batch
- A collection of Documents to mutate in the index.
Document
- Has a unique identifier (arbitrary bytes).
- Is comprised of a list of fields.
Field
- Has a name (string).
- Has a type (text, number, date, geopoint).
- Has a value (depending on type).
- Can be indexed, stored, or both.
- If indexed, can be analyzed.
-m If indexed, can optionally store term vectors.
## Scope
Scorch *MUST* implement the bleve.index API without requiring any changes to this API.
Scorch *MAY* introduce new interfaces, which can be discovered to allow use of new capabilities not in the current API.
## Implementation
The scorch implementation starts with the concept of a segmented index.
A segment is simply a slice, subset, or portion of the entire index. A segmented index is one which is composed of one or more segments. Although segments are created in a particular order, knowing this ordering is not required to achieve correct semantics when querying. Because there is no ordering, this means that when searching an index, you can (and should) search all the segments concurrently.
### Internal Wrapper
In order to accommodate the existing APIs while also improving the implementation, the scorch implementation includes some wrapper functionality that must be described.
#### \_id field
In scorch, field 0 is prearranged to be named \_id. All documents have a value for this field, which is the documents external identifier. In this version the field *MUST* be both indexed AND stored. The scorch wrapper adds this field, as it will not be present in the Document from the calling bleve code.
NOTE: If a document already contains a field \_id, it will be replaced. If this is problematic, the caller must ensure such a scenario does not happen.
### Proposed Structures
```go
type Segment interface {
Dictionary(field string) TermDictionary
}
type TermDictionary interface {
PostingsList(term string, excluding PostingsList) PostingsList
}
type PostingsList interface {
Next() Posting
And(other PostingsList) PostingsList
Or(other PostingsList) PostingsList
}
type Posting interface {
Number() uint64
Frequency() uint64
Norm() float64
Locations() Locations
}
type Locations interface {
Start() uint64
End() uint64
Pos() uint64
ArrayPositions() ...
}
type DeletedDocs {
}
type SegmentSnapshot struct {
segment Segment
deleted PostingsList
}
type IndexSnapshot struct {
segment []SegmentSnapshot
}
```
**What about errors?**
**What about memory mgmnt or context?**
**Postings List separate iterator to separate stateful from stateless**
### Mutating the Index
The bleve.index API has methods for directly making individual mutations (Update/Delete/SetInternal/DeleteInternal), however for this first implementation, we assume that all of these calls can simply be turned into a Batch of size 1. This may be highly inefficient, but it will be correct. This decision is made based on the fact that Couchbase FTS always uses Batches.
NOTE: As a side-effect of this decision, it should be clear that performance tuning may depend on the batch size, which may in-turn require changes in FTS.
From this point forward, only Batch mutations will be discussed.
Sequence of Operations:
1. For each document in the batch, search through all existing segments. The goal is to build up a per-segment bitset which tells us which documents in that segment are obsoleted by the addition of the new segment we're currently building. NOTE: we're not ready for this change to take effect yet, so rather than this operation mutating anything, they simply return bitsets, which we can apply later. Logically, this is something like:
```go
foreach segment {
dict := segment.Dictionary("\_id")
postings := empty postings list
foreach docID {
postings = postings.Or(dict.PostingsList(docID, nil))
}
}
```
NOTE: it is illustrated above as nested for loops, but some or all of these could be concurrently. The end result is that for each segment, we have (possibly empty) bitset.
2. Also concurrent with 1, the documents in the batch are analyzed. This analysis proceeds using the existing analyzer pool.
3. (after 2 completes) Analyzed documents are fed into a function which builds a new Segment representing this information.
4. We now have everything we need to update the state of the system to include this new snapshot.
- Acquire a lock
- Create a new IndexSnapshot
- For each SegmentSnapshot in the IndexSnapshot, take the deleted PostingsList and OR it with the new postings list for this Segment. Construct a new SegmentSnapshot for the segment using this new deleted PostingsList. Append this SegmentSnapshot to the IndexSnapshot.
- Create a new SegmentSnapshot wrapping our new segment with nil deleted docs.
- Append the new SegmentSnapshot to the IndexSnapshot
- Release the lock
An ASCII art example:
```text
0 - Empty Index
No segments
IndexSnapshot
segments []
deleted []
1 - Index Batch [ A B C ]
segment 0
numbers [ 1 2 3 ]
\_id [ A B C ]
IndexSnapshot
segments [ 0 ]
deleted [ nil ]
2 - Index Batch [ B' ]
segment 0 1
numbers [ 1 2 3 ] [ 1 ]
\_id [ A B C ] [ B ]
Compute bitset segment-0-deleted-by-1:
[ 0 1 0 ]
OR it with previous (nil) (call it 0-1)
[ 0 1 0 ]
IndexSnapshot
segments [ 0 1 ]
deleted [ 0-1 nil ]
3 - Index Batch [ C' ]
segment 0 1 2
numbers [ 1 2 3 ] [ 1 ] [ 1 ]
\_id [ A B C ] [ B ] [ C ]
Compute bitset segment-0-deleted-by-2:
[ 0 0 1 ]
OR it with previous ([ 0 1 0 ]) (call it 0-12)
[ 0 1 1 ]
Compute bitset segment-1-deleted-by-2:
[ 0 ]
OR it with previous (nil)
still just nil
IndexSnapshot
segments [ 0 1 2 ]
deleted [ 0-12 nil nil ]
```
**is there opportunity to stop early when doc is found in one segment**
**also, more efficient way to find bits for long lists of ids?**
### Searching
In the bleve.index API all searching starts by getting an IndexReader, which represents a snapshot of the index at a point in time.
As described in the section above, our index implementation maintains a pointer to the current IndexSnapshot. When a caller gets an IndexReader, they get a copy of this pointer, and can use it as long as they like. The IndexSnapshot contains SegmentSnapshots, which only contain pointers to immutable segments. The deleted posting lists associated with a segment change over time, but the particular deleted posting list in YOUR snapshot is immutable. This gives a stable view of the data.
#### Term Search
Term search is the only searching primitive exposed in today's bleve.index API. This ultimately could limit our ability to take advantage of the indexing improvements, but it also means it will be easier to get a first version of this working.
A term search for term T in field F will look something like this:
```go
searchResultPostings = empty
foreach segment {
dict := segment.Dictionary(F)
segmentResultPostings = dict.PostingsList(T, segmentSnapshotDeleted)
// make segmentLocal numbers into global numbers, and flip bits in searchResultPostings
}
```
The searchResultPostings will be a new implementation of the TermFieldReader interface.
As a reminder this interface is:
```go
// TermFieldReader is the interface exposing the enumeration of documents
// containing a given term in a given field. Documents are returned in byte
// lexicographic order over their identifiers.
type TermFieldReader interface {
// Next returns the next document containing the term in this field, or nil
// when it reaches the end of the enumeration. The preAlloced TermFieldDoc
// is optional, and when non-nil, will be used instead of allocating memory.
Next(preAlloced *TermFieldDoc) (*TermFieldDoc, error)
// Advance resets the enumeration at specified document or its immediate
// follower.
Advance(ID IndexInternalID, preAlloced *TermFieldDoc) (*TermFieldDoc, error)
// Count returns the number of documents contains the term in this field.
Count() uint64
Close() error
}
```
At first glance this appears problematic, we have no way to return documents in order of their identifiers. But it turns out the wording of this perhaps too strong, or a bit ambiguous. Originally, this referred to the external identifiers, but with the introduction of a distinction between internal/external identifiers, returning them in order of their internal identifiers is also acceptable. **ASIDE**: the reason for this is that most callers just use Next() and literally don't care what the order is, they could be in any order and it would be fine. There is only one search that cares and that is the ConjunctionSearcher, which relies on Next/Advance having very specific semantics. Later in this document we will have a proposal to split into multiple interfaces:
- The weakest interface, only supports Next() no ordering at all.
- Ordered, supporting Advance()
- And/Or'able capable of internally efficiently doing these ops with like interfaces (if not capable then can always fall back to external walking)
But, the good news is that we don't even have to do that for our first implementation. As long as the global numbers we use for internal identifiers are consistent within this IndexSnapshot, then Next() will be ordered by ascending document number, and Advance() will still work correctly.
NOTE: there is another place where we rely on the ordering of these hits, and that is in the "\_id" sort order. Previously this was the natural order, and a NOOP for the collector, now it must be implemented by actually sorting on the "\_id" field. We probably should introduce at least a marker interface to detect this.
An ASCII art example:
```text
Let's start with the IndexSnapshot we ended with earlier:
3 - Index Batch [ C' ]
segment 0 1 2
numbers [ 1 2 3 ] [ 1 ] [ 1 ]
\_id [ A B C ] [ B ] [ C ]
Compute bitset segment-0-deleted-by-2:
[ 0 0 1 ]
OR it with previous ([ 0 1 0 ]) (call it 0-12)
[ 0 1 1 ]
Compute bitset segment-1-deleted-by-2:
[ 0 0 0 ]
OR it with previous (nil)
still just nil
IndexSnapshot
segments [ 0 1 2 ]
deleted [ 0-12 nil nil ]
Now let's search for the term 'cat' in the field 'desc' and let's assume that Document C (both versions) would match it.
Concurrently:
- Segment 0
- Get Term Dictionary For Field 'desc'
- From it get Postings List for term 'cat' EXCLUDING 0-12
- raw segment matches [ 0 0 1 ] but excluding [ 0 1 1 ] gives [ 0 0 0 ]
- Segment 1
- Get Term Dictionary For Field 'desc'
- From it get Postings List for term 'cat' excluding nil
- [ 0 ]
- Segment 2
- Get Term Dictionary For Field 'desc'
- From it get Postings List for term 'cat' excluding nil
- [ 1 ]
Map local bitsets into global number space (global meaning cross-segment but still unique to this snapshot)
IndexSnapshot already should have mapping something like:
0 - Offset 0
1 - Offset 3 (because segment 0 had 3 docs)
2 - Offset 4 (because segment 1 had 1 doc)
This maps to search result bitset:
[ 0 0 0 0 1]
Caller would call Next() and get doc number 5 (assuming 1 based indexing for now)
Caller could then ask to get term locations, stored fields, external doc ID for document number 5. Internally in the IndexSnapshot, we can now convert that back, and realize doc number 5 comes from segment 2, 5-4=1 so we're looking for doc number 1 in segment 2. That happens to be C...
```
#### Future improvements
In the future, interfaces to detect these non-serially operating TermFieldReaders could expose their own And() and Or() up to the higher level Conjunction/Disjunction searchers. Doing this alone offers some win, but also means there would be greater burden on the Searcher code rewriting logical expressions for maximum performance.
Another related topic is that of peak memory usage. With serially operating TermFieldReaders it was necessary to start them all at the same time and operate in unison. However, with these non-serially operating TermFieldReaders we have the option of doing a few at a time, consolidating them, dispoting the intermediaries, and then doing a few more. For very complex queries with many clauses this could reduce peak memory usage.
### Memory Tracking
All segments must be able to produce two statistics, an estimate of their explicit memory usage, and their actual size on disk (if any). For in-memory segments, disk usage could be zero, and the memory usage represents the entire information content. For mmap-based disk segments, the memory could be as low as the size of tracking structure itself (say just a few pointers).
This would allow the implementation to throttle or block incoming mutations when a threshold memory usage has (or would be) exceeded.
### Persistence
Obviously, we want to support (but maybe not require) asynchronous persistence of segments. My expectation is that segments are initially built in memory. At some point they are persisted to disk. This poses some interesting challenges.
At runtime, the state of an index (it's IndexSnapshot) is not only the contents of the segments, but also the bitmasks of deleted documents. These bitmasks indirectly encode an ordering in which the segments were added. The reason is that the bitmasks encode which items have been obsoleted by other (subsequent or more future) segments. In the runtime implementation we compute bitmask deltas and then merge them at the same time we bring the new segment in. One idea is that we could take a similar approach on disk. When we persist a segment, we persist the bitmask deltas of segments known to exist at that time, and eventually these can get merged up into a base segment deleted bitmask.
This also relates to the topic rollback, addressed next...
### Rollback
One desirable property in the Couchbase ecosystem is the ability to rollback to some previous (though typically not long ago) state. One idea for keeping this property in this design is to protect some of the most recent segments from merging. Then, if necessary, they could be "undone" to reveal previous states of the system. In these scenarios "undone" has to properly undo the deleted bitmasks on the other segments. Again, the current thinking is that rather than "undo" anything, it could be work that was deferred in the first place, thus making it easier to logically undo.
Another possibly related approach would be to tie this into our existing snapshot mechanism. Perhaps simulating a slow reader (holding onto index snapshots) for some period of time, can be the mechanism to achieve the desired end goal.
### Internal Storage
The bleve.index API has support for "internal storage". The ability to store information under a separate name space.
This is not used for high volume storage, so it is tempting to think we could just put a small k/v store alongside the rest of the index. But, the reality is that this storage is used to maintain key information related to the rollback scenario. Because of this, its crucial that ordering and overwriting of key/value pairs correspond with actual segment persistence in the index. Based on this, I believe its important to put the internal key/value pairs inside the segments themselves. But, this also means that they must follow a similar "deleted" bitmask approach to obsolete values in older segments. But, this also seems to substantially increase the complexity of the solution because of the separate name space, it would appear to require its own bitmask. Further keys aren't numeric, which then implies yet another mapping from internal key to number, etc.
More thought is required here.
### Merging
The segmented index approach requires merging to prevent the number of segments from growing too large.
Recent experience with LSMs has taught us that having the correct merge strategy can make a huge difference in the overall performance of the system. In particular, a simple merge strategy which merges segments too aggressively can lead to high write amplification and unnecessarily rendering cached data useless.
A few simple principles have been identified.
- Roughly we merge multiple smaller segments into a single larger one.
- The larger a segment gets the less likely we should be to ever merge it.
- Segments with large numbers of deleted/obsoleted items are good candidates as the merge will result in a space savings.
- Segments with all items deleted/obsoleted can be dropped.
Merging of a segment should be able to proceed even if that segment is held by an ongoing snapshot, it should only delay the removal of it.

View File

@@ -0,0 +1,332 @@
// Copyright (c) 2019 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"fmt"
"os"
"sync"
"github.com/RoaringBitmap/roaring/v2"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
bolt "go.etcd.io/bbolt"
)
const DefaultBuilderBatchSize = 1000
const DefaultBuilderMergeMax = 10
type Builder struct {
m sync.Mutex
segCount uint64
path string
buildPath string
segPaths []string
batchSize int
mergeMax int
batch *index.Batch
internal map[string][]byte
segPlugin SegmentPlugin
}
func NewBuilder(config map[string]interface{}) (*Builder, error) {
path, ok := config["path"].(string)
if !ok {
return nil, fmt.Errorf("must specify path")
}
buildPathPrefix, _ := config["buildPathPrefix"].(string)
buildPath, err := os.MkdirTemp(buildPathPrefix, "scorch-offline-build")
if err != nil {
return nil, err
}
rv := &Builder{
path: path,
buildPath: buildPath,
mergeMax: DefaultBuilderMergeMax,
batchSize: DefaultBuilderBatchSize,
batch: index.NewBatch(),
segPlugin: defaultSegmentPlugin,
}
err = rv.parseConfig(config)
if err != nil {
return nil, fmt.Errorf("error parsing builder config: %v", err)
}
return rv, nil
}
func (o *Builder) parseConfig(config map[string]interface{}) (err error) {
if v, ok := config["mergeMax"]; ok {
var t int
if t, err = parseToInteger(v); err != nil {
return fmt.Errorf("mergeMax parse err: %v", err)
}
if t > 0 {
o.mergeMax = t
}
}
if v, ok := config["batchSize"]; ok {
var t int
if t, err = parseToInteger(v); err != nil {
return fmt.Errorf("batchSize parse err: %v", err)
}
if t > 0 {
o.batchSize = t
}
}
if v, ok := config["internal"]; ok {
if vinternal, ok := v.(map[string][]byte); ok {
o.internal = vinternal
}
}
forcedSegmentType, forcedSegmentVersion, err := configForceSegmentTypeVersion(config)
if err != nil {
return err
}
if forcedSegmentType != "" && forcedSegmentVersion != 0 {
segPlugin, err := chooseSegmentPlugin(forcedSegmentType,
uint32(forcedSegmentVersion))
if err != nil {
return err
}
o.segPlugin = segPlugin
}
return nil
}
// Index will place the document into the index.
// It is invalid to index the same document multiple times.
func (o *Builder) Index(doc index.Document) error {
o.m.Lock()
defer o.m.Unlock()
o.batch.Update(doc)
return o.maybeFlushBatchLOCKED(o.batchSize)
}
func (o *Builder) maybeFlushBatchLOCKED(moreThan int) error {
if len(o.batch.IndexOps) >= moreThan {
defer o.batch.Reset()
return o.executeBatchLOCKED(o.batch)
}
return nil
}
func (o *Builder) executeBatchLOCKED(batch *index.Batch) (err error) {
analysisResults := make([]index.Document, 0, len(batch.IndexOps))
for _, doc := range batch.IndexOps {
if doc != nil {
// insert _id field
doc.AddIDField()
// perform analysis directly
analyze(doc, nil)
analysisResults = append(analysisResults, doc)
}
}
seg, _, err := o.segPlugin.New(analysisResults)
if err != nil {
return fmt.Errorf("error building segment base: %v", err)
}
filename := zapFileName(o.segCount)
o.segCount++
path := o.buildPath + string(os.PathSeparator) + filename
if segUnpersisted, ok := seg.(segment.UnpersistedSegment); ok {
err = segUnpersisted.Persist(path)
if err != nil {
return fmt.Errorf("error persisting segment base to %s: %v", path, err)
}
o.segPaths = append(o.segPaths, path)
return nil
}
return fmt.Errorf("new segment does not implement unpersisted: %T", seg)
}
func (o *Builder) doMerge() error {
// as long as we have more than 1 segment, keep merging
for len(o.segPaths) > 1 {
// merge the next <mergeMax> number of segments into one new one
// or, if there are fewer than <mergeMax> remaining, merge them all
mergeCount := o.mergeMax
if mergeCount > len(o.segPaths) {
mergeCount = len(o.segPaths)
}
mergePaths := o.segPaths[0:mergeCount]
o.segPaths = o.segPaths[mergeCount:]
// open each of the segments to be merged
mergeSegs := make([]segment.Segment, 0, mergeCount)
// closeOpenedSegs attempts to close all opened
// segments even if an error occurs, in which case
// the first error is returned
closeOpenedSegs := func() error {
var err error
for _, seg := range mergeSegs {
clErr := seg.Close()
if clErr != nil && err == nil {
err = clErr
}
}
return err
}
for _, mergePath := range mergePaths {
seg, err := o.segPlugin.Open(mergePath)
if err != nil {
_ = closeOpenedSegs()
return fmt.Errorf("error opening segment (%s) for merge: %v", mergePath, err)
}
mergeSegs = append(mergeSegs, seg)
}
// do the merge
mergedSegPath := o.buildPath + string(os.PathSeparator) + zapFileName(o.segCount)
drops := make([]*roaring.Bitmap, mergeCount)
_, _, err := o.segPlugin.Merge(mergeSegs, drops, mergedSegPath, nil, nil)
if err != nil {
_ = closeOpenedSegs()
return fmt.Errorf("error merging segments (%v): %v", mergePaths, err)
}
o.segCount++
o.segPaths = append(o.segPaths, mergedSegPath)
// close segments opened for merge
err = closeOpenedSegs()
if err != nil {
return fmt.Errorf("error closing opened segments: %v", err)
}
// remove merged segments
for _, mergePath := range mergePaths {
err = os.RemoveAll(mergePath)
if err != nil {
return fmt.Errorf("error removing segment %s after merge: %v", mergePath, err)
}
}
}
return nil
}
func (o *Builder) Close() error {
o.m.Lock()
defer o.m.Unlock()
// see if there is a partial batch
err := o.maybeFlushBatchLOCKED(1)
if err != nil {
return fmt.Errorf("error flushing batch before close: %v", err)
}
// perform all the merging
err = o.doMerge()
if err != nil {
return fmt.Errorf("error while merging: %v", err)
}
// ensure the store path exists
err = os.MkdirAll(o.path, 0700)
if err != nil {
return err
}
// move final segment into place
// segment id 2 is chosen to match the behavior of a scorch
// index which indexes a single batch of data
finalSegPath := o.path + string(os.PathSeparator) + zapFileName(2)
err = os.Rename(o.segPaths[0], finalSegPath)
if err != nil {
return fmt.Errorf("error moving final segment into place: %v", err)
}
// remove the buildPath, as it is no longer needed
err = os.RemoveAll(o.buildPath)
if err != nil {
return fmt.Errorf("error removing build path: %v", err)
}
// prepare wrapping
seg, err := o.segPlugin.Open(finalSegPath)
if err != nil {
return fmt.Errorf("error opening final segment")
}
// create a segment snapshot for this segment
ss := &SegmentSnapshot{
segment: seg,
}
is := &IndexSnapshot{
epoch: 3, // chosen to match scorch behavior when indexing a single batch
segment: []*SegmentSnapshot{ss},
creator: "scorch-builder",
internal: o.internal,
}
// create the root bolt
rootBoltPath := o.path + string(os.PathSeparator) + "root.bolt"
rootBolt, err := bolt.Open(rootBoltPath, 0600, nil)
if err != nil {
return err
}
// start a write transaction
tx, err := rootBolt.Begin(true)
if err != nil {
return err
}
// fill the root bolt with this fake index snapshot
_, _, err = prepareBoltSnapshot(is, tx, o.path, o.segPlugin, nil, nil)
if err != nil {
_ = tx.Rollback()
_ = rootBolt.Close()
return fmt.Errorf("error preparing bolt snapshot in root.bolt: %v", err)
}
// commit bolt data
err = tx.Commit()
if err != nil {
_ = rootBolt.Close()
return fmt.Errorf("error committing bolt tx in root.bolt: %v", err)
}
// close bolt
err = rootBolt.Close()
if err != nil {
return fmt.Errorf("error closing root.bolt: %v", err)
}
// close final segment
err = seg.Close()
if err != nil {
return fmt.Errorf("error closing final segment: %v", err)
}
return nil
}

View File

@@ -0,0 +1,41 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import segment "github.com/blevesearch/scorch_segment_api/v2"
type emptyPostingsIterator struct{}
func (e *emptyPostingsIterator) Next() (segment.Posting, error) {
return nil, nil
}
func (e *emptyPostingsIterator) Advance(uint64) (segment.Posting, error) {
return nil, nil
}
func (e *emptyPostingsIterator) Size() int {
return 0
}
func (e *emptyPostingsIterator) BytesRead() uint64 {
return 0
}
func (e *emptyPostingsIterator) ResetBytesRead(uint64) {}
func (e *emptyPostingsIterator) BytesWritten() uint64 { return 0 }
var anEmptyPostingsIterator = &emptyPostingsIterator{}

View File

@@ -0,0 +1,77 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import "time"
// RegistryAsyncErrorCallbacks should be treated as read-only after
// process init()'ialization.
var RegistryAsyncErrorCallbacks = map[string]func(error, string){}
// RegistryEventCallbacks should be treated as read-only after
// process init()'ialization.
// In the event of not having a callback, these return true.
var RegistryEventCallbacks = map[string]func(Event) bool{}
// Event represents the information provided in an OnEvent() callback.
type Event struct {
Kind EventKind
Scorch *Scorch
Duration time.Duration
}
// EventKind represents an event code for OnEvent() callbacks.
type EventKind int
// EventKindCloseStart is fired when a Scorch.Close() has begun.
var EventKindCloseStart = EventKind(1)
// EventKindClose is fired when a scorch index has been fully closed.
var EventKindClose = EventKind(2)
// EventKindMergerProgress is fired when the merger has completed a
// round of merge processing.
var EventKindMergerProgress = EventKind(3)
// EventKindPersisterProgress is fired when the persister has completed
// a round of persistence processing.
var EventKindPersisterProgress = EventKind(4)
// EventKindBatchIntroductionStart is fired when Batch() is invoked which
// introduces a new segment.
var EventKindBatchIntroductionStart = EventKind(5)
// EventKindBatchIntroduction is fired when Batch() completes.
var EventKindBatchIntroduction = EventKind(6)
// EventKindMergeTaskIntroductionStart is fired when the merger is about to
// start the introduction of merged segment from a single merge task.
var EventKindMergeTaskIntroductionStart = EventKind(7)
// EventKindMergeTaskIntroduction is fired when the merger has completed
// the introduction of merged segment from a single merge task.
var EventKindMergeTaskIntroduction = EventKind(8)
// EventKindPreMergeCheck is fired before the merge begins to check if
// the caller should proceed with the merge.
var EventKindPreMergeCheck = EventKind(9)
// EventKindIndexStart is fired when Index() is invoked which
// creates a new Document object from an interface using the index mapping.
var EventKindIndexStart = EventKind(10)
// EventKindPurgerCheck is fired before the purge code is invoked and decides
// whether to execute or not. For unit test purposes
var EventKindPurgerCheck = EventKind(11)

View File

@@ -0,0 +1,92 @@
// Copyright 2014 The Cockroach Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
// implied. See the License for the specific language governing
// permissions and limitations under the License.
// This code originated from:
// https://github.com/cockroachdb/cockroach/blob/2dd65dde5d90c157f4b93f92502ca1063b904e1d/pkg/util/encoding/encoding.go
// Modified to not use pkg/errors
package scorch
import "fmt"
const (
// intMin is chosen such that the range of int tags does not overlap the
// ascii character set that is frequently used in testing.
intMin = 0x80 // 128
intMaxWidth = 8
intZero = intMin + intMaxWidth // 136
intSmall = intMax - intZero - intMaxWidth // 109
// intMax is the maximum int tag value.
intMax = 0xfd // 253
)
// encodeUvarintAscending encodes the uint64 value using a variable length
// (length-prefixed) representation. The length is encoded as a single
// byte indicating the number of encoded bytes (-8) to follow. See
// EncodeVarintAscending for rationale. The encoded bytes are appended to the
// supplied buffer and the final buffer is returned.
func encodeUvarintAscending(b []byte, v uint64) []byte {
switch {
case v <= intSmall:
return append(b, intZero+byte(v))
case v <= 0xff:
return append(b, intMax-7, byte(v))
case v <= 0xffff:
return append(b, intMax-6, byte(v>>8), byte(v))
case v <= 0xffffff:
return append(b, intMax-5, byte(v>>16), byte(v>>8), byte(v))
case v <= 0xffffffff:
return append(b, intMax-4, byte(v>>24), byte(v>>16), byte(v>>8), byte(v))
case v <= 0xffffffffff:
return append(b, intMax-3, byte(v>>32), byte(v>>24), byte(v>>16), byte(v>>8),
byte(v))
case v <= 0xffffffffffff:
return append(b, intMax-2, byte(v>>40), byte(v>>32), byte(v>>24), byte(v>>16),
byte(v>>8), byte(v))
case v <= 0xffffffffffffff:
return append(b, intMax-1, byte(v>>48), byte(v>>40), byte(v>>32), byte(v>>24),
byte(v>>16), byte(v>>8), byte(v))
default:
return append(b, intMax, byte(v>>56), byte(v>>48), byte(v>>40), byte(v>>32),
byte(v>>24), byte(v>>16), byte(v>>8), byte(v))
}
}
// decodeUvarintAscending decodes a varint encoded uint64 from the input
// buffer. The remainder of the input buffer and the decoded uint64
// are returned.
func decodeUvarintAscending(b []byte) ([]byte, uint64, error) {
if len(b) == 0 {
return nil, 0, fmt.Errorf("insufficient bytes to decode uvarint value")
}
length := int(b[0]) - intZero
b = b[1:] // skip length byte
if length <= intSmall {
return b, uint64(length), nil
}
length -= intSmall
if length < 0 || length > 8 {
return nil, 0, fmt.Errorf("invalid uvarint length of %d", length)
} else if len(b) < length {
return nil, 0, fmt.Errorf("insufficient bytes to decode uvarint value: %q", b)
}
var v uint64
// It is faster to range over the elements in a slice than to index
// into the slice on each loop iteration.
for _, t := range b[:length] {
v = (v << 8) | uint64(t)
}
return b[length:], v, nil
}

View File

@@ -0,0 +1,515 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"fmt"
"path/filepath"
"sync/atomic"
"github.com/RoaringBitmap/roaring/v2"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
type segmentIntroduction struct {
id uint64
data segment.Segment
obsoletes map[uint64]*roaring.Bitmap
ids []string
internal map[string][]byte
stats *fieldStats
applied chan error
persisted chan error
persistedCallback index.BatchCallback
}
type persistIntroduction struct {
persisted map[uint64]segment.Segment
applied notificationChan
}
type epochWatcher struct {
epoch uint64
notifyCh notificationChan
}
func (s *Scorch) introducerLoop() {
defer func() {
if r := recover(); r != nil {
s.fireAsyncError(&AsyncPanicError{
Source: "introducer",
Path: s.path,
})
}
s.asyncTasks.Done()
}()
var epochWatchers []*epochWatcher
OUTER:
for {
atomic.AddUint64(&s.stats.TotIntroduceLoop, 1)
select {
case <-s.closeCh:
break OUTER
case epochWatcher := <-s.introducerNotifier:
epochWatchers = append(epochWatchers, epochWatcher)
case nextMerge := <-s.merges:
s.introduceMerge(nextMerge)
case next := <-s.introductions:
err := s.introduceSegment(next)
if err != nil {
continue OUTER
}
case persist := <-s.persists:
s.introducePersist(persist)
}
var epochCurr uint64
s.rootLock.RLock()
if s.root != nil {
epochCurr = s.root.epoch
}
s.rootLock.RUnlock()
var epochWatchersNext []*epochWatcher
for _, w := range epochWatchers {
if w.epoch < epochCurr {
close(w.notifyCh)
} else {
epochWatchersNext = append(epochWatchersNext, w)
}
}
epochWatchers = epochWatchersNext
}
}
func (s *Scorch) introduceSegment(next *segmentIntroduction) error {
atomic.AddUint64(&s.stats.TotIntroduceSegmentBeg, 1)
defer atomic.AddUint64(&s.stats.TotIntroduceSegmentEnd, 1)
s.rootLock.RLock()
root := s.root
root.AddRef()
s.rootLock.RUnlock()
defer func() { _ = root.DecRef() }()
nsegs := len(root.segment)
// prepare new index snapshot
newSnapshot := &IndexSnapshot{
parent: s,
segment: make([]*SegmentSnapshot, 0, nsegs+1),
offsets: make([]uint64, 0, nsegs+1),
internal: make(map[string][]byte, len(root.internal)),
refs: 1,
creator: "introduceSegment",
}
// iterate through current segments
var running uint64
var docsToPersistCount, memSegments, fileSegments uint64
var droppedSegmentFiles []string
for i := range root.segment {
// see if optimistic work included this segment
delta, ok := next.obsoletes[root.segment[i].id]
if !ok {
var err error
delta, err = root.segment[i].segment.DocNumbers(next.ids)
if err != nil {
next.applied <- fmt.Errorf("error computing doc numbers: %v", err)
close(next.applied)
_ = newSnapshot.DecRef()
return err
}
}
newss := &SegmentSnapshot{
id: root.segment[i].id,
segment: root.segment[i].segment,
stats: root.segment[i].stats,
cachedDocs: root.segment[i].cachedDocs,
cachedMeta: root.segment[i].cachedMeta,
creator: root.segment[i].creator,
}
// apply new obsoletions
if root.segment[i].deleted == nil {
newss.deleted = delta
} else {
if delta.IsEmpty() {
newss.deleted = root.segment[i].deleted
} else {
newss.deleted = roaring.Or(root.segment[i].deleted, delta)
}
}
if newss.deleted.IsEmpty() {
newss.deleted = nil
}
// check for live size before copying
if newss.LiveSize() > 0 {
newSnapshot.segment = append(newSnapshot.segment, newss)
root.segment[i].segment.AddRef()
newSnapshot.offsets = append(newSnapshot.offsets, running)
running += newss.segment.Count()
} else if seg, ok := newss.segment.(segment.PersistedSegment); ok {
droppedSegmentFiles = append(droppedSegmentFiles,
filepath.Base(seg.Path()))
}
if isMemorySegment(root.segment[i]) {
docsToPersistCount += root.segment[i].Count()
memSegments++
} else {
fileSegments++
}
}
atomic.StoreUint64(&s.stats.TotItemsToPersist, docsToPersistCount)
atomic.StoreUint64(&s.stats.TotMemorySegmentsAtRoot, memSegments)
atomic.StoreUint64(&s.stats.TotFileSegmentsAtRoot, fileSegments)
// append new segment, if any, to end of the new index snapshot
if next.data != nil {
newSegmentSnapshot := &SegmentSnapshot{
id: next.id,
segment: next.data, // take ownership of next.data's ref-count
stats: next.stats,
cachedDocs: &cachedDocs{cache: nil},
cachedMeta: &cachedMeta{meta: nil},
creator: "introduceSegment",
}
newSnapshot.segment = append(newSnapshot.segment, newSegmentSnapshot)
newSnapshot.offsets = append(newSnapshot.offsets, running)
// increment numItemsIntroduced which tracks the number of items
// queued for persistence.
atomic.AddUint64(&s.stats.TotIntroducedItems, newSegmentSnapshot.Count())
atomic.AddUint64(&s.stats.TotIntroducedSegmentsBatch, 1)
}
// copy old values
for key, oldVal := range root.internal {
newSnapshot.internal[key] = oldVal
}
// set new values and apply deletes
for key, newVal := range next.internal {
if newVal != nil {
newSnapshot.internal[key] = newVal
} else {
delete(newSnapshot.internal, key)
}
}
newSnapshot.updateSize()
s.rootLock.Lock()
if next.persisted != nil {
s.rootPersisted = append(s.rootPersisted, next.persisted)
}
if next.persistedCallback != nil {
s.persistedCallbacks = append(s.persistedCallbacks, next.persistedCallback)
}
// swap in new index snapshot
newSnapshot.epoch = s.nextSnapshotEpoch
s.nextSnapshotEpoch++
rootPrev := s.root
s.root = newSnapshot
atomic.StoreUint64(&s.stats.CurRootEpoch, s.root.epoch)
// release lock
s.rootLock.Unlock()
if rootPrev != nil {
_ = rootPrev.DecRef()
}
// update the removal eligibility for those segment files
// that are not a part of the latest root.
for _, filename := range droppedSegmentFiles {
s.unmarkIneligibleForRemoval(filename)
}
close(next.applied)
return nil
}
func (s *Scorch) introducePersist(persist *persistIntroduction) {
atomic.AddUint64(&s.stats.TotIntroducePersistBeg, 1)
defer atomic.AddUint64(&s.stats.TotIntroducePersistEnd, 1)
s.rootLock.Lock()
root := s.root
root.AddRef()
nextSnapshotEpoch := s.nextSnapshotEpoch
s.nextSnapshotEpoch++
s.rootLock.Unlock()
defer func() { _ = root.DecRef() }()
newIndexSnapshot := &IndexSnapshot{
parent: s,
epoch: nextSnapshotEpoch,
segment: make([]*SegmentSnapshot, len(root.segment)),
offsets: make([]uint64, len(root.offsets)),
internal: make(map[string][]byte, len(root.internal)),
refs: 1,
creator: "introducePersist",
}
var docsToPersistCount, memSegments, fileSegments uint64
for i, segmentSnapshot := range root.segment {
// see if this segment has been replaced
if replacement, ok := persist.persisted[segmentSnapshot.id]; ok {
newSegmentSnapshot := &SegmentSnapshot{
id: segmentSnapshot.id,
segment: replacement,
deleted: segmentSnapshot.deleted,
stats: segmentSnapshot.stats,
cachedDocs: segmentSnapshot.cachedDocs,
cachedMeta: segmentSnapshot.cachedMeta,
creator: "introducePersist",
mmaped: 1,
}
newIndexSnapshot.segment[i] = newSegmentSnapshot
delete(persist.persisted, segmentSnapshot.id)
// update items persisted incase of a new segment snapshot
atomic.AddUint64(&s.stats.TotPersistedItems, newSegmentSnapshot.Count())
atomic.AddUint64(&s.stats.TotPersistedSegments, 1)
fileSegments++
} else {
newIndexSnapshot.segment[i] = root.segment[i]
newIndexSnapshot.segment[i].segment.AddRef()
if isMemorySegment(root.segment[i]) {
docsToPersistCount += root.segment[i].Count()
memSegments++
} else {
fileSegments++
}
}
newIndexSnapshot.offsets[i] = root.offsets[i]
}
for k, v := range root.internal {
newIndexSnapshot.internal[k] = v
}
atomic.StoreUint64(&s.stats.TotItemsToPersist, docsToPersistCount)
atomic.StoreUint64(&s.stats.TotMemorySegmentsAtRoot, memSegments)
atomic.StoreUint64(&s.stats.TotFileSegmentsAtRoot, fileSegments)
newIndexSnapshot.updateSize()
s.rootLock.Lock()
rootPrev := s.root
s.root = newIndexSnapshot
atomic.StoreUint64(&s.stats.CurRootEpoch, s.root.epoch)
s.rootLock.Unlock()
if rootPrev != nil {
_ = rootPrev.DecRef()
}
close(persist.applied)
}
// The introducer should definitely handle the segmentMerge.notify
// channel before exiting the introduceMerge.
func (s *Scorch) introduceMerge(nextMerge *segmentMerge) {
atomic.AddUint64(&s.stats.TotIntroduceMergeBeg, 1)
defer atomic.AddUint64(&s.stats.TotIntroduceMergeEnd, 1)
s.rootLock.RLock()
root := s.root
root.AddRef()
s.rootLock.RUnlock()
defer func() { _ = root.DecRef() }()
newSnapshot := &IndexSnapshot{
parent: s,
internal: root.internal,
refs: 1,
creator: "introduceMerge",
}
var running, docsToPersistCount, memSegments, fileSegments uint64
var droppedSegmentFiles []string
newSegmentDeleted := make([]*roaring.Bitmap, len(nextMerge.new))
for i := range newSegmentDeleted {
// create a bitmaps to track the obsoletes per newly merged segments
newSegmentDeleted[i] = roaring.NewBitmap()
}
// iterate through current segments
for i := range root.segment {
segmentID := root.segment[i].id
if segSnapAtMerge, ok := nextMerge.mergedSegHistory[segmentID]; ok {
// this segment is going away, see if anything else was deleted since we started the merge
if segSnapAtMerge != nil && root.segment[i].deleted != nil {
// assume all these deletes are new
deletedSince := root.segment[i].deleted
// if we already knew about some of them, remove
if segSnapAtMerge.oldSegment.deleted != nil {
deletedSince = roaring.AndNot(root.segment[i].deleted, segSnapAtMerge.oldSegment.deleted)
}
deletedSinceItr := deletedSince.Iterator()
for deletedSinceItr.HasNext() {
oldDocNum := deletedSinceItr.Next()
newDocNum := segSnapAtMerge.oldNewDocIDs[oldDocNum]
newSegmentDeleted[segSnapAtMerge.workerID].Add(uint32(newDocNum))
}
}
// clean up the old segment map to figure out the
// obsolete segments wrt root in meantime, whatever
// segments left behind in old map after processing
// the root segments would be the obsolete segment set
delete(nextMerge.mergedSegHistory, segmentID)
} else if root.segment[i].LiveSize() > 0 {
// this segment is staying
newSnapshot.segment = append(newSnapshot.segment, &SegmentSnapshot{
id: root.segment[i].id,
segment: root.segment[i].segment,
deleted: root.segment[i].deleted,
stats: root.segment[i].stats,
cachedDocs: root.segment[i].cachedDocs,
cachedMeta: root.segment[i].cachedMeta,
creator: root.segment[i].creator,
})
root.segment[i].segment.AddRef()
newSnapshot.offsets = append(newSnapshot.offsets, running)
running += root.segment[i].segment.Count()
if isMemorySegment(root.segment[i]) {
docsToPersistCount += root.segment[i].Count()
memSegments++
} else {
fileSegments++
}
} else if root.segment[i].LiveSize() == 0 {
if seg, ok := root.segment[i].segment.(segment.PersistedSegment); ok {
droppedSegmentFiles = append(droppedSegmentFiles,
filepath.Base(seg.Path()))
}
}
}
// before the newMerge introduction, need to clean the newly
// merged segment wrt the current root segments, hence
// applying the obsolete segment contents to newly merged segment
for _, ss := range nextMerge.mergedSegHistory {
obsoleted := ss.oldSegment.DocNumbersLive()
if obsoleted != nil {
obsoletedIter := obsoleted.Iterator()
for obsoletedIter.HasNext() {
oldDocNum := obsoletedIter.Next()
newDocNum := ss.oldNewDocIDs[oldDocNum]
newSegmentDeleted[ss.workerID].Add(uint32(newDocNum))
}
}
}
skipped := true
// make the newly merged segments part of the newSnapshot being constructed
for i, newMergedSegment := range nextMerge.new {
// checking if this newly merged segment is worth keeping based on
// obsoleted doc count since the merge intro started
if newMergedSegment != nil &&
newMergedSegment.Count() > newSegmentDeleted[i].GetCardinality() {
stats := newFieldStats()
if fsr, ok := newMergedSegment.(segment.FieldStatsReporter); ok {
fsr.UpdateFieldStats(stats)
}
// put the merged segment at the end of newSnapshot
newSnapshot.segment = append(newSnapshot.segment, &SegmentSnapshot{
id: nextMerge.id[i],
segment: newMergedSegment, // take ownership for nextMerge.new's ref-count
deleted: newSegmentDeleted[i],
stats: stats,
cachedDocs: &cachedDocs{cache: nil},
cachedMeta: &cachedMeta{meta: nil},
creator: "introduceMerge",
mmaped: nextMerge.mmaped,
})
newSnapshot.offsets = append(newSnapshot.offsets, running)
running += newMergedSegment.Count()
switch newMergedSegment.(type) {
case segment.PersistedSegment:
fileSegments++
default:
docsToPersistCount += newMergedSegment.Count() - newSegmentDeleted[i].GetCardinality()
memSegments++
}
skipped = false
}
}
if skipped {
atomic.AddUint64(&s.stats.TotFileMergeIntroductionsObsoleted, 1)
} else {
atomic.AddUint64(&s.stats.TotIntroducedSegmentsMerge, uint64(len(nextMerge.new)))
}
atomic.StoreUint64(&s.stats.TotItemsToPersist, docsToPersistCount)
atomic.StoreUint64(&s.stats.TotMemorySegmentsAtRoot, memSegments)
atomic.StoreUint64(&s.stats.TotFileSegmentsAtRoot, fileSegments)
newSnapshot.AddRef() // 1 ref for the nextMerge.notify response
newSnapshot.updateSize()
s.rootLock.Lock()
// swap in new index snapshot
newSnapshot.epoch = s.nextSnapshotEpoch
s.nextSnapshotEpoch++
rootPrev := s.root
s.root = newSnapshot
atomic.StoreUint64(&s.stats.CurRootEpoch, s.root.epoch)
// release lock
s.rootLock.Unlock()
if rootPrev != nil {
_ = rootPrev.DecRef()
}
// update the removal eligibility for those segment files
// that are not a part of the latest root.
for _, filename := range droppedSegmentFiles {
s.unmarkIneligibleForRemoval(filename)
}
// notify requester that we incorporated this
nextMerge.notifyCh <- &mergeTaskIntroStatus{
indexSnapshot: newSnapshot,
skipped: skipped}
close(nextMerge.notifyCh)
}
func isMemorySegment(s *SegmentSnapshot) bool {
switch s.segment.(type) {
case segment.PersistedSegment:
return false
default:
return true
}
}

View File

@@ -0,0 +1,637 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"context"
"fmt"
"os"
"strings"
"sync"
"sync/atomic"
"time"
"github.com/RoaringBitmap/roaring/v2"
"github.com/blevesearch/bleve/v2/index/scorch/mergeplan"
"github.com/blevesearch/bleve/v2/util"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
func (s *Scorch) mergerLoop() {
defer func() {
if r := recover(); r != nil {
s.fireAsyncError(&AsyncPanicError{
Source: "merger",
Path: s.path,
})
}
s.asyncTasks.Done()
}()
var lastEpochMergePlanned uint64
var ctrlMsg *mergerCtrl
mergePlannerOptions, err := s.parseMergePlannerOptions()
if err != nil {
s.fireAsyncError(fmt.Errorf("mergePlannerOption json parsing err: %v", err))
return
}
ctrlMsgDflt := &mergerCtrl{ctx: context.Background(),
options: mergePlannerOptions,
doneCh: nil}
OUTER:
for {
atomic.AddUint64(&s.stats.TotFileMergeLoopBeg, 1)
select {
case <-s.closeCh:
break OUTER
default:
// check to see if there is a new snapshot to persist
s.rootLock.Lock()
ourSnapshot := s.root
ourSnapshot.AddRef()
atomic.StoreUint64(&s.iStats.mergeSnapshotSize, uint64(ourSnapshot.Size()))
atomic.StoreUint64(&s.iStats.mergeEpoch, ourSnapshot.epoch)
s.rootLock.Unlock()
if ctrlMsg == nil && ourSnapshot.epoch != lastEpochMergePlanned {
ctrlMsg = ctrlMsgDflt
}
if ctrlMsg != nil {
continueMerge := s.fireEvent(EventKindPreMergeCheck, 0)
// The default, if there's no handler, is to continue the merge.
if !continueMerge {
// If it's decided that this merge can't take place now,
// begin the merge process all over again.
// Retry instead of blocking/waiting here since a long wait
// can result in more segments introduced i.e. s.root will
// be updated.
// decrement the ref count since its no longer needed in this
// iteration
_ = ourSnapshot.DecRef()
continue OUTER
}
startTime := time.Now()
// lets get started
err := s.planMergeAtSnapshot(ctrlMsg.ctx, ctrlMsg.options,
ourSnapshot)
if err != nil {
atomic.StoreUint64(&s.iStats.mergeEpoch, 0)
if err == segment.ErrClosed {
// index has been closed
_ = ourSnapshot.DecRef()
// continue the workloop on a user triggered cancel
if ctrlMsg.doneCh != nil {
close(ctrlMsg.doneCh)
ctrlMsg = nil
continue OUTER
}
// exit the workloop on index closure
ctrlMsg = nil
break OUTER
}
s.fireAsyncError(fmt.Errorf("merging err: %v", err))
_ = ourSnapshot.DecRef()
atomic.AddUint64(&s.stats.TotFileMergeLoopErr, 1)
continue OUTER
}
if ctrlMsg.doneCh != nil {
close(ctrlMsg.doneCh)
}
ctrlMsg = nil
lastEpochMergePlanned = ourSnapshot.epoch
atomic.StoreUint64(&s.stats.LastMergedEpoch, ourSnapshot.epoch)
s.fireEvent(EventKindMergerProgress, time.Since(startTime))
}
_ = ourSnapshot.DecRef()
// tell the persister we're waiting for changes
// first make a epochWatcher chan
ew := &epochWatcher{
epoch: lastEpochMergePlanned,
notifyCh: make(notificationChan, 1),
}
// give it to the persister
select {
case <-s.closeCh:
break OUTER
case s.persisterNotifier <- ew:
case ctrlMsg = <-s.forceMergeRequestCh:
continue OUTER
}
// now wait for persister (but also detect close)
select {
case <-s.closeCh:
break OUTER
case <-ew.notifyCh:
case ctrlMsg = <-s.forceMergeRequestCh:
}
}
atomic.AddUint64(&s.stats.TotFileMergeLoopEnd, 1)
}
}
type mergerCtrl struct {
ctx context.Context
options *mergeplan.MergePlanOptions
doneCh chan struct{}
}
// ForceMerge helps users trigger a merge operation on
// an online scorch index.
func (s *Scorch) ForceMerge(ctx context.Context,
mo *mergeplan.MergePlanOptions) error {
// check whether force merge is already under processing
s.rootLock.Lock()
if s.stats.TotFileMergeForceOpsStarted >
s.stats.TotFileMergeForceOpsCompleted {
s.rootLock.Unlock()
return fmt.Errorf("force merge already in progress")
}
s.stats.TotFileMergeForceOpsStarted++
s.rootLock.Unlock()
if mo != nil {
err := mergeplan.ValidateMergePlannerOptions(mo)
if err != nil {
return err
}
} else {
// assume the default single segment merge policy
mo = &mergeplan.SingleSegmentMergePlanOptions
}
msg := &mergerCtrl{options: mo,
doneCh: make(chan struct{}),
ctx: ctx,
}
// request the merger perform a force merge
select {
case s.forceMergeRequestCh <- msg:
case <-s.closeCh:
return nil
}
// wait for the force merge operation completion
select {
case <-msg.doneCh:
atomic.AddUint64(&s.stats.TotFileMergeForceOpsCompleted, 1)
case <-s.closeCh:
}
return nil
}
func (s *Scorch) parseMergePlannerOptions() (*mergeplan.MergePlanOptions,
error) {
mergePlannerOptions := mergeplan.DefaultMergePlanOptions
po, err := s.parsePersisterOptions()
if err != nil {
return nil, err
}
// by default use the MaxSizeInMemoryMergePerWorker from the persister option
// as the FloorSegmentFileSize for the merge planner which would be the
// first tier size in the planning. If the value is 0, then we don't use the
// file size in the planning.
mergePlannerOptions.FloorSegmentFileSize = int64(po.MaxSizeInMemoryMergePerWorker)
if v, ok := s.config["scorchMergePlanOptions"]; ok {
b, err := util.MarshalJSON(v)
if err != nil {
return &mergePlannerOptions, err
}
err = util.UnmarshalJSON(b, &mergePlannerOptions)
if err != nil {
return &mergePlannerOptions, err
}
err = mergeplan.ValidateMergePlannerOptions(&mergePlannerOptions)
if err != nil {
return nil, err
}
}
return &mergePlannerOptions, nil
}
type closeChWrapper struct {
ch1 chan struct{}
ctx context.Context
closeCh chan struct{}
cancelCh chan struct{}
}
func newCloseChWrapper(ch1 chan struct{},
ctx context.Context) *closeChWrapper {
return &closeChWrapper{
ch1: ch1,
ctx: ctx,
closeCh: make(chan struct{}),
cancelCh: make(chan struct{}),
}
}
func (w *closeChWrapper) close() {
close(w.closeCh)
}
func (w *closeChWrapper) listen() {
select {
case <-w.ch1:
close(w.cancelCh)
case <-w.ctx.Done():
close(w.cancelCh)
case <-w.closeCh:
}
}
func (s *Scorch) planMergeAtSnapshot(ctx context.Context,
options *mergeplan.MergePlanOptions, ourSnapshot *IndexSnapshot) error {
// build list of persisted segments in this snapshot
var onlyPersistedSnapshots []mergeplan.Segment
for _, segmentSnapshot := range ourSnapshot.segment {
if _, ok := segmentSnapshot.segment.(segment.PersistedSegment); ok {
onlyPersistedSnapshots = append(onlyPersistedSnapshots, segmentSnapshot)
}
}
atomic.AddUint64(&s.stats.TotFileMergePlan, 1)
// give this list to the planner
resultMergePlan, err := mergeplan.Plan(onlyPersistedSnapshots, options)
if err != nil {
atomic.AddUint64(&s.stats.TotFileMergePlanErr, 1)
return fmt.Errorf("merge planning err: %v", err)
}
if resultMergePlan == nil {
// nothing to do
atomic.AddUint64(&s.stats.TotFileMergePlanNone, 1)
return nil
}
atomic.AddUint64(&s.stats.TotFileMergePlanOk, 1)
atomic.AddUint64(&s.stats.TotFileMergePlanTasks, uint64(len(resultMergePlan.Tasks)))
// process tasks in serial for now
var filenames []string
cw := newCloseChWrapper(s.closeCh, ctx)
defer cw.close()
go cw.listen()
for _, task := range resultMergePlan.Tasks {
if len(task.Segments) == 0 {
atomic.AddUint64(&s.stats.TotFileMergePlanTasksSegmentsEmpty, 1)
continue
}
atomic.AddUint64(&s.stats.TotFileMergePlanTasksSegments, uint64(len(task.Segments)))
oldMap := make(map[uint64]*SegmentSnapshot, len(task.Segments))
newSegmentID := atomic.AddUint64(&s.nextSegmentID, 1)
segmentsToMerge := make([]segment.Segment, 0, len(task.Segments))
docsToDrop := make([]*roaring.Bitmap, 0, len(task.Segments))
mergedSegHistory := make(map[uint64]*mergedSegmentHistory, len(task.Segments))
for _, planSegment := range task.Segments {
if segSnapshot, ok := planSegment.(*SegmentSnapshot); ok {
oldMap[segSnapshot.id] = segSnapshot
mergedSegHistory[segSnapshot.id] = &mergedSegmentHistory{
workerID: 0,
oldSegment: segSnapshot,
}
if persistedSeg, ok := segSnapshot.segment.(segment.PersistedSegment); ok {
if segSnapshot.LiveSize() == 0 {
atomic.AddUint64(&s.stats.TotFileMergeSegmentsEmpty, 1)
oldMap[segSnapshot.id] = nil
delete(mergedSegHistory, segSnapshot.id)
} else {
segmentsToMerge = append(segmentsToMerge, segSnapshot.segment)
docsToDrop = append(docsToDrop, segSnapshot.deleted)
}
// track the files getting merged for unsetting the
// removal ineligibility. This helps to unflip files
// even with fast merger, slow persister work flows.
path := persistedSeg.Path()
filenames = append(filenames,
strings.TrimPrefix(path, s.path+string(os.PathSeparator)))
}
}
}
var seg segment.Segment
var filename string
if len(segmentsToMerge) > 0 {
filename = zapFileName(newSegmentID)
s.markIneligibleForRemoval(filename)
path := s.path + string(os.PathSeparator) + filename
fileMergeZapStartTime := time.Now()
atomic.AddUint64(&s.stats.TotFileMergeZapBeg, 1)
prevBytesReadTotal := cumulateBytesRead(segmentsToMerge)
newDocNums, _, err := s.segPlugin.Merge(segmentsToMerge, docsToDrop, path,
cw.cancelCh, s)
atomic.AddUint64(&s.stats.TotFileMergeZapEnd, 1)
fileMergeZapTime := uint64(time.Since(fileMergeZapStartTime))
atomic.AddUint64(&s.stats.TotFileMergeZapTime, fileMergeZapTime)
if atomic.LoadUint64(&s.stats.MaxFileMergeZapTime) < fileMergeZapTime {
atomic.StoreUint64(&s.stats.MaxFileMergeZapTime, fileMergeZapTime)
}
if err != nil {
s.unmarkIneligibleForRemoval(filename)
atomic.AddUint64(&s.stats.TotFileMergePlanTasksErr, 1)
if err == segment.ErrClosed {
return err
}
return fmt.Errorf("merging failed: %v", err)
}
seg, err = s.segPlugin.Open(path)
if err != nil {
s.unmarkIneligibleForRemoval(filename)
atomic.AddUint64(&s.stats.TotFileMergePlanTasksErr, 1)
return err
}
totalBytesRead := seg.BytesRead() + prevBytesReadTotal
seg.ResetBytesRead(totalBytesRead)
for i, segNewDocNums := range newDocNums {
if mergedSegHistory[task.Segments[i].Id()] != nil {
mergedSegHistory[task.Segments[i].Id()].oldNewDocIDs = segNewDocNums
}
}
atomic.AddUint64(&s.stats.TotFileMergeSegments, uint64(len(segmentsToMerge)))
}
sm := &segmentMerge{
id: []uint64{newSegmentID},
mergedSegHistory: mergedSegHistory,
new: []segment.Segment{seg},
newCount: seg.Count(),
notifyCh: make(chan *mergeTaskIntroStatus),
mmaped: 1,
}
s.fireEvent(EventKindMergeTaskIntroductionStart, 0)
// give it to the introducer
select {
case <-s.closeCh:
_ = seg.Close()
return segment.ErrClosed
case s.merges <- sm:
atomic.AddUint64(&s.stats.TotFileMergeIntroductions, 1)
}
introStartTime := time.Now()
// it is safe to blockingly wait for the merge introduction
// here as the introducer is bound to handle the notify channel.
introStatus := <-sm.notifyCh
introTime := uint64(time.Since(introStartTime))
atomic.AddUint64(&s.stats.TotFileMergeZapIntroductionTime, introTime)
if atomic.LoadUint64(&s.stats.MaxFileMergeZapIntroductionTime) < introTime {
atomic.StoreUint64(&s.stats.MaxFileMergeZapIntroductionTime, introTime)
}
atomic.AddUint64(&s.stats.TotFileMergeIntroductionsDone, 1)
if introStatus != nil && introStatus.indexSnapshot != nil {
_ = introStatus.indexSnapshot.DecRef()
if introStatus.skipped {
// close the segment on skipping introduction.
s.unmarkIneligibleForRemoval(filename)
_ = seg.Close()
}
}
atomic.AddUint64(&s.stats.TotFileMergePlanTasksDone, 1)
s.fireEvent(EventKindMergeTaskIntroduction, 0)
}
// once all the newly merged segment introductions are done,
// its safe to unflip the removal ineligibility for the replaced
// older segments
for _, f := range filenames {
s.unmarkIneligibleForRemoval(f)
}
return nil
}
type mergeTaskIntroStatus struct {
indexSnapshot *IndexSnapshot
skipped bool
}
// this is important when it comes to introducing multiple merged segments in a
// single introducer channel push. That way there is a check to ensure that the
// file count doesn't explode during the index's lifetime.
type mergedSegmentHistory struct {
workerID uint64
oldNewDocIDs []uint64
oldSegment *SegmentSnapshot
}
type segmentMerge struct {
id []uint64
new []segment.Segment
mergedSegHistory map[uint64]*mergedSegmentHistory
notifyCh chan *mergeTaskIntroStatus
mmaped uint32
newCount uint64
}
func cumulateBytesRead(sbs []segment.Segment) uint64 {
var rv uint64
for _, seg := range sbs {
rv += seg.BytesRead()
}
return rv
}
func closeNewMergedSegments(segs []segment.Segment) error {
for _, seg := range segs {
if seg != nil {
_ = seg.DecRef()
}
}
return nil
}
// mergeAndPersistInMemorySegments takes an IndexSnapshot and a list of in-memory segments,
// which are merged and persisted to disk concurrently. These are then introduced as
// the new root snapshot in one-shot.
func (s *Scorch) mergeAndPersistInMemorySegments(snapshot *IndexSnapshot,
flushableObjs []*flushable) (*IndexSnapshot, []uint64, error) {
atomic.AddUint64(&s.stats.TotMemMergeBeg, 1)
memMergeZapStartTime := time.Now()
atomic.AddUint64(&s.stats.TotMemMergeZapBeg, 1)
var wg sync.WaitGroup
// we're tracking the merged segments and their doc number per worker
// to be able to introduce them all at once, so the first dimension of the
// slices here correspond to workerID
newDocIDsSet := make([][][]uint64, len(flushableObjs))
newMergedSegments := make([]segment.Segment, len(flushableObjs))
newMergedSegmentIDs := make([]uint64, len(flushableObjs))
numFlushes := len(flushableObjs)
var numSegments, newMergedCount uint64
var em sync.Mutex
var errs []error
// deploy the workers to merge and flush the batches of segments concurrently
// and create a new file segment
for i := 0; i < numFlushes; i++ {
wg.Add(1)
go func(segsBatch []segment.Segment, dropsBatch []*roaring.Bitmap, id int) {
defer wg.Done()
newSegmentID := atomic.AddUint64(&s.nextSegmentID, 1)
filename := zapFileName(newSegmentID)
path := s.path + string(os.PathSeparator) + filename
// the newly merged segment is already flushed out to disk, just needs
// to be opened using mmap.
newDocIDs, _, err :=
s.segPlugin.Merge(segsBatch, dropsBatch, path, s.closeCh, s)
if err != nil {
em.Lock()
errs = append(errs, err)
em.Unlock()
atomic.AddUint64(&s.stats.TotMemMergeErr, 1)
return
}
// to prevent accidental cleanup of this newly created file, mark it
// as ineligible for removal. this will be flipped back when the bolt
// is updated - which is valid, since the snapshot updated in bolt is
// cleaned up only if its zero ref'd (MB-66163 for more details)
s.markIneligibleForRemoval(filename)
newMergedSegmentIDs[id] = newSegmentID
newDocIDsSet[id] = newDocIDs
newMergedSegments[id], err = s.segPlugin.Open(path)
if err != nil {
em.Lock()
errs = append(errs, err)
em.Unlock()
atomic.AddUint64(&s.stats.TotMemMergeErr, 1)
return
}
atomic.AddUint64(&newMergedCount, newMergedSegments[id].Count())
atomic.AddUint64(&numSegments, uint64(len(segsBatch)))
}(flushableObjs[i].segments, flushableObjs[i].drops, i)
}
wg.Wait()
if errs != nil {
// close the new merged segments
_ = closeNewMergedSegments(newMergedSegments)
var errf error
for _, err := range errs {
if err == segment.ErrClosed {
// the index snapshot was closed which will be handled gracefully
// by retrying the whole merge+flush operation in a later iteration
// so its safe to early exit the same error.
return nil, nil, err
}
errf = fmt.Errorf("%w; %v", errf, err)
}
return nil, nil, errf
}
atomic.AddUint64(&s.stats.TotMemMergeZapEnd, 1)
memMergeZapTime := uint64(time.Since(memMergeZapStartTime))
atomic.AddUint64(&s.stats.TotMemMergeZapTime, memMergeZapTime)
if atomic.LoadUint64(&s.stats.MaxMemMergeZapTime) < memMergeZapTime {
atomic.StoreUint64(&s.stats.MaxMemMergeZapTime, memMergeZapTime)
}
// update the segmentMerge task with the newly merged + flushed segments which
// are to be introduced atomically.
sm := &segmentMerge{
id: newMergedSegmentIDs,
new: newMergedSegments,
mergedSegHistory: make(map[uint64]*mergedSegmentHistory, numSegments),
notifyCh: make(chan *mergeTaskIntroStatus),
newCount: newMergedCount,
}
// create a history map which maps the old in-memory segments with the specific
// persister worker (also the specific file segment its going to be part of)
// which flushed it out. This map will be used on the introducer side to out-ref
// the in-memory segments and also track the new tombstones if present.
for i, flushable := range flushableObjs {
for j, idx := range flushable.sbIdxs {
ss := snapshot.segment[idx]
// oldSegmentSnapshot.id -> {workerID, oldSegmentSnapshot, docIDs}
sm.mergedSegHistory[ss.id] = &mergedSegmentHistory{
workerID: uint64(i),
oldNewDocIDs: newDocIDsSet[i][j],
oldSegment: ss,
}
}
}
select { // send to introducer
case <-s.closeCh:
_ = closeNewMergedSegments(newMergedSegments)
return nil, nil, segment.ErrClosed
case s.merges <- sm:
}
// blockingly wait for the introduction to complete
var newSnapshot *IndexSnapshot
introStatus := <-sm.notifyCh
if introStatus != nil && introStatus.indexSnapshot != nil {
newSnapshot = introStatus.indexSnapshot
atomic.AddUint64(&s.stats.TotMemMergeSegments, uint64(numSegments))
atomic.AddUint64(&s.stats.TotMemMergeDone, 1)
if introStatus.skipped {
// close the segment on skipping introduction.
_ = newSnapshot.DecRef()
_ = closeNewMergedSegments(newMergedSegments)
newSnapshot = nil
}
}
return newSnapshot, newMergedSegmentIDs, nil
}
func (s *Scorch) ReportBytesWritten(bytesWritten uint64) {
atomic.AddUint64(&s.stats.TotFileMergeWrittenBytes, bytesWritten)
}

View File

@@ -0,0 +1,456 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package mergeplan provides a segment merge planning approach that's
// inspired by Lucene's TieredMergePolicy.java and descriptions like
// http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
package mergeplan
import (
"errors"
"fmt"
"math"
"sort"
"strings"
)
// A Segment represents the information that the planner needs to
// calculate segment merging.
type Segment interface {
// Unique id of the segment -- used for sorting.
Id() uint64
// Full segment size (the size before any logical deletions).
FullSize() int64
// Size of the live data of the segment; i.e., FullSize() minus
// any logical deletions.
LiveSize() int64
HasVector() bool
// Size of the persisted segment file.
FileSize() int64
}
// Plan() will functionally compute a merge plan. A segment will be
// assigned to at most a single MergeTask in the output MergePlan. A
// segment not assigned to any MergeTask means the segment should
// remain unmerged.
func Plan(segments []Segment, o *MergePlanOptions) (*MergePlan, error) {
return plan(segments, o)
}
// A MergePlan is the result of the Plan() API.
//
// The planner doesnt know how or whether these tasks are executed --
// thats up to a separate merge execution system, which might execute
// these tasks concurrently or not, and which might execute all the
// tasks or not.
type MergePlan struct {
Tasks []*MergeTask
}
// A MergeTask represents several segments that should be merged
// together into a single segment.
type MergeTask struct {
Segments []Segment
}
// The MergePlanOptions is designed to be reusable between planning calls.
type MergePlanOptions struct {
// Max # segments per logarithmic tier, or max width of any
// logarithmic “step”. Smaller values mean more merging but fewer
// segments. Should be >= SegmentsPerMergeTask, else you'll have
// too much merging.
MaxSegmentsPerTier int
// Max size of any segment produced after merging. Actual
// merging, however, may produce segment sizes different than the
// planners predicted sizes.
MaxSegmentSize int64
// Max size (in bytes) of the persisted segment file that contains the
// vectors. This is used to prevent merging of segments that
// contain vectors that are too large.
MaxSegmentFileSize int64
// The growth factor for each tier in a staircase of idealized
// segments computed by CalcBudget().
TierGrowth float64
// The number of segments in any resulting MergeTask. e.g.,
// len(result.Tasks[ * ].Segments) == SegmentsPerMergeTask.
SegmentsPerMergeTask int
// Small segments are rounded up to this size, i.e., treated as
// equal (floor) size for consideration. This is to prevent lots
// of tiny segments from resulting in a long tail in the index.
FloorSegmentSize int64
// Small segments' file size are rounded up to this size to prevent lot
// of tiny segments causing a long tail in the index.
FloorSegmentFileSize int64
// Controls how aggressively merges that reclaim more deletions
// are favored. Higher values will more aggressively target
// merges that reclaim deletions, but be careful not to go so high
// that way too much merging takes place; a value of 3.0 is
// probably nearly too high. A value of 0.0 means deletions don't
// impact merge selection.
ReclaimDeletesWeight float64
// Optional, defaults to mergeplan.CalcBudget().
CalcBudget func(totalSize int64, firstTierSize int64,
o *MergePlanOptions) (budgetNumSegments int)
// Optional, defaults to mergeplan.ScoreSegments().
ScoreSegments func(segments []Segment, o *MergePlanOptions) float64
// Optional.
Logger func(string)
}
// Returns the higher of the input or FloorSegmentSize.
func (o *MergePlanOptions) RaiseToFloorSegmentSize(s int64) int64 {
if s > o.FloorSegmentSize {
return s
}
return o.FloorSegmentSize
}
func (o *MergePlanOptions) RaiseToFloorSegmentFileSize(s int64) int64 {
if s > o.FloorSegmentFileSize {
return s
}
return o.FloorSegmentFileSize
}
// MaxSegmentSizeLimit represents the maximum size of a segment,
// this limit comes with hit-1 optimisation/max encoding limit uint31.
const MaxSegmentSizeLimit = 1<<31 - 1
// ErrMaxSegmentSizeTooLarge is returned when the size of the segment
// exceeds the MaxSegmentSizeLimit
var ErrMaxSegmentSizeTooLarge = errors.New("MaxSegmentSize exceeds the size limit")
// DefaultMergePlanOptions suggests the default options.
var DefaultMergePlanOptions = MergePlanOptions{
MaxSegmentsPerTier: 10,
MaxSegmentSize: 5000000,
MaxSegmentFileSize: 4000000000, // 4GB
TierGrowth: 10.0,
SegmentsPerMergeTask: 10,
FloorSegmentSize: 2000,
ReclaimDeletesWeight: 2.0,
}
// SingleSegmentMergePlanOptions helps in creating a
// single segment index.
var SingleSegmentMergePlanOptions = MergePlanOptions{
MaxSegmentsPerTier: 1,
MaxSegmentSize: 1 << 30,
MaxSegmentFileSize: 1 << 40,
TierGrowth: 1.0,
SegmentsPerMergeTask: 10,
FloorSegmentSize: 1 << 30,
ReclaimDeletesWeight: 2.0,
FloorSegmentFileSize: 1 << 40,
}
// -------------------------------------------
func plan(segmentsIn []Segment, o *MergePlanOptions) (*MergePlan, error) {
if len(segmentsIn) <= 1 {
return nil, nil
}
if o == nil {
o = &DefaultMergePlanOptions
}
segments := append([]Segment(nil), segmentsIn...) // Copy.
sort.Sort(byLiveSizeDescending(segments))
var minLiveSize int64 = math.MaxInt64
var eligibles []Segment
var eligiblesLiveSize int64
var eligiblesFileSize int64
var minFileSize int64 = math.MaxInt64
for _, segment := range segments {
if minLiveSize > segment.LiveSize() {
minLiveSize = segment.LiveSize()
}
if minFileSize > segment.FileSize() {
minFileSize = segment.FileSize()
}
isEligible := segment.LiveSize() < o.MaxSegmentSize/2
// An eligible segment (based on #documents) may be too large
// and thus need a stricter check based on the file size.
// This is particularly important for segments that contain
// vectors.
if isEligible && segment.HasVector() && o.MaxSegmentFileSize > 0 {
isEligible = segment.FileSize() < o.MaxSegmentFileSize/2
}
// Only small-enough segments are eligible.
if isEligible {
eligibles = append(eligibles, segment)
eligiblesLiveSize += segment.LiveSize()
eligiblesFileSize += segment.FileSize()
}
}
calcBudget := o.CalcBudget
if calcBudget == nil {
calcBudget = CalcBudget
}
var budgetNumSegments int
if o.FloorSegmentFileSize > 0 {
minFileSize = o.RaiseToFloorSegmentFileSize(minFileSize)
budgetNumSegments = calcBudget(eligiblesFileSize, minFileSize, o)
} else {
minLiveSize = o.RaiseToFloorSegmentSize(minLiveSize)
budgetNumSegments = calcBudget(eligiblesLiveSize, minLiveSize, o)
}
scoreSegments := o.ScoreSegments
if scoreSegments == nil {
scoreSegments = ScoreSegments
}
rv := &MergePlan{}
var empties []Segment
for _, eligible := range eligibles {
if eligible.LiveSize() <= 0 {
empties = append(empties, eligible)
}
}
if len(empties) > 0 {
rv.Tasks = append(rv.Tasks, &MergeTask{Segments: empties})
eligibles = removeSegments(eligibles, empties)
}
// While were over budget, keep looping, which might produce
// another MergeTask.
for len(eligibles) > 0 && (len(eligibles)+len(rv.Tasks)) > budgetNumSegments {
// Track a current best roster as we examine and score
// potential rosters of merges.
var bestRoster []Segment
var bestRosterScore float64 // Lower score is better.
for startIdx := 0; startIdx < len(eligibles); startIdx++ {
var roster []Segment
var rosterLiveSize int64
var rosterFileSize int64 // useful for segments with vectors
for idx := startIdx; idx < len(eligibles) && len(roster) < o.SegmentsPerMergeTask; idx++ {
eligible := eligibles[idx]
if rosterLiveSize+eligible.LiveSize() >= o.MaxSegmentSize {
continue
}
if eligible.HasVector() {
efs := eligible.FileSize()
if rosterFileSize+efs >= o.MaxSegmentFileSize {
continue
}
rosterFileSize += efs
}
roster = append(roster, eligible)
rosterLiveSize += eligible.LiveSize()
}
if len(roster) > 0 {
rosterScore := scoreSegments(roster, o)
if len(bestRoster) == 0 || rosterScore < bestRosterScore {
bestRoster = roster
bestRosterScore = rosterScore
}
}
}
if len(bestRoster) == 0 {
return rv, nil
}
// create tasks with valid merges - i.e. there should be atleast 2 non-empty segments
if len(bestRoster) > 1 {
rv.Tasks = append(rv.Tasks, &MergeTask{Segments: bestRoster})
}
eligibles = removeSegments(eligibles, bestRoster)
}
return rv, nil
}
// Compute the number of segments that would be needed to cover the
// totalSize, by climbing up a logarithmically growing staircase of
// segment tiers.
func CalcBudget(totalSize int64, firstTierSize int64, o *MergePlanOptions) (
budgetNumSegments int) {
tierSize := firstTierSize
if tierSize < 1 {
tierSize = 1
}
maxSegmentsPerTier := o.MaxSegmentsPerTier
if maxSegmentsPerTier < 1 {
maxSegmentsPerTier = 1
}
tierGrowth := o.TierGrowth
if tierGrowth < 1.0 {
tierGrowth = 1.0
}
for totalSize > 0 {
segmentsInTier := float64(totalSize) / float64(tierSize)
if segmentsInTier < float64(maxSegmentsPerTier) {
budgetNumSegments += int(math.Ceil(segmentsInTier))
break
}
budgetNumSegments += maxSegmentsPerTier
totalSize -= int64(maxSegmentsPerTier) * tierSize
tierSize = int64(float64(tierSize) * tierGrowth)
}
return budgetNumSegments
}
// Of note, removeSegments() keeps the ordering of the results stable.
func removeSegments(segments []Segment, toRemove []Segment) []Segment {
rv := make([]Segment, 0, len(segments)-len(toRemove))
OUTER:
for _, segment := range segments {
for _, r := range toRemove {
if segment == r {
continue OUTER
}
}
rv = append(rv, segment)
}
return rv
}
// Smaller result score is better.
func ScoreSegments(segments []Segment, o *MergePlanOptions) float64 {
var totBeforeSize int64
var totAfterSize int64
var totAfterSizeFloored int64
for _, segment := range segments {
totBeforeSize += segment.FullSize()
totAfterSize += segment.LiveSize()
totAfterSizeFloored += o.RaiseToFloorSegmentSize(segment.LiveSize())
}
if totBeforeSize <= 0 || totAfterSize <= 0 || totAfterSizeFloored <= 0 {
return 0
}
// Roughly guess the "balance" of the segments -- whether the
// segments are about the same size.
balance :=
float64(o.RaiseToFloorSegmentSize(segments[0].LiveSize())) /
float64(totAfterSizeFloored)
// Gently favor smaller merges over bigger ones. We don't want to
// make the exponent too large else we end up with poor merges of
// small segments in order to avoid the large merges.
score := balance * math.Pow(float64(totAfterSize), 0.05)
// Strongly favor merges that reclaim deletes.
nonDelRatio := float64(totAfterSize) / float64(totBeforeSize)
score *= math.Pow(nonDelRatio, o.ReclaimDeletesWeight)
return score
}
// ------------------------------------------
// ToBarChart returns an ASCII rendering of the segments and the plan.
// The barMax is the max width of the bars in the bar chart.
func ToBarChart(prefix string, barMax int, segments []Segment, plan *MergePlan) string {
rv := make([]string, 0, len(segments))
var maxFullSize int64
for _, segment := range segments {
if maxFullSize < segment.FullSize() {
maxFullSize = segment.FullSize()
}
}
if maxFullSize < 0 {
maxFullSize = 1
}
for _, segment := range segments {
barFull := int(segment.FullSize())
barLive := int(segment.LiveSize())
if maxFullSize > int64(barMax) {
barFull = int(float64(barMax) * float64(barFull) / float64(maxFullSize))
barLive = int(float64(barMax) * float64(barLive) / float64(maxFullSize))
}
barKind := " "
barChar := "."
if plan != nil {
TASK_LOOP:
for taski, task := range plan.Tasks {
for _, taskSegment := range task.Segments {
if taskSegment == segment {
barKind = "*"
barChar = fmt.Sprintf("%d", taski)
break TASK_LOOP
}
}
}
}
bar :=
strings.Repeat(barChar, barLive)[0:barLive] +
strings.Repeat("x", barFull-barLive)[0:barFull-barLive]
rv = append(rv, fmt.Sprintf("%s %5d: %5d /%5d - %s %s", prefix,
segment.Id(),
segment.LiveSize(),
segment.FullSize(),
barKind, bar))
}
return strings.Join(rv, "\n")
}
// ValidateMergePlannerOptions validates the merge planner options
func ValidateMergePlannerOptions(options *MergePlanOptions) error {
if options.MaxSegmentSize > MaxSegmentSizeLimit {
return ErrMaxSegmentSizeTooLarge
}
return nil
}

View File

@@ -0,0 +1,28 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package mergeplan
type byLiveSizeDescending []Segment
func (a byLiveSizeDescending) Len() int { return len(a) }
func (a byLiveSizeDescending) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a byLiveSizeDescending) Less(i, j int) bool {
if a[i].LiveSize() != a[j].LiveSize() {
return a[i].LiveSize() > a[j].LiveSize()
}
return a[i].Id() < a[j].Id()
}

View File

@@ -0,0 +1,399 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"fmt"
"sync/atomic"
"github.com/RoaringBitmap/roaring/v2"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
var OptimizeConjunction = true
var OptimizeConjunctionUnadorned = true
var OptimizeDisjunctionUnadorned = true
func (s *IndexSnapshotTermFieldReader) Optimize(kind string,
octx index.OptimizableContext) (index.OptimizableContext, error) {
if OptimizeConjunction && kind == "conjunction" {
return s.optimizeConjunction(octx)
}
if OptimizeConjunctionUnadorned && kind == "conjunction:unadorned" {
return s.optimizeConjunctionUnadorned(octx)
}
if OptimizeDisjunctionUnadorned && kind == "disjunction:unadorned" {
return s.optimizeDisjunctionUnadorned(octx)
}
return nil, nil
}
var OptimizeDisjunctionUnadornedMinChildCardinality = uint64(256)
// ----------------------------------------------------------------
func (s *IndexSnapshotTermFieldReader) optimizeConjunction(
octx index.OptimizableContext) (index.OptimizableContext, error) {
if octx == nil {
octx = &OptimizeTFRConjunction{snapshot: s.snapshot}
}
o, ok := octx.(*OptimizeTFRConjunction)
if !ok {
return octx, nil
}
if o.snapshot != s.snapshot {
return nil, fmt.Errorf("tried to optimize conjunction across different snapshots")
}
o.tfrs = append(o.tfrs, s)
return o, nil
}
type OptimizeTFRConjunction struct {
snapshot *IndexSnapshot
tfrs []*IndexSnapshotTermFieldReader
}
func (o *OptimizeTFRConjunction) Finish() (index.Optimized, error) {
if len(o.tfrs) <= 1 {
return nil, nil
}
for i := range o.snapshot.segment {
itr0, ok := o.tfrs[0].iterators[i].(segment.OptimizablePostingsIterator)
if !ok || itr0.ActualBitmap() == nil {
continue
}
itr1, ok := o.tfrs[1].iterators[i].(segment.OptimizablePostingsIterator)
if !ok || itr1.ActualBitmap() == nil {
continue
}
bm := roaring.And(itr0.ActualBitmap(), itr1.ActualBitmap())
for _, tfr := range o.tfrs[2:] {
itr, ok := tfr.iterators[i].(segment.OptimizablePostingsIterator)
if !ok || itr.ActualBitmap() == nil {
continue
}
bm.And(itr.ActualBitmap())
}
// in this conjunction optimization, the postings iterators
// will all share the same AND'ed together actual bitmap. The
// regular conjunction searcher machinery will still be used,
// but the underlying bitmap will be smaller.
for _, tfr := range o.tfrs {
itr, ok := tfr.iterators[i].(segment.OptimizablePostingsIterator)
if ok && itr.ActualBitmap() != nil {
itr.ReplaceActual(bm)
}
}
}
return nil, nil
}
// ----------------------------------------------------------------
// An "unadorned" conjunction optimization is appropriate when
// additional or subsidiary information like freq-norm's and
// term-vectors are not required, and instead only the internal-id's
// are needed.
func (s *IndexSnapshotTermFieldReader) optimizeConjunctionUnadorned(
octx index.OptimizableContext) (index.OptimizableContext, error) {
if octx == nil {
octx = &OptimizeTFRConjunctionUnadorned{snapshot: s.snapshot}
}
o, ok := octx.(*OptimizeTFRConjunctionUnadorned)
if !ok {
return nil, nil
}
if o.snapshot != s.snapshot {
return nil, fmt.Errorf("tried to optimize unadorned conjunction across different snapshots")
}
o.tfrs = append(o.tfrs, s)
return o, nil
}
type OptimizeTFRConjunctionUnadorned struct {
snapshot *IndexSnapshot
tfrs []*IndexSnapshotTermFieldReader
}
var OptimizeTFRConjunctionUnadornedTerm = []byte("<conjunction:unadorned>")
var OptimizeTFRConjunctionUnadornedField = "*"
// Finish of an unadorned conjunction optimization will compute a
// termFieldReader with an "actual" bitmap that represents the
// constituent bitmaps AND'ed together. This termFieldReader cannot
// provide any freq-norm or termVector associated information.
func (o *OptimizeTFRConjunctionUnadorned) Finish() (rv index.Optimized, err error) {
if len(o.tfrs) <= 1 {
return nil, nil
}
// We use an artificial term and field because the optimized
// termFieldReader can represent multiple terms and fields.
oTFR := o.snapshot.unadornedTermFieldReader(
OptimizeTFRConjunctionUnadornedTerm, OptimizeTFRConjunctionUnadornedField)
var actualBMs []*roaring.Bitmap // Collected from regular posting lists.
OUTER:
for i := range o.snapshot.segment {
actualBMs = actualBMs[:0]
var docNum1HitLast uint64
var docNum1HitLastOk bool
for _, tfr := range o.tfrs {
if _, ok := tfr.iterators[i].(*emptyPostingsIterator); ok {
// An empty postings iterator means the entire AND is empty.
oTFR.iterators[i] = anEmptyPostingsIterator
continue OUTER
}
itr, ok := tfr.iterators[i].(segment.OptimizablePostingsIterator)
if !ok {
// We only optimize postings iterators that support this operation.
return nil, nil
}
// If the postings iterator is "1-hit" optimized, then we
// can perform several optimizations up-front here.
docNum1Hit, ok := itr.DocNum1Hit()
if ok {
if docNum1HitLastOk && docNum1HitLast != docNum1Hit {
// The docNum1Hit doesn't match the previous
// docNum1HitLast, so the entire AND is empty.
oTFR.iterators[i] = anEmptyPostingsIterator
continue OUTER
}
docNum1HitLast = docNum1Hit
docNum1HitLastOk = true
continue
}
if itr.ActualBitmap() == nil {
// An empty actual bitmap means the entire AND is empty.
oTFR.iterators[i] = anEmptyPostingsIterator
continue OUTER
}
// Collect the actual bitmap for more processing later.
actualBMs = append(actualBMs, itr.ActualBitmap())
}
if docNum1HitLastOk {
// We reach here if all the 1-hit optimized posting
// iterators had the same 1-hit docNum, so we can check if
// our collected actual bitmaps also have that docNum.
for _, bm := range actualBMs {
if !bm.Contains(uint32(docNum1HitLast)) {
// The docNum1Hit isn't in one of our actual
// bitmaps, so the entire AND is empty.
oTFR.iterators[i] = anEmptyPostingsIterator
continue OUTER
}
}
// The actual bitmaps and docNum1Hits all contain or have
// the same 1-hit docNum, so that's our AND'ed result.
oTFR.iterators[i] = newUnadornedPostingsIteratorFrom1Hit(docNum1HitLast)
continue OUTER
}
if len(actualBMs) == 0 {
// If we've collected no actual bitmaps at this point,
// then the entire AND is empty.
oTFR.iterators[i] = anEmptyPostingsIterator
continue OUTER
}
if len(actualBMs) == 1 {
// If we've only 1 actual bitmap, then that's our result.
oTFR.iterators[i] = newUnadornedPostingsIteratorFromBitmap(actualBMs[0])
continue OUTER
}
// Else, AND together our collected bitmaps as our result.
bm := roaring.And(actualBMs[0], actualBMs[1])
for _, actualBM := range actualBMs[2:] {
bm.And(actualBM)
}
oTFR.iterators[i] = newUnadornedPostingsIteratorFromBitmap(bm)
}
atomic.AddUint64(&o.snapshot.parent.stats.TotTermSearchersStarted, uint64(1))
return oTFR, nil
}
// ----------------------------------------------------------------
// An "unadorned" disjunction optimization is appropriate when
// additional or subsidiary information like freq-norm's and
// term-vectors are not required, and instead only the internal-id's
// are needed.
func (s *IndexSnapshotTermFieldReader) optimizeDisjunctionUnadorned(
octx index.OptimizableContext) (index.OptimizableContext, error) {
if octx == nil {
octx = &OptimizeTFRDisjunctionUnadorned{
snapshot: s.snapshot,
}
}
o, ok := octx.(*OptimizeTFRDisjunctionUnadorned)
if !ok {
return nil, nil
}
if o.snapshot != s.snapshot {
return nil, fmt.Errorf("tried to optimize unadorned disjunction across different snapshots")
}
o.tfrs = append(o.tfrs, s)
return o, nil
}
type OptimizeTFRDisjunctionUnadorned struct {
snapshot *IndexSnapshot
tfrs []*IndexSnapshotTermFieldReader
}
var OptimizeTFRDisjunctionUnadornedTerm = []byte("<disjunction:unadorned>")
var OptimizeTFRDisjunctionUnadornedField = "*"
// Finish of an unadorned disjunction optimization will compute a
// termFieldReader with an "actual" bitmap that represents the
// constituent bitmaps OR'ed together. This termFieldReader cannot
// provide any freq-norm or termVector associated information.
func (o *OptimizeTFRDisjunctionUnadorned) Finish() (rv index.Optimized, err error) {
if len(o.tfrs) <= 1 {
return nil, nil
}
for i := range o.snapshot.segment {
var cMax uint64
for _, tfr := range o.tfrs {
itr, ok := tfr.iterators[i].(segment.OptimizablePostingsIterator)
if !ok {
return nil, nil
}
if itr.ActualBitmap() != nil {
c := itr.ActualBitmap().GetCardinality()
if cMax < c {
cMax = c
}
}
}
}
// We use an artificial term and field because the optimized
// termFieldReader can represent multiple terms and fields.
oTFR := o.snapshot.unadornedTermFieldReader(
OptimizeTFRDisjunctionUnadornedTerm, OptimizeTFRDisjunctionUnadornedField)
var docNums []uint32 // Collected docNum's from 1-hit posting lists.
var actualBMs []*roaring.Bitmap // Collected from regular posting lists.
for i := range o.snapshot.segment {
docNums = docNums[:0]
actualBMs = actualBMs[:0]
for _, tfr := range o.tfrs {
itr, ok := tfr.iterators[i].(segment.OptimizablePostingsIterator)
if !ok {
return nil, nil
}
docNum, ok := itr.DocNum1Hit()
if ok {
docNums = append(docNums, uint32(docNum))
continue
}
if itr.ActualBitmap() != nil {
actualBMs = append(actualBMs, itr.ActualBitmap())
}
}
var bm *roaring.Bitmap
if len(actualBMs) > 2 {
bm = roaring.HeapOr(actualBMs...)
} else if len(actualBMs) == 2 {
bm = roaring.Or(actualBMs[0], actualBMs[1])
} else if len(actualBMs) == 1 {
bm = actualBMs[0].Clone()
}
if bm == nil {
bm = roaring.New()
}
bm.AddMany(docNums)
oTFR.iterators[i] = newUnadornedPostingsIteratorFromBitmap(bm)
}
atomic.AddUint64(&o.snapshot.parent.stats.TotTermSearchersStarted, uint64(1))
return oTFR, nil
}
// ----------------------------------------------------------------
func (i *IndexSnapshot) unadornedTermFieldReader(
term []byte, field string) *IndexSnapshotTermFieldReader {
// This IndexSnapshotTermFieldReader will not be recycled, more
// conversation here: https://github.com/blevesearch/bleve/pull/1438
return &IndexSnapshotTermFieldReader{
term: term,
field: field,
snapshot: i,
iterators: make([]segment.PostingsIterator, len(i.segment)),
segmentOffset: 0,
includeFreq: false,
includeNorm: false,
includeTermVectors: false,
recycle: false,
// signal downstream that this is a special unadorned termFieldReader
unadorned: true,
}
}

View File

@@ -0,0 +1,207 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors
// +build vectors
package scorch
import (
"context"
"fmt"
"sync"
"sync/atomic"
"github.com/blevesearch/bleve/v2/search"
index "github.com/blevesearch/bleve_index_api"
segment_api "github.com/blevesearch/scorch_segment_api/v2"
)
type OptimizeVR struct {
ctx context.Context
snapshot *IndexSnapshot
totalCost uint64
// maps field to vector readers
vrs map[string][]*IndexSnapshotVectorReader
// if at least one of the vector readers requires filtered kNN.
requiresFiltering bool
}
// This setting _MUST_ only be changed during init and not after.
var BleveMaxKNNConcurrency = 10
func (o *OptimizeVR) invokeSearcherEndCallback() {
if o.ctx != nil {
if cb := o.ctx.Value(search.SearcherEndCallbackKey); cb != nil {
if cbF, ok := cb.(search.SearcherEndCallbackFn); ok {
if o.totalCost > 0 {
// notify the callback that the searcher creation etc. is finished
// and report back the total cost for it to track and take actions
// appropriately.
_ = cbF(o.totalCost)
}
}
}
}
}
func (o *OptimizeVR) Finish() error {
// for each field, get the vector index --> invoke the zap func.
// for each VR, populate postings list and iterators
// by passing the obtained vector index and getting similar vectors.
// defer close index - just once.
var errorsM sync.Mutex
var errors []error
defer o.invokeSearcherEndCallback()
wg := sync.WaitGroup{}
semaphore := make(chan struct{}, BleveMaxKNNConcurrency)
// Launch goroutines to get vector index for each segment
for i, seg := range o.snapshot.segment {
if sv, ok := seg.segment.(segment_api.VectorSegment); ok {
wg.Add(1)
semaphore <- struct{}{} // Acquire a semaphore slot
go func(index int, segment segment_api.VectorSegment, origSeg *SegmentSnapshot) {
defer func() {
<-semaphore // Release the semaphore slot
wg.Done()
}()
for field, vrs := range o.vrs {
vecIndex, err := segment.InterpretVectorIndex(field,
o.requiresFiltering, origSeg.deleted)
if err != nil {
errorsM.Lock()
errors = append(errors, err)
errorsM.Unlock()
return
}
// update the vector index size as a meta value in the segment snapshot
vectorIndexSize := vecIndex.Size()
origSeg.cachedMeta.updateMeta(field, vectorIndexSize)
for _, vr := range vrs {
var pl segment_api.VecPostingsList
var err error
// for each VR, populate postings list and iterators
// by passing the obtained vector index and getting similar vectors.
// check if the vector reader is configured to use a pre-filter
// to filter out ineligible documents before performing
// kNN search.
if vr.eligibleSelector != nil {
pl, err = vecIndex.SearchWithFilter(vr.vector, vr.k,
vr.eligibleSelector.SegmentEligibleDocs(index), vr.searchParams)
} else {
pl, err = vecIndex.Search(vr.vector, vr.k, vr.searchParams)
}
if err != nil {
errorsM.Lock()
errors = append(errors, err)
errorsM.Unlock()
go vecIndex.Close()
return
}
atomic.AddUint64(&o.snapshot.parent.stats.TotKNNSearches, uint64(1))
// postings and iterators are already alloc'ed when
// IndexSnapshotVectorReader is created
vr.postings[index] = pl
vr.iterators[index] = pl.Iterator(vr.iterators[index])
}
go vecIndex.Close()
}
}(i, sv, seg)
}
}
wg.Wait()
close(semaphore)
if len(errors) > 0 {
return errors[0]
}
return nil
}
func (s *IndexSnapshotVectorReader) VectorOptimize(ctx context.Context,
octx index.VectorOptimizableContext,
) (index.VectorOptimizableContext, error) {
if s.snapshot.parent.segPlugin.Version() < VectorSearchSupportedSegmentVersion {
return nil, fmt.Errorf("vector search not supported for this index, "+
"index's segment version %v, supported segment version for vector search %v",
s.snapshot.parent.segPlugin.Version(), VectorSearchSupportedSegmentVersion)
}
if octx == nil {
octx = &OptimizeVR{
snapshot: s.snapshot,
vrs: make(map[string][]*IndexSnapshotVectorReader),
}
}
o, ok := octx.(*OptimizeVR)
if !ok {
return octx, nil
}
o.ctx = ctx
if !o.requiresFiltering {
o.requiresFiltering = s.eligibleSelector != nil
}
if o.snapshot != s.snapshot {
o.invokeSearcherEndCallback()
return nil, fmt.Errorf("tried to optimize KNN across different snapshots")
}
// for every searcher creation, consult the segment snapshot to see
// what's the vector index size and since you're anyways going
// to use this vector index to perform the search etc. as part of the Finish()
// perform a check as to whether we allow the searcher creation (the downstream)
// Finish() logic to even occur or not.
var sumVectorIndexSize uint64
for _, seg := range o.snapshot.segment {
vecIndexSize := seg.cachedMeta.fetchMeta(s.field)
if vecIndexSize != nil {
sumVectorIndexSize += vecIndexSize.(uint64)
}
}
if o.ctx != nil {
if cb := o.ctx.Value(search.SearcherStartCallbackKey); cb != nil {
if cbF, ok := cb.(search.SearcherStartCallbackFn); ok {
err := cbF(sumVectorIndexSize)
if err != nil {
// it's important to invoke the end callback at this point since
// if the earlier searchers of this optimze struct were successful
// the cost corresponding to it would be incremented and if the
// current searcher fails the check then we end up erroring out
// the overall optimized searcher creation, the cost needs to be
// handled appropriately.
o.invokeSearcherEndCallback()
return nil, err
}
}
}
}
// total cost is essentially the sum of the vector indexes' size across all the
// searchers - all of them end up reading and maintaining a vector index.
// misacconting this value would end up calling the "end" callback with a value
// not equal to the value passed to "start" callback.
o.totalCost += sumVectorIndexSize
o.vrs[s.field] = append(o.vrs[s.field], s)
return o, nil
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,63 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"regexp/syntax"
"github.com/blevesearch/vellum/regexp"
)
func parseRegexp(pattern string) (a *regexp.Regexp, prefixBeg, prefixEnd []byte, err error) {
// TODO: potential optimization where syntax.Regexp supports a Simplify() API?
parsed, err := syntax.Parse(pattern, syntax.Perl)
if err != nil {
return nil, nil, nil, err
}
re, err := regexp.NewParsedWithLimit(pattern, parsed, regexp.DefaultLimit)
if err != nil {
return nil, nil, nil, err
}
prefix := literalPrefix(parsed)
if prefix != "" {
prefixBeg := []byte(prefix)
prefixEnd := calculateExclusiveEndFromPrefix(prefixBeg)
return re, prefixBeg, prefixEnd, nil
}
return re, nil, nil, nil
}
// Returns the literal prefix given the parse tree for a regexp
func literalPrefix(s *syntax.Regexp) string {
// traverse the left-most branch in the parse tree as long as the
// node represents a concatenation
for s != nil && s.Op == syntax.OpConcat {
if len(s.Sub) < 1 {
return ""
}
s = s.Sub[0]
}
if s.Op == syntax.OpLiteral && (s.Flags&syntax.FoldCase == 0) {
return string(s.Rune)
}
return "" // no literal prefix
}

View File

@@ -0,0 +1,216 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"fmt"
"log"
"os"
bolt "go.etcd.io/bbolt"
)
type RollbackPoint struct {
epoch uint64
meta map[string][]byte
}
func (r *RollbackPoint) GetInternal(key []byte) []byte {
return r.meta[string(key)]
}
// RollbackPoints returns an array of rollback points available for
// the application to rollback to, with more recent rollback points
// (higher epochs) coming first.
func RollbackPoints(path string) ([]*RollbackPoint, error) {
if len(path) == 0 {
return nil, fmt.Errorf("RollbackPoints: invalid path")
}
rootBoltPath := path + string(os.PathSeparator) + "root.bolt"
rootBoltOpt := &bolt.Options{
ReadOnly: true,
}
rootBolt, err := bolt.Open(rootBoltPath, 0600, rootBoltOpt)
if err != nil || rootBolt == nil {
return nil, err
}
// start a read-only bolt transaction
tx, err := rootBolt.Begin(false)
if err != nil {
return nil, fmt.Errorf("RollbackPoints: failed to start" +
" read-only transaction")
}
// read-only bolt transactions to be rolled back
defer func() {
_ = tx.Rollback()
_ = rootBolt.Close()
}()
snapshots := tx.Bucket(boltSnapshotsBucket)
if snapshots == nil {
return nil, nil
}
rollbackPoints := []*RollbackPoint{}
c1 := snapshots.Cursor()
for k, _ := c1.Last(); k != nil; k, _ = c1.Prev() {
_, snapshotEpoch, err := decodeUvarintAscending(k)
if err != nil {
log.Printf("RollbackPoints:"+
" unable to parse segment epoch %x, continuing", k)
continue
}
snapshot := snapshots.Bucket(k)
if snapshot == nil {
log.Printf("RollbackPoints:"+
" snapshot key, but bucket missing %x, continuing", k)
continue
}
meta := map[string][]byte{}
c2 := snapshot.Cursor()
for j, _ := c2.First(); j != nil; j, _ = c2.Next() {
if j[0] == boltInternalKey[0] {
internalBucket := snapshot.Bucket(j)
if internalBucket == nil {
err = fmt.Errorf("internal bucket missing")
break
}
err = internalBucket.ForEach(func(key []byte, val []byte) error {
copiedVal := append([]byte(nil), val...)
meta[string(key)] = copiedVal
return nil
})
if err != nil {
break
}
}
}
if err != nil {
log.Printf("RollbackPoints:"+
" failed in fetching internal data: %v", err)
continue
}
rollbackPoints = append(rollbackPoints, &RollbackPoint{
epoch: snapshotEpoch,
meta: meta,
})
}
return rollbackPoints, nil
}
// Rollback atomically and durably brings the store back to the point
// in time as represented by the RollbackPoint.
// Rollback() should only be passed a RollbackPoint that came from the
// same store using the RollbackPoints() API along with the index path.
func Rollback(path string, to *RollbackPoint) error {
if to == nil {
return fmt.Errorf("Rollback: RollbackPoint is nil")
}
if len(path) == 0 {
return fmt.Errorf("Rollback: index path is empty")
}
rootBoltPath := path + string(os.PathSeparator) + "root.bolt"
rootBoltOpt := &bolt.Options{
ReadOnly: false,
}
rootBolt, err := bolt.Open(rootBoltPath, 0600, rootBoltOpt)
if err != nil || rootBolt == nil {
return err
}
defer func() {
err1 := rootBolt.Close()
if err1 != nil && err == nil {
err = err1
}
}()
// pick all the younger persisted epochs in bolt store
// including the target one.
var found bool
var eligibleEpochs []uint64
err = rootBolt.View(func(tx *bolt.Tx) error {
snapshots := tx.Bucket(boltSnapshotsBucket)
if snapshots == nil {
return nil
}
sc := snapshots.Cursor()
for sk, _ := sc.Last(); sk != nil && !found; sk, _ = sc.Prev() {
_, snapshotEpoch, err := decodeUvarintAscending(sk)
if err != nil {
continue
}
if snapshotEpoch == to.epoch {
found = true
}
eligibleEpochs = append(eligibleEpochs, snapshotEpoch)
}
return nil
})
if len(eligibleEpochs) == 0 {
return fmt.Errorf("Rollback: no persisted epochs found in bolt")
}
if !found {
return fmt.Errorf("Rollback: target epoch %d not found in bolt", to.epoch)
}
// start a write transaction
tx, err := rootBolt.Begin(true)
if err != nil {
return err
}
defer func() {
if err == nil {
err = tx.Commit()
} else {
_ = tx.Rollback()
}
if err == nil {
err = rootBolt.Sync()
}
}()
snapshots := tx.Bucket(boltSnapshotsBucket)
if snapshots == nil {
return nil
}
for _, epoch := range eligibleEpochs {
k := encodeUvarintAscending(nil, epoch)
if err != nil {
continue
}
if epoch == to.epoch {
// return here as it already processed until the given epoch
return nil
}
err = snapshots.DeleteBucket(k)
if err == bolt.ErrBucketNotFound {
err = nil
}
}
return err
}

View File

@@ -0,0 +1,942 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"encoding/json"
"fmt"
"os"
"path/filepath"
"sync"
"sync/atomic"
"time"
"github.com/RoaringBitmap/roaring/v2"
"github.com/blevesearch/bleve/v2/registry"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
bolt "go.etcd.io/bbolt"
)
const Name = "scorch"
const Version uint8 = 2
var ErrClosed = fmt.Errorf("scorch closed")
type Scorch struct {
nextSegmentID uint64
stats Stats
iStats internalStats
readOnly bool
version uint8
config map[string]interface{}
analysisQueue *index.AnalysisQueue
path string
unsafeBatch bool
rootLock sync.RWMutex
root *IndexSnapshot // holds 1 ref-count on the root
rootPersisted []chan error // closed when root is persisted
persistedCallbacks []index.BatchCallback
nextSnapshotEpoch uint64
eligibleForRemoval []uint64 // Index snapshot epochs that are safe to GC.
ineligibleForRemoval map[string]bool // Filenames that should not be GC'ed yet.
// keeps track of segments scheduled for online copy/backup operation. Each segment's filename maps to
// the count of copy schedules. Segments with non-zero counts are protected from removal by the cleanup
// operation. Counts decrement upon successful copy, allowing removal of segments with zero or absent counts.
// must be accessed within the rootLock as it is accessed by the asynchronous cleanup routine.
copyScheduled map[string]int
numSnapshotsToKeep int
rollbackRetentionFactor float64
checkPoints []*snapshotMetaData
rollbackSamplingInterval time.Duration
closeCh chan struct{}
introductions chan *segmentIntroduction
persists chan *persistIntroduction
merges chan *segmentMerge
introducerNotifier chan *epochWatcher
persisterNotifier chan *epochWatcher
rootBolt *bolt.DB
asyncTasks sync.WaitGroup
onEvent func(event Event) bool
onAsyncError func(err error, path string)
forceMergeRequestCh chan *mergerCtrl
segPlugin SegmentPlugin
spatialPlugin index.SpatialAnalyzerPlugin
}
// AsyncPanicError is passed to scorch asyncErrorHandler when panic occurs in scorch background process
type AsyncPanicError struct {
Source string
Path string
}
func (e *AsyncPanicError) Error() string {
return fmt.Sprintf("%s panic when processing %s", e.Source, e.Path)
}
type internalStats struct {
persistEpoch uint64
persistSnapshotSize uint64
mergeEpoch uint64
mergeSnapshotSize uint64
newSegBufBytesAdded uint64
newSegBufBytesRemoved uint64
analysisBytesAdded uint64
analysisBytesRemoved uint64
}
func NewScorch(storeName string,
config map[string]interface{},
analysisQueue *index.AnalysisQueue,
) (index.Index, error) {
rv := &Scorch{
version: Version,
config: config,
analysisQueue: analysisQueue,
nextSnapshotEpoch: 1,
closeCh: make(chan struct{}),
ineligibleForRemoval: map[string]bool{},
forceMergeRequestCh: make(chan *mergerCtrl, 1),
segPlugin: defaultSegmentPlugin,
copyScheduled: map[string]int{},
}
forcedSegmentType, forcedSegmentVersion, err := configForceSegmentTypeVersion(config)
if err != nil {
return nil, err
}
if forcedSegmentType != "" && forcedSegmentVersion != 0 {
err := rv.loadSegmentPlugin(forcedSegmentType,
uint32(forcedSegmentVersion))
if err != nil {
return nil, err
}
}
typ, ok := config["spatialPlugin"].(string)
if ok {
if err := rv.loadSpatialAnalyzerPlugin(typ); err != nil {
return nil, err
}
}
rv.root = &IndexSnapshot{parent: rv, refs: 1, creator: "NewScorch"}
ro, ok := config["read_only"].(bool)
if ok {
rv.readOnly = ro
}
ub, ok := config["unsafe_batch"].(bool)
if ok {
rv.unsafeBatch = ub
}
ecbName, ok := config["eventCallbackName"].(string)
if ok {
rv.onEvent = RegistryEventCallbacks[ecbName]
}
aecbName, ok := config["asyncErrorCallbackName"].(string)
if ok {
rv.onAsyncError = RegistryAsyncErrorCallbacks[aecbName]
}
// validate any custom persistor options to
// prevent an async error in the persistor routine
_, err = rv.parsePersisterOptions()
if err != nil {
return nil, err
}
// validate any custom merge planner options to
// prevent an async error in the merger routine
_, err = rv.parseMergePlannerOptions()
if err != nil {
return nil, err
}
return rv, nil
}
// configForceSegmentTypeVersion checks if the caller has requested a
// specific segment type/version
func configForceSegmentTypeVersion(config map[string]interface{}) (string, uint32, error) {
forcedSegmentVersion, err := parseToInteger(config["forceSegmentVersion"])
if err != nil {
return "", 0, nil
}
forcedSegmentType, ok := config["forceSegmentType"].(string)
if !ok {
return "", 0, fmt.Errorf(
"forceSegmentVersion set to %d, must also specify forceSegmentType", forcedSegmentVersion)
}
return forcedSegmentType, uint32(forcedSegmentVersion), nil
}
func (s *Scorch) NumEventsBlocking() uint64 {
eventsCompleted := atomic.LoadUint64(&s.stats.TotEventTriggerCompleted)
eventsStarted := atomic.LoadUint64(&s.stats.TotEventTriggerStarted)
return eventsStarted - eventsCompleted
}
func (s *Scorch) fireEvent(kind EventKind, dur time.Duration) bool {
res := true
if s.onEvent != nil {
atomic.AddUint64(&s.stats.TotEventTriggerStarted, 1)
res = s.onEvent(Event{Kind: kind, Scorch: s, Duration: dur})
atomic.AddUint64(&s.stats.TotEventTriggerCompleted, 1)
}
return res
}
func (s *Scorch) fireAsyncError(err error) {
if s.onAsyncError != nil {
s.onAsyncError(err, s.path)
}
atomic.AddUint64(&s.stats.TotOnErrors, 1)
}
func (s *Scorch) Open() error {
err := s.openBolt()
if err != nil {
return err
}
s.asyncTasks.Add(1)
go s.introducerLoop()
if !s.readOnly && s.path != "" {
s.asyncTasks.Add(1)
go s.persisterLoop()
s.asyncTasks.Add(1)
go s.mergerLoop()
}
return nil
}
func (s *Scorch) openBolt() error {
var ok bool
s.path, ok = s.config["path"].(string)
if !ok {
return fmt.Errorf("must specify path")
}
if s.path == "" {
s.unsafeBatch = true
}
rootBoltOpt := *bolt.DefaultOptions
if s.readOnly {
rootBoltOpt.ReadOnly = true
rootBoltOpt.OpenFile = func(path string, flag int, mode os.FileMode) (*os.File, error) {
// Bolt appends an O_CREATE flag regardless.
// See - https://github.com/etcd-io/bbolt/blob/v1.3.5/db.go#L210
// Use os.O_RDONLY only if path exists (#1623)
if _, err := os.Stat(path); os.IsNotExist(err) {
return os.OpenFile(path, flag, mode)
}
return os.OpenFile(path, os.O_RDONLY, mode)
}
} else {
if s.path != "" {
err := os.MkdirAll(s.path, 0o700)
if err != nil {
return err
}
}
}
if boltTimeoutStr, ok := s.config["bolt_timeout"].(string); ok {
var err error
boltTimeout, err := time.ParseDuration(boltTimeoutStr)
if err != nil {
return fmt.Errorf("invalid duration specified for bolt_timeout: %v", err)
}
rootBoltOpt.Timeout = boltTimeout
}
rootBoltPath := s.path + string(os.PathSeparator) + "root.bolt"
var err error
if s.path != "" {
s.rootBolt, err = bolt.Open(rootBoltPath, 0o600, &rootBoltOpt)
if err != nil {
return err
}
// now see if there is any existing state to load
err = s.loadFromBolt()
if err != nil {
_ = s.Close()
return err
}
}
atomic.StoreUint64(&s.stats.TotFileSegmentsAtRoot, uint64(len(s.root.segment)))
s.introductions = make(chan *segmentIntroduction)
s.persists = make(chan *persistIntroduction)
s.merges = make(chan *segmentMerge)
s.introducerNotifier = make(chan *epochWatcher, 1)
s.persisterNotifier = make(chan *epochWatcher, 1)
s.closeCh = make(chan struct{})
s.forceMergeRequestCh = make(chan *mergerCtrl, 1)
if !s.readOnly && s.path != "" {
err := s.removeOldZapFiles() // Before persister or merger create any new files.
if err != nil {
_ = s.Close()
return err
}
}
s.numSnapshotsToKeep = NumSnapshotsToKeep
if v, ok := s.config["numSnapshotsToKeep"]; ok {
var t int
if t, err = parseToInteger(v); err != nil {
return fmt.Errorf("numSnapshotsToKeep parse err: %v", err)
}
if t > 0 {
s.numSnapshotsToKeep = t
}
}
s.rollbackSamplingInterval = RollbackSamplingInterval
if v, ok := s.config["rollbackSamplingInterval"]; ok {
var t time.Duration
if t, err = parseToTimeDuration(v); err != nil {
return fmt.Errorf("rollbackSamplingInterval parse err: %v", err)
}
s.rollbackSamplingInterval = t
}
s.rollbackRetentionFactor = RollbackRetentionFactor
if v, ok := s.config["rollbackRetentionFactor"]; ok {
var r float64
if r, ok = v.(float64); ok {
return fmt.Errorf("rollbackRetentionFactor parse err: %v", err)
}
s.rollbackRetentionFactor = r
}
typ, ok := s.config["spatialPlugin"].(string)
if ok {
if err := s.loadSpatialAnalyzerPlugin(typ); err != nil {
return err
}
}
return nil
}
func (s *Scorch) Close() (err error) {
startTime := time.Now()
defer func() {
s.fireEvent(EventKindClose, time.Since(startTime))
}()
s.fireEvent(EventKindCloseStart, 0)
// signal to async tasks we want to close
close(s.closeCh)
// wait for them to close
s.asyncTasks.Wait()
// now close the root bolt
if s.rootBolt != nil {
err = s.rootBolt.Close()
s.rootLock.Lock()
if s.root != nil {
err2 := s.root.DecRef()
if err == nil {
err = err2
}
}
s.root = nil
s.rootLock.Unlock()
}
return
}
func (s *Scorch) Update(doc index.Document) error {
b := index.NewBatch()
b.Update(doc)
return s.Batch(b)
}
func (s *Scorch) Delete(id string) error {
b := index.NewBatch()
b.Delete(id)
return s.Batch(b)
}
// Batch applices a batch of changes to the index atomically
func (s *Scorch) Batch(batch *index.Batch) (err error) {
start := time.Now()
// notify handlers that we're about to index a batch of data
s.fireEvent(EventKindBatchIntroductionStart, 0)
defer func() {
s.fireEvent(EventKindBatchIntroduction, time.Since(start))
}()
resultChan := make(chan index.Document, len(batch.IndexOps))
var numUpdates uint64
var numDeletes uint64
var numPlainTextBytes uint64
var ids []string
for docID, doc := range batch.IndexOps {
if doc != nil {
// insert _id field
doc.AddIDField()
numUpdates++
numPlainTextBytes += doc.NumPlainTextBytes()
} else {
numDeletes++
}
ids = append(ids, docID)
}
// FIXME could sort ids list concurrent with analysis?
if numUpdates > 0 {
go func() {
for k := range batch.IndexOps {
doc := batch.IndexOps[k]
if doc != nil {
// put the work on the queue
s.analysisQueue.Queue(func() {
analyze(doc, s.setSpatialAnalyzerPlugin)
resultChan <- doc
})
}
}
}()
}
// wait for analysis result
analysisResults := make([]index.Document, int(numUpdates))
var itemsDeQueued uint64
var totalAnalysisSize int
for itemsDeQueued < numUpdates {
result := <-resultChan
resultSize := result.Size()
// check if the document is searchable by the index
if result.Indexed() {
atomic.AddUint64(&s.stats.TotMutationsFiltered, 1)
}
atomic.AddUint64(&s.iStats.analysisBytesAdded, uint64(resultSize))
totalAnalysisSize += resultSize
analysisResults[itemsDeQueued] = result
itemsDeQueued++
}
close(resultChan)
defer atomic.AddUint64(&s.iStats.analysisBytesRemoved, uint64(totalAnalysisSize))
atomic.AddUint64(&s.stats.TotAnalysisTime, uint64(time.Since(start)))
indexStart := time.Now()
var newSegment segment.Segment
var bufBytes uint64
stats := newFieldStats()
if len(analysisResults) > 0 {
newSegment, bufBytes, err = s.segPlugin.New(analysisResults)
if err != nil {
return err
}
if segB, ok := newSegment.(segment.DiskStatsReporter); ok {
atomic.AddUint64(&s.stats.TotBytesWrittenAtIndexTime,
segB.BytesWritten())
}
atomic.AddUint64(&s.iStats.newSegBufBytesAdded, bufBytes)
if fsr, ok := newSegment.(segment.FieldStatsReporter); ok {
fsr.UpdateFieldStats(stats)
}
} else {
atomic.AddUint64(&s.stats.TotBatchesEmpty, 1)
}
err = s.prepareSegment(newSegment, ids, batch.InternalOps, batch.PersistedCallback(), stats)
if err != nil {
if newSegment != nil {
_ = newSegment.Close()
}
atomic.AddUint64(&s.stats.TotOnErrors, 1)
} else {
atomic.AddUint64(&s.stats.TotUpdates, numUpdates)
atomic.AddUint64(&s.stats.TotDeletes, numDeletes)
atomic.AddUint64(&s.stats.TotBatches, 1)
atomic.AddUint64(&s.stats.TotIndexedPlainTextBytes, numPlainTextBytes)
}
atomic.AddUint64(&s.iStats.newSegBufBytesRemoved, bufBytes)
atomic.AddUint64(&s.stats.TotIndexTime, uint64(time.Since(indexStart)))
return err
}
func (s *Scorch) prepareSegment(newSegment segment.Segment, ids []string,
internalOps map[string][]byte, persistedCallback index.BatchCallback, stats *fieldStats,
) error {
// new introduction
introduction := &segmentIntroduction{
id: atomic.AddUint64(&s.nextSegmentID, 1),
data: newSegment,
ids: ids,
internal: internalOps,
stats: stats,
applied: make(chan error),
persistedCallback: persistedCallback,
}
if !s.unsafeBatch {
introduction.persisted = make(chan error, 1)
}
// optimistically prepare obsoletes outside of rootLock
s.rootLock.RLock()
root := s.root
root.AddRef()
s.rootLock.RUnlock()
defer func() { _ = root.DecRef() }()
introduction.obsoletes = make(map[uint64]*roaring.Bitmap, len(root.segment))
for _, seg := range root.segment {
delta, err := seg.segment.DocNumbers(ids)
if err != nil {
return err
}
introduction.obsoletes[seg.id] = delta
}
introStartTime := time.Now()
s.introductions <- introduction
// block until this segment is applied
err := <-introduction.applied
if err != nil {
return err
}
if introduction.persisted != nil {
err = <-introduction.persisted
}
introTime := uint64(time.Since(introStartTime))
atomic.AddUint64(&s.stats.TotBatchIntroTime, introTime)
if atomic.LoadUint64(&s.stats.MaxBatchIntroTime) < introTime {
atomic.StoreUint64(&s.stats.MaxBatchIntroTime, introTime)
}
return err
}
func (s *Scorch) SetInternal(key, val []byte) error {
b := index.NewBatch()
b.SetInternal(key, val)
return s.Batch(b)
}
func (s *Scorch) DeleteInternal(key []byte) error {
b := index.NewBatch()
b.DeleteInternal(key)
return s.Batch(b)
}
// Reader returns a low-level accessor on the index data. Close it to
// release associated resources.
func (s *Scorch) Reader() (index.IndexReader, error) {
return s.currentSnapshot(), nil
}
func (s *Scorch) currentSnapshot() *IndexSnapshot {
s.rootLock.RLock()
rv := s.root
if rv != nil {
rv.AddRef()
}
s.rootLock.RUnlock()
return rv
}
func (s *Scorch) Stats() json.Marshaler {
return &s.stats
}
func (s *Scorch) BytesReadQueryTime() uint64 {
return s.stats.TotBytesReadAtQueryTime
}
func (s *Scorch) diskFileStats(rootSegmentPaths map[string]struct{}) (uint64,
uint64, uint64,
) {
var numFilesOnDisk, numBytesUsedDisk, numBytesOnDiskByRoot uint64
if s.path != "" {
files, err := os.ReadDir(s.path)
if err == nil {
for _, f := range files {
if !f.IsDir() {
if finfo, err := f.Info(); err == nil {
numBytesUsedDisk += uint64(finfo.Size())
numFilesOnDisk++
if rootSegmentPaths != nil {
fname := s.path + string(os.PathSeparator) + finfo.Name()
if _, fileAtRoot := rootSegmentPaths[fname]; fileAtRoot {
numBytesOnDiskByRoot += uint64(finfo.Size())
}
}
}
}
}
}
}
// if no root files path given, then consider all disk files.
if rootSegmentPaths == nil {
return numFilesOnDisk, numBytesUsedDisk, numBytesUsedDisk
}
return numFilesOnDisk, numBytesUsedDisk, numBytesOnDiskByRoot
}
func (s *Scorch) StatsMap() map[string]interface{} {
m := s.stats.ToMap()
indexSnapshot := s.currentSnapshot()
if indexSnapshot == nil {
return nil
}
defer func() {
_ = indexSnapshot.Close()
}()
rootSegPaths := indexSnapshot.diskSegmentsPaths()
s.rootLock.RLock()
m["CurFilesIneligibleForRemoval"] = uint64(len(s.ineligibleForRemoval))
s.rootLock.RUnlock()
numFilesOnDisk, numBytesUsedDisk, numBytesOnDiskByRoot := s.diskFileStats(rootSegPaths)
m["CurOnDiskBytes"] = numBytesUsedDisk
m["CurOnDiskFiles"] = numFilesOnDisk
// TODO: consider one day removing these backwards compatible
// names for apps using the old names
m["updates"] = m["TotUpdates"]
m["deletes"] = m["TotDeletes"]
m["batches"] = m["TotBatches"]
m["errors"] = m["TotOnErrors"]
m["analysis_time"] = m["TotAnalysisTime"]
m["index_time"] = m["TotIndexTime"]
m["term_searchers_started"] = m["TotTermSearchersStarted"]
m["term_searchers_finished"] = m["TotTermSearchersFinished"]
m["knn_searches"] = m["TotKNNSearches"]
m["synonym_searches"] = m["TotSynonymSearches"]
m["total_mutations_filtered"] = m["TotMutationsFiltered"]
m["num_bytes_read_at_query_time"] = m["TotBytesReadAtQueryTime"]
m["num_plain_text_bytes_indexed"] = m["TotIndexedPlainTextBytes"]
m["num_bytes_written_at_index_time"] = m["TotBytesWrittenAtIndexTime"]
m["num_items_introduced"] = m["TotIntroducedItems"]
m["num_items_persisted"] = m["TotPersistedItems"]
m["num_recs_to_persist"] = m["TotItemsToPersist"]
// total disk bytes found in index directory inclusive of older snapshots
m["num_bytes_used_disk"] = numBytesUsedDisk
// total disk bytes by the latest root index, exclusive of older snapshots
m["num_bytes_used_disk_by_root"] = numBytesOnDiskByRoot
// num_bytes_used_disk_by_root_reclaimable is an approximation about the
// reclaimable disk space in an index. (eg: from a full compaction)
m["num_bytes_used_disk_by_root_reclaimable"] = uint64(float64(numBytesOnDiskByRoot) *
indexSnapshot.reClaimableDocsRatio())
m["num_files_on_disk"] = numFilesOnDisk
m["num_root_memorysegments"] = m["TotMemorySegmentsAtRoot"]
m["num_root_filesegments"] = m["TotFileSegmentsAtRoot"]
m["num_persister_nap_pause_completed"] = m["TotPersisterNapPauseCompleted"]
m["num_persister_nap_merger_break"] = m["TotPersisterMergerNapBreak"]
m["total_compaction_written_bytes"] = m["TotFileMergeWrittenBytes"]
// the bool stat `index_bgthreads_active` indicates whether the background routines
// (which are responsible for the index to attain a steady state) are still
// doing some work.
if rootEpoch, ok := m["CurRootEpoch"].(uint64); ok {
if lastMergedEpoch, ok := m["LastMergedEpoch"].(uint64); ok {
if lastPersistedEpoch, ok := m["LastPersistedEpoch"].(uint64); ok {
m["index_bgthreads_active"] = !(lastMergedEpoch == rootEpoch && lastPersistedEpoch == rootEpoch)
}
}
}
// calculate the aggregate of all the segment's field stats
aggFieldStats := newFieldStats()
for _, segmentSnapshot := range indexSnapshot.Segments() {
if segmentSnapshot.stats != nil {
aggFieldStats.Aggregate(segmentSnapshot.stats)
}
}
aggFieldStatsMap := aggFieldStats.Fetch()
for statName, stats := range aggFieldStatsMap {
for fieldName, val := range stats {
m["field:"+fieldName+":"+statName] = val
}
}
return m
}
func (s *Scorch) Analyze(d index.Document) {
analyze(d, s.setSpatialAnalyzerPlugin)
}
type customAnalyzerPluginInitFunc func(field index.Field)
func (s *Scorch) setSpatialAnalyzerPlugin(f index.Field) {
if s.segPlugin != nil {
// check whether the current field is a custom tokenizable
// spatial field then set the spatial analyser plugin for
// overriding the tokenisation during the analysis stage.
if sf, ok := f.(index.TokenizableSpatialField); ok {
sf.SetSpatialAnalyzerPlugin(s.spatialPlugin)
}
}
}
func analyze(d index.Document, fn customAnalyzerPluginInitFunc) {
d.VisitFields(func(field index.Field) {
if field.Options().IsIndexed() {
if fn != nil {
fn(field)
}
field.Analyze()
if d.HasComposite() && field.Name() != "_id" {
// see if any of the composite fields need this
d.VisitComposite(func(cf index.CompositeField) {
cf.Compose(field.Name(), field.AnalyzedLength(), field.AnalyzedTokenFrequencies())
})
// Since the encoded geoShape is only necessary within the doc values
// of the geoShapeField, it has been removed from the field's term dictionary.
// However, '_all' field uses its term dictionary as its docValues, so it
// becomes necessary to add the geoShape into the '_all' field's term dictionary
if f, ok := field.(index.GeoShapeField); ok {
d.VisitComposite(func(cf index.CompositeField) {
geoshape := f.EncodedShape()
cf.Compose(field.Name(), 1, index.TokenFrequencies{
string(geoshape): &index.TokenFreq{
Term: geoshape,
Locations: []*index.TokenLocation{
{
Start: 0,
End: len(geoshape),
Position: 1,
},
},
},
})
})
}
}
}
})
}
func (s *Scorch) AddEligibleForRemoval(epoch uint64) {
s.rootLock.Lock()
if s.root == nil || s.root.epoch != epoch {
s.eligibleForRemoval = append(s.eligibleForRemoval, epoch)
}
s.rootLock.Unlock()
}
func (s *Scorch) MemoryUsed() (memUsed uint64) {
indexSnapshot := s.currentSnapshot()
if indexSnapshot == nil {
return
}
defer func() {
_ = indexSnapshot.Close()
}()
// Account for current root snapshot overhead
memUsed += uint64(indexSnapshot.Size())
// Account for snapshot that the persister may be working on
persistEpoch := atomic.LoadUint64(&s.iStats.persistEpoch)
persistSnapshotSize := atomic.LoadUint64(&s.iStats.persistSnapshotSize)
if persistEpoch != 0 && indexSnapshot.epoch > persistEpoch {
// the snapshot that the persister is working on isn't the same as
// the current snapshot
memUsed += persistSnapshotSize
}
// Account for snapshot that the merger may be working on
mergeEpoch := atomic.LoadUint64(&s.iStats.mergeEpoch)
mergeSnapshotSize := atomic.LoadUint64(&s.iStats.mergeSnapshotSize)
if mergeEpoch != 0 && indexSnapshot.epoch > mergeEpoch {
// the snapshot that the merger is working on isn't the same as
// the current snapshot
memUsed += mergeSnapshotSize
}
memUsed += (atomic.LoadUint64(&s.iStats.newSegBufBytesAdded) -
atomic.LoadUint64(&s.iStats.newSegBufBytesRemoved))
memUsed += (atomic.LoadUint64(&s.iStats.analysisBytesAdded) -
atomic.LoadUint64(&s.iStats.analysisBytesRemoved))
return memUsed
}
func (s *Scorch) markIneligibleForRemoval(filename string) {
s.rootLock.Lock()
s.ineligibleForRemoval[filename] = true
s.rootLock.Unlock()
}
func (s *Scorch) unmarkIneligibleForRemoval(filename string) {
s.rootLock.Lock()
delete(s.ineligibleForRemoval, filename)
s.rootLock.Unlock()
}
func init() {
err := registry.RegisterIndexType(Name, NewScorch)
if err != nil {
panic(err)
}
}
func parseToTimeDuration(i interface{}) (time.Duration, error) {
switch v := i.(type) {
case string:
return time.ParseDuration(v)
default:
return 0, fmt.Errorf("expects a duration string")
}
}
func parseToInteger(i interface{}) (int, error) {
switch v := i.(type) {
case float64:
return int(v), nil
case int:
return v, nil
default:
return 0, fmt.Errorf("expects int or float64 value")
}
}
// Holds Zap's field level stats at a segment level
type fieldStats struct {
// StatName -> FieldName -> value
statMap map[string]map[string]uint64
}
// Add the data into the map after checking if the statname is valid
func (fs *fieldStats) Store(statName, fieldName string, value uint64) {
if _, exists := fs.statMap[statName]; !exists {
fs.statMap[statName] = make(map[string]uint64)
}
fs.statMap[statName][fieldName] = value
}
// Combine the given stats map with the existing map
func (fs *fieldStats) Aggregate(stats segment.FieldStats) {
statMap := stats.Fetch()
if statMap == nil {
return
}
for statName, statMap := range statMap {
if _, exists := fs.statMap[statName]; !exists {
fs.statMap[statName] = make(map[string]uint64)
}
for fieldName, val := range statMap {
if _, exists := fs.statMap[statName][fieldName]; !exists {
fs.statMap[statName][fieldName] = 0
}
fs.statMap[statName][fieldName] += val
}
}
}
// Returns the stats map
func (fs *fieldStats) Fetch() map[string]map[string]uint64 {
if fs == nil {
return nil
}
return fs.statMap
}
// Initializes an empty stats map
func newFieldStats() *fieldStats {
rv := &fieldStats{
statMap: map[string]map[string]uint64{},
}
return rv
}
// CopyReader returns a low-level accessor for index data, ensuring persisted segments
// remain on disk for backup, preventing race conditions with the persister/merger cleanup.
// Close the reader after backup to allow segment removal by the persister/merger.
func (s *Scorch) CopyReader() index.CopyReader {
s.rootLock.Lock()
rv := s.root
if rv != nil {
rv.AddRef()
var fileName string
// schedule a backup for all the segments from the root. Note that the
// both the unpersisted and persisted segments are scheduled for backup.
// because during the backup, the unpersisted segments may get persisted and
// hence we need to protect both the unpersisted and persisted segments from removal
// by the cleanup routine during the online backup
for _, seg := range rv.segment {
if perSeg, ok := seg.segment.(segment.PersistedSegment); ok {
// segment is persisted
fileName = filepath.Base(perSeg.Path())
} else {
// segment is not persisted
// the name of the segment file that is generated if the
// the segment is persisted in the future.
fileName = zapFileName(seg.id)
}
rv.parent.copyScheduled[fileName]++
}
}
s.rootLock.Unlock()
return rv
}
// external API to fire a scorch event (EventKindIndexStart) externally from bleve
func (s *Scorch) FireIndexEvent() {
s.fireEvent(EventKindIndexStart, 0)
}

View File

@@ -0,0 +1,144 @@
// Copyright (c) 2019 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"fmt"
"github.com/RoaringBitmap/roaring/v2"
"github.com/blevesearch/bleve/v2/geo"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
zapv11 "github.com/blevesearch/zapx/v11"
zapv12 "github.com/blevesearch/zapx/v12"
zapv13 "github.com/blevesearch/zapx/v13"
zapv14 "github.com/blevesearch/zapx/v14"
zapv15 "github.com/blevesearch/zapx/v15"
zapv16 "github.com/blevesearch/zapx/v16"
)
// SegmentPlugin represents the essential functions required by a package to plug in
// it's segment implementation
type SegmentPlugin interface {
// Type is the name for this segment plugin
Type() string
// Version is a numeric value identifying a specific version of this type.
// When incompatible changes are made to a particular type of plugin, the
// version must be incremented.
Version() uint32
// New takes a set of Documents and turns them into a new Segment
New(results []index.Document) (segment.Segment, uint64, error)
// Open attempts to open the file at the specified path and
// return the corresponding Segment
Open(path string) (segment.Segment, error)
// Merge takes a set of Segments, and creates a new segment on disk at
// the specified path.
// Drops is a set of bitmaps (one for each segment) indicating which
// documents can be dropped from the segments during the merge.
// If the closeCh channel is closed, Merge will cease doing work at
// the next opportunity, and return an error (closed).
// StatsReporter can optionally be provided, in which case progress
// made during the merge is reported while operation continues.
// Returns:
// A slice of new document numbers (one for each input segment),
// this allows the caller to know a particular document's new
// document number in the newly merged segment.
// The number of bytes written to the new segment file.
// An error, if any occurred.
Merge(segments []segment.Segment, drops []*roaring.Bitmap, path string,
closeCh chan struct{}, s segment.StatsReporter) (
[][]uint64, uint64, error)
}
var supportedSegmentPlugins map[string]map[uint32]SegmentPlugin
var defaultSegmentPlugin SegmentPlugin
func init() {
ResetSegmentPlugins()
RegisterSegmentPlugin(&zapv16.ZapPlugin{}, true)
RegisterSegmentPlugin(&zapv15.ZapPlugin{}, false)
RegisterSegmentPlugin(&zapv14.ZapPlugin{}, false)
RegisterSegmentPlugin(&zapv13.ZapPlugin{}, false)
RegisterSegmentPlugin(&zapv12.ZapPlugin{}, false)
RegisterSegmentPlugin(&zapv11.ZapPlugin{}, false)
}
func ResetSegmentPlugins() {
supportedSegmentPlugins = map[string]map[uint32]SegmentPlugin{}
}
func RegisterSegmentPlugin(plugin SegmentPlugin, makeDefault bool) {
if _, ok := supportedSegmentPlugins[plugin.Type()]; !ok {
supportedSegmentPlugins[plugin.Type()] = map[uint32]SegmentPlugin{}
}
supportedSegmentPlugins[plugin.Type()][plugin.Version()] = plugin
if makeDefault {
defaultSegmentPlugin = plugin
}
}
func SupportedSegmentTypes() (rv []string) {
for k := range supportedSegmentPlugins {
rv = append(rv, k)
}
return
}
func SupportedSegmentTypeVersions(typ string) (rv []uint32) {
for k := range supportedSegmentPlugins[typ] {
rv = append(rv, k)
}
return rv
}
func chooseSegmentPlugin(forcedSegmentType string,
forcedSegmentVersion uint32) (SegmentPlugin, error) {
if versions, ok := supportedSegmentPlugins[forcedSegmentType]; ok {
if segPlugin, ok := versions[uint32(forcedSegmentVersion)]; ok {
return segPlugin, nil
}
return nil, fmt.Errorf(
"unsupported version %d for segment type: %s, supported: %v",
forcedSegmentVersion, forcedSegmentType,
SupportedSegmentTypeVersions(forcedSegmentType))
}
return nil, fmt.Errorf("unsupported segment type: %s, supported: %v",
forcedSegmentType, SupportedSegmentTypes())
}
func (s *Scorch) loadSegmentPlugin(forcedSegmentType string,
forcedSegmentVersion uint32) error {
segPlugin, err := chooseSegmentPlugin(forcedSegmentType,
forcedSegmentVersion)
if err != nil {
return err
}
s.segPlugin = segPlugin
return nil
}
func (s *Scorch) loadSpatialAnalyzerPlugin(typ string) error {
s.spatialPlugin = geo.GetSpatialAnalyzerPlugin(typ)
if s.spatialPlugin == nil {
return fmt.Errorf("unsupported spatial plugin type: %s", typ)
}
return nil
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,119 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"container/heap"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
type segmentDictCursor struct {
dict segment.TermDictionary
itr segment.DictionaryIterator
curr index.DictEntry
}
type IndexSnapshotFieldDict struct {
cardinality int
bytesRead uint64
snapshot *IndexSnapshot
cursors []*segmentDictCursor
entry index.DictEntry
}
func (i *IndexSnapshotFieldDict) BytesRead() uint64 {
return i.bytesRead
}
func (i *IndexSnapshotFieldDict) Len() int { return len(i.cursors) }
func (i *IndexSnapshotFieldDict) Less(a, b int) bool {
return i.cursors[a].curr.Term < i.cursors[b].curr.Term
}
func (i *IndexSnapshotFieldDict) Swap(a, b int) {
i.cursors[a], i.cursors[b] = i.cursors[b], i.cursors[a]
}
func (i *IndexSnapshotFieldDict) Push(x interface{}) {
i.cursors = append(i.cursors, x.(*segmentDictCursor))
}
func (i *IndexSnapshotFieldDict) Pop() interface{} {
n := len(i.cursors)
x := i.cursors[n-1]
i.cursors = i.cursors[0 : n-1]
return x
}
func (i *IndexSnapshotFieldDict) Next() (*index.DictEntry, error) {
if len(i.cursors) == 0 {
return nil, nil
}
i.entry = i.cursors[0].curr
next, err := i.cursors[0].itr.Next()
if err != nil {
return nil, err
}
if next == nil {
// at end of this cursor, remove it
heap.Pop(i)
} else {
// modified heap, fix it
i.cursors[0].curr = *next
heap.Fix(i, 0)
}
// look for any other entries with the exact same term
for len(i.cursors) > 0 && i.cursors[0].curr.Term == i.entry.Term {
i.entry.Count += i.cursors[0].curr.Count
next, err := i.cursors[0].itr.Next()
if err != nil {
return nil, err
}
if next == nil {
// at end of this cursor, remove it
heap.Pop(i)
} else {
// modified heap, fix it
i.cursors[0].curr = *next
heap.Fix(i, 0)
}
}
return &i.entry, nil
}
func (i *IndexSnapshotFieldDict) Cardinality() int {
return i.cardinality
}
func (i *IndexSnapshotFieldDict) Close() error {
return nil
}
func (i *IndexSnapshotFieldDict) Contains(key []byte) (bool, error) {
if len(i.cursors) == 0 {
return false, nil
}
for _, cursor := range i.cursors {
if found, _ := cursor.dict.Contains(key); found {
return true, nil
}
}
return false, nil
}

View File

@@ -0,0 +1,80 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"bytes"
"reflect"
"github.com/RoaringBitmap/roaring/v2"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
)
var reflectStaticSizeIndexSnapshotDocIDReader int
func init() {
var isdr IndexSnapshotDocIDReader
reflectStaticSizeIndexSnapshotDocIDReader = int(reflect.TypeOf(isdr).Size())
}
type IndexSnapshotDocIDReader struct {
snapshot *IndexSnapshot
iterators []roaring.IntIterable
segmentOffset int
}
func (i *IndexSnapshotDocIDReader) Size() int {
return reflectStaticSizeIndexSnapshotDocIDReader + size.SizeOfPtr
}
func (i *IndexSnapshotDocIDReader) Next() (index.IndexInternalID, error) {
for i.segmentOffset < len(i.iterators) {
if !i.iterators[i.segmentOffset].HasNext() {
i.segmentOffset++
continue
}
next := i.iterators[i.segmentOffset].Next()
// make segment number into global number by adding offset
globalOffset := i.snapshot.offsets[i.segmentOffset]
return docNumberToBytes(nil, uint64(next)+globalOffset), nil
}
return nil, nil
}
func (i *IndexSnapshotDocIDReader) Advance(ID index.IndexInternalID) (index.IndexInternalID, error) {
// FIXME do something better
next, err := i.Next()
if err != nil {
return nil, err
}
if next == nil {
return nil, nil
}
for bytes.Compare(next, ID) < 0 {
next, err = i.Next()
if err != nil {
return nil, err
}
if next == nil {
break
}
}
return next, nil
}
func (i *IndexSnapshotDocIDReader) Close() error {
return nil
}

View File

@@ -0,0 +1,79 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"reflect"
"github.com/blevesearch/bleve/v2/size"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
var reflectStaticSizeIndexSnapshotThesaurusTermReader int
func init() {
var istr IndexSnapshotThesaurusTermReader
reflectStaticSizeIndexSnapshotThesaurusTermReader = int(reflect.TypeOf(istr).Size())
}
type IndexSnapshotThesaurusTermReader struct {
name string
snapshot *IndexSnapshot
thesauri []segment.Thesaurus
postings []segment.SynonymsList
iterators []segment.SynonymsIterator
segmentOffset int
}
func (i *IndexSnapshotThesaurusTermReader) Size() int {
sizeInBytes := reflectStaticSizeIndexSnapshotThesaurusTermReader + size.SizeOfPtr +
len(i.name) + size.SizeOfString
for _, postings := range i.postings {
if postings != nil {
sizeInBytes += postings.Size()
}
}
for _, iterator := range i.iterators {
if iterator != nil {
sizeInBytes += iterator.Size()
}
}
return sizeInBytes
}
func (i *IndexSnapshotThesaurusTermReader) Next() (string, error) {
// find the next hit
for i.segmentOffset < len(i.iterators) {
if i.iterators[i.segmentOffset] != nil {
next, err := i.iterators[i.segmentOffset].Next()
if err != nil {
return "", err
}
if next != nil {
synTerm := next.Term()
return synTerm, nil
}
}
i.segmentOffset++
}
return "", nil
}
func (i *IndexSnapshotThesaurusTermReader) Close() error {
return nil
}

View File

@@ -0,0 +1,232 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"bytes"
"context"
"fmt"
"reflect"
"sync/atomic"
"github.com/blevesearch/bleve/v2/search"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
var reflectStaticSizeIndexSnapshotTermFieldReader int
func init() {
var istfr IndexSnapshotTermFieldReader
reflectStaticSizeIndexSnapshotTermFieldReader = int(reflect.TypeOf(istfr).Size())
}
type IndexSnapshotTermFieldReader struct {
term []byte
field string
snapshot *IndexSnapshot
dicts []segment.TermDictionary
postings []segment.PostingsList
iterators []segment.PostingsIterator
segmentOffset int
includeFreq bool
includeNorm bool
includeTermVectors bool
currPosting segment.Posting
currID index.IndexInternalID
recycle bool
bytesRead uint64
ctx context.Context
unadorned bool
}
func (i *IndexSnapshotTermFieldReader) incrementBytesRead(val uint64) {
i.bytesRead += val
}
func (i *IndexSnapshotTermFieldReader) Size() int {
sizeInBytes := reflectStaticSizeIndexSnapshotTermFieldReader + size.SizeOfPtr +
len(i.term) +
len(i.field) +
len(i.currID)
for _, entry := range i.postings {
sizeInBytes += entry.Size()
}
for _, entry := range i.iterators {
sizeInBytes += entry.Size()
}
if i.currPosting != nil {
sizeInBytes += i.currPosting.Size()
}
return sizeInBytes
}
func (i *IndexSnapshotTermFieldReader) Next(preAlloced *index.TermFieldDoc) (*index.TermFieldDoc, error) {
rv := preAlloced
if rv == nil {
rv = &index.TermFieldDoc{}
}
// find the next hit
for i.segmentOffset < len(i.iterators) {
prevBytesRead := i.iterators[i.segmentOffset].BytesRead()
next, err := i.iterators[i.segmentOffset].Next()
if err != nil {
return nil, err
}
if next != nil {
// make segment number into global number by adding offset
globalOffset := i.snapshot.offsets[i.segmentOffset]
nnum := next.Number()
rv.ID = docNumberToBytes(rv.ID, nnum+globalOffset)
i.postingToTermFieldDoc(next, rv)
i.currID = rv.ID
i.currPosting = next
// postingsIterators is maintain the bytesRead stat in a cumulative fashion.
// this is because there are chances of having a series of loadChunk calls,
// and they have to be added together before sending the bytesRead at this point
// upstream.
bytesRead := i.iterators[i.segmentOffset].BytesRead()
if bytesRead > prevBytesRead {
i.incrementBytesRead(bytesRead - prevBytesRead)
}
return rv, nil
}
i.segmentOffset++
}
return nil, nil
}
func (i *IndexSnapshotTermFieldReader) postingToTermFieldDoc(next segment.Posting, rv *index.TermFieldDoc) {
if i.includeFreq {
rv.Freq = next.Frequency()
}
if i.includeNorm {
rv.Norm = next.Norm()
}
if i.includeTermVectors {
locs := next.Locations()
if cap(rv.Vectors) < len(locs) {
rv.Vectors = make([]*index.TermFieldVector, len(locs))
backing := make([]index.TermFieldVector, len(locs))
for i := range backing {
rv.Vectors[i] = &backing[i]
}
}
rv.Vectors = rv.Vectors[:len(locs)]
for i, loc := range locs {
*rv.Vectors[i] = index.TermFieldVector{
Start: loc.Start(),
End: loc.End(),
Pos: loc.Pos(),
ArrayPositions: loc.ArrayPositions(),
Field: loc.Field(),
}
}
}
}
func (i *IndexSnapshotTermFieldReader) Advance(ID index.IndexInternalID, preAlloced *index.TermFieldDoc) (*index.TermFieldDoc, error) {
// FIXME do something better
// for now, if we need to seek backwards, then restart from the beginning
if i.currPosting != nil && bytes.Compare(i.currID, ID) >= 0 {
// Check if the TFR is a special unadorned composite optimization.
// Such a TFR will NOT have a valid `term` or `field` set, making it
// impossible for the TFR to replace itself with a new one.
if !i.unadorned {
i2, err := i.snapshot.TermFieldReader(context.TODO(), i.term, i.field,
i.includeFreq, i.includeNorm, i.includeTermVectors)
if err != nil {
return nil, err
}
// close the current term field reader before replacing it with a new one
_ = i.Close()
*i = *(i2.(*IndexSnapshotTermFieldReader))
} else {
// unadorned composite optimization
// we need to reset all the iterators
// back to the beginning, which effectively
// achives the same thing as the above
for _, iter := range i.iterators {
if optimizedIterator, ok := iter.(ResetablePostingsIterator); ok {
optimizedIterator.ResetIterator()
}
}
}
}
num, err := docInternalToNumber(ID)
if err != nil {
return nil, fmt.Errorf("error converting to doc number % x - %v", ID, err)
}
segIndex, ldocNum := i.snapshot.segmentIndexAndLocalDocNumFromGlobal(num)
if segIndex >= len(i.snapshot.segment) {
return nil, fmt.Errorf("computed segment index %d out of bounds %d",
segIndex, len(i.snapshot.segment))
}
// skip directly to the target segment
i.segmentOffset = segIndex
next, err := i.iterators[i.segmentOffset].Advance(ldocNum)
if err != nil {
return nil, err
}
if next == nil {
// we jumped directly to the segment that should have contained it
// but it wasn't there, so reuse Next() which should correctly
// get the next hit after it (we moved i.segmentOffset)
return i.Next(preAlloced)
}
if preAlloced == nil {
preAlloced = &index.TermFieldDoc{}
}
preAlloced.ID = docNumberToBytes(preAlloced.ID, next.Number()+
i.snapshot.offsets[segIndex])
i.postingToTermFieldDoc(next, preAlloced)
i.currID = preAlloced.ID
i.currPosting = next
return preAlloced, nil
}
func (i *IndexSnapshotTermFieldReader) Count() uint64 {
var rv uint64
for _, posting := range i.postings {
rv += posting.Count()
}
return rv
}
func (i *IndexSnapshotTermFieldReader) Close() error {
if i.ctx != nil {
statsCallbackFn := i.ctx.Value(search.SearchIOStatsCallbackKey)
if statsCallbackFn != nil {
// essentially before you close the TFR, you must report this
// reader's bytesRead value
statsCallbackFn.(search.SearchIOStatsCallbackFunc)(i.bytesRead)
}
search.RecordSearchCost(i.ctx, search.AddM, i.bytesRead)
}
if i.snapshot != nil {
atomic.AddUint64(&i.snapshot.parent.stats.TotTermSearchersFinished, uint64(1))
i.snapshot.recycleTermFieldReader(i)
}
return nil
}

View File

@@ -0,0 +1,107 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"container/heap"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
type segmentThesCursor struct {
thes segment.Thesaurus
itr segment.ThesaurusIterator
curr index.ThesaurusEntry
}
type IndexSnapshotThesaurusKeys struct {
snapshot *IndexSnapshot
cursors []*segmentThesCursor
entry index.ThesaurusEntry
}
func (i *IndexSnapshotThesaurusKeys) Len() int { return len(i.cursors) }
func (i *IndexSnapshotThesaurusKeys) Less(a, b int) bool {
return i.cursors[a].curr.Term < i.cursors[b].curr.Term
}
func (i *IndexSnapshotThesaurusKeys) Swap(a, b int) {
i.cursors[a], i.cursors[b] = i.cursors[b], i.cursors[a]
}
func (i *IndexSnapshotThesaurusKeys) Push(x interface{}) {
i.cursors = append(i.cursors, x.(*segmentThesCursor))
}
func (i *IndexSnapshotThesaurusKeys) Pop() interface{} {
n := len(i.cursors)
x := i.cursors[n-1]
i.cursors = i.cursors[0 : n-1]
return x
}
func (i *IndexSnapshotThesaurusKeys) Next() (*index.ThesaurusEntry, error) {
if len(i.cursors) == 0 {
return nil, nil
}
i.entry = i.cursors[0].curr
next, err := i.cursors[0].itr.Next()
if err != nil {
return nil, err
}
if next == nil {
// at end of this cursor, remove it
heap.Pop(i)
} else {
// modified heap, fix it
i.cursors[0].curr = *next
heap.Fix(i, 0)
}
// look for any other entries with the exact same term
for len(i.cursors) > 0 && i.cursors[0].curr.Term == i.entry.Term {
next, err := i.cursors[0].itr.Next()
if err != nil {
return nil, err
}
if next == nil {
// at end of this cursor, remove it
heap.Pop(i)
} else {
// modified heap, fix it
i.cursors[0].curr = *next
heap.Fix(i, 0)
}
}
return &i.entry, nil
}
func (i *IndexSnapshotThesaurusKeys) Close() error {
return nil
}
func (i *IndexSnapshotThesaurusKeys) Contains(key []byte) (bool, error) {
if len(i.cursors) == 0 {
return false, nil
}
for _, cursor := range i.cursors {
if found, _ := cursor.thes.Contains(key); found {
return true, nil
}
}
return false, nil
}

View File

@@ -0,0 +1,165 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors
// +build vectors
package scorch
import (
"bytes"
"context"
"encoding/json"
"fmt"
"reflect"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
segment_api "github.com/blevesearch/scorch_segment_api/v2"
)
const VectorSearchSupportedSegmentVersion = 16
var reflectStaticSizeIndexSnapshotVectorReader int
func init() {
var istfr IndexSnapshotVectorReader
reflectStaticSizeIndexSnapshotVectorReader = int(reflect.TypeOf(istfr).Size())
}
type IndexSnapshotVectorReader struct {
vector []float32
field string
k int64
snapshot *IndexSnapshot
postings []segment_api.VecPostingsList
iterators []segment_api.VecPostingsIterator
segmentOffset int
currPosting segment_api.VecPosting
currID index.IndexInternalID
ctx context.Context
searchParams json.RawMessage
eligibleSelector index.EligibleDocumentSelector
}
func (i *IndexSnapshotVectorReader) Size() int {
sizeInBytes := reflectStaticSizeIndexSnapshotVectorReader + size.SizeOfPtr +
len(i.vector)*size.SizeOfFloat32 +
len(i.field) +
len(i.currID)
for _, entry := range i.postings {
sizeInBytes += entry.Size()
}
for _, entry := range i.iterators {
sizeInBytes += entry.Size()
}
if i.currPosting != nil {
sizeInBytes += i.currPosting.Size()
}
return sizeInBytes
}
func (i *IndexSnapshotVectorReader) Next(preAlloced *index.VectorDoc) (
*index.VectorDoc, error) {
rv := preAlloced
if rv == nil {
rv = &index.VectorDoc{}
}
for i.segmentOffset < len(i.iterators) {
next, err := i.iterators[i.segmentOffset].Next()
if err != nil {
return nil, err
}
if next != nil {
// make segment number into global number by adding offset
globalOffset := i.snapshot.offsets[i.segmentOffset]
nnum := next.Number()
rv.ID = docNumberToBytes(rv.ID, nnum+globalOffset)
rv.Score = float64(next.Score())
i.currID = rv.ID
i.currPosting = next
return rv, nil
}
i.segmentOffset++
}
return nil, nil
}
func (i *IndexSnapshotVectorReader) Advance(ID index.IndexInternalID,
preAlloced *index.VectorDoc) (*index.VectorDoc, error) {
if i.currPosting != nil && bytes.Compare(i.currID, ID) >= 0 {
i2, err := i.snapshot.VectorReader(i.ctx, i.vector, i.field, i.k,
i.searchParams, i.eligibleSelector)
if err != nil {
return nil, err
}
// close the current term field reader before replacing it with a new one
_ = i.Close()
*i = *(i2.(*IndexSnapshotVectorReader))
}
num, err := docInternalToNumber(ID)
if err != nil {
return nil, fmt.Errorf("error converting to doc number % x - %v", ID, err)
}
segIndex, ldocNum := i.snapshot.segmentIndexAndLocalDocNumFromGlobal(num)
if segIndex >= len(i.snapshot.segment) {
return nil, fmt.Errorf("computed segment index %d out of bounds %d",
segIndex, len(i.snapshot.segment))
}
// skip directly to the target segment
i.segmentOffset = segIndex
next, err := i.iterators[i.segmentOffset].Advance(ldocNum)
if err != nil {
return nil, err
}
if next == nil {
// we jumped directly to the segment that should have contained it
// but it wasn't there, so reuse Next() which should correctly
// get the next hit after it (we moved i.segmentOffset)
return i.Next(preAlloced)
}
if preAlloced == nil {
preAlloced = &index.VectorDoc{}
}
preAlloced.ID = docNumberToBytes(preAlloced.ID, next.Number()+
i.snapshot.offsets[segIndex])
i.currID = preAlloced.ID
i.currPosting = next
return preAlloced, nil
}
func (i *IndexSnapshotVectorReader) Count() uint64 {
var rv uint64
for _, posting := range i.postings {
rv += posting.Count()
}
return rv
}
func (i *IndexSnapshotVectorReader) Close() error {
// TODO Consider if any scope of recycling here.
return nil
}

View File

@@ -0,0 +1,340 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"bytes"
"os"
"sync"
"sync/atomic"
"github.com/RoaringBitmap/roaring/v2"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
var TermSeparator byte = 0xff
var TermSeparatorSplitSlice = []byte{TermSeparator}
type SegmentSnapshot struct {
// this flag is needed to identify whether this
// segment was mmaped recently, in which case
// we consider the loading cost of the metadata
// as part of IO stats.
mmaped uint32
id uint64
segment segment.Segment
deleted *roaring.Bitmap
creator string
stats *fieldStats
cachedMeta *cachedMeta
cachedDocs *cachedDocs
}
func (s *SegmentSnapshot) Segment() segment.Segment {
return s.segment
}
func (s *SegmentSnapshot) Deleted() *roaring.Bitmap {
return s.deleted
}
func (s *SegmentSnapshot) Id() uint64 {
return s.id
}
func (s *SegmentSnapshot) FullSize() int64 {
return int64(s.segment.Count())
}
func (s *SegmentSnapshot) LiveSize() int64 {
return int64(s.Count())
}
func (s *SegmentSnapshot) HasVector() bool {
// number of vectors, for each vector field in the segment
numVecs := s.stats.Fetch()["num_vectors"]
return len(numVecs) > 0
}
func (s *SegmentSnapshot) FileSize() int64 {
ps, ok := s.segment.(segment.PersistedSegment)
if !ok {
return 0
}
path := ps.Path()
if path == "" {
return 0
}
fi, err := os.Stat(path)
if err != nil {
return 0
}
return fi.Size()
}
func (s *SegmentSnapshot) Close() error {
return s.segment.Close()
}
func (s *SegmentSnapshot) VisitDocument(num uint64, visitor segment.StoredFieldValueVisitor) error {
return s.segment.VisitStoredFields(num, visitor)
}
func (s *SegmentSnapshot) DocID(num uint64) ([]byte, error) {
return s.segment.DocID(num)
}
func (s *SegmentSnapshot) Count() uint64 {
rv := s.segment.Count()
if s.deleted != nil {
rv -= s.deleted.GetCardinality()
}
return rv
}
func (s *SegmentSnapshot) DocNumbers(docIDs []string) (*roaring.Bitmap, error) {
rv, err := s.segment.DocNumbers(docIDs)
if err != nil {
return nil, err
}
if s.deleted != nil {
rv.AndNot(s.deleted)
}
return rv, nil
}
// DocNumbersLive returns a bitmap containing doc numbers for all live docs
func (s *SegmentSnapshot) DocNumbersLive() *roaring.Bitmap {
rv := roaring.NewBitmap()
rv.AddRange(0, s.segment.Count())
if s.deleted != nil {
rv.AndNot(s.deleted)
}
return rv
}
func (s *SegmentSnapshot) Fields() []string {
return s.segment.Fields()
}
func (s *SegmentSnapshot) Size() (rv int) {
rv = s.segment.Size()
if s.deleted != nil {
rv += int(s.deleted.GetSizeInBytes())
}
rv += s.cachedDocs.Size()
return
}
type cachedFieldDocs struct {
m sync.Mutex
readyCh chan struct{} // closed when the cachedFieldDocs.docs is ready to be used.
err error // Non-nil if there was an error when preparing this cachedFieldDocs.
docs map[uint64][]byte // Keyed by localDocNum, value is a list of terms delimited by 0xFF.
size uint64
}
func (cfd *cachedFieldDocs) Size() int {
var rv int
cfd.m.Lock()
for _, entry := range cfd.docs {
rv += 8 /* size of uint64 */ + len(entry)
}
cfd.m.Unlock()
return rv
}
func (cfd *cachedFieldDocs) prepareField(field string, ss *SegmentSnapshot) {
cfd.m.Lock()
defer func() {
close(cfd.readyCh)
cfd.m.Unlock()
}()
cfd.size += uint64(size.SizeOfUint64) /* size field */
dict, err := ss.segment.Dictionary(field)
if err != nil {
cfd.err = err
return
}
var postings segment.PostingsList
var postingsItr segment.PostingsIterator
dictItr := dict.AutomatonIterator(nil, nil, nil)
next, err := dictItr.Next()
for err == nil && next != nil {
var err1 error
postings, err1 = dict.PostingsList([]byte(next.Term), nil, postings)
if err1 != nil {
cfd.err = err1
return
}
cfd.size += uint64(size.SizeOfUint64) /* map key */
postingsItr = postings.Iterator(false, false, false, postingsItr)
nextPosting, err2 := postingsItr.Next()
for err2 == nil && nextPosting != nil {
docNum := nextPosting.Number()
cfd.docs[docNum] = append(cfd.docs[docNum], []byte(next.Term)...)
cfd.docs[docNum] = append(cfd.docs[docNum], TermSeparator)
cfd.size += uint64(len(next.Term) + 1) // map value
nextPosting, err2 = postingsItr.Next()
}
if err2 != nil {
cfd.err = err2
return
}
next, err = dictItr.Next()
}
if err != nil {
cfd.err = err
return
}
}
type cachedDocs struct {
size uint64
m sync.Mutex // As the cache is asynchronously prepared, need a lock
cache map[string]*cachedFieldDocs // Keyed by field
}
func (c *cachedDocs) prepareFields(wantedFields []string, ss *SegmentSnapshot) error {
c.m.Lock()
if c.cache == nil {
c.cache = make(map[string]*cachedFieldDocs, len(ss.Fields()))
}
for _, field := range wantedFields {
_, exists := c.cache[field]
if !exists {
c.cache[field] = &cachedFieldDocs{
readyCh: make(chan struct{}),
docs: make(map[uint64][]byte),
}
go c.cache[field].prepareField(field, ss)
}
}
for _, field := range wantedFields {
cachedFieldDocs := c.cache[field]
c.m.Unlock()
<-cachedFieldDocs.readyCh
if cachedFieldDocs.err != nil {
return cachedFieldDocs.err
}
c.m.Lock()
}
c.updateSizeLOCKED()
c.m.Unlock()
return nil
}
// hasFields returns true if the cache has all the given fields
func (c *cachedDocs) hasFields(fields []string) bool {
c.m.Lock()
for _, field := range fields {
if _, exists := c.cache[field]; !exists {
c.m.Unlock()
return false // found a field not in cache
}
}
c.m.Unlock()
return true
}
func (c *cachedDocs) Size() int {
return int(atomic.LoadUint64(&c.size))
}
func (c *cachedDocs) updateSizeLOCKED() {
sizeInBytes := 0
for k, v := range c.cache { // cachedFieldDocs
sizeInBytes += len(k)
if v != nil {
sizeInBytes += v.Size()
}
}
atomic.StoreUint64(&c.size, uint64(sizeInBytes))
}
func (c *cachedDocs) visitDoc(localDocNum uint64,
fields []string, visitor index.DocValueVisitor) {
c.m.Lock()
for _, field := range fields {
if cachedFieldDocs, exists := c.cache[field]; exists {
c.m.Unlock()
<-cachedFieldDocs.readyCh
c.m.Lock()
if tlist, exists := cachedFieldDocs.docs[localDocNum]; exists {
for {
i := bytes.Index(tlist, TermSeparatorSplitSlice)
if i < 0 {
break
}
visitor(field, tlist[0:i])
tlist = tlist[i+1:]
}
}
}
}
c.m.Unlock()
}
// the purpose of the cachedMeta is to simply allow the user of this type to record
// and cache certain meta data information (specific to the segment) that can be
// used across calls to save compute on the same.
// for example searcher creations on the same index snapshot can use this struct
// to help and fetch the backing index size information which can be used in
// memory usage calculation thereby deciding whether to allow a query or not.
type cachedMeta struct {
m sync.RWMutex
meta map[string]interface{}
}
func (c *cachedMeta) updateMeta(field string, val interface{}) {
c.m.Lock()
if c.meta == nil {
c.meta = make(map[string]interface{})
}
c.meta[field] = val
c.m.Unlock()
}
func (c *cachedMeta) fetchMeta(field string) (rv interface{}) {
c.m.RLock()
rv = c.meta[field]
c.m.RUnlock()
return rv
}

View File

@@ -0,0 +1,85 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors
// +build vectors
package scorch
import (
"context"
"encoding/json"
"fmt"
index "github.com/blevesearch/bleve_index_api"
segment_api "github.com/blevesearch/scorch_segment_api/v2"
)
func (is *IndexSnapshot) VectorReader(ctx context.Context, vector []float32,
field string, k int64, searchParams json.RawMessage,
eligibleSelector index.EligibleDocumentSelector) (
index.VectorReader, error) {
rv := &IndexSnapshotVectorReader{
vector: vector,
field: field,
k: k,
snapshot: is,
searchParams: searchParams,
eligibleSelector: eligibleSelector,
}
if rv.postings == nil {
rv.postings = make([]segment_api.VecPostingsList, len(is.segment))
}
if rv.iterators == nil {
rv.iterators = make([]segment_api.VecPostingsIterator, len(is.segment))
}
// initialize postings and iterators within the OptimizeVR's Finish()
return rv, nil
}
// eligibleDocumentSelector is used to filter out documents that are eligible for
// the KNN search from a pre-filter query.
type eligibleDocumentSelector struct {
// segment ID -> segment local doc nums
eligibleDocNums map[int][]uint64
is *IndexSnapshot
}
// SegmentEligibleDocs returns the list of eligible local doc numbers for the given segment.
func (eds *eligibleDocumentSelector) SegmentEligibleDocs(segmentID int) []uint64 {
return eds.eligibleDocNums[segmentID]
}
// AddEligibleDocumentMatch adds a document match to the list of eligible documents.
func (eds *eligibleDocumentSelector) AddEligibleDocumentMatch(id index.IndexInternalID) error {
if eds.is == nil {
return fmt.Errorf("eligibleDocumentSelector is not initialized with IndexSnapshot")
}
// Get the segment number and the local doc number for this document.
segIdx, docNum, err := eds.is.segmentIndexAndLocalDocNum(id)
if err != nil {
return err
}
// Add the local doc number to the list of eligible doc numbers for this segment.
eds.eligibleDocNums[segIdx] = append(eds.eligibleDocNums[segIdx], docNum)
return nil
}
func (is *IndexSnapshot) NewEligibleDocumentSelector() index.EligibleDocumentSelector {
return &eligibleDocumentSelector{
eligibleDocNums: map[int][]uint64{},
is: is,
}
}

View File

@@ -0,0 +1,160 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"reflect"
"sync/atomic"
"github.com/blevesearch/bleve/v2/util"
)
// Stats tracks statistics about the index, fields that are
// prefixed like CurXxxx are gauges (can go up and down),
// and fields that are prefixed like TotXxxx are monotonically
// increasing counters.
type Stats struct {
TotUpdates uint64
TotDeletes uint64
TotBatches uint64
TotBatchesEmpty uint64
TotBatchIntroTime uint64
MaxBatchIntroTime uint64
CurRootEpoch uint64
LastPersistedEpoch uint64
LastMergedEpoch uint64
TotOnErrors uint64
TotAnalysisTime uint64
TotIndexTime uint64
TotIndexedPlainTextBytes uint64
TotBytesReadAtQueryTime uint64
TotBytesWrittenAtIndexTime uint64
TotTermSearchersStarted uint64
TotTermSearchersFinished uint64
TotKNNSearches uint64
TotSynonymSearches uint64
TotEventTriggerStarted uint64
TotEventTriggerCompleted uint64
TotIntroduceLoop uint64
TotIntroduceSegmentBeg uint64
TotIntroduceSegmentEnd uint64
TotIntroducePersistBeg uint64
TotIntroducePersistEnd uint64
TotIntroduceMergeBeg uint64
TotIntroduceMergeEnd uint64
TotIntroduceRevertBeg uint64
TotIntroduceRevertEnd uint64
TotIntroducedItems uint64
TotIntroducedSegmentsBatch uint64
TotIntroducedSegmentsMerge uint64
TotPersistLoopBeg uint64
TotPersistLoopErr uint64
TotPersistLoopProgress uint64
TotPersistLoopWait uint64
TotPersistLoopWaitNotified uint64
TotPersistLoopEnd uint64
TotPersistedItems uint64
TotItemsToPersist uint64
TotPersistedSegments uint64
TotMutationsFiltered uint64
TotPersisterSlowMergerPause uint64
TotPersisterSlowMergerResume uint64
TotPersisterNapPauseCompleted uint64
TotPersisterMergerNapBreak uint64
TotFileMergeLoopBeg uint64
TotFileMergeLoopErr uint64
TotFileMergeLoopEnd uint64
TotFileMergeForceOpsStarted uint64
TotFileMergeForceOpsCompleted uint64
TotFileMergePlan uint64
TotFileMergePlanErr uint64
TotFileMergePlanNone uint64
TotFileMergePlanOk uint64
TotFileMergePlanTasks uint64
TotFileMergePlanTasksDone uint64
TotFileMergePlanTasksErr uint64
TotFileMergePlanTasksSegments uint64
TotFileMergePlanTasksSegmentsEmpty uint64
TotFileMergeSegmentsEmpty uint64
TotFileMergeSegments uint64
TotFileSegmentsAtRoot uint64
TotFileMergeWrittenBytes uint64
TotFileMergeZapBeg uint64
TotFileMergeZapEnd uint64
TotFileMergeZapTime uint64
MaxFileMergeZapTime uint64
TotFileMergeZapIntroductionTime uint64
MaxFileMergeZapIntroductionTime uint64
TotFileMergeIntroductions uint64
TotFileMergeIntroductionsDone uint64
TotFileMergeIntroductionsSkipped uint64
TotFileMergeIntroductionsObsoleted uint64
CurFilesIneligibleForRemoval uint64
TotSnapshotsRemovedFromMetaStore uint64
TotMemMergeBeg uint64
TotMemMergeErr uint64
TotMemMergeDone uint64
TotMemMergeZapBeg uint64
TotMemMergeZapEnd uint64
TotMemMergeZapTime uint64
MaxMemMergeZapTime uint64
TotMemMergeSegments uint64
TotMemorySegmentsAtRoot uint64
}
// atomically populates the returned map
func (s *Stats) ToMap() map[string]interface{} {
m := map[string]interface{}{}
sve := reflect.ValueOf(s).Elem()
svet := sve.Type()
for i := 0; i < svet.NumField(); i++ {
svef := sve.Field(i)
if svef.CanAddr() {
svefp := svef.Addr().Interface()
m[svet.Field(i).Name] = atomic.LoadUint64(svefp.(*uint64))
}
}
return m
}
// MarshalJSON implements json.Marshaler, and in contrast to standard
// json marshaling provides atomic safety
func (s *Stats) MarshalJSON() ([]byte, error) {
return util.MarshalJSON(s.ToMap())
}

View File

@@ -0,0 +1,199 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package scorch
import (
"math"
"reflect"
"github.com/RoaringBitmap/roaring/v2"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
var reflectStaticSizeUnadornedPostingsIteratorBitmap int
var reflectStaticSizeUnadornedPostingsIterator1Hit int
var reflectStaticSizeUnadornedPosting int
func init() {
var pib unadornedPostingsIteratorBitmap
reflectStaticSizeUnadornedPostingsIteratorBitmap = int(reflect.TypeOf(pib).Size())
var pi1h unadornedPostingsIterator1Hit
reflectStaticSizeUnadornedPostingsIterator1Hit = int(reflect.TypeOf(pi1h).Size())
var up UnadornedPosting
reflectStaticSizeUnadornedPosting = int(reflect.TypeOf(up).Size())
}
type unadornedPostingsIteratorBitmap struct {
actual roaring.IntPeekable
actualBM *roaring.Bitmap
}
func (i *unadornedPostingsIteratorBitmap) Next() (segment.Posting, error) {
return i.nextAtOrAfter(0)
}
func (i *unadornedPostingsIteratorBitmap) Advance(docNum uint64) (segment.Posting, error) {
return i.nextAtOrAfter(docNum)
}
func (i *unadornedPostingsIteratorBitmap) nextAtOrAfter(atOrAfter uint64) (segment.Posting, error) {
docNum, exists := i.nextDocNumAtOrAfter(atOrAfter)
if !exists {
return nil, nil
}
return UnadornedPosting(docNum), nil
}
func (i *unadornedPostingsIteratorBitmap) nextDocNumAtOrAfter(atOrAfter uint64) (uint64, bool) {
if i.actual == nil || !i.actual.HasNext() {
return 0, false
}
i.actual.AdvanceIfNeeded(uint32(atOrAfter))
if !i.actual.HasNext() {
return 0, false // couldn't find anything
}
return uint64(i.actual.Next()), true
}
func (i *unadornedPostingsIteratorBitmap) Size() int {
return reflectStaticSizeUnadornedPostingsIteratorBitmap
}
func (i *unadornedPostingsIteratorBitmap) BytesRead() uint64 {
return 0
}
func (i *unadornedPostingsIteratorBitmap) BytesWritten() uint64 {
return 0
}
func (i *unadornedPostingsIteratorBitmap) ResetBytesRead(uint64) {}
func (i *unadornedPostingsIteratorBitmap) ActualBitmap() *roaring.Bitmap {
return i.actualBM
}
func (i *unadornedPostingsIteratorBitmap) DocNum1Hit() (uint64, bool) {
return 0, false
}
func (i *unadornedPostingsIteratorBitmap) ReplaceActual(actual *roaring.Bitmap) {
i.actualBM = actual
i.actual = actual.Iterator()
}
// Resets the iterator to the beginning of the postings list.
// by resetting the actual iterator.
func (i *unadornedPostingsIteratorBitmap) ResetIterator() {
i.actual = i.actualBM.Iterator()
}
func newUnadornedPostingsIteratorFromBitmap(bm *roaring.Bitmap) segment.PostingsIterator {
return &unadornedPostingsIteratorBitmap{
actualBM: bm,
actual: bm.Iterator(),
}
}
const docNum1HitFinished = math.MaxUint64
type unadornedPostingsIterator1Hit struct {
docNumOrig uint64 // original 1-hit docNum used to create this iterator
docNum uint64 // current docNum
}
func (i *unadornedPostingsIterator1Hit) Next() (segment.Posting, error) {
return i.nextAtOrAfter(0)
}
func (i *unadornedPostingsIterator1Hit) Advance(docNum uint64) (segment.Posting, error) {
return i.nextAtOrAfter(docNum)
}
func (i *unadornedPostingsIterator1Hit) nextAtOrAfter(atOrAfter uint64) (segment.Posting, error) {
docNum, exists := i.nextDocNumAtOrAfter(atOrAfter)
if !exists {
return nil, nil
}
return UnadornedPosting(docNum), nil
}
func (i *unadornedPostingsIterator1Hit) nextDocNumAtOrAfter(atOrAfter uint64) (uint64, bool) {
if i.docNum == docNum1HitFinished {
return 0, false
}
if i.docNum < atOrAfter {
// advanced past our 1-hit
i.docNum = docNum1HitFinished // consume our 1-hit docNum
return 0, false
}
docNum := i.docNum
i.docNum = docNum1HitFinished // consume our 1-hit docNum
return docNum, true
}
func (i *unadornedPostingsIterator1Hit) Size() int {
return reflectStaticSizeUnadornedPostingsIterator1Hit
}
func (i *unadornedPostingsIterator1Hit) BytesRead() uint64 {
return 0
}
func (i *unadornedPostingsIterator1Hit) BytesWritten() uint64 {
return 0
}
func (i *unadornedPostingsIterator1Hit) ResetBytesRead(uint64) {}
// ResetIterator resets the iterator to the original state.
func (i *unadornedPostingsIterator1Hit) ResetIterator() {
i.docNum = i.docNumOrig
}
func newUnadornedPostingsIteratorFrom1Hit(docNum1Hit uint64) segment.PostingsIterator {
return &unadornedPostingsIterator1Hit{
docNumOrig: docNum1Hit,
docNum: docNum1Hit,
}
}
type ResetablePostingsIterator interface {
ResetIterator()
}
type UnadornedPosting uint64
func (p UnadornedPosting) Number() uint64 {
return uint64(p)
}
func (p UnadornedPosting) Frequency() uint64 {
return 0
}
func (p UnadornedPosting) Norm() float64 {
return 0
}
func (p UnadornedPosting) Locations() []segment.Location {
return nil
}
func (p UnadornedPosting) Size() int {
return reflectStaticSizeUnadornedPosting
}

View File

@@ -0,0 +1,129 @@
// Copyright (c) 2015 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package upsidedown
import (
index "github.com/blevesearch/bleve_index_api"
)
type IndexRow interface {
KeySize() int
KeyTo([]byte) (int, error)
Key() []byte
ValueSize() int
ValueTo([]byte) (int, error)
Value() []byte
}
type AnalysisResult struct {
DocID string
Rows []IndexRow
}
func (udc *UpsideDownCouch) Analyze(d index.Document) *AnalysisResult {
return udc.analyze(d)
}
func (udc *UpsideDownCouch) analyze(d index.Document) *AnalysisResult {
rv := &AnalysisResult{
DocID: d.ID(),
Rows: make([]IndexRow, 0, 100),
}
docIDBytes := []byte(d.ID())
// track our back index entries
backIndexStoredEntries := make([]*BackIndexStoreEntry, 0)
// information we collate as we merge fields with same name
fieldTermFreqs := make(map[uint16]index.TokenFrequencies)
fieldLengths := make(map[uint16]int)
fieldIncludeTermVectors := make(map[uint16]bool)
fieldNames := make(map[uint16]string)
analyzeField := func(field index.Field, storable bool) {
fieldIndex, newFieldRow := udc.fieldIndexOrNewRow(field.Name())
if newFieldRow != nil {
rv.Rows = append(rv.Rows, newFieldRow)
}
fieldNames[fieldIndex] = field.Name()
if field.Options().IsIndexed() {
field.Analyze()
fieldLength := field.AnalyzedLength()
tokenFreqs := field.AnalyzedTokenFrequencies()
existingFreqs := fieldTermFreqs[fieldIndex]
if existingFreqs == nil {
fieldTermFreqs[fieldIndex] = tokenFreqs
} else {
existingFreqs.MergeAll(field.Name(), tokenFreqs)
fieldTermFreqs[fieldIndex] = existingFreqs
}
fieldLengths[fieldIndex] += fieldLength
fieldIncludeTermVectors[fieldIndex] = field.Options().IncludeTermVectors()
}
if storable && field.Options().IsStored() {
rv.Rows, backIndexStoredEntries = udc.storeField(docIDBytes, field, fieldIndex, rv.Rows, backIndexStoredEntries)
}
}
// walk all the fields, record stored fields now
// place information about indexed fields into map
// this collates information across fields with
// same names (arrays)
d.VisitFields(func(field index.Field) {
analyzeField(field, true)
})
if d.HasComposite() {
for fieldIndex, tokenFreqs := range fieldTermFreqs {
// see if any of the composite fields need this
d.VisitComposite(func(field index.CompositeField) {
field.Compose(fieldNames[fieldIndex], fieldLengths[fieldIndex], tokenFreqs)
})
}
d.VisitComposite(func(field index.CompositeField) {
analyzeField(field, false)
})
}
rowsCapNeeded := len(rv.Rows) + 1
for _, tokenFreqs := range fieldTermFreqs {
rowsCapNeeded += len(tokenFreqs)
}
rv.Rows = append(make([]IndexRow, 0, rowsCapNeeded), rv.Rows...)
backIndexTermsEntries := make([]*BackIndexTermsEntry, 0, len(fieldTermFreqs))
// walk through the collated information and process
// once for each indexed field (unique name)
for fieldIndex, tokenFreqs := range fieldTermFreqs {
fieldLength := fieldLengths[fieldIndex]
includeTermVectors := fieldIncludeTermVectors[fieldIndex]
// encode this field
rv.Rows, backIndexTermsEntries = udc.indexField(docIDBytes, includeTermVectors, fieldIndex, fieldLength, tokenFreqs, rv.Rows, backIndexTermsEntries)
}
// build the back index row
backIndexRow := NewBackIndexRow(docIDBytes, backIndexTermsEntries, backIndexStoredEntries)
rv.Rows = append(rv.Rows, backIndexRow)
return rv
}

View File

@@ -0,0 +1,8 @@
#!/bin/sh
BENCHMARKS=`grep "func Benchmark" *_test.go | sed 's/.*func //' | sed s/\(.*{//`
for BENCHMARK in $BENCHMARKS
do
go test -v -run=xxx -bench=^$BENCHMARK$ -benchtime=10s -tags 'forestdb leveldb' | grep -v ok | grep -v PASS
done

View File

@@ -0,0 +1,174 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package upsidedown
import (
"bytes"
"sort"
"github.com/blevesearch/upsidedown_store_api"
)
// the functions in this file are only intended to be used by
// the bleve_dump utility and the debug http handlers
// if your application relies on them, you're doing something wrong
// they may change or be removed at any time
func dumpPrefix(kvreader store.KVReader, rv chan interface{}, prefix []byte) {
start := prefix
if start == nil {
start = []byte{0}
}
it := kvreader.PrefixIterator(start)
defer func() {
cerr := it.Close()
if cerr != nil {
rv <- cerr
}
}()
key, val, valid := it.Current()
for valid {
ck := make([]byte, len(key))
copy(ck, key)
cv := make([]byte, len(val))
copy(cv, val)
row, err := ParseFromKeyValue(ck, cv)
if err != nil {
rv <- err
return
}
rv <- row
it.Next()
key, val, valid = it.Current()
}
}
func dumpRange(kvreader store.KVReader, rv chan interface{}, start, end []byte) {
it := kvreader.RangeIterator(start, end)
defer func() {
cerr := it.Close()
if cerr != nil {
rv <- cerr
}
}()
key, val, valid := it.Current()
for valid {
ck := make([]byte, len(key))
copy(ck, key)
cv := make([]byte, len(val))
copy(cv, val)
row, err := ParseFromKeyValue(ck, cv)
if err != nil {
rv <- err
return
}
rv <- row
it.Next()
key, val, valid = it.Current()
}
}
func (i *IndexReader) DumpAll() chan interface{} {
rv := make(chan interface{})
go func() {
defer close(rv)
dumpRange(i.kvreader, rv, nil, nil)
}()
return rv
}
func (i *IndexReader) DumpFields() chan interface{} {
rv := make(chan interface{})
go func() {
defer close(rv)
dumpPrefix(i.kvreader, rv, []byte{'f'})
}()
return rv
}
type keyset [][]byte
func (k keyset) Len() int { return len(k) }
func (k keyset) Swap(i, j int) { k[i], k[j] = k[j], k[i] }
func (k keyset) Less(i, j int) bool { return bytes.Compare(k[i], k[j]) < 0 }
// DumpDoc returns all rows in the index related to this doc id
func (i *IndexReader) DumpDoc(id string) chan interface{} {
idBytes := []byte(id)
rv := make(chan interface{})
go func() {
defer close(rv)
back, err := backIndexRowForDoc(i.kvreader, []byte(id))
if err != nil {
rv <- err
return
}
// no such doc
if back == nil {
return
}
// build sorted list of term keys
keys := make(keyset, 0)
for _, entry := range back.termsEntries {
for i := range entry.Terms {
tfr := NewTermFrequencyRow([]byte(entry.Terms[i]), uint16(*entry.Field), idBytes, 0, 0)
key := tfr.Key()
keys = append(keys, key)
}
}
sort.Sort(keys)
// first add all the stored rows
storedRowPrefix := NewStoredRow(idBytes, 0, []uint64{}, 'x', []byte{}).ScanPrefixForDoc()
dumpPrefix(i.kvreader, rv, storedRowPrefix)
// now walk term keys in order and add them as well
if len(keys) > 0 {
it := i.kvreader.RangeIterator(keys[0], nil)
defer func() {
cerr := it.Close()
if cerr != nil {
rv <- cerr
}
}()
for _, key := range keys {
it.Seek(key)
rkey, rval, valid := it.Current()
if !valid {
break
}
rck := make([]byte, len(rkey))
copy(rck, key)
rcv := make([]byte, len(rval))
copy(rcv, rval)
row, err := ParseFromKeyValue(rck, rcv)
if err != nil {
rv <- err
return
}
rv <- row
}
}
}()
return rv
}

View File

@@ -0,0 +1,88 @@
// Copyright (c) 2015 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package upsidedown
import (
"sync"
)
type FieldCache struct {
fieldIndexes map[string]uint16
indexFields []string
lastFieldIndex int
mutex sync.RWMutex
}
func NewFieldCache() *FieldCache {
return &FieldCache{
fieldIndexes: make(map[string]uint16),
lastFieldIndex: -1,
}
}
func (f *FieldCache) AddExisting(field string, index uint16) {
f.mutex.Lock()
f.addLOCKED(field, index)
f.mutex.Unlock()
}
func (f *FieldCache) addLOCKED(field string, index uint16) uint16 {
f.fieldIndexes[field] = index
if len(f.indexFields) < int(index)+1 {
prevIndexFields := f.indexFields
f.indexFields = make([]string, int(index)+16)
copy(f.indexFields, prevIndexFields)
}
f.indexFields[int(index)] = field
if int(index) > f.lastFieldIndex {
f.lastFieldIndex = int(index)
}
return index
}
// FieldNamed returns the index of the field, and whether or not it existed
// before this call. if createIfMissing is true, and new field index is assigned
// but the second return value will still be false
func (f *FieldCache) FieldNamed(field string, createIfMissing bool) (uint16, bool) {
f.mutex.RLock()
if index, ok := f.fieldIndexes[field]; ok {
f.mutex.RUnlock()
return index, true
} else if !createIfMissing {
f.mutex.RUnlock()
return 0, false
}
// trade read lock for write lock
f.mutex.RUnlock()
f.mutex.Lock()
// need to check again with write lock
if index, ok := f.fieldIndexes[field]; ok {
f.mutex.Unlock()
return index, true
}
// assign next field id
index := f.addLOCKED(field, uint16(f.lastFieldIndex+1))
f.mutex.Unlock()
return index, false
}
func (f *FieldCache) FieldIndexed(index uint16) (field string) {
f.mutex.RLock()
if int(index) < len(f.indexFields) {
field = f.indexFields[int(index)]
}
f.mutex.RUnlock()
return field
}

View File

@@ -0,0 +1,86 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package upsidedown
import (
"fmt"
index "github.com/blevesearch/bleve_index_api"
store "github.com/blevesearch/upsidedown_store_api"
)
type UpsideDownCouchFieldDict struct {
indexReader *IndexReader
iterator store.KVIterator
dictRow *DictionaryRow
dictEntry *index.DictEntry
field uint16
}
func newUpsideDownCouchFieldDict(indexReader *IndexReader, field uint16, startTerm, endTerm []byte) (*UpsideDownCouchFieldDict, error) {
startKey := NewDictionaryRow(startTerm, field, 0).Key()
if endTerm == nil {
endTerm = []byte{ByteSeparator}
} else {
endTerm = incrementBytes(endTerm)
}
endKey := NewDictionaryRow(endTerm, field, 0).Key()
it := indexReader.kvreader.RangeIterator(startKey, endKey)
return &UpsideDownCouchFieldDict{
indexReader: indexReader,
iterator: it,
dictRow: &DictionaryRow{}, // Pre-alloced, reused row.
dictEntry: &index.DictEntry{}, // Pre-alloced, reused entry.
field: field,
}, nil
}
func (r *UpsideDownCouchFieldDict) BytesRead() uint64 {
return 0
}
func (r *UpsideDownCouchFieldDict) Next() (*index.DictEntry, error) {
key, val, valid := r.iterator.Current()
if !valid {
return nil, nil
}
err := r.dictRow.parseDictionaryK(key)
if err != nil {
return nil, fmt.Errorf("unexpected error parsing dictionary row key: %v", err)
}
err = r.dictRow.parseDictionaryV(val)
if err != nil {
return nil, fmt.Errorf("unexpected error parsing dictionary row val: %v", err)
}
r.dictEntry.Term = string(r.dictRow.term)
r.dictEntry.Count = r.dictRow.count
// advance the iterator to the next term
r.iterator.Next()
return r.dictEntry, nil
}
func (r *UpsideDownCouchFieldDict) Cardinality() int {
return 0
}
func (r *UpsideDownCouchFieldDict) Close() error {
return r.iterator.Close()
}

View File

@@ -0,0 +1,228 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package upsidedown
import (
"context"
"reflect"
"github.com/blevesearch/bleve/v2/document"
index "github.com/blevesearch/bleve_index_api"
store "github.com/blevesearch/upsidedown_store_api"
)
var reflectStaticSizeIndexReader int
func init() {
var ir IndexReader
reflectStaticSizeIndexReader = int(reflect.TypeOf(ir).Size())
}
type IndexReader struct {
index *UpsideDownCouch
kvreader store.KVReader
docCount uint64
}
func (i *IndexReader) TermFieldReader(ctx context.Context, term []byte, fieldName string, includeFreq, includeNorm, includeTermVectors bool) (index.TermFieldReader, error) {
fieldIndex, fieldExists := i.index.fieldCache.FieldNamed(fieldName, false)
if fieldExists {
return newUpsideDownCouchTermFieldReader(i, term, uint16(fieldIndex), includeFreq, includeNorm, includeTermVectors)
}
return newUpsideDownCouchTermFieldReader(i, []byte{ByteSeparator}, ^uint16(0), includeFreq, includeNorm, includeTermVectors)
}
func (i *IndexReader) FieldDict(fieldName string) (index.FieldDict, error) {
return i.FieldDictRange(fieldName, nil, nil)
}
func (i *IndexReader) FieldDictRange(fieldName string, startTerm []byte, endTerm []byte) (index.FieldDict, error) {
fieldIndex, fieldExists := i.index.fieldCache.FieldNamed(fieldName, false)
if fieldExists {
return newUpsideDownCouchFieldDict(i, uint16(fieldIndex), startTerm, endTerm)
}
return newUpsideDownCouchFieldDict(i, ^uint16(0), []byte{ByteSeparator}, []byte{})
}
func (i *IndexReader) FieldDictPrefix(fieldName string, termPrefix []byte) (index.FieldDict, error) {
return i.FieldDictRange(fieldName, termPrefix, termPrefix)
}
func (i *IndexReader) DocIDReaderAll() (index.DocIDReader, error) {
return newUpsideDownCouchDocIDReader(i)
}
func (i *IndexReader) DocIDReaderOnly(ids []string) (index.DocIDReader, error) {
return newUpsideDownCouchDocIDReaderOnly(i, ids)
}
func (i *IndexReader) Document(id string) (doc index.Document, err error) {
// first hit the back index to confirm doc exists
var backIndexRow *BackIndexRow
backIndexRow, err = backIndexRowForDoc(i.kvreader, []byte(id))
if err != nil {
return
}
if backIndexRow == nil {
return
}
rvd := document.NewDocument(id)
storedRow := NewStoredRow([]byte(id), 0, []uint64{}, 'x', nil)
storedRowScanPrefix := storedRow.ScanPrefixForDoc()
it := i.kvreader.PrefixIterator(storedRowScanPrefix)
defer func() {
if cerr := it.Close(); err == nil && cerr != nil {
err = cerr
}
}()
key, val, valid := it.Current()
for valid {
safeVal := make([]byte, len(val))
copy(safeVal, val)
var row *StoredRow
row, err = NewStoredRowKV(key, safeVal)
if err != nil {
return nil, err
}
if row != nil {
fieldName := i.index.fieldCache.FieldIndexed(row.field)
field := decodeFieldType(row.typ, fieldName, row.arrayPositions, row.value)
if field != nil {
rvd.AddField(field)
}
}
it.Next()
key, val, valid = it.Current()
}
return rvd, nil
}
func (i *IndexReader) documentVisitFieldTerms(id index.IndexInternalID, fields []string, visitor index.DocValueVisitor) error {
fieldsMap := make(map[uint16]string, len(fields))
for _, f := range fields {
id, ok := i.index.fieldCache.FieldNamed(f, false)
if ok {
fieldsMap[id] = f
}
}
tempRow := BackIndexRow{
doc: id,
}
keyBuf := GetRowBuffer()
if tempRow.KeySize() > len(keyBuf.buf) {
keyBuf.buf = make([]byte, 2*tempRow.KeySize())
}
defer PutRowBuffer(keyBuf)
keySize, err := tempRow.KeyTo(keyBuf.buf)
if err != nil {
return err
}
value, err := i.kvreader.Get(keyBuf.buf[:keySize])
if err != nil {
return err
}
if value == nil {
return nil
}
return visitBackIndexRow(value, func(field uint32, term []byte) {
if field, ok := fieldsMap[uint16(field)]; ok {
visitor(field, term)
}
})
}
func (i *IndexReader) Fields() (fields []string, err error) {
fields = make([]string, 0)
it := i.kvreader.PrefixIterator([]byte{'f'})
defer func() {
if cerr := it.Close(); err == nil && cerr != nil {
err = cerr
}
}()
key, val, valid := it.Current()
for valid {
var row UpsideDownCouchRow
row, err = ParseFromKeyValue(key, val)
if err != nil {
fields = nil
return
}
if row != nil {
fieldRow, ok := row.(*FieldRow)
if ok {
fields = append(fields, fieldRow.name)
}
}
it.Next()
key, val, valid = it.Current()
}
return
}
func (i *IndexReader) GetInternal(key []byte) ([]byte, error) {
internalRow := NewInternalRow(key, nil)
return i.kvreader.Get(internalRow.Key())
}
func (i *IndexReader) DocCount() (uint64, error) {
return i.docCount, nil
}
func (i *IndexReader) Close() error {
return i.kvreader.Close()
}
func (i *IndexReader) ExternalID(id index.IndexInternalID) (string, error) {
return string(id), nil
}
func (i *IndexReader) InternalID(id string) (index.IndexInternalID, error) {
return index.IndexInternalID(id), nil
}
func incrementBytes(in []byte) []byte {
rv := make([]byte, len(in))
copy(rv, in)
for i := len(rv) - 1; i >= 0; i-- {
rv[i] = rv[i] + 1
if rv[i] != 0 {
// didn't overflow, so stop
break
}
}
return rv
}
func (i *IndexReader) DocValueReader(fields []string) (index.DocValueReader, error) {
return &DocValueReader{i: i, fields: fields}, nil
}
type DocValueReader struct {
i *IndexReader
fields []string
}
func (dvr *DocValueReader) VisitDocValues(id index.IndexInternalID,
visitor index.DocValueVisitor) error {
return dvr.i.documentVisitFieldTerms(id, dvr.fields, visitor)
}
func (dvr *DocValueReader) BytesRead() uint64 { return 0 }

View File

@@ -0,0 +1,376 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package upsidedown
import (
"bytes"
"reflect"
"sort"
"sync/atomic"
"github.com/blevesearch/bleve/v2/size"
index "github.com/blevesearch/bleve_index_api"
"github.com/blevesearch/upsidedown_store_api"
)
var reflectStaticSizeUpsideDownCouchTermFieldReader int
var reflectStaticSizeUpsideDownCouchDocIDReader int
func init() {
var tfr UpsideDownCouchTermFieldReader
reflectStaticSizeUpsideDownCouchTermFieldReader =
int(reflect.TypeOf(tfr).Size())
var cdr UpsideDownCouchDocIDReader
reflectStaticSizeUpsideDownCouchDocIDReader =
int(reflect.TypeOf(cdr).Size())
}
type UpsideDownCouchTermFieldReader struct {
count uint64
indexReader *IndexReader
iterator store.KVIterator
term []byte
tfrNext *TermFrequencyRow
tfrPrealloc TermFrequencyRow
keyBuf []byte
field uint16
includeTermVectors bool
}
func (r *UpsideDownCouchTermFieldReader) Size() int {
sizeInBytes := reflectStaticSizeUpsideDownCouchTermFieldReader + size.SizeOfPtr +
len(r.term) +
r.tfrPrealloc.Size() +
len(r.keyBuf)
if r.tfrNext != nil {
sizeInBytes += r.tfrNext.Size()
}
return sizeInBytes
}
func newUpsideDownCouchTermFieldReader(indexReader *IndexReader, term []byte, field uint16, includeFreq, includeNorm, includeTermVectors bool) (*UpsideDownCouchTermFieldReader, error) {
bufNeeded := termFrequencyRowKeySize(term, nil)
if bufNeeded < dictionaryRowKeySize(term) {
bufNeeded = dictionaryRowKeySize(term)
}
buf := make([]byte, bufNeeded)
bufUsed := dictionaryRowKeyTo(buf, field, term)
val, err := indexReader.kvreader.Get(buf[:bufUsed])
if err != nil {
return nil, err
}
if val == nil {
atomic.AddUint64(&indexReader.index.stats.termSearchersStarted, uint64(1))
rv := &UpsideDownCouchTermFieldReader{
count: 0,
term: term,
field: field,
includeTermVectors: includeTermVectors,
}
rv.tfrNext = &rv.tfrPrealloc
return rv, nil
}
count, err := dictionaryRowParseV(val)
if err != nil {
return nil, err
}
bufUsed = termFrequencyRowKeyTo(buf, field, term, nil)
it := indexReader.kvreader.PrefixIterator(buf[:bufUsed])
atomic.AddUint64(&indexReader.index.stats.termSearchersStarted, uint64(1))
return &UpsideDownCouchTermFieldReader{
indexReader: indexReader,
iterator: it,
count: count,
term: term,
field: field,
includeTermVectors: includeTermVectors,
}, nil
}
func (r *UpsideDownCouchTermFieldReader) Count() uint64 {
return r.count
}
func (r *UpsideDownCouchTermFieldReader) Next(preAlloced *index.TermFieldDoc) (*index.TermFieldDoc, error) {
if r.iterator != nil {
// We treat tfrNext also like an initialization flag, which
// tells us whether we need to invoke the underlying
// iterator.Next(). The first time, don't call iterator.Next().
if r.tfrNext != nil {
r.iterator.Next()
} else {
r.tfrNext = &r.tfrPrealloc
}
key, val, valid := r.iterator.Current()
if valid {
tfr := r.tfrNext
err := tfr.parseKDoc(key, r.term)
if err != nil {
return nil, err
}
err = tfr.parseV(val, r.includeTermVectors)
if err != nil {
return nil, err
}
rv := preAlloced
if rv == nil {
rv = &index.TermFieldDoc{}
}
rv.ID = append(rv.ID, tfr.doc...)
rv.Freq = tfr.freq
rv.Norm = float64(tfr.norm)
if tfr.vectors != nil {
rv.Vectors = r.indexReader.index.termFieldVectorsFromTermVectors(tfr.vectors)
}
return rv, nil
}
}
return nil, nil
}
func (r *UpsideDownCouchTermFieldReader) Advance(docID index.IndexInternalID, preAlloced *index.TermFieldDoc) (rv *index.TermFieldDoc, err error) {
if r.iterator != nil {
if r.tfrNext == nil {
r.tfrNext = &TermFrequencyRow{}
}
tfr := InitTermFrequencyRow(r.tfrNext, r.term, r.field, docID, 0, 0)
r.keyBuf, err = tfr.KeyAppendTo(r.keyBuf[:0])
if err != nil {
return nil, err
}
r.iterator.Seek(r.keyBuf)
key, val, valid := r.iterator.Current()
if valid {
err := tfr.parseKDoc(key, r.term)
if err != nil {
return nil, err
}
err = tfr.parseV(val, r.includeTermVectors)
if err != nil {
return nil, err
}
rv = preAlloced
if rv == nil {
rv = &index.TermFieldDoc{}
}
rv.ID = append(rv.ID, tfr.doc...)
rv.Freq = tfr.freq
rv.Norm = float64(tfr.norm)
if tfr.vectors != nil {
rv.Vectors = r.indexReader.index.termFieldVectorsFromTermVectors(tfr.vectors)
}
return rv, nil
}
}
return nil, nil
}
func (r *UpsideDownCouchTermFieldReader) Close() error {
if r.indexReader != nil {
atomic.AddUint64(&r.indexReader.index.stats.termSearchersFinished, uint64(1))
}
if r.iterator != nil {
return r.iterator.Close()
}
return nil
}
type UpsideDownCouchDocIDReader struct {
indexReader *IndexReader
iterator store.KVIterator
only []string
onlyPos int
onlyMode bool
}
func (r *UpsideDownCouchDocIDReader) Size() int {
sizeInBytes := reflectStaticSizeUpsideDownCouchDocIDReader +
reflectStaticSizeIndexReader + size.SizeOfPtr
for _, entry := range r.only {
sizeInBytes += size.SizeOfString + len(entry)
}
return sizeInBytes
}
func newUpsideDownCouchDocIDReader(indexReader *IndexReader) (*UpsideDownCouchDocIDReader, error) {
startBytes := []byte{0x0}
endBytes := []byte{0xff}
bisr := NewBackIndexRow(startBytes, nil, nil)
bier := NewBackIndexRow(endBytes, nil, nil)
it := indexReader.kvreader.RangeIterator(bisr.Key(), bier.Key())
return &UpsideDownCouchDocIDReader{
indexReader: indexReader,
iterator: it,
}, nil
}
func newUpsideDownCouchDocIDReaderOnly(indexReader *IndexReader, ids []string) (*UpsideDownCouchDocIDReader, error) {
// we don't actually own the list of ids, so if before we sort we must copy
idsCopy := make([]string, len(ids))
copy(idsCopy, ids)
// ensure ids are sorted
sort.Strings(idsCopy)
startBytes := []byte{0x0}
if len(idsCopy) > 0 {
startBytes = []byte(idsCopy[0])
}
endBytes := []byte{0xff}
if len(idsCopy) > 0 {
endBytes = incrementBytes([]byte(idsCopy[len(idsCopy)-1]))
}
bisr := NewBackIndexRow(startBytes, nil, nil)
bier := NewBackIndexRow(endBytes, nil, nil)
it := indexReader.kvreader.RangeIterator(bisr.Key(), bier.Key())
return &UpsideDownCouchDocIDReader{
indexReader: indexReader,
iterator: it,
only: idsCopy,
onlyMode: true,
}, nil
}
func (r *UpsideDownCouchDocIDReader) Next() (index.IndexInternalID, error) {
key, val, valid := r.iterator.Current()
if r.onlyMode {
var rv index.IndexInternalID
for valid && r.onlyPos < len(r.only) {
br, err := NewBackIndexRowKV(key, val)
if err != nil {
return nil, err
}
if !bytes.Equal(br.doc, []byte(r.only[r.onlyPos])) {
ok := r.nextOnly()
if !ok {
return nil, nil
}
r.iterator.Seek(NewBackIndexRow([]byte(r.only[r.onlyPos]), nil, nil).Key())
key, val, valid = r.iterator.Current()
continue
} else {
rv = append([]byte(nil), br.doc...)
break
}
}
if valid && r.onlyPos < len(r.only) {
ok := r.nextOnly()
if ok {
r.iterator.Seek(NewBackIndexRow([]byte(r.only[r.onlyPos]), nil, nil).Key())
}
return rv, nil
}
} else {
if valid {
br, err := NewBackIndexRowKV(key, val)
if err != nil {
return nil, err
}
rv := append([]byte(nil), br.doc...)
r.iterator.Next()
return rv, nil
}
}
return nil, nil
}
func (r *UpsideDownCouchDocIDReader) Advance(docID index.IndexInternalID) (index.IndexInternalID, error) {
if r.onlyMode {
r.onlyPos = sort.SearchStrings(r.only, string(docID))
if r.onlyPos >= len(r.only) {
// advanced to key after our last only key
return nil, nil
}
r.iterator.Seek(NewBackIndexRow([]byte(r.only[r.onlyPos]), nil, nil).Key())
key, val, valid := r.iterator.Current()
var rv index.IndexInternalID
for valid && r.onlyPos < len(r.only) {
br, err := NewBackIndexRowKV(key, val)
if err != nil {
return nil, err
}
if !bytes.Equal(br.doc, []byte(r.only[r.onlyPos])) {
// the only key we seek'd to didn't exist
// now look for the closest key that did exist in only
r.onlyPos = sort.SearchStrings(r.only, string(br.doc))
if r.onlyPos >= len(r.only) {
// advanced to key after our last only key
return nil, nil
}
// now seek to this new only key
r.iterator.Seek(NewBackIndexRow([]byte(r.only[r.onlyPos]), nil, nil).Key())
key, val, valid = r.iterator.Current()
continue
} else {
rv = append([]byte(nil), br.doc...)
break
}
}
if valid && r.onlyPos < len(r.only) {
ok := r.nextOnly()
if ok {
r.iterator.Seek(NewBackIndexRow([]byte(r.only[r.onlyPos]), nil, nil).Key())
}
return rv, nil
}
} else {
bir := NewBackIndexRow(docID, nil, nil)
r.iterator.Seek(bir.Key())
key, val, valid := r.iterator.Current()
if valid {
br, err := NewBackIndexRowKV(key, val)
if err != nil {
return nil, err
}
rv := append([]byte(nil), br.doc...)
r.iterator.Next()
return rv, nil
}
}
return nil, nil
}
func (r *UpsideDownCouchDocIDReader) Close() error {
return r.iterator.Close()
}
// move the r.only pos forward one, skipping duplicates
// return true if there is more data, or false if we got to the end of the list
func (r *UpsideDownCouchDocIDReader) nextOnly() bool {
// advance 1 position, until we see a different key
// it's already sorted, so this skips duplicates
start := r.onlyPos
r.onlyPos++
for r.onlyPos < len(r.only) && r.only[r.onlyPos] == r.only[start] {
start = r.onlyPos
r.onlyPos++
}
// inidicate if we got to the end of the list
return r.onlyPos < len(r.only)
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,76 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package upsidedown
import (
"encoding/binary"
)
var mergeOperator upsideDownMerge
var dictionaryTermIncr []byte
var dictionaryTermDecr []byte
func init() {
dictionaryTermIncr = make([]byte, 8)
binary.LittleEndian.PutUint64(dictionaryTermIncr, uint64(1))
dictionaryTermDecr = make([]byte, 8)
var negOne = int64(-1)
binary.LittleEndian.PutUint64(dictionaryTermDecr, uint64(negOne))
}
type upsideDownMerge struct{}
func (m *upsideDownMerge) FullMerge(key, existingValue []byte, operands [][]byte) ([]byte, bool) {
// set up record based on key
dr, err := NewDictionaryRowK(key)
if err != nil {
return nil, false
}
if len(existingValue) > 0 {
// if existing value, parse it
err = dr.parseDictionaryV(existingValue)
if err != nil {
return nil, false
}
}
// now process operands
for _, operand := range operands {
next := int64(binary.LittleEndian.Uint64(operand))
if next < 0 && uint64(-next) > dr.count {
// subtracting next from existing would overflow
dr.count = 0
} else if next < 0 {
dr.count -= uint64(-next)
} else {
dr.count += uint64(next)
}
}
return dr.Value(), true
}
func (m *upsideDownMerge) PartialMerge(key, leftOperand, rightOperand []byte) ([]byte, bool) {
left := int64(binary.LittleEndian.Uint64(leftOperand))
right := int64(binary.LittleEndian.Uint64(rightOperand))
rv := make([]byte, 8)
binary.LittleEndian.PutUint64(rv, uint64(left+right))
return rv, true
}
func (m *upsideDownMerge) Name() string {
return "upsideDownMerge"
}

View File

@@ -0,0 +1,55 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package upsidedown
import (
"sync/atomic"
"github.com/blevesearch/bleve/v2/util"
"github.com/blevesearch/upsidedown_store_api"
)
type indexStat struct {
updates, deletes, batches, errors uint64
analysisTime, indexTime uint64
termSearchersStarted uint64
termSearchersFinished uint64
numPlainTextBytesIndexed uint64
i *UpsideDownCouch
}
func (i *indexStat) statsMap() map[string]interface{} {
m := map[string]interface{}{}
m["updates"] = atomic.LoadUint64(&i.updates)
m["deletes"] = atomic.LoadUint64(&i.deletes)
m["batches"] = atomic.LoadUint64(&i.batches)
m["errors"] = atomic.LoadUint64(&i.errors)
m["analysis_time"] = atomic.LoadUint64(&i.analysisTime)
m["index_time"] = atomic.LoadUint64(&i.indexTime)
m["term_searchers_started"] = atomic.LoadUint64(&i.termSearchersStarted)
m["term_searchers_finished"] = atomic.LoadUint64(&i.termSearchersFinished)
m["num_plain_text_bytes_indexed"] = atomic.LoadUint64(&i.numPlainTextBytesIndexed)
if o, ok := i.i.store.(store.KVStoreStats); ok {
m["kv"] = o.StatsMap()
}
return m
}
func (i *indexStat) MarshalJSON() ([]byte, error) {
m := i.statsMap()
return util.MarshalJSON(m)
}

View File

@@ -0,0 +1,85 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package boltdb
import (
"bytes"
bolt "go.etcd.io/bbolt"
)
type Iterator struct {
store *Store
tx *bolt.Tx
cursor *bolt.Cursor
prefix []byte
start []byte
end []byte
valid bool
key []byte
val []byte
}
func (i *Iterator) updateValid() {
i.valid = (i.key != nil)
if i.valid {
if i.prefix != nil {
i.valid = bytes.HasPrefix(i.key, i.prefix)
} else if i.end != nil {
i.valid = bytes.Compare(i.key, i.end) < 0
}
}
}
func (i *Iterator) Seek(k []byte) {
if i.start != nil && bytes.Compare(k, i.start) < 0 {
k = i.start
}
if i.prefix != nil && !bytes.HasPrefix(k, i.prefix) {
if bytes.Compare(k, i.prefix) < 0 {
k = i.prefix
} else {
i.valid = false
return
}
}
i.key, i.val = i.cursor.Seek(k)
i.updateValid()
}
func (i *Iterator) Next() {
i.key, i.val = i.cursor.Next()
i.updateValid()
}
func (i *Iterator) Current() ([]byte, []byte, bool) {
return i.key, i.val, i.valid
}
func (i *Iterator) Key() []byte {
return i.key
}
func (i *Iterator) Value() []byte {
return i.val
}
func (i *Iterator) Valid() bool {
return i.valid
}
func (i *Iterator) Close() error {
return nil
}

View File

@@ -0,0 +1,73 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package boltdb
import (
store "github.com/blevesearch/upsidedown_store_api"
bolt "go.etcd.io/bbolt"
)
type Reader struct {
store *Store
tx *bolt.Tx
bucket *bolt.Bucket
}
func (r *Reader) Get(key []byte) ([]byte, error) {
var rv []byte
v := r.bucket.Get(key)
if v != nil {
rv = make([]byte, len(v))
copy(rv, v)
}
return rv, nil
}
func (r *Reader) MultiGet(keys [][]byte) ([][]byte, error) {
return store.MultiGet(r, keys)
}
func (r *Reader) PrefixIterator(prefix []byte) store.KVIterator {
cursor := r.bucket.Cursor()
rv := &Iterator{
store: r.store,
tx: r.tx,
cursor: cursor,
prefix: prefix,
}
rv.Seek(prefix)
return rv
}
func (r *Reader) RangeIterator(start, end []byte) store.KVIterator {
cursor := r.bucket.Cursor()
rv := &Iterator{
store: r.store,
tx: r.tx,
cursor: cursor,
start: start,
end: end,
}
rv.Seek(start)
return rv
}
func (r *Reader) Close() error {
return r.tx.Rollback()
}

View File

@@ -0,0 +1,28 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package boltdb
import (
"github.com/blevesearch/bleve/v2/util"
)
type stats struct {
s *Store
}
func (s *stats) MarshalJSON() ([]byte, error) {
bs := s.s.db.Stats()
return util.MarshalJSON(bs)
}

View File

@@ -0,0 +1,184 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package boltdb implements a store.KVStore on top of BoltDB. It supports the
// following options:
//
// "bucket" (string): the name of BoltDB bucket to use, defaults to "bleve".
//
// "nosync" (bool): if true, set boltdb.DB.NoSync to true. It speeds up index
// operations in exchange of losing integrity guarantees if indexation aborts
// without closing the index. Use it when rebuilding indexes from zero.
package boltdb
import (
"bytes"
"encoding/json"
"fmt"
"os"
"github.com/blevesearch/bleve/v2/registry"
store "github.com/blevesearch/upsidedown_store_api"
bolt "go.etcd.io/bbolt"
)
const (
Name = "boltdb"
defaultCompactBatchSize = 100
)
type Store struct {
path string
bucket string
db *bolt.DB
noSync bool
fillPercent float64
mo store.MergeOperator
}
func New(mo store.MergeOperator, config map[string]interface{}) (store.KVStore, error) {
path, ok := config["path"].(string)
if !ok {
return nil, fmt.Errorf("must specify path")
}
if path == "" {
return nil, os.ErrInvalid
}
bucket, ok := config["bucket"].(string)
if !ok {
bucket = "bleve"
}
noSync, _ := config["nosync"].(bool)
fillPercent, ok := config["fillPercent"].(float64)
if !ok {
fillPercent = bolt.DefaultFillPercent
}
bo := &bolt.Options{}
ro, ok := config["read_only"].(bool)
if ok {
bo.ReadOnly = ro
}
if initialMmapSize, ok := config["initialMmapSize"].(int); ok {
bo.InitialMmapSize = initialMmapSize
} else if initialMmapSize, ok := config["initialMmapSize"].(float64); ok {
bo.InitialMmapSize = int(initialMmapSize)
}
db, err := bolt.Open(path, 0600, bo)
if err != nil {
return nil, err
}
db.NoSync = noSync
if !bo.ReadOnly {
err = db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucketIfNotExists([]byte(bucket))
return err
})
if err != nil {
return nil, err
}
}
rv := Store{
path: path,
bucket: bucket,
db: db,
mo: mo,
noSync: noSync,
fillPercent: fillPercent,
}
return &rv, nil
}
func (bs *Store) Close() error {
return bs.db.Close()
}
func (bs *Store) Reader() (store.KVReader, error) {
tx, err := bs.db.Begin(false)
if err != nil {
return nil, err
}
return &Reader{
store: bs,
tx: tx,
bucket: tx.Bucket([]byte(bs.bucket)),
}, nil
}
func (bs *Store) Writer() (store.KVWriter, error) {
return &Writer{
store: bs,
}, nil
}
func (bs *Store) Stats() json.Marshaler {
return &stats{
s: bs,
}
}
// CompactWithBatchSize removes DictionaryTerm entries with a count of zero (in batchSize batches)
// Removing entries is a workaround for github issue #374.
func (bs *Store) CompactWithBatchSize(batchSize int) error {
for {
cnt := 0
err := bs.db.Batch(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte(bs.bucket)).Cursor()
prefix := []byte("d")
for k, v := c.Seek(prefix); bytes.HasPrefix(k, prefix); k, v = c.Next() {
if bytes.Equal(v, []byte{0}) {
cnt++
if err := c.Delete(); err != nil {
return err
}
if cnt == batchSize {
break
}
}
}
return nil
})
if err != nil {
return err
}
if cnt == 0 {
break
}
}
return nil
}
// Compact calls CompactWithBatchSize with a default batch size of 100. This is a workaround
// for github issue #374.
func (bs *Store) Compact() error {
return bs.CompactWithBatchSize(defaultCompactBatchSize)
}
func init() {
err := registry.RegisterKVStore(Name, New)
if err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,95 @@
// Copyright (c) 2014 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package boltdb
import (
"fmt"
store "github.com/blevesearch/upsidedown_store_api"
)
type Writer struct {
store *Store
}
func (w *Writer) NewBatch() store.KVBatch {
return store.NewEmulatedBatch(w.store.mo)
}
func (w *Writer) NewBatchEx(options store.KVBatchOptions) ([]byte, store.KVBatch, error) {
return make([]byte, options.TotalBytes), w.NewBatch(), nil
}
func (w *Writer) ExecuteBatch(batch store.KVBatch) (err error) {
emulatedBatch, ok := batch.(*store.EmulatedBatch)
if !ok {
return fmt.Errorf("wrong type of batch")
}
tx, err := w.store.db.Begin(true)
if err != nil {
return
}
// defer function to ensure that once started,
// we either Commit tx or Rollback
defer func() {
// if nothing went wrong, commit
if err == nil {
// careful to catch error here too
err = tx.Commit()
} else {
// caller should see error that caused abort,
// not success or failure of Rollback itself
_ = tx.Rollback()
}
}()
bucket := tx.Bucket([]byte(w.store.bucket))
bucket.FillPercent = w.store.fillPercent
for k, mergeOps := range emulatedBatch.Merger.Merges {
kb := []byte(k)
existingVal := bucket.Get(kb)
mergedVal, fullMergeOk := w.store.mo.FullMerge(kb, existingVal, mergeOps)
if !fullMergeOk {
err = fmt.Errorf("merge operator returned failure")
return
}
err = bucket.Put(kb, mergedVal)
if err != nil {
return
}
}
for _, op := range emulatedBatch.Ops {
if op.V != nil {
err = bucket.Put(op.K, op.V)
if err != nil {
return
}
} else {
err = bucket.Delete(op.K)
if err != nil {
return
}
}
}
return
}
func (w *Writer) Close() error {
return nil
}

View File

@@ -0,0 +1,152 @@
// Copyright (c) 2015 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package gtreap provides an in-memory implementation of the
// KVStore interfaces using the gtreap balanced-binary treap,
// copy-on-write data structure.
package gtreap
import (
"bytes"
"sync"
"github.com/blevesearch/gtreap"
)
type Iterator struct {
t *gtreap.Treap
m sync.Mutex
cancelCh chan struct{}
nextCh chan *Item
curr *Item
currOk bool
prefix []byte
start []byte
end []byte
}
func (w *Iterator) Seek(k []byte) {
if w.start != nil && bytes.Compare(k, w.start) < 0 {
k = w.start
}
if w.prefix != nil && !bytes.HasPrefix(k, w.prefix) {
if bytes.Compare(k, w.prefix) < 0 {
k = w.prefix
} else {
var end []byte
for i := len(w.prefix) - 1; i >= 0; i-- {
c := w.prefix[i]
if c < 0xff {
end = make([]byte, i+1)
copy(end, w.prefix)
end[i] = c + 1
break
}
}
k = end
}
}
w.restart(&Item{k: k})
}
func (w *Iterator) restart(start *Item) *Iterator {
cancelCh := make(chan struct{})
nextCh := make(chan *Item, 1)
w.m.Lock()
if w.cancelCh != nil {
close(w.cancelCh)
}
w.cancelCh = cancelCh
w.nextCh = nextCh
w.curr = nil
w.currOk = false
w.m.Unlock()
go func() {
if start != nil {
w.t.VisitAscend(start, func(itm gtreap.Item) bool {
select {
case <-cancelCh:
return false
case nextCh <- itm.(*Item):
return true
}
})
}
close(nextCh)
}()
w.Next()
return w
}
func (w *Iterator) Next() {
w.m.Lock()
nextCh := w.nextCh
w.m.Unlock()
w.curr, w.currOk = <-nextCh
}
func (w *Iterator) Current() ([]byte, []byte, bool) {
w.m.Lock()
defer w.m.Unlock()
if !w.currOk || w.curr == nil {
return nil, nil, false
}
if w.prefix != nil && !bytes.HasPrefix(w.curr.k, w.prefix) {
return nil, nil, false
} else if w.end != nil && bytes.Compare(w.curr.k, w.end) >= 0 {
return nil, nil, false
}
return w.curr.k, w.curr.v, w.currOk
}
func (w *Iterator) Key() []byte {
k, _, ok := w.Current()
if !ok {
return nil
}
return k
}
func (w *Iterator) Value() []byte {
_, v, ok := w.Current()
if !ok {
return nil
}
return v
}
func (w *Iterator) Valid() bool {
_, _, ok := w.Current()
return ok
}
func (w *Iterator) Close() error {
w.m.Lock()
if w.cancelCh != nil {
close(w.cancelCh)
}
w.cancelCh = nil
w.nextCh = nil
w.curr = nil
w.currOk = false
w.m.Unlock()
return nil
}

Some files were not shown because too many files have changed in this diff Show More