Integrate BACKBEAT SDK and resolve KACHING license validation

Major integrations and fixes:
- Added BACKBEAT SDK integration for P2P operation timing
- Implemented beat-aware status tracking for distributed operations
- Added Docker secrets support for secure license management
- Resolved KACHING license validation via HTTPS/TLS
- Updated docker-compose configuration for clean stack deployment
- Disabled rollback policies to prevent deployment failures
- Added license credential storage (CHORUS-DEV-MULTI-001)

Technical improvements:
- BACKBEAT P2P operation tracking with phase management
- Enhanced configuration system with file-based secrets
- Improved error handling for license validation
- Clean separation of KACHING and CHORUS deployment stacks

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
anthonyrawlins
2025-09-06 07:56:26 +10:00
parent 543ab216f9
commit 9bdcbe0447
4730 changed files with 1480093 additions and 1916 deletions

12
vendor/github.com/blevesearch/zapx/v16/.gitignore generated vendored Normal file
View File

@@ -0,0 +1,12 @@
#*
*.sublime-*
*~
.#*
.project
.settings
**/.idea/
**/*.iml
.DS_Store
/cmd/zap/zap
*.test
tags

28
vendor/github.com/blevesearch/zapx/v16/.golangci.yml generated vendored Normal file
View File

@@ -0,0 +1,28 @@
linters:
# please, do not use `enable-all`: it's deprecated and will be removed soon.
# inverted configuration with `enable-all` and `disable` is not scalable during updates of golangci-lint
disable-all: true
enable:
- bodyclose
- deadcode
- depguard
- dupl
- errcheck
- gofmt
- goimports
- goprintffuncname
- gosec
- gosimple
- govet
- ineffassign
- misspell
- nakedret
- nolintlint
- rowserrcheck
- staticcheck
- structcheck
- typecheck
- unused
- varcheck
- whitespace

202
vendor/github.com/blevesearch/zapx/v16/LICENSE generated vendored Normal file
View File

@@ -0,0 +1,202 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

163
vendor/github.com/blevesearch/zapx/v16/README.md generated vendored Normal file
View File

@@ -0,0 +1,163 @@
# zapx file format
The zapx module is fork of [zap](https://github.com/blevesearch/zap) module which maintains file format compatibility, but removes dependency on bleve, and instead depends only on the indepenent interface modules:
- [bleve_index_api](https://github.com/blevesearch/scorch_segment_api)
- [scorch_segment_api](https://github.com/blevesearch/scorch_segment_api)
Advanced ZAP File Format Documentation is [here](zap.md).
The file is written in the reverse order that we typically access data. This helps us write in one pass since later sections of the file require file offsets of things we've already written.
Current usage:
- mmap the entire file
- crc-32 bytes and version are in fixed position at end of the file
- reading remainder of footer could be version specific
- remainder of footer gives us:
- 3 important offsets (docValue , fields index and stored data index)
- 2 important values (number of docs and chunk factor)
- field data is processed once and memoized onto the heap so that we never have to go back to disk for it
- access to stored data by doc number means first navigating to the stored data index, then accessing a fixed position offset into that slice, which gives us the actual address of the data. the first bytes of that section tell us the size of data so that we know where it ends.
- access to all other indexed data follows the following pattern:
- first know the field name -> convert to id
- next navigate to term dictionary for that field
- some operations stop here and do dictionary ops
- next use dictionary to navigate to posting list for a specific term
- walk posting list
- if necessary, walk posting details as we go
- if location info is desired, consult location bitmap to see if it is there
## stored fields section
- for each document
- preparation phase:
- produce a slice of metadata bytes and data bytes
- produce these slices in field id order
- field value is appended to the data slice
- metadata slice is varint encoded with the following values for each field value
- field id (uint16)
- field type (byte)
- field value start offset in uncompressed data slice (uint64)
- field value length (uint64)
- field number of array positions (uint64)
- one additional value for each array position (uint64)
- compress the data slice using snappy
- file writing phase:
- remember the start offset for this document
- write out meta data length (varint uint64)
- write out compressed data length (varint uint64)
- write out the metadata bytes
- write out the compressed data bytes
## stored fields idx
- for each document
- write start offset (remembered from previous section) of stored data (big endian uint64)
With this index and a known document number, we have direct access to all the stored field data.
## posting details (freq/norm) section
- for each posting list
- produce a slice containing multiple consecutive chunks (each chunk is varint stream)
- produce a slice remembering offsets of where each chunk starts
- preparation phase:
- for each hit in the posting list
- if this hit is in next chunk close out encoding of last chunk and record offset start of next
- encode term frequency (uint64)
- encode norm factor (float32)
- file writing phase:
- remember start position for this posting list details
- write out number of chunks that follow (varint uint64)
- write out length of each chunk (each a varint uint64)
- write out the byte slice containing all the chunk data
If you know the doc number you're interested in, this format lets you jump to the correct chunk (docNum/chunkFactor) directly and then seek within that chunk until you find it.
## posting details (location) section
- for each posting list
- produce a slice containing multiple consecutive chunks (each chunk is varint stream)
- produce a slice remembering offsets of where each chunk starts
- preparation phase:
- for each hit in the posting list
- if this hit is in next chunk close out encoding of last chunk and record offset start of next
- encode field (uint16)
- encode field pos (uint64)
- encode field start (uint64)
- encode field end (uint64)
- encode number of array positions to follow (uint64)
- encode each array position (each uint64)
- file writing phase:
- remember start position for this posting list details
- write out number of chunks that follow (varint uint64)
- write out length of each chunk (each a varint uint64)
- write out the byte slice containing all the chunk data
If you know the doc number you're interested in, this format lets you jump to the correct chunk (docNum/chunkFactor) directly and then seek within that chunk until you find it.
## postings list section
- for each posting list
- preparation phase:
- encode roaring bitmap posting list to bytes (so we know the length)
- file writing phase:
- remember the start position for this posting list
- write freq/norm details offset (remembered from previous, as varint uint64)
- write location details offset (remembered from previous, as varint uint64)
- write length of encoded roaring bitmap
- write the serialized roaring bitmap data
## dictionary
- for each field
- preparation phase:
- encode vellum FST with dictionary data pointing to file offset of posting list (remembered from previous)
- file writing phase:
- remember the start position of this persistDictionary
- write length of vellum data (varint uint64)
- write out vellum data
## fields section
- for each field
- file writing phase:
- remember start offset for each field
- write dictionary address (remembered from previous) (varint uint64)
- write length of field name (varint uint64)
- write field name bytes
## fields idx
- for each field
- file writing phase:
- write big endian uint64 of start offset for each field
NOTE: currently we don't know or record the length of this fields index. Instead we rely on the fact that we know it immediately precedes a footer of known size.
## fields DocValue
- for each field
- preparation phase:
- produce a slice containing multiple consecutive chunks, where each chunk is composed of a meta section followed by compressed columnar field data
- produce a slice remembering the length of each chunk
- file writing phase:
- remember the start position of this first field DocValue offset in the footer
- write out number of chunks that follow (varint uint64)
- write out length of each chunk (each a varint uint64)
- write out the byte slice containing all the chunk data
NOTE: currently the meta header inside each chunk gives clue to the location offsets and size of the data pertaining to a given docID and any
read operation leverage that meta information to extract the document specific data from the file.
## footer
- file writing phase
- write number of docs (big endian uint64)
- write stored field index location (big endian uint64)
- write field index location (big endian uint64)
- write field docValue location (big endian uint64)
- write out chunk factor (big endian uint32)
- write out version (big endian uint32)
- write out file CRC of everything preceding this (big endian uint32)

195
vendor/github.com/blevesearch/zapx/v16/build.go generated vendored Normal file
View File

@@ -0,0 +1,195 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bufio"
"fmt"
"io"
"math"
"os"
"github.com/blevesearch/vellum"
)
const Version uint32 = 16
const IndexSectionsVersion uint32 = 16
const Type string = "zap"
const fieldNotUninverted = math.MaxUint64
func (sb *SegmentBase) Persist(path string) error {
return PersistSegmentBase(sb, path)
}
// WriteTo is an implementation of io.WriterTo interface.
func (sb *SegmentBase) WriteTo(w io.Writer) (int64, error) {
if w == nil {
return 0, fmt.Errorf("invalid writer found")
}
n, err := persistSegmentBaseToWriter(sb, w)
return int64(n), err
}
// PersistSegmentBase persists SegmentBase in the zap file format.
func PersistSegmentBase(sb *SegmentBase, path string) error {
flag := os.O_RDWR | os.O_CREATE
f, err := os.OpenFile(path, flag, 0600)
if err != nil {
return err
}
cleanup := func() {
_ = f.Close()
_ = os.Remove(path)
}
_, err = persistSegmentBaseToWriter(sb, f)
if err != nil {
cleanup()
return err
}
err = f.Sync()
if err != nil {
cleanup()
return err
}
err = f.Close()
if err != nil {
cleanup()
return err
}
return err
}
type bufWriter struct {
w *bufio.Writer
n int
}
func (br *bufWriter) Write(in []byte) (int, error) {
n, err := br.w.Write(in)
br.n += n
return n, err
}
func persistSegmentBaseToWriter(sb *SegmentBase, w io.Writer) (int, error) {
br := &bufWriter{w: bufio.NewWriter(w)}
_, err := br.Write(sb.mem)
if err != nil {
return 0, err
}
err = persistFooter(sb.numDocs, sb.storedIndexOffset, sb.fieldsIndexOffset, sb.sectionsIndexOffset,
sb.docValueOffset, sb.chunkMode, sb.memCRC, br)
if err != nil {
return 0, err
}
err = br.w.Flush()
if err != nil {
return 0, err
}
return br.n, nil
}
func persistStoredFieldValues(fieldID int,
storedFieldValues [][]byte, stf []byte, spf [][]uint64,
curr int, metaEncode varintEncoder, data []byte) (
int, []byte, error) {
for i := 0; i < len(storedFieldValues); i++ {
// encode field
_, err := metaEncode(uint64(fieldID))
if err != nil {
return 0, nil, err
}
// encode type
_, err = metaEncode(uint64(stf[i]))
if err != nil {
return 0, nil, err
}
// encode start offset
_, err = metaEncode(uint64(curr))
if err != nil {
return 0, nil, err
}
// end len
_, err = metaEncode(uint64(len(storedFieldValues[i])))
if err != nil {
return 0, nil, err
}
// encode number of array pos
_, err = metaEncode(uint64(len(spf[i])))
if err != nil {
return 0, nil, err
}
// encode all array positions
for _, pos := range spf[i] {
_, err = metaEncode(pos)
if err != nil {
return 0, nil, err
}
}
data = append(data, storedFieldValues[i]...)
curr += len(storedFieldValues[i])
}
return curr, data, nil
}
func InitSegmentBase(mem []byte, memCRC uint32, chunkMode uint32, numDocs uint64,
storedIndexOffset uint64, sectionsIndexOffset uint64) (*SegmentBase, error) {
sb := &SegmentBase{
mem: mem,
memCRC: memCRC,
chunkMode: chunkMode,
numDocs: numDocs,
storedIndexOffset: storedIndexOffset,
fieldsIndexOffset: sectionsIndexOffset,
sectionsIndexOffset: sectionsIndexOffset,
fieldDvReaders: make([]map[uint16]*docValueReader, len(segmentSections)),
docValueOffset: 0, // docValueOffsets identified automatically by the section
fieldFSTs: make(map[uint16]*vellum.FST),
vecIndexCache: newVectorIndexCache(),
synIndexCache: newSynonymIndexCache(),
// following fields gets populated by loadFieldsNew
fieldsMap: make(map[string]uint16),
dictLocs: make([]uint64, 0),
fieldsInv: make([]string, 0),
}
sb.updateSize()
// load the data/section starting offsets for each field
// by via the sectionsIndexOffset as starting point.
err := sb.loadFieldsNew()
if err != nil {
return nil, err
}
err = sb.loadDvReaders()
if err != nil {
return nil, err
}
return sb, nil
}

84
vendor/github.com/blevesearch/zapx/v16/chunk.go generated vendored Normal file
View File

@@ -0,0 +1,84 @@
// Copyright (c) 2019 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"errors"
"fmt"
)
// LegacyChunkMode was the original chunk mode (always chunk size 1024)
// this mode is still used for chunking doc values.
var LegacyChunkMode uint32 = 1024
// DefaultChunkMode is the most recent improvement to chunking and should
// be used by default.
var DefaultChunkMode uint32 = 1026
var ErrChunkSizeZero = errors.New("chunk size is zero")
// getChunkSize returns the chunk size for the given chunkMode, cardinality, and
// maxDocs.
//
// In error cases, the returned chunk size will be 0. Caller can differentiate
// between a valid chunk size of 0 and an error by checking for ErrChunkSizeZero.
func getChunkSize(chunkMode uint32, cardinality uint64, maxDocs uint64) (uint64, error) {
switch {
case chunkMode == 0:
return 0, ErrChunkSizeZero
// any chunkMode <= 1024 will always chunk with chunkSize=chunkMode
case chunkMode <= 1024:
// legacy chunk size
return uint64(chunkMode), nil
case chunkMode == 1025:
// attempt at simple improvement
// theory - the point of chunking is to put a bound on the maximum number of
// calls to Next() needed to find a random document. ie, you should be able
// to do one jump to the correct chunk, and then walk through at most
// chunk-size items
// previously 1024 was chosen as the chunk size, but this is particularly
// wasteful for low cardinality terms. the observation is that if there
// are less than 1024 items, why not put them all in one chunk,
// this way you'll still achieve the same goal of visiting at most
// chunk-size items.
// no attempt is made to tweak any other case
if cardinality <= 1024 {
if maxDocs == 0 {
return 0, ErrChunkSizeZero
}
return maxDocs, nil
}
return 1024, nil
case chunkMode == 1026:
// improve upon the ideas tested in chunkMode 1025
// the observation that the fewest number of dense chunks is the most
// desirable layout, given the built-in assumptions of chunking
// (that we want to put an upper-bound on the number of items you must
// walk over without skipping, currently tuned to 1024)
//
// 1. compute the number of chunks needed (max 1024/chunk)
// 2. convert to chunkSize, dividing into maxDocs
numChunks := (cardinality / 1024) + 1
chunkSize := maxDocs / numChunks
if chunkSize == 0 {
return 0, ErrChunkSizeZero
}
return chunkSize, nil
}
return 0, fmt.Errorf("unknown chunk mode %d", chunkMode)
}

256
vendor/github.com/blevesearch/zapx/v16/contentcoder.go generated vendored Normal file
View File

@@ -0,0 +1,256 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"encoding/binary"
"io"
"reflect"
"github.com/golang/snappy"
)
var reflectStaticSizeMetaData int
func init() {
var md MetaData
reflectStaticSizeMetaData = int(reflect.TypeOf(md).Size())
}
var (
termSeparator byte = 0xff
termSeparatorSplitSlice = []byte{termSeparator}
)
type chunkedContentCoder struct {
bytesWritten uint64 // moved to top to correct alignment issues on ARM, 386 and 32-bit MIPS.
final []byte
chunkSize uint64
currChunk uint64
chunkLens []uint64
compressed []byte // temp buf for snappy compression
w io.Writer
progressiveWrite bool
chunkMeta []MetaData
chunkMetaBuf bytes.Buffer
chunkBuf bytes.Buffer
}
// MetaData represents the data information inside a
// chunk.
type MetaData struct {
DocNum uint64 // docNum of the data inside the chunk
DocDvOffset uint64 // offset of data inside the chunk for the given docid
}
// newChunkedContentCoder returns a new chunk content coder which
// packs data into chunks based on the provided chunkSize
func newChunkedContentCoder(chunkSize uint64, maxDocNum uint64,
w io.Writer, progressiveWrite bool,
) *chunkedContentCoder {
total := maxDocNum/chunkSize + 1
rv := &chunkedContentCoder{
chunkSize: chunkSize,
chunkLens: make([]uint64, total),
chunkMeta: make([]MetaData, 0, total),
w: w,
progressiveWrite: progressiveWrite,
}
return rv
}
// Reset lets you reuse this chunked content coder. Buffers are reset
// and re used. You cannot change the chunk size.
func (c *chunkedContentCoder) Reset() {
c.currChunk = 0
c.bytesWritten = 0
c.final = c.final[:0]
c.chunkBuf.Reset()
c.chunkMetaBuf.Reset()
for i := range c.chunkLens {
c.chunkLens[i] = 0
}
c.chunkMeta = c.chunkMeta[:0]
}
func (c *chunkedContentCoder) SetChunkSize(chunkSize uint64, maxDocNum uint64) {
total := int(maxDocNum/chunkSize + 1)
c.chunkSize = chunkSize
if cap(c.chunkLens) < total {
c.chunkLens = make([]uint64, total)
} else {
c.chunkLens = c.chunkLens[:total]
}
if cap(c.chunkMeta) < total {
c.chunkMeta = make([]MetaData, 0, total)
}
}
// Close indicates you are done calling Add() this allows
// the final chunk to be encoded.
func (c *chunkedContentCoder) Close() error {
return c.flushContents()
}
func (c *chunkedContentCoder) incrementBytesWritten(val uint64) {
c.bytesWritten += val
}
func (c *chunkedContentCoder) getBytesWritten() uint64 {
return c.bytesWritten
}
func (c *chunkedContentCoder) flushContents() error {
// flush the contents, with meta information at first
buf := make([]byte, binary.MaxVarintLen64)
n := binary.PutUvarint(buf, uint64(len(c.chunkMeta)))
_, err := c.chunkMetaBuf.Write(buf[:n])
if err != nil {
return err
}
// write out the metaData slice
for _, meta := range c.chunkMeta {
_, err := writeUvarints(&c.chunkMetaBuf, meta.DocNum, meta.DocDvOffset)
if err != nil {
return err
}
}
// write the metadata to final data
metaData := c.chunkMetaBuf.Bytes()
c.final = append(c.final, c.chunkMetaBuf.Bytes()...)
// write the compressed data to the final data
c.compressed = snappy.Encode(c.compressed[:cap(c.compressed)], c.chunkBuf.Bytes())
c.incrementBytesWritten(uint64(len(c.compressed)))
c.final = append(c.final, c.compressed...)
c.chunkLens[c.currChunk] = uint64(len(c.compressed) + len(metaData))
if c.progressiveWrite {
_, err := c.w.Write(c.final)
if err != nil {
return err
}
c.final = c.final[:0]
}
return nil
}
// Add encodes the provided byte slice into the correct chunk for the provided
// doc num. You MUST call Add() with increasing docNums.
func (c *chunkedContentCoder) Add(docNum uint64, vals []byte) error {
chunk := docNum / c.chunkSize
if chunk != c.currChunk {
// flush out the previous chunk details
err := c.flushContents()
if err != nil {
return err
}
// clearing the chunk specific meta for next chunk
c.chunkBuf.Reset()
c.chunkMetaBuf.Reset()
c.chunkMeta = c.chunkMeta[:0]
c.currChunk = chunk
}
// get the starting offset for this doc
dvOffset := c.chunkBuf.Len()
dvSize, err := c.chunkBuf.Write(vals)
if err != nil {
return err
}
c.chunkMeta = append(c.chunkMeta, MetaData{
DocNum: docNum,
DocDvOffset: uint64(dvOffset + dvSize),
})
return nil
}
// Write commits all the encoded chunked contents to the provided writer.
//
// | ..... data ..... | chunk offsets (varints)
// | position of chunk offsets (uint64) | number of offsets (uint64) |
func (c *chunkedContentCoder) Write() (int, error) {
var tw int
if c.final != nil {
// write out the data section first
nw, err := c.w.Write(c.final)
tw += nw
if err != nil {
return tw, err
}
}
chunkOffsetsStart := uint64(tw)
if cap(c.final) < binary.MaxVarintLen64 {
c.final = make([]byte, binary.MaxVarintLen64)
} else {
c.final = c.final[0:binary.MaxVarintLen64]
}
chunkOffsets := modifyLengthsToEndOffsets(c.chunkLens)
// write out the chunk offsets
for _, chunkOffset := range chunkOffsets {
n := binary.PutUvarint(c.final, chunkOffset)
nw, err := c.w.Write(c.final[:n])
tw += nw
if err != nil {
return tw, err
}
}
chunkOffsetsLen := uint64(tw) - chunkOffsetsStart
c.final = c.final[0:8]
// write out the length of chunk offsets
binary.BigEndian.PutUint64(c.final, chunkOffsetsLen)
nw, err := c.w.Write(c.final)
tw += nw
if err != nil {
return tw, err
}
// write out the number of chunks
binary.BigEndian.PutUint64(c.final, uint64(len(c.chunkLens)))
nw, err = c.w.Write(c.final)
tw += nw
if err != nil {
return tw, err
}
c.final = c.final[:0]
return tw, nil
}
// ReadDocValueBoundary elicits the start, end offsets from a
// metaData header slice
func ReadDocValueBoundary(chunk int, metaHeaders []MetaData) (uint64, uint64) {
var start uint64
if chunk > 0 {
start = metaHeaders[chunk-1].DocDvOffset
}
return start, metaHeaders[chunk].DocDvOffset
}

61
vendor/github.com/blevesearch/zapx/v16/count.go generated vendored Normal file
View File

@@ -0,0 +1,61 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"hash/crc32"
"io"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
// CountHashWriter is a wrapper around a Writer which counts the number of
// bytes which have been written and computes a crc32 hash
type CountHashWriter struct {
w io.Writer
crc uint32
n int
s segment.StatsReporter
}
// NewCountHashWriter returns a CountHashWriter which wraps the provided Writer
func NewCountHashWriter(w io.Writer) *CountHashWriter {
return &CountHashWriter{w: w}
}
func NewCountHashWriterWithStatsReporter(w io.Writer, s segment.StatsReporter) *CountHashWriter {
return &CountHashWriter{w: w, s: s}
}
// Write writes the provided bytes to the wrapped writer and counts the bytes
func (c *CountHashWriter) Write(b []byte) (int, error) {
n, err := c.w.Write(b)
c.crc = crc32.Update(c.crc, crc32.IEEETable, b[:n])
c.n += n
if c.s != nil {
c.s.ReportBytesWritten(uint64(n))
}
return n, err
}
// Count returns the number of bytes written
func (c *CountHashWriter) Count() int {
return c.n
}
// Sum32 returns the CRC-32 hash of the content written to this writer
func (c *CountHashWriter) Sum32() uint32 {
return c.crc
}

188
vendor/github.com/blevesearch/zapx/v16/dict.go generated vendored Normal file
View File

@@ -0,0 +1,188 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"fmt"
"github.com/RoaringBitmap/roaring/v2"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
"github.com/blevesearch/vellum"
)
// Dictionary is the zap representation of the term dictionary
type Dictionary struct {
sb *SegmentBase
field string
fieldID uint16
fst *vellum.FST
fstReader *vellum.Reader
bytesRead uint64
}
// represents an immutable, empty dictionary
var emptyDictionary = &Dictionary{}
func (d *Dictionary) Cardinality() int {
if d.fst != nil {
return d.fst.Len()
}
return 0
}
// PostingsList returns the postings list for the specified term
func (d *Dictionary) PostingsList(term []byte, except *roaring.Bitmap,
prealloc segment.PostingsList) (segment.PostingsList, error) {
var preallocPL *PostingsList
pl, ok := prealloc.(*PostingsList)
if ok && pl != nil {
preallocPL = pl
}
return d.postingsList(term, except, preallocPL)
}
func (d *Dictionary) postingsList(term []byte, except *roaring.Bitmap, rv *PostingsList) (*PostingsList, error) {
if d.fstReader == nil {
if rv == nil || rv == emptyPostingsList {
return emptyPostingsList, nil
}
return d.postingsListInit(rv, except), nil
}
postingsOffset, exists, err := d.fstReader.Get(term)
if err != nil {
return nil, fmt.Errorf("vellum err: %v", err)
}
if !exists {
if rv == nil || rv == emptyPostingsList {
return emptyPostingsList, nil
}
return d.postingsListInit(rv, except), nil
}
return d.postingsListFromOffset(postingsOffset, except, rv)
}
func (d *Dictionary) postingsListFromOffset(postingsOffset uint64, except *roaring.Bitmap, rv *PostingsList) (*PostingsList, error) {
rv = d.postingsListInit(rv, except)
err := rv.read(postingsOffset, d)
if err != nil {
return nil, err
}
return rv, nil
}
func (d *Dictionary) postingsListInit(rv *PostingsList, except *roaring.Bitmap) *PostingsList {
if rv == nil || rv == emptyPostingsList {
rv = &PostingsList{}
} else {
postings := rv.postings
if postings != nil {
postings.Clear()
}
*rv = PostingsList{} // clear the struct
rv.postings = postings
}
rv.sb = d.sb
rv.except = except
return rv
}
func (d *Dictionary) Contains(key []byte) (bool, error) {
if d.fst != nil {
return d.fst.Contains(key)
}
return false, nil
}
// AutomatonIterator returns an iterator which only visits terms
// having the the vellum automaton and start/end key range
func (d *Dictionary) AutomatonIterator(a segment.Automaton,
startKeyInclusive, endKeyExclusive []byte) segment.DictionaryIterator {
if d.fst != nil {
rv := &DictionaryIterator{
d: d,
}
itr, err := d.fst.Search(a, startKeyInclusive, endKeyExclusive)
if err == nil {
rv.itr = itr
} else if err != vellum.ErrIteratorDone {
rv.err = err
}
return rv
}
return emptyDictionaryIterator
}
func (d *Dictionary) incrementBytesRead(val uint64) {
d.bytesRead += val
}
func (d *Dictionary) BytesRead() uint64 {
return d.bytesRead
}
func (d *Dictionary) ResetBytesRead(val uint64) {
d.bytesRead = val
}
func (d *Dictionary) BytesWritten() uint64 {
return 0
}
// DictionaryIterator is an iterator for term dictionary
type DictionaryIterator struct {
d *Dictionary
itr vellum.Iterator
err error
tmp PostingsList
entry index.DictEntry
omitCount bool
}
var emptyDictionaryIterator = &DictionaryIterator{}
// Next returns the next entry in the dictionary
func (i *DictionaryIterator) Next() (*index.DictEntry, error) {
if i.err != nil && i.err != vellum.ErrIteratorDone {
return nil, i.err
} else if i.itr == nil || i.err == vellum.ErrIteratorDone {
return nil, nil
}
term, postingsOffset := i.itr.Current()
if fitr, ok := i.itr.(vellum.FuzzyIterator); ok {
i.entry.EditDistance = fitr.EditDistance()
}
i.entry.Term = string(term)
if !i.omitCount {
i.err = i.tmp.read(postingsOffset, i.d)
if i.err != nil {
return nil, i.err
}
i.entry.Count = i.tmp.Count()
}
i.err = i.itr.Next()
return &i.entry, nil
}

354
vendor/github.com/blevesearch/zapx/v16/docvalues.go generated vendored Normal file
View File

@@ -0,0 +1,354 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"encoding/binary"
"fmt"
"math"
"reflect"
"sort"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
"github.com/golang/snappy"
)
var reflectStaticSizedocValueReader int
func init() {
var dvi docValueReader
reflectStaticSizedocValueReader = int(reflect.TypeOf(dvi).Size())
}
type docNumTermsVisitor func(docNum uint64, terms []byte) error
type docVisitState struct {
dvrs map[uint16]*docValueReader
segment *SegmentBase
bytesRead uint64
}
// Implements the segment.DiskStatsReporter interface
// The purpose of this implementation is to get
// the bytes read from the disk (pertaining to the
// docvalues) while querying.
// the loadDvChunk retrieves the next chunk of docvalues
// and the bytes retrieved off the disk pertaining to that
// is accounted as well.
func (d *docVisitState) incrementBytesRead(val uint64) {
d.bytesRead += val
}
func (d *docVisitState) BytesRead() uint64 {
return d.bytesRead
}
func (d *docVisitState) BytesWritten() uint64 {
return 0
}
func (d *docVisitState) ResetBytesRead(val uint64) {
d.bytesRead = val
}
type docValueReader struct {
field string
curChunkNum uint64
chunkOffsets []uint64
dvDataLoc uint64
curChunkHeader []MetaData
curChunkData []byte // compressed data cache
uncompressed []byte // temp buf for snappy decompression
bytesRead uint64
}
func (di *docValueReader) size() int {
return reflectStaticSizedocValueReader + SizeOfPtr +
len(di.field) +
len(di.chunkOffsets)*SizeOfUint64 +
len(di.curChunkHeader)*reflectStaticSizeMetaData +
len(di.curChunkData)
}
func (di *docValueReader) cloneInto(rv *docValueReader) *docValueReader {
if rv == nil {
rv = &docValueReader{}
}
rv.field = di.field
rv.curChunkNum = math.MaxUint64
rv.chunkOffsets = di.chunkOffsets // immutable, so it's sharable
rv.dvDataLoc = di.dvDataLoc
rv.curChunkHeader = rv.curChunkHeader[:0]
rv.curChunkData = nil
rv.uncompressed = rv.uncompressed[:0]
return rv
}
func (di *docValueReader) curChunkNumber() uint64 {
return di.curChunkNum
}
func (sb *SegmentBase) loadFieldDocValueReader(field string,
fieldDvLocStart, fieldDvLocEnd uint64) (*docValueReader, error) {
// get the docValue offset for the given fields
if fieldDvLocStart == fieldNotUninverted {
// no docValues found, nothing to do
return nil, nil
}
// read the number of chunks, and chunk offsets position
var numChunks, chunkOffsetsPosition uint64
if fieldDvLocEnd-fieldDvLocStart > 16 {
numChunks = binary.BigEndian.Uint64(sb.mem[fieldDvLocEnd-8 : fieldDvLocEnd])
// read the length of chunk offsets
chunkOffsetsLen := binary.BigEndian.Uint64(sb.mem[fieldDvLocEnd-16 : fieldDvLocEnd-8])
// acquire position of chunk offsets
chunkOffsetsPosition = (fieldDvLocEnd - 16) - chunkOffsetsLen
// 16 bytes since it corresponds to the length
// of chunk offsets and the position of the offsets
sb.incrementBytesRead(16)
} else {
return nil, fmt.Errorf("loadFieldDocValueReader: fieldDvLoc too small: %d-%d", fieldDvLocEnd, fieldDvLocStart)
}
fdvIter := &docValueReader{
curChunkNum: math.MaxUint64,
field: field,
chunkOffsets: make([]uint64, int(numChunks)),
}
// read the chunk offsets
var offset uint64
for i := 0; i < int(numChunks); i++ {
loc, read := binary.Uvarint(sb.mem[chunkOffsetsPosition+offset : chunkOffsetsPosition+offset+binary.MaxVarintLen64])
if read <= 0 {
return nil, fmt.Errorf("corrupted chunk offset during segment load")
}
fdvIter.chunkOffsets[i] = loc
offset += uint64(read)
}
sb.incrementBytesRead(offset)
// set the data offset
fdvIter.dvDataLoc = fieldDvLocStart
return fdvIter, nil
}
func (d *docValueReader) getBytesRead() uint64 {
return d.bytesRead
}
func (d *docValueReader) incrementBytesRead(val uint64) {
d.bytesRead += val
}
func (di *docValueReader) loadDvChunk(chunkNumber uint64, s *SegmentBase) error {
// advance to the chunk where the docValues
// reside for the given docNum
destChunkDataLoc, curChunkEnd := di.dvDataLoc, di.dvDataLoc
start, end := readChunkBoundary(int(chunkNumber), di.chunkOffsets)
if start >= end {
di.curChunkHeader = di.curChunkHeader[:0]
di.curChunkData = nil
di.curChunkNum = chunkNumber
di.uncompressed = di.uncompressed[:0]
return nil
}
destChunkDataLoc += start
curChunkEnd += end
// read the number of docs reside in the chunk
numDocs, read := binary.Uvarint(s.mem[destChunkDataLoc : destChunkDataLoc+binary.MaxVarintLen64])
if read <= 0 {
return fmt.Errorf("failed to read the chunk")
}
chunkMetaLoc := destChunkDataLoc + uint64(read)
di.incrementBytesRead(uint64(read))
offset := uint64(0)
if cap(di.curChunkHeader) < int(numDocs) {
di.curChunkHeader = make([]MetaData, int(numDocs))
} else {
di.curChunkHeader = di.curChunkHeader[:int(numDocs)]
}
for i := 0; i < int(numDocs); i++ {
di.curChunkHeader[i].DocNum, read = binary.Uvarint(s.mem[chunkMetaLoc+offset : chunkMetaLoc+offset+binary.MaxVarintLen64])
offset += uint64(read)
di.curChunkHeader[i].DocDvOffset, read = binary.Uvarint(s.mem[chunkMetaLoc+offset : chunkMetaLoc+offset+binary.MaxVarintLen64])
offset += uint64(read)
}
compressedDataLoc := chunkMetaLoc + offset
dataLength := curChunkEnd - compressedDataLoc
di.incrementBytesRead(uint64(dataLength + offset))
di.curChunkData = s.mem[compressedDataLoc : compressedDataLoc+dataLength]
di.curChunkNum = chunkNumber
di.uncompressed = di.uncompressed[:0]
return nil
}
func (di *docValueReader) iterateAllDocValues(s *SegmentBase, visitor docNumTermsVisitor) error {
for i := 0; i < len(di.chunkOffsets); i++ {
err := di.loadDvChunk(uint64(i), s)
if err != nil {
return err
}
if di.curChunkData == nil || len(di.curChunkHeader) == 0 {
continue
}
// uncompress the already loaded data
uncompressed, err := snappy.Decode(di.uncompressed[:cap(di.uncompressed)], di.curChunkData)
if err != nil {
return err
}
di.uncompressed = uncompressed
start := uint64(0)
for _, entry := range di.curChunkHeader {
err = visitor(entry.DocNum, uncompressed[start:entry.DocDvOffset])
if err != nil {
return err
}
start = entry.DocDvOffset
}
}
return nil
}
func (di *docValueReader) visitDocValues(docNum uint64,
visitor index.DocValueVisitor) error {
// binary search the term locations for the docNum
start, end := di.getDocValueLocs(docNum)
if start == math.MaxUint64 || end == math.MaxUint64 || start == end {
return nil
}
var uncompressed []byte
var err error
// use the uncompressed copy if available
if len(di.uncompressed) > 0 {
uncompressed = di.uncompressed
} else {
// uncompress the already loaded data
uncompressed, err = snappy.Decode(di.uncompressed[:cap(di.uncompressed)], di.curChunkData)
if err != nil {
return err
}
di.uncompressed = uncompressed
}
// pick the terms for the given docNum
uncompressed = uncompressed[start:end]
for {
i := bytes.Index(uncompressed, termSeparatorSplitSlice)
if i < 0 {
break
}
visitor(di.field, uncompressed[0:i])
uncompressed = uncompressed[i+1:]
}
return nil
}
func (di *docValueReader) getDocValueLocs(docNum uint64) (uint64, uint64) {
i := sort.Search(len(di.curChunkHeader), func(i int) bool {
return di.curChunkHeader[i].DocNum >= docNum
})
if i < len(di.curChunkHeader) && di.curChunkHeader[i].DocNum == docNum {
return ReadDocValueBoundary(i, di.curChunkHeader)
}
return math.MaxUint64, math.MaxUint64
}
// VisitDocValues is an implementation of the
// DocValueVisitable interface
func (sb *SegmentBase) VisitDocValues(localDocNum uint64, fields []string,
visitor index.DocValueVisitor, dvsIn segment.DocVisitState) (
segment.DocVisitState, error) {
dvs, ok := dvsIn.(*docVisitState)
if !ok || dvs == nil {
dvs = &docVisitState{}
} else {
if dvs.segment != sb {
dvs.segment = sb
dvs.dvrs = nil
dvs.bytesRead = 0
}
}
var fieldIDPlus1 uint16
if dvs.dvrs == nil {
dvs.dvrs = make(map[uint16]*docValueReader, len(fields))
for _, field := range fields {
if fieldIDPlus1, ok = sb.fieldsMap[field]; !ok {
continue
}
fieldID := fieldIDPlus1 - 1
if dvIter, exists := sb.fieldDvReaders[SectionInvertedTextIndex][fieldID]; exists &&
dvIter != nil {
dvs.dvrs[fieldID] = dvIter.cloneInto(dvs.dvrs[fieldID])
}
}
}
// find the chunkNumber where the docValues are stored
// NOTE: doc values continue to use legacy chunk mode
chunkFactor, err := getChunkSize(LegacyChunkMode, 0, 0)
if err != nil {
return nil, err
}
docInChunk := localDocNum / chunkFactor
var dvr *docValueReader
for _, field := range fields {
if fieldIDPlus1, ok = sb.fieldsMap[field]; !ok {
continue
}
fieldID := fieldIDPlus1 - 1
if dvr, ok = dvs.dvrs[fieldID]; ok && dvr != nil {
// check if the chunk is already loaded
if docInChunk != dvr.curChunkNumber() {
err := dvr.loadDvChunk(docInChunk, sb)
if err != nil {
return dvs, err
}
dvs.ResetBytesRead(dvr.getBytesRead())
} else {
dvs.ResetBytesRead(0)
}
_ = dvr.visitDocValues(localDocNum, visitor)
}
}
return dvs, nil
}
// VisitableDocValueFields returns the list of fields with
// persisted doc value terms ready to be visitable using the
// VisitDocumentFieldTerms method.
func (sb *SegmentBase) VisitableDocValueFields() ([]string, error) {
return sb.fieldDvNames, nil
}

138
vendor/github.com/blevesearch/zapx/v16/enumerator.go generated vendored Normal file
View File

@@ -0,0 +1,138 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"github.com/blevesearch/vellum"
)
// enumerator provides an ordered traversal of multiple vellum
// iterators. Like JOIN of iterators, the enumerator produces a
// sequence of (key, iteratorIndex, value) tuples, sorted by key ASC,
// then iteratorIndex ASC, where the same key might be seen or
// repeated across multiple child iterators.
type enumerator struct {
itrs []vellum.Iterator
currKs [][]byte
currVs []uint64
lowK []byte
lowIdxs []int
lowCurr int
}
// newEnumerator returns a new enumerator over the vellum Iterators
func newEnumerator(itrs []vellum.Iterator) (*enumerator, error) {
rv := &enumerator{
itrs: itrs,
currKs: make([][]byte, len(itrs)),
currVs: make([]uint64, len(itrs)),
lowIdxs: make([]int, 0, len(itrs)),
}
for i, itr := range rv.itrs {
rv.currKs[i], rv.currVs[i] = itr.Current()
}
rv.updateMatches(false)
if rv.lowK == nil && len(rv.lowIdxs) == 0 {
return rv, vellum.ErrIteratorDone
}
return rv, nil
}
// updateMatches maintains the low key matches based on the currKs
func (m *enumerator) updateMatches(skipEmptyKey bool) {
m.lowK = nil
m.lowIdxs = m.lowIdxs[:0]
m.lowCurr = 0
for i, key := range m.currKs {
if (key == nil && m.currVs[i] == 0) || // in case of empty iterator
(len(key) == 0 && skipEmptyKey) { // skip empty keys
continue
}
cmp := bytes.Compare(key, m.lowK)
if cmp < 0 || len(m.lowIdxs) == 0 {
// reached a new low
m.lowK = key
m.lowIdxs = m.lowIdxs[:0]
m.lowIdxs = append(m.lowIdxs, i)
} else if cmp == 0 {
m.lowIdxs = append(m.lowIdxs, i)
}
}
}
// Current returns the enumerator's current key, iterator-index, and
// value. If the enumerator is not pointing at a valid value (because
// Next returned an error previously), Current will return nil,0,0.
func (m *enumerator) Current() ([]byte, int, uint64) {
var i int
var v uint64
if m.lowCurr < len(m.lowIdxs) {
i = m.lowIdxs[m.lowCurr]
v = m.currVs[i]
}
return m.lowK, i, v
}
// GetLowIdxsAndValues will return all of the iterator indices
// which point to the current key, and their corresponding
// values. This can be used by advanced caller which may need
// to peek into these other sets of data before processing.
func (m *enumerator) GetLowIdxsAndValues() ([]int, []uint64) {
values := make([]uint64, 0, len(m.lowIdxs))
for _, idx := range m.lowIdxs {
values = append(values, m.currVs[idx])
}
return m.lowIdxs, values
}
// Next advances the enumerator to the next key/iterator/value result,
// else vellum.ErrIteratorDone is returned.
func (m *enumerator) Next() error {
m.lowCurr += 1
if m.lowCurr >= len(m.lowIdxs) {
// move all the current low iterators forwards
for _, vi := range m.lowIdxs {
err := m.itrs[vi].Next()
if err != nil && err != vellum.ErrIteratorDone {
return err
}
m.currKs[vi], m.currVs[vi] = m.itrs[vi].Current()
}
// can skip any empty keys encountered at this point
m.updateMatches(true)
}
if m.lowK == nil && len(m.lowIdxs) == 0 {
return vellum.ErrIteratorDone
}
return nil
}
// Close all the underlying Iterators. The first error, if any, will
// be returned.
func (m *enumerator) Close() error {
var rv error
for _, itr := range m.itrs {
err := itr.Close()
if rv == nil {
rv = err
}
}
return rv
}

View File

@@ -0,0 +1,342 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors
// +build vectors
package zap
import (
"encoding/binary"
"sync"
"sync/atomic"
"time"
"github.com/RoaringBitmap/roaring/v2"
faiss "github.com/blevesearch/go-faiss"
)
func newVectorIndexCache() *vectorIndexCache {
return &vectorIndexCache{
cache: make(map[uint16]*cacheEntry),
closeCh: make(chan struct{}),
}
}
type vectorIndexCache struct {
closeCh chan struct{}
m sync.RWMutex
cache map[uint16]*cacheEntry
}
func (vc *vectorIndexCache) Clear() {
vc.m.Lock()
close(vc.closeCh)
// forcing a close on all indexes to avoid memory leaks.
for _, entry := range vc.cache {
entry.close()
}
vc.cache = nil
vc.m.Unlock()
}
// loadOrCreate obtains the vector index from the cache or creates it if it's not
// present. It also returns the batch executor for the field if it's present in the
// cache.
func (vc *vectorIndexCache) loadOrCreate(fieldID uint16, mem []byte,
loadDocVecIDMap bool, except *roaring.Bitmap) (
index *faiss.IndexImpl, vecDocIDMap map[int64]uint32, docVecIDMap map[uint32][]int64,
vecIDsToExclude []int64, err error) {
vc.m.RLock()
entry, ok := vc.cache[fieldID]
if ok {
index, vecDocIDMap, docVecIDMap = entry.load()
vecIDsToExclude = getVecIDsToExclude(vecDocIDMap, except)
if !loadDocVecIDMap || len(entry.docVecIDMap) > 0 {
vc.m.RUnlock()
return index, vecDocIDMap, docVecIDMap, vecIDsToExclude, nil
}
vc.m.RUnlock()
vc.m.Lock()
// in cases where only the docVecID isn't part of the cache, build it and
// add it to the cache, while holding a lock to avoid concurrent modifications.
// typically seen for the first filtered query.
docVecIDMap = vc.addDocVecIDMapToCacheLOCKED(entry)
vc.m.Unlock()
return index, vecDocIDMap, docVecIDMap, vecIDsToExclude, nil
}
vc.m.RUnlock()
// acquiring a lock since this is modifying the cache.
vc.m.Lock()
defer vc.m.Unlock()
return vc.createAndCacheLOCKED(fieldID, mem, loadDocVecIDMap, except)
}
func (vc *vectorIndexCache) addDocVecIDMapToCacheLOCKED(ce *cacheEntry) map[uint32][]int64 {
// Handle concurrent accesses (to avoid unnecessary work) by adding a
// check within the write lock here.
if ce.docVecIDMap != nil {
return ce.docVecIDMap
}
docVecIDMap := make(map[uint32][]int64, len(ce.vecDocIDMap))
for vecID, docID := range ce.vecDocIDMap {
docVecIDMap[docID] = append(docVecIDMap[docID], vecID)
}
ce.docVecIDMap = docVecIDMap
return docVecIDMap
}
// Rebuilding the cache on a miss.
func (vc *vectorIndexCache) createAndCacheLOCKED(fieldID uint16, mem []byte,
loadDocVecIDMap bool, except *roaring.Bitmap) (
index *faiss.IndexImpl, vecDocIDMap map[int64]uint32,
docVecIDMap map[uint32][]int64, vecIDsToExclude []int64, err error) {
// Handle concurrent accesses (to avoid unnecessary work) by adding a
// check within the write lock here.
entry := vc.cache[fieldID]
if entry != nil {
index, vecDocIDMap, docVecIDMap = entry.load()
vecIDsToExclude = getVecIDsToExclude(vecDocIDMap, except)
if !loadDocVecIDMap || len(entry.docVecIDMap) > 0 {
return index, vecDocIDMap, docVecIDMap, vecIDsToExclude, nil
}
docVecIDMap = vc.addDocVecIDMapToCacheLOCKED(entry)
return index, vecDocIDMap, docVecIDMap, vecIDsToExclude, nil
}
// if the cache doesn't have the entry, construct the vector to doc id map and
// the vector index out of the mem bytes and update the cache under lock.
pos := 0
numVecs, n := binary.Uvarint(mem[pos : pos+binary.MaxVarintLen64])
pos += n
vecDocIDMap = make(map[int64]uint32, numVecs)
if loadDocVecIDMap {
docVecIDMap = make(map[uint32][]int64, numVecs)
}
isExceptNotEmpty := except != nil && !except.IsEmpty()
for i := 0; i < int(numVecs); i++ {
vecID, n := binary.Varint(mem[pos : pos+binary.MaxVarintLen64])
pos += n
docID, n := binary.Uvarint(mem[pos : pos+binary.MaxVarintLen64])
pos += n
docIDUint32 := uint32(docID)
if isExceptNotEmpty && except.Contains(docIDUint32) {
vecIDsToExclude = append(vecIDsToExclude, vecID)
continue
}
vecDocIDMap[vecID] = docIDUint32
if loadDocVecIDMap {
docVecIDMap[docIDUint32] = append(docVecIDMap[docIDUint32], vecID)
}
}
indexSize, n := binary.Uvarint(mem[pos : pos+binary.MaxVarintLen64])
pos += n
index, err = faiss.ReadIndexFromBuffer(mem[pos:pos+int(indexSize)], faissIOFlags)
if err != nil {
return nil, nil, nil, nil, err
}
vc.insertLOCKED(fieldID, index, vecDocIDMap, loadDocVecIDMap, docVecIDMap)
return index, vecDocIDMap, docVecIDMap, vecIDsToExclude, nil
}
func (vc *vectorIndexCache) insertLOCKED(fieldIDPlus1 uint16,
index *faiss.IndexImpl, vecDocIDMap map[int64]uint32, loadDocVecIDMap bool,
docVecIDMap map[uint32][]int64) {
// the first time we've hit the cache, try to spawn a monitoring routine
// which will reconcile the moving averages for all the fields being hit
if len(vc.cache) == 0 {
go vc.monitor()
}
_, ok := vc.cache[fieldIDPlus1]
if !ok {
// initializing the alpha with 0.4 essentially means that we are favoring
// the history a little bit more relative to the current sample value.
// this makes the average to be kept above the threshold value for a
// longer time and thereby the index to be resident in the cache
// for longer time.
vc.cache[fieldIDPlus1] = createCacheEntry(index, vecDocIDMap,
loadDocVecIDMap, docVecIDMap, 0.4)
}
}
func (vc *vectorIndexCache) incHit(fieldIDPlus1 uint16) {
vc.m.RLock()
entry, ok := vc.cache[fieldIDPlus1]
if ok {
entry.incHit()
}
vc.m.RUnlock()
}
func (vc *vectorIndexCache) decRef(fieldIDPlus1 uint16) {
vc.m.RLock()
entry, ok := vc.cache[fieldIDPlus1]
if ok {
entry.decRef()
}
vc.m.RUnlock()
}
func (vc *vectorIndexCache) cleanup() bool {
vc.m.Lock()
cache := vc.cache
// for every field reconcile the average with the current sample values
for fieldIDPlus1, entry := range cache {
sample := atomic.LoadUint64(&entry.tracker.sample)
entry.tracker.add(sample)
refCount := atomic.LoadInt64(&entry.refs)
// the comparison threshold as of now is (1 - a). mathematically it
// means that there is only 1 query per second on average as per history.
// and in the current second, there were no queries performed against
// this index.
if entry.tracker.avg <= (1-entry.tracker.alpha) && refCount <= 0 {
atomic.StoreUint64(&entry.tracker.sample, 0)
delete(vc.cache, fieldIDPlus1)
entry.close()
continue
}
atomic.StoreUint64(&entry.tracker.sample, 0)
}
rv := len(vc.cache) == 0
vc.m.Unlock()
return rv
}
var monitorFreq = 1 * time.Second
func (vc *vectorIndexCache) monitor() {
ticker := time.NewTicker(monitorFreq)
defer ticker.Stop()
for {
select {
case <-vc.closeCh:
return
case <-ticker.C:
exit := vc.cleanup()
if exit {
// no entries to be monitored, exit
return
}
}
}
}
// -----------------------------------------------------------------------------
type ewma struct {
alpha float64
avg float64
// every hit to the cache entry is recorded as part of a sample
// which will be used to calculate the average in the next cycle of average
// computation (which is average traffic for the field till now). this is
// used to track the per second hits to the cache entries.
sample uint64
}
func (e *ewma) add(val uint64) {
if e.avg == 0.0 {
e.avg = float64(val)
} else {
// the exponentially weighted moving average
// X(t) = a.v + (1 - a).X(t-1)
e.avg = e.alpha*float64(val) + (1-e.alpha)*e.avg
}
}
// -----------------------------------------------------------------------------
func createCacheEntry(index *faiss.IndexImpl, vecDocIDMap map[int64]uint32,
loadDocVecIDMap bool, docVecIDMap map[uint32][]int64, alpha float64) *cacheEntry {
ce := &cacheEntry{
index: index,
vecDocIDMap: vecDocIDMap,
tracker: &ewma{
alpha: alpha,
sample: 1,
},
refs: 1,
}
if loadDocVecIDMap {
ce.docVecIDMap = docVecIDMap
}
return ce
}
type cacheEntry struct {
tracker *ewma
// this is used to track the live references to the cache entry,
// such that while we do a cleanup() and we see that the avg is below a
// threshold we close/cleanup only if the live refs to the cache entry is 0.
refs int64
index *faiss.IndexImpl
vecDocIDMap map[int64]uint32
docVecIDMap map[uint32][]int64
}
func (ce *cacheEntry) incHit() {
atomic.AddUint64(&ce.tracker.sample, 1)
}
func (ce *cacheEntry) addRef() {
atomic.AddInt64(&ce.refs, 1)
}
func (ce *cacheEntry) decRef() {
atomic.AddInt64(&ce.refs, -1)
}
func (ce *cacheEntry) load() (*faiss.IndexImpl, map[int64]uint32, map[uint32][]int64) {
ce.incHit()
ce.addRef()
return ce.index, ce.vecDocIDMap, ce.docVecIDMap
}
func (ce *cacheEntry) close() {
go func() {
ce.index.Close()
ce.index = nil
ce.vecDocIDMap = nil
ce.docVecIDMap = nil
}()
}
// -----------------------------------------------------------------------------
func getVecIDsToExclude(vecDocIDMap map[int64]uint32, except *roaring.Bitmap) (vecIDsToExclude []int64) {
if except != nil && !except.IsEmpty() {
for vecID, docID := range vecDocIDMap {
if except.Contains(docID) {
vecIDsToExclude = append(vecIDsToExclude, vecID)
}
}
}
return vecIDsToExclude
}

View File

@@ -0,0 +1,27 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build !vectors
// +build !vectors
package zap
type vectorIndexCache struct {
}
func newVectorIndexCache() *vectorIndexCache {
return nil
}
func (v *vectorIndexCache) Clear() {}

View File

@@ -0,0 +1,22 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors && !windows
// +build vectors,!windows
package zap
import faiss "github.com/blevesearch/go-faiss"
const faissIOFlags = faiss.IOFlagReadMmap | faiss.IOFlagSkipPrefetch

View File

@@ -0,0 +1,22 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors && windows
// +build vectors,windows
package zap
import faiss "github.com/blevesearch/go-faiss"
const faissIOFlags = faiss.IOFlagReadOnly

View File

@@ -0,0 +1,576 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors
// +build vectors
package zap
import (
"encoding/binary"
"encoding/json"
"math"
"reflect"
"github.com/RoaringBitmap/roaring/v2"
"github.com/RoaringBitmap/roaring/v2/roaring64"
"github.com/bits-and-blooms/bitset"
faiss "github.com/blevesearch/go-faiss"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
var reflectStaticSizeVecPostingsList int
var reflectStaticSizeVecPostingsIterator int
var reflectStaticSizeVecPosting int
func init() {
var pl VecPostingsList
reflectStaticSizeVecPostingsList = int(reflect.TypeOf(pl).Size())
var pi VecPostingsIterator
reflectStaticSizeVecPostingsIterator = int(reflect.TypeOf(pi).Size())
var p VecPosting
reflectStaticSizeVecPosting = int(reflect.TypeOf(p).Size())
}
type VecPosting struct {
docNum uint64
score float32
}
func (vp *VecPosting) Number() uint64 {
return vp.docNum
}
func (vp *VecPosting) Score() float32 {
return vp.score
}
func (vp *VecPosting) Size() int {
sizeInBytes := reflectStaticSizePosting
return sizeInBytes
}
// =============================================================================
// the vector postings list is supposed to store the docNum and its similarity
// score as a vector postings entry in it.
// The way in which is it stored is using a roaring64 bitmap.
// the docNum is stored in high 32 and the lower 32 bits contains the score value.
// the score is actually a float32 value and in order to store it as a uint32 in
// the bitmap, we use the IEEE 754 floating point format.
//
// each entry in the roaring64 bitmap of the vector postings list is a 64 bit
// number which looks like this:
// MSB LSB
// |64 63 62 ... 32| 31 30 ... 0|
// | <docNum> | <score> |
type VecPostingsList struct {
// todo: perhaps we don't even need to store a bitmap if there is only
// one similar vector the query, but rather store it as a field value
// in the struct
except *roaring64.Bitmap
postings *roaring64.Bitmap
}
var emptyVecPostingsIterator = &VecPostingsIterator{}
var emptyVecPostingsList = &VecPostingsList{}
func (vpl *VecPostingsList) Iterator(prealloc segment.VecPostingsIterator) segment.VecPostingsIterator {
if vpl.postings == nil {
return emptyVecPostingsIterator
}
// tbd: do we check the cardinality of postings and scores?
var preallocPI *VecPostingsIterator
pi, ok := prealloc.(*VecPostingsIterator)
if ok && pi != nil {
preallocPI = pi
}
if preallocPI == emptyVecPostingsIterator {
preallocPI = nil
}
return vpl.iterator(preallocPI)
}
func (vpl *VecPostingsList) iterator(rv *VecPostingsIterator) *VecPostingsIterator {
if rv == nil {
rv = &VecPostingsIterator{}
} else {
*rv = VecPostingsIterator{} // clear the struct
}
// think on some of the edge cases over here.
if vpl.postings == nil {
return rv
}
rv.postings = vpl
rv.all = vpl.postings.Iterator()
if vpl.except != nil {
rv.ActualBM = roaring64.AndNot(vpl.postings, vpl.except)
rv.Actual = rv.ActualBM.Iterator()
} else {
rv.ActualBM = vpl.postings
rv.Actual = rv.all // Optimize to use same iterator for all & Actual.
}
return rv
}
func (vpl *VecPostingsList) Size() int {
sizeInBytes := reflectStaticSizeVecPostingsList + SizeOfPtr
if vpl.except != nil {
sizeInBytes += int(vpl.except.GetSizeInBytes())
}
return sizeInBytes
}
func (vpl *VecPostingsList) Count() uint64 {
if vpl.postings != nil {
n := vpl.postings.GetCardinality()
var e uint64
if vpl.except != nil {
e = vpl.postings.AndCardinality(vpl.except)
}
return n - e
}
return 0
}
func (vpl *VecPostingsList) ResetBytesRead(val uint64) {
}
func (vpl *VecPostingsList) BytesRead() uint64 {
return 0
}
func (vpl *VecPostingsList) BytesWritten() uint64 {
return 0
}
// =============================================================================
type VecPostingsIterator struct {
postings *VecPostingsList
all roaring64.IntPeekable64
Actual roaring64.IntPeekable64
ActualBM *roaring64.Bitmap
next VecPosting // reused across Next() calls
}
func (vpItr *VecPostingsIterator) nextCodeAtOrAfterClean(atOrAfter uint64) (uint64, bool, error) {
vpItr.Actual.AdvanceIfNeeded(atOrAfter)
if !vpItr.Actual.HasNext() {
return 0, false, nil // couldn't find anything
}
return vpItr.Actual.Next(), true, nil
}
func (vpItr *VecPostingsIterator) nextCodeAtOrAfter(atOrAfter uint64) (uint64, bool, error) {
if vpItr.Actual == nil || !vpItr.Actual.HasNext() {
return 0, false, nil
}
if vpItr.postings == nil || vpItr.postings == emptyVecPostingsList {
// couldn't find anything
return 0, false, nil
}
if vpItr.postings.postings == vpItr.ActualBM {
return vpItr.nextCodeAtOrAfterClean(atOrAfter)
}
vpItr.Actual.AdvanceIfNeeded(atOrAfter)
if !vpItr.Actual.HasNext() || !vpItr.all.HasNext() {
// couldn't find anything
return 0, false, nil
}
n := vpItr.Actual.Next()
allN := vpItr.all.Next()
// n is the next actual hit (excluding some postings), and
// allN is the next hit in the full postings, and
// if they don't match, move 'all' forwards until they do.
for allN != n {
if !vpItr.all.HasNext() {
return 0, false, nil
}
allN = vpItr.all.Next()
}
return n, true, nil
}
// a transformation function which stores both the score and the docNum as a single
// entry which is a uint64 number.
func getVectorCode(docNum uint32, score float32) uint64 {
return uint64(docNum)<<32 | uint64(math.Float32bits(score))
}
// Next returns the next posting on the vector postings list, or nil at the end
func (vpItr *VecPostingsIterator) nextAtOrAfter(atOrAfter uint64) (segment.VecPosting, error) {
// transform the docNum provided to the vector code format and use that to
// get the next entry. the comparison still happens docNum wise since after
// the transformation, the docNum occupies the upper 32 bits just an entry in
// the postings list
atOrAfter = getVectorCode(uint32(atOrAfter), 0)
code, exists, err := vpItr.nextCodeAtOrAfter(atOrAfter)
if err != nil || !exists {
return nil, err
}
vpItr.next = VecPosting{} // clear the struct
rv := &vpItr.next
rv.score = math.Float32frombits(uint32(code))
rv.docNum = code >> 32
return rv, nil
}
func (vpItr *VecPostingsIterator) Next() (segment.VecPosting, error) {
return vpItr.nextAtOrAfter(0)
}
func (vpItr *VecPostingsIterator) Advance(docNum uint64) (segment.VecPosting, error) {
return vpItr.nextAtOrAfter(docNum)
}
func (vpItr *VecPostingsIterator) Size() int {
sizeInBytes := reflectStaticSizePostingsIterator + SizeOfPtr +
vpItr.next.Size()
return sizeInBytes
}
func (vpItr *VecPostingsIterator) ResetBytesRead(val uint64) {
}
func (vpItr *VecPostingsIterator) BytesRead() uint64 {
return 0
}
func (vpItr *VecPostingsIterator) BytesWritten() uint64 {
return 0
}
// vectorIndexWrapper conforms to scorch_segment_api's VectorIndex interface
type vectorIndexWrapper struct {
search func(qVector []float32, k int64,
params json.RawMessage) (segment.VecPostingsList, error)
searchWithFilter func(qVector []float32, k int64, eligibleDocIDs []uint64,
params json.RawMessage) (segment.VecPostingsList, error)
close func()
size func() uint64
}
func (i *vectorIndexWrapper) Search(qVector []float32, k int64,
params json.RawMessage) (
segment.VecPostingsList, error) {
return i.search(qVector, k, params)
}
func (i *vectorIndexWrapper) SearchWithFilter(qVector []float32, k int64,
eligibleDocIDs []uint64, params json.RawMessage) (
segment.VecPostingsList, error) {
return i.searchWithFilter(qVector, k, eligibleDocIDs, params)
}
func (i *vectorIndexWrapper) Close() {
i.close()
}
func (i *vectorIndexWrapper) Size() uint64 {
return i.size()
}
// InterpretVectorIndex returns a construct of closures (vectorIndexWrapper)
// that will allow the caller to -
// (1) search within an attached vector index
// (2) search limited to a subset of documents within an attached vector index
// (3) close attached vector index
// (4) get the size of the attached vector index
func (sb *SegmentBase) InterpretVectorIndex(field string, requiresFiltering bool,
except *roaring.Bitmap) (
segment.VectorIndex, error) {
// Params needed for the closures
var vecIndex *faiss.IndexImpl
var vecDocIDMap map[int64]uint32
var docVecIDMap map[uint32][]int64
var vectorIDsToExclude []int64
var fieldIDPlus1 uint16
var vecIndexSize uint64
// Utility function to add the corresponding docID and scores for each vector
// returned after the kNN query to the newly
// created vecPostingsList
addIDsToPostingsList := func(pl *VecPostingsList, ids []int64, scores []float32) {
for i := 0; i < len(ids); i++ {
vecID := ids[i]
// Checking if it's present in the vecDocIDMap.
// If -1 is returned as an ID(insufficient vectors), this will ensure
// it isn't added to the final postings list.
if docID, ok := vecDocIDMap[vecID]; ok {
code := getVectorCode(docID, scores[i])
pl.postings.Add(code)
}
}
}
var (
wrapVecIndex = &vectorIndexWrapper{
search: func(qVector []float32, k int64, params json.RawMessage) (
segment.VecPostingsList, error) {
// 1. returned postings list (of type PostingsList) has two types of information - docNum and its score.
// 2. both the values can be represented using roaring bitmaps.
// 3. the Iterator (of type PostingsIterator) returned would operate in terms of VecPostings.
// 4. VecPostings would just have the docNum and the score. Every call of Next()
// and Advance just returns the next VecPostings. The caller would do a vp.Number()
// and the Score() to get the corresponding values
rv := &VecPostingsList{
except: nil, // todo: handle the except bitmap within postings iterator.
postings: roaring64.New(),
}
if vecIndex == nil || vecIndex.D() != len(qVector) {
// vector index not found or dimensionality mismatched
return rv, nil
}
scores, ids, err := vecIndex.SearchWithoutIDs(qVector, k,
vectorIDsToExclude, params)
if err != nil {
return nil, err
}
addIDsToPostingsList(rv, ids, scores)
return rv, nil
},
searchWithFilter: func(qVector []float32, k int64,
eligibleDocIDs []uint64, params json.RawMessage) (
segment.VecPostingsList, error) {
// 1. returned postings list (of type PostingsList) has two types of information - docNum and its score.
// 2. both the values can be represented using roaring bitmaps.
// 3. the Iterator (of type PostingsIterator) returned would operate in terms of VecPostings.
// 4. VecPostings would just have the docNum and the score. Every call of Next()
// and Advance just returns the next VecPostings. The caller would do a vp.Number()
// and the Score() to get the corresponding values
rv := &VecPostingsList{
except: nil, // todo: handle the except bitmap within postings iterator.
postings: roaring64.New(),
}
if vecIndex == nil || vecIndex.D() != len(qVector) {
// vector index not found or dimensionality mismatched
return rv, nil
}
// Check and proceed only if non-zero documents eligible per the filter query.
if len(eligibleDocIDs) == 0 {
return rv, nil
}
// If every element in the index is eligible (full selectivity),
// then this can basically be considered unfiltered kNN.
if len(eligibleDocIDs) == int(sb.numDocs) {
scores, ids, err := vecIndex.SearchWithoutIDs(qVector, k,
vectorIDsToExclude, params)
if err != nil {
return nil, err
}
addIDsToPostingsList(rv, ids, scores)
return rv, nil
}
// vector IDs corresponding to the local doc numbers to be
// considered for the search
vectorIDsToInclude := make([]int64, 0, len(eligibleDocIDs))
for _, id := range eligibleDocIDs {
vecIDs := docVecIDMap[uint32(id)]
// In the common case where vecIDs has only one element, which occurs
// when a document has only one vector field, we can
// avoid the unnecessary overhead of slice unpacking (append(vecIDs...)).
// Directly append the single element for efficiency.
if len(vecIDs) == 1 {
vectorIDsToInclude = append(vectorIDsToInclude, vecIDs[0])
} else {
vectorIDsToInclude = append(vectorIDsToInclude, vecIDs...)
}
}
// In case a doc has invalid vector fields but valid non-vector fields,
// filter hit IDs may be ineligible for the kNN since the document does
// not have any/valid vectors.
if len(vectorIDsToInclude) == 0 {
return rv, nil
}
// If the index is not an IVF index, then the search can be
// performed directly, using the Flat index.
if !vecIndex.IsIVFIndex() {
// vector IDs corresponding to the local doc numbers to be
// considered for the search
scores, ids, err := vecIndex.SearchWithIDs(qVector, k,
vectorIDsToInclude, params)
if err != nil {
return nil, err
}
addIDsToPostingsList(rv, ids, scores)
return rv, nil
}
// Determining which clusters, identified by centroid ID,
// have at least one eligible vector and hence, ought to be
// probed.
clusterVectorCounts, err := vecIndex.ObtainClusterVectorCountsFromIVFIndex(vectorIDsToInclude)
if err != nil {
return nil, err
}
var selector faiss.Selector
// If there are more elements to be included than excluded, it
// might be quicker to use an exclusion selector as a filter
// instead of an inclusion selector.
if float32(len(eligibleDocIDs))/float32(len(docVecIDMap)) > 0.5 {
// Use a bitset to efficiently track eligible document IDs.
// This reduces the lookup cost when checking if a document ID is eligible,
// compared to using a map or slice.
bs := bitset.New(uint(len(eligibleDocIDs)))
for _, docID := range eligibleDocIDs {
bs.Set(uint(docID))
}
ineligibleVectorIDs := make([]int64, 0, len(vecDocIDMap)-len(vectorIDsToInclude))
for docID, vecIDs := range docVecIDMap {
// Check if the document ID is NOT in the eligible set, marking it as ineligible.
if !bs.Test(uint(docID)) {
// In the common case where vecIDs has only one element, which occurs
// when a document has only one vector field, we can
// avoid the unnecessary overhead of slice unpacking (append(vecIDs...)).
// Directly append the single element for efficiency.
if len(vecIDs) == 1 {
ineligibleVectorIDs = append(ineligibleVectorIDs, vecIDs[0])
} else {
ineligibleVectorIDs = append(ineligibleVectorIDs, vecIDs...)
}
}
}
selector, err = faiss.NewIDSelectorNot(ineligibleVectorIDs)
} else {
selector, err = faiss.NewIDSelectorBatch(vectorIDsToInclude)
}
if err != nil {
return nil, err
}
// If no error occurred during the creation of the selector, then
// it should be deleted once the search is complete.
defer selector.Delete()
// Ordering the retrieved centroid IDs by increasing order
// of distance i.e. decreasing order of proximity to query vector.
centroidIDs := make([]int64, 0, len(clusterVectorCounts))
for centroidID := range clusterVectorCounts {
centroidIDs = append(centroidIDs, centroidID)
}
closestCentroidIDs, centroidDistances, err :=
vecIndex.ObtainClustersWithDistancesFromIVFIndex(qVector, centroidIDs)
if err != nil {
return nil, err
}
// Getting the nprobe value set at index time.
nprobe := int(vecIndex.GetNProbe())
// Determining the minimum number of centroids to be probed
// to ensure that at least 'k' vectors are collected while
// examining at least 'nprobe' centroids.
var eligibleDocsTillNow int64
minEligibleCentroids := len(closestCentroidIDs)
for i, centroidID := range closestCentroidIDs {
eligibleDocsTillNow += clusterVectorCounts[centroidID]
// Stop once we've examined at least 'nprobe' centroids and
// collected at least 'k' vectors.
if eligibleDocsTillNow >= k && i+1 >= nprobe {
minEligibleCentroids = i + 1
break
}
}
// Search the clusters specified by 'closestCentroidIDs' for
// vectors whose IDs are present in 'vectorIDsToInclude'
scores, ids, err := vecIndex.SearchClustersFromIVFIndex(
selector, closestCentroidIDs, minEligibleCentroids,
k, qVector, centroidDistances, params)
if err != nil {
return nil, err
}
addIDsToPostingsList(rv, ids, scores)
return rv, nil
},
close: func() {
// skipping the closing because the index is cached and it's being
// deferred to a later point of time.
sb.vecIndexCache.decRef(fieldIDPlus1)
},
size: func() uint64 {
return vecIndexSize
},
}
err error
)
fieldIDPlus1 = sb.fieldsMap[field]
if fieldIDPlus1 <= 0 {
return wrapVecIndex, nil
}
vectorSection := sb.fieldsSectionsMap[fieldIDPlus1-1][SectionFaissVectorIndex]
// check if the field has a vector section in the segment.
if vectorSection <= 0 {
return wrapVecIndex, nil
}
pos := int(vectorSection)
// the below loop loads the following:
// 1. doc values(first 2 iterations) - adhering to the sections format. never
// valid values for vector section
// 2. index optimization type.
for i := 0; i < 3; i++ {
_, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
}
vecIndex, vecDocIDMap, docVecIDMap, vectorIDsToExclude, err =
sb.vecIndexCache.loadOrCreate(fieldIDPlus1, sb.mem[pos:], requiresFiltering,
except)
if vecIndex != nil {
vecIndexSize = vecIndex.Size()
}
return wrapVecIndex, err
}
func (sb *SegmentBase) UpdateFieldStats(stats segment.FieldStats) {
for _, fieldName := range sb.fieldsInv {
pos := int(sb.fieldsSectionsMap[sb.fieldsMap[fieldName]-1][SectionFaissVectorIndex])
if pos == 0 {
continue
}
for i := 0; i < 3; i++ {
_, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
}
numVecs, _ := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
stats.Store("num_vectors", fieldName, numVecs)
}
}

138
vendor/github.com/blevesearch/zapx/v16/intDecoder.go generated vendored Normal file
View File

@@ -0,0 +1,138 @@
// Copyright (c) 2019 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"encoding/binary"
"fmt"
)
type chunkedIntDecoder struct {
startOffset uint64
dataStartOffset uint64
chunkOffsets []uint64
curChunkBytes []byte
data []byte
r *memUvarintReader
bytesRead uint64
}
// newChunkedIntDecoder expects an optional or reset chunkedIntDecoder for better reuse.
func newChunkedIntDecoder(buf []byte, offset uint64, rv *chunkedIntDecoder) *chunkedIntDecoder {
if rv == nil {
rv = &chunkedIntDecoder{startOffset: offset, data: buf}
} else {
rv.startOffset = offset
rv.data = buf
}
var n, numChunks uint64
var read int
if offset == termNotEncoded {
numChunks = 0
} else {
numChunks, read = binary.Uvarint(buf[offset+n : offset+n+binary.MaxVarintLen64])
}
n += uint64(read)
if cap(rv.chunkOffsets) >= int(numChunks) {
rv.chunkOffsets = rv.chunkOffsets[:int(numChunks)]
} else {
rv.chunkOffsets = make([]uint64, int(numChunks))
}
for i := 0; i < int(numChunks); i++ {
rv.chunkOffsets[i], read = binary.Uvarint(buf[offset+n : offset+n+binary.MaxVarintLen64])
n += uint64(read)
}
rv.bytesRead += n
rv.dataStartOffset = offset + n
return rv
}
// A util function which fetches the query time
// specific bytes encoded by intcoder (for eg the
// freqNorm and location details of a term in document)
// the loadChunk retrieves the next chunk and the
// number of bytes retrieve in that operation is accounted
func (d *chunkedIntDecoder) getBytesRead() uint64 {
return d.bytesRead
}
func (d *chunkedIntDecoder) loadChunk(chunk int) error {
if d.startOffset == termNotEncoded {
d.r = newMemUvarintReader([]byte(nil))
return nil
}
if chunk >= len(d.chunkOffsets) {
return fmt.Errorf("tried to load freq chunk that doesn't exist %d/(%d)",
chunk, len(d.chunkOffsets))
}
end, start := d.dataStartOffset, d.dataStartOffset
s, e := readChunkBoundary(chunk, d.chunkOffsets)
start += s
end += e
d.curChunkBytes = d.data[start:end]
d.bytesRead += uint64(len(d.curChunkBytes))
if d.r == nil {
d.r = newMemUvarintReader(d.curChunkBytes)
} else {
d.r.Reset(d.curChunkBytes)
}
return nil
}
func (d *chunkedIntDecoder) reset() {
d.startOffset = 0
d.dataStartOffset = 0
d.chunkOffsets = d.chunkOffsets[:0]
d.curChunkBytes = d.curChunkBytes[:0]
d.bytesRead = 0
d.data = d.data[:0]
if d.r != nil {
d.r.Reset([]byte(nil))
}
}
func (d *chunkedIntDecoder) isNil() bool {
return d.curChunkBytes == nil || len(d.curChunkBytes) == 0
}
func (d *chunkedIntDecoder) readUvarint() (uint64, error) {
return d.r.ReadUvarint()
}
func (d *chunkedIntDecoder) readBytes(start, end int) []byte {
return d.curChunkBytes[start:end]
}
func (d *chunkedIntDecoder) SkipUvarint() {
d.r.SkipUvarint()
}
func (d *chunkedIntDecoder) SkipBytes(count int) {
d.r.SkipBytes(count)
}
func (d *chunkedIntDecoder) Len() int {
return d.r.Len()
}
func (d *chunkedIntDecoder) remainingLen() int {
return len(d.curChunkBytes) - d.r.Len()
}

218
vendor/github.com/blevesearch/zapx/v16/intcoder.go generated vendored Normal file
View File

@@ -0,0 +1,218 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"encoding/binary"
"io"
)
// We can safely use 0 to represent termNotEncoded since 0
// could never be a valid address for term location information.
// (stored field index is always non-empty and earlier in the
// file)
const termNotEncoded = 0
type chunkedIntCoder struct {
final []byte
chunkSize uint64
chunkBuf bytes.Buffer
chunkLens []uint64
currChunk uint64
buf []byte
bytesWritten uint64
}
// newChunkedIntCoder returns a new chunk int coder which packs data into
// chunks based on the provided chunkSize and supports up to the specified
// maxDocNum
func newChunkedIntCoder(chunkSize uint64, maxDocNum uint64) *chunkedIntCoder {
total := maxDocNum/chunkSize + 1
rv := &chunkedIntCoder{
chunkSize: chunkSize,
chunkLens: make([]uint64, total),
final: make([]byte, 0, 64),
}
return rv
}
// Reset lets you reuse this chunked int coder. buffers are reset and reused
// from previous use. you cannot change the chunk size or max doc num.
func (c *chunkedIntCoder) Reset() {
c.final = c.final[:0]
c.bytesWritten = 0
c.chunkBuf.Reset()
c.currChunk = 0
for i := range c.chunkLens {
c.chunkLens[i] = 0
}
}
// SetChunkSize changes the chunk size. It is only valid to do so
// with a new chunkedIntCoder, or immediately after calling Reset()
func (c *chunkedIntCoder) SetChunkSize(chunkSize uint64, maxDocNum uint64) {
total := int(maxDocNum/chunkSize + 1)
c.chunkSize = chunkSize
if cap(c.chunkLens) < total {
c.chunkLens = make([]uint64, total)
} else {
c.chunkLens = c.chunkLens[:total]
}
}
func (c *chunkedIntCoder) incrementBytesWritten(val uint64) {
c.bytesWritten += val
}
func (c *chunkedIntCoder) getBytesWritten() uint64 {
return c.bytesWritten
}
// Add encodes the provided integers into the correct chunk for the provided
// doc num. You MUST call Add() with increasing docNums.
func (c *chunkedIntCoder) Add(docNum uint64, vals ...uint64) error {
chunk := docNum / c.chunkSize
if chunk != c.currChunk {
// starting a new chunk
c.Close()
c.chunkBuf.Reset()
c.currChunk = chunk
}
if len(c.buf) < binary.MaxVarintLen64 {
c.buf = make([]byte, binary.MaxVarintLen64)
}
for _, val := range vals {
wb := binary.PutUvarint(c.buf, val)
_, err := c.chunkBuf.Write(c.buf[:wb])
if err != nil {
return err
}
}
return nil
}
func (c *chunkedIntCoder) AddBytes(docNum uint64, buf []byte) error {
chunk := docNum / c.chunkSize
if chunk != c.currChunk {
// starting a new chunk
c.Close()
c.chunkBuf.Reset()
c.currChunk = chunk
}
_, err := c.chunkBuf.Write(buf)
return err
}
// Close indicates you are done calling Add() this allows the final chunk
// to be encoded.
func (c *chunkedIntCoder) Close() {
encodingBytes := c.chunkBuf.Bytes()
c.incrementBytesWritten(uint64(len(encodingBytes)))
c.chunkLens[c.currChunk] = uint64(len(encodingBytes))
c.final = append(c.final, encodingBytes...)
c.currChunk = uint64(cap(c.chunkLens)) // sentinel to detect double close
}
// Write commits all the encoded chunked integers to the provided writer.
func (c *chunkedIntCoder) Write(w io.Writer) (int, error) {
bufNeeded := binary.MaxVarintLen64 * (1 + len(c.chunkLens))
if len(c.buf) < bufNeeded {
c.buf = make([]byte, bufNeeded)
}
buf := c.buf
// convert the chunk lengths into chunk offsets
chunkOffsets := modifyLengthsToEndOffsets(c.chunkLens)
// write out the number of chunks & each chunk offsets
n := binary.PutUvarint(buf, uint64(len(chunkOffsets)))
for _, chunkOffset := range chunkOffsets {
n += binary.PutUvarint(buf[n:], chunkOffset)
}
tw, err := w.Write(buf[:n])
if err != nil {
return tw, err
}
// write out the data
nw, err := w.Write(c.final)
tw += nw
if err != nil {
return tw, err
}
return tw, nil
}
// writeAt commits all the encoded chunked integers to the provided writer
// and returns the starting offset, total bytes written and an error
func (c *chunkedIntCoder) writeAt(w io.Writer) (uint64, int, error) {
startOffset := uint64(termNotEncoded)
if len(c.final) <= 0 {
return startOffset, 0, nil
}
if chw := w.(*CountHashWriter); chw != nil {
startOffset = uint64(chw.Count())
}
tw, err := c.Write(w)
return startOffset, tw, err
}
func (c *chunkedIntCoder) FinalSize() int {
return len(c.final)
}
// modifyLengthsToEndOffsets converts the chunk length array
// to a chunk offset array. The readChunkBoundary
// will figure out the start and end of every chunk from
// these offsets. Starting offset of i'th index is stored
// in i-1'th position except for 0'th index and ending offset
// is stored at i'th index position.
// For 0'th element, starting position is always zero.
// eg:
// Lens -> 5 5 5 5 => 5 10 15 20
// Lens -> 0 5 0 5 => 0 5 5 10
// Lens -> 0 0 0 5 => 0 0 0 5
// Lens -> 5 0 0 0 => 5 5 5 5
// Lens -> 0 5 0 0 => 0 5 5 5
// Lens -> 0 0 5 0 => 0 0 5 5
func modifyLengthsToEndOffsets(lengths []uint64) []uint64 {
var runningOffset uint64
var index, i int
for i = 1; i <= len(lengths); i++ {
runningOffset += lengths[i-1]
lengths[index] = runningOffset
index++
}
return lengths
}
func readChunkBoundary(chunk int, offsets []uint64) (uint64, uint64) {
var start uint64
if chunk > 0 {
start = offsets[chunk-1]
}
return start, offsets[chunk]
}

103
vendor/github.com/blevesearch/zapx/v16/memuvarint.go generated vendored Normal file
View File

@@ -0,0 +1,103 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"fmt"
)
type memUvarintReader struct {
C int // index of next byte to read from S
S []byte
}
func newMemUvarintReader(s []byte) *memUvarintReader {
return &memUvarintReader{S: s}
}
// Len returns the number of unread bytes.
func (r *memUvarintReader) Len() int {
n := len(r.S) - r.C
if n < 0 {
return 0
}
return n
}
// ReadUvarint reads an encoded uint64. The original code this was
// based on is at encoding/binary/ReadUvarint().
func (r *memUvarintReader) ReadUvarint() (uint64, error) {
if r.C >= len(r.S) {
// nothing else to read
return 0, nil
}
var x uint64
var s uint
var C = r.C
var S = r.S
for {
b := S[C]
C++
if b < 0x80 {
r.C = C
// why 63? The original code had an 'i += 1' loop var and
// checked for i > 9 || i == 9 ...; but, we no longer
// check for the i var, but instead check here for s,
// which is incremented by 7. So, 7*9 == 63.
//
// why the "extra" >= check? The normal case is that s <
// 63, so we check this single >= guard first so that we
// hit the normal, nil-error return pathway sooner.
if s >= 63 && (s > 63 || b > 1) {
return 0, fmt.Errorf("memUvarintReader overflow")
}
return x | uint64(b)<<s, nil
}
x |= uint64(b&0x7f) << s
s += 7
}
}
// SkipUvarint skips ahead one encoded uint64.
func (r *memUvarintReader) SkipUvarint() {
for {
if r.C >= len(r.S) {
return
}
b := r.S[r.C]
r.C++
if b < 0x80 {
return
}
}
}
// SkipBytes skips a count number of bytes.
func (r *memUvarintReader) SkipBytes(count int) {
r.C = r.C + count
}
func (r *memUvarintReader) Reset(s []byte) {
r.C = 0
r.S = s
}

616
vendor/github.com/blevesearch/zapx/v16/merge.go generated vendored Normal file
View File

@@ -0,0 +1,616 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bufio"
"bytes"
"encoding/binary"
"fmt"
"math"
"os"
"sort"
"github.com/RoaringBitmap/roaring/v2"
seg "github.com/blevesearch/scorch_segment_api/v2"
"github.com/golang/snappy"
)
var DefaultFileMergerBufferSize = 1024 * 1024
const docDropped = math.MaxUint64 // sentinel docNum to represent a deleted doc
// Merge takes a slice of segments and bit masks describing which
// documents may be dropped, and creates a new segment containing the
// remaining data. This new segment is built at the specified path.
func (*ZapPlugin) Merge(segments []seg.Segment, drops []*roaring.Bitmap, path string,
closeCh chan struct{}, s seg.StatsReporter) (
[][]uint64, uint64, error) {
segmentBases := make([]*SegmentBase, len(segments))
for segmenti, segment := range segments {
switch segmentx := segment.(type) {
case *Segment:
segmentBases[segmenti] = &segmentx.SegmentBase
case *SegmentBase:
segmentBases[segmenti] = segmentx
default:
panic(fmt.Sprintf("oops, unexpected segment type: %T", segment))
}
}
return mergeSegmentBases(segmentBases, drops, path, DefaultChunkMode, closeCh, s)
}
func mergeSegmentBases(segmentBases []*SegmentBase, drops []*roaring.Bitmap, path string,
chunkMode uint32, closeCh chan struct{}, s seg.StatsReporter) (
[][]uint64, uint64, error) {
flag := os.O_RDWR | os.O_CREATE
f, err := os.OpenFile(path, flag, 0600)
if err != nil {
return nil, 0, err
}
cleanup := func() {
_ = f.Close()
_ = os.Remove(path)
}
// buffer the output
br := bufio.NewWriterSize(f, DefaultFileMergerBufferSize)
// wrap it for counting (tracking offsets)
cr := NewCountHashWriterWithStatsReporter(br, s)
newDocNums, numDocs, storedIndexOffset, _, _, sectionsIndexOffset, err :=
mergeToWriter(segmentBases, drops, chunkMode, cr, closeCh)
if err != nil {
cleanup()
return nil, 0, err
}
// passing the sectionsIndexOffset as fieldsIndexOffset and the docValueOffset as 0 for the footer
err = persistFooter(numDocs, storedIndexOffset, sectionsIndexOffset, sectionsIndexOffset,
0, chunkMode, cr.Sum32(), cr)
if err != nil {
cleanup()
return nil, 0, err
}
err = br.Flush()
if err != nil {
cleanup()
return nil, 0, err
}
err = f.Sync()
if err != nil {
cleanup()
return nil, 0, err
}
err = f.Close()
if err != nil {
cleanup()
return nil, 0, err
}
return newDocNums, uint64(cr.Count()), nil
}
func mergeToWriter(segments []*SegmentBase, drops []*roaring.Bitmap,
chunkMode uint32, cr *CountHashWriter, closeCh chan struct{}) (
newDocNums [][]uint64, numDocs, storedIndexOffset uint64,
fieldsInv []string, fieldsMap map[string]uint16, sectionsIndexOffset uint64,
err error) {
var fieldsSame bool
fieldsSame, fieldsInv = mergeFields(segments)
fieldsMap = mapFields(fieldsInv)
numDocs = computeNewDocCount(segments, drops)
if isClosed(closeCh) {
return nil, 0, 0, nil, nil, 0, seg.ErrClosed
}
// the merge opaque is especially important when it comes to tracking the file
// offset a field of a particular section is at. This will be used to write the
// offsets in the fields section index of the file (the final merged file).
mergeOpaque := map[int]resetable{}
args := map[string]interface{}{
"chunkMode": chunkMode,
"fieldsSame": fieldsSame,
"fieldsMap": fieldsMap,
"numDocs": numDocs,
}
if numDocs > 0 {
storedIndexOffset, newDocNums, err = mergeStoredAndRemap(segments, drops,
fieldsMap, fieldsInv, fieldsSame, numDocs, cr, closeCh)
if err != nil {
return nil, 0, 0, nil, nil, 0, err
}
// at this point, ask each section implementation to merge itself
for i, x := range segmentSections {
mergeOpaque[int(i)] = x.InitOpaque(args)
err = x.Merge(mergeOpaque, segments, drops, fieldsInv, newDocNums, cr, closeCh)
if err != nil {
return nil, 0, 0, nil, nil, 0, err
}
}
}
// we can persist the fields section index now, this will point
// to the various indexes (each in different section) available for a field.
sectionsIndexOffset, err = persistFieldsSection(fieldsInv, cr, mergeOpaque)
if err != nil {
return nil, 0, 0, nil, nil, 0, err
}
return newDocNums, numDocs, storedIndexOffset, fieldsInv, fieldsMap, sectionsIndexOffset, nil
}
// mapFields takes the fieldsInv list and returns a map of fieldName
// to fieldID+1
func mapFields(fields []string) map[string]uint16 {
rv := make(map[string]uint16, len(fields))
for i, fieldName := range fields {
rv[fieldName] = uint16(i) + 1
}
return rv
}
// computeNewDocCount determines how many documents will be in the newly
// merged segment when obsoleted docs are dropped
func computeNewDocCount(segments []*SegmentBase, drops []*roaring.Bitmap) uint64 {
var newDocCount uint64
for segI, segment := range segments {
newDocCount += segment.numDocs
if drops[segI] != nil {
newDocCount -= drops[segI].GetCardinality()
}
}
return newDocCount
}
func mergeTermFreqNormLocsByCopying(term []byte, postItr *PostingsIterator,
newDocNums []uint64, newRoaring *roaring.Bitmap,
tfEncoder *chunkedIntCoder, locEncoder *chunkedIntCoder) (
lastDocNum uint64, lastFreq uint64, lastNorm uint64, err error) {
nextDocNum, nextFreq, nextNorm, nextFreqNormBytes, nextLocBytes, err :=
postItr.nextBytes()
for err == nil && len(nextFreqNormBytes) > 0 {
hitNewDocNum := newDocNums[nextDocNum]
if hitNewDocNum == docDropped {
return 0, 0, 0, fmt.Errorf("see hit with dropped doc num")
}
newRoaring.Add(uint32(hitNewDocNum))
err = tfEncoder.AddBytes(hitNewDocNum, nextFreqNormBytes)
if err != nil {
return 0, 0, 0, err
}
if len(nextLocBytes) > 0 {
err = locEncoder.AddBytes(hitNewDocNum, nextLocBytes)
if err != nil {
return 0, 0, 0, err
}
}
lastDocNum = hitNewDocNum
lastFreq = nextFreq
lastNorm = nextNorm
nextDocNum, nextFreq, nextNorm, nextFreqNormBytes, nextLocBytes, err =
postItr.nextBytes()
}
return lastDocNum, lastFreq, lastNorm, err
}
func mergeTermFreqNormLocs(fieldsMap map[string]uint16, term []byte, postItr *PostingsIterator,
newDocNums []uint64, newRoaring *roaring.Bitmap,
tfEncoder *chunkedIntCoder, locEncoder *chunkedIntCoder, bufLoc []uint64) (
lastDocNum uint64, lastFreq uint64, lastNorm uint64, bufLocOut []uint64, err error) {
next, err := postItr.Next()
for next != nil && err == nil {
hitNewDocNum := newDocNums[next.Number()]
if hitNewDocNum == docDropped {
return 0, 0, 0, nil, fmt.Errorf("see hit with dropped docNum")
}
newRoaring.Add(uint32(hitNewDocNum))
nextFreq := next.Frequency()
var nextNorm uint64
if pi, ok := next.(*Posting); ok {
nextNorm = pi.NormUint64()
} else {
return 0, 0, 0, nil, fmt.Errorf("unexpected posting type %T", next)
}
locs := next.Locations()
if nextFreq > 0 {
err = tfEncoder.Add(hitNewDocNum,
encodeFreqHasLocs(nextFreq, len(locs) > 0), nextNorm)
} else {
err = tfEncoder.Add(hitNewDocNum,
encodeFreqHasLocs(nextFreq, len(locs) > 0))
}
if err != nil {
return 0, 0, 0, nil, err
}
if len(locs) > 0 {
numBytesLocs := 0
for _, loc := range locs {
ap := loc.ArrayPositions()
numBytesLocs += totalUvarintBytes(uint64(fieldsMap[loc.Field()]-1),
loc.Pos(), loc.Start(), loc.End(), uint64(len(ap)), ap)
}
err = locEncoder.Add(hitNewDocNum, uint64(numBytesLocs))
if err != nil {
return 0, 0, 0, nil, err
}
for _, loc := range locs {
ap := loc.ArrayPositions()
if cap(bufLoc) < 5+len(ap) {
bufLoc = make([]uint64, 0, 5+len(ap))
}
args := bufLoc[0:5]
args[0] = uint64(fieldsMap[loc.Field()] - 1)
args[1] = loc.Pos()
args[2] = loc.Start()
args[3] = loc.End()
args[4] = uint64(len(ap))
args = append(args, ap...)
err = locEncoder.Add(hitNewDocNum, args...)
if err != nil {
return 0, 0, 0, nil, err
}
}
}
lastDocNum = hitNewDocNum
lastFreq = nextFreq
lastNorm = nextNorm
next, err = postItr.Next()
}
return lastDocNum, lastFreq, lastNorm, bufLoc, err
}
func writePostings(postings *roaring.Bitmap, tfEncoder, locEncoder *chunkedIntCoder,
use1HitEncoding func(uint64) (bool, uint64, uint64),
w *CountHashWriter, bufMaxVarintLen64 []byte) (
offset uint64, err error) {
if postings == nil {
return 0, nil
}
termCardinality := postings.GetCardinality()
if termCardinality <= 0 {
return 0, nil
}
if use1HitEncoding != nil {
encodeAs1Hit, docNum1Hit, normBits1Hit := use1HitEncoding(termCardinality)
if encodeAs1Hit {
return FSTValEncode1Hit(docNum1Hit, normBits1Hit), nil
}
}
var tfOffset uint64
tfOffset, _, err = tfEncoder.writeAt(w)
if err != nil {
return 0, err
}
var locOffset uint64
locOffset, _, err = locEncoder.writeAt(w)
if err != nil {
return 0, err
}
postingsOffset := uint64(w.Count())
n := binary.PutUvarint(bufMaxVarintLen64, tfOffset)
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return 0, err
}
n = binary.PutUvarint(bufMaxVarintLen64, locOffset)
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return 0, err
}
_, err = writeRoaringWithLen(postings, w, bufMaxVarintLen64)
if err != nil {
return 0, err
}
return postingsOffset, nil
}
type varintEncoder func(uint64) (int, error)
func mergeStoredAndRemap(segments []*SegmentBase, drops []*roaring.Bitmap,
fieldsMap map[string]uint16, fieldsInv []string, fieldsSame bool, newSegDocCount uint64,
w *CountHashWriter, closeCh chan struct{}) (uint64, [][]uint64, error) {
var rv [][]uint64 // The remapped or newDocNums for each segment.
var newDocNum uint64
var curr int
var data, compressed []byte
var metaBuf bytes.Buffer
varBuf := make([]byte, binary.MaxVarintLen64)
metaEncode := func(val uint64) (int, error) {
wb := binary.PutUvarint(varBuf, val)
return metaBuf.Write(varBuf[:wb])
}
vals := make([][][]byte, len(fieldsInv))
typs := make([][]byte, len(fieldsInv))
poss := make([][][]uint64, len(fieldsInv))
var posBuf []uint64
docNumOffsets := make([]uint64, newSegDocCount)
vdc := visitDocumentCtxPool.Get().(*visitDocumentCtx)
defer visitDocumentCtxPool.Put(vdc)
// for each segment
for segI, segment := range segments {
// check for the closure in meantime
if isClosed(closeCh) {
return 0, nil, seg.ErrClosed
}
segNewDocNums := make([]uint64, segment.numDocs)
dropsI := drops[segI]
// optimize when the field mapping is the same across all
// segments and there are no deletions, via byte-copying
// of stored docs bytes directly to the writer
if fieldsSame && (dropsI == nil || dropsI.GetCardinality() == 0) {
err := segment.copyStoredDocs(newDocNum, docNumOffsets, w)
if err != nil {
return 0, nil, err
}
for i := uint64(0); i < segment.numDocs; i++ {
segNewDocNums[i] = newDocNum
newDocNum++
}
rv = append(rv, segNewDocNums)
continue
}
// for each doc num
for docNum := uint64(0); docNum < segment.numDocs; docNum++ {
// TODO: roaring's API limits docNums to 32-bits?
if dropsI != nil && dropsI.Contains(uint32(docNum)) {
segNewDocNums[docNum] = docDropped
continue
}
segNewDocNums[docNum] = newDocNum
curr = 0
metaBuf.Reset()
data = data[:0]
posTemp := posBuf
// collect all the data
for i := 0; i < len(fieldsInv); i++ {
vals[i] = vals[i][:0]
typs[i] = typs[i][:0]
poss[i] = poss[i][:0]
}
err := segment.visitStoredFields(vdc, docNum, func(field string, typ byte, value []byte, pos []uint64) bool {
fieldID := int(fieldsMap[field]) - 1
if fieldID < 0 {
// no entry for field in fieldsMap
return false
}
vals[fieldID] = append(vals[fieldID], value)
typs[fieldID] = append(typs[fieldID], typ)
// copy array positions to preserve them beyond the scope of this callback
var curPos []uint64
if len(pos) > 0 {
if cap(posTemp) < len(pos) {
posBuf = make([]uint64, len(pos)*len(fieldsInv))
posTemp = posBuf
}
curPos = posTemp[0:len(pos)]
copy(curPos, pos)
posTemp = posTemp[len(pos):]
}
poss[fieldID] = append(poss[fieldID], curPos)
return true
})
if err != nil {
return 0, nil, err
}
// _id field special case optimizes ExternalID() lookups
idFieldVal := vals[uint16(0)][0]
_, err = metaEncode(uint64(len(idFieldVal)))
if err != nil {
return 0, nil, err
}
// now walk the non-"_id" fields in order
for fieldID := 1; fieldID < len(fieldsInv); fieldID++ {
storedFieldValues := vals[fieldID]
stf := typs[fieldID]
spf := poss[fieldID]
var err2 error
curr, data, err2 = persistStoredFieldValues(fieldID,
storedFieldValues, stf, spf, curr, metaEncode, data)
if err2 != nil {
return 0, nil, err2
}
}
metaBytes := metaBuf.Bytes()
compressed = snappy.Encode(compressed[:cap(compressed)], data)
// record where we're about to start writing
docNumOffsets[newDocNum] = uint64(w.Count())
// write out the meta len and compressed data len
_, err = writeUvarints(w,
uint64(len(metaBytes)),
uint64(len(idFieldVal)+len(compressed)))
if err != nil {
return 0, nil, err
}
// now write the meta
_, err = w.Write(metaBytes)
if err != nil {
return 0, nil, err
}
// now write the _id field val (counted as part of the 'compressed' data)
_, err = w.Write(idFieldVal)
if err != nil {
return 0, nil, err
}
// now write the compressed data
_, err = w.Write(compressed)
if err != nil {
return 0, nil, err
}
newDocNum++
}
rv = append(rv, segNewDocNums)
}
// return value is the start of the stored index
storedIndexOffset := uint64(w.Count())
// now write out the stored doc index
for _, docNumOffset := range docNumOffsets {
err := binary.Write(w, binary.BigEndian, docNumOffset)
if err != nil {
return 0, nil, err
}
}
return storedIndexOffset, rv, nil
}
// copyStoredDocs writes out a segment's stored doc info, optimized by
// using a single Write() call for the entire set of bytes. The
// newDocNumOffsets is filled with the new offsets for each doc.
func (sb *SegmentBase) copyStoredDocs(newDocNum uint64, newDocNumOffsets []uint64,
w *CountHashWriter) error {
if sb.numDocs <= 0 {
return nil
}
indexOffset0, storedOffset0, _, _, _ :=
sb.getDocStoredOffsets(0) // the segment's first doc
indexOffsetN, storedOffsetN, readN, metaLenN, dataLenN :=
sb.getDocStoredOffsets(sb.numDocs - 1) // the segment's last doc
storedOffset0New := uint64(w.Count())
storedBytes := sb.mem[storedOffset0 : storedOffsetN+readN+metaLenN+dataLenN]
_, err := w.Write(storedBytes)
if err != nil {
return err
}
// remap the storedOffset's for the docs into new offsets relative
// to storedOffset0New, filling the given docNumOffsetsOut array
for indexOffset := indexOffset0; indexOffset <= indexOffsetN; indexOffset += 8 {
storedOffset := binary.BigEndian.Uint64(sb.mem[indexOffset : indexOffset+8])
storedOffsetNew := storedOffset - storedOffset0 + storedOffset0New
newDocNumOffsets[newDocNum] = storedOffsetNew
newDocNum += 1
}
return nil
}
// mergeFields builds a unified list of fields used across all the
// input segments, and computes whether the fields are the same across
// segments (which depends on fields to be sorted in the same way
// across segments)
func mergeFields(segments []*SegmentBase) (bool, []string) {
fieldsSame := true
var segment0Fields []string
if len(segments) > 0 {
segment0Fields = segments[0].Fields()
}
fieldsExist := map[string]struct{}{}
for _, segment := range segments {
fields := segment.Fields()
for fieldi, field := range fields {
fieldsExist[field] = struct{}{}
if len(segment0Fields) != len(fields) || segment0Fields[fieldi] != field {
fieldsSame = false
}
}
}
rv := make([]string, 0, len(fieldsExist))
// ensure _id stays first
rv = append(rv, "_id")
for k := range fieldsExist {
if k != "_id" {
rv = append(rv, k)
}
}
sort.Strings(rv[1:]) // leave _id as first
return fieldsSame, rv
}
func isClosed(closeCh chan struct{}) bool {
select {
case <-closeCh:
return true
default:
return false
}
}

439
vendor/github.com/blevesearch/zapx/v16/new.go generated vendored Normal file
View File

@@ -0,0 +1,439 @@
// Copyright (c) 2018 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"encoding/binary"
"math"
"sort"
"sync"
"sync/atomic"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
"github.com/golang/snappy"
)
var NewSegmentBufferNumResultsBump int = 100
var NewSegmentBufferNumResultsFactor float64 = 1.0
var NewSegmentBufferAvgBytesPerDocFactor float64 = 1.0
// ValidateDocFields can be set by applications to perform additional checks
// on fields in a document being added to a new segment, by default it does
// nothing.
// This API is experimental and may be removed at any time.
var ValidateDocFields = func(field index.Field) error {
return nil
}
// New creates an in-memory zap-encoded SegmentBase from a set of Documents
func (z *ZapPlugin) New(results []index.Document) (
segment.Segment, uint64, error) {
return z.newWithChunkMode(results, DefaultChunkMode)
}
func (*ZapPlugin) newWithChunkMode(results []index.Document,
chunkMode uint32) (segment.Segment, uint64, error) {
s := interimPool.Get().(*interim)
var br bytes.Buffer
if s.lastNumDocs > 0 {
// use previous results to initialize the buf with an estimate
// size, but note that the interim instance comes from a
// global interimPool, so multiple scorch instances indexing
// different docs can lead to low quality estimates
estimateAvgBytesPerDoc := int(float64(s.lastOutSize/s.lastNumDocs) *
NewSegmentBufferNumResultsFactor)
estimateNumResults := int(float64(len(results)+NewSegmentBufferNumResultsBump) *
NewSegmentBufferAvgBytesPerDocFactor)
br.Grow(estimateAvgBytesPerDoc * estimateNumResults)
}
s.results = results
s.chunkMode = chunkMode
s.w = NewCountHashWriter(&br)
storedIndexOffset, sectionsIndexOffset, err := s.convert()
if err != nil {
return nil, uint64(0), err
}
sb, err := InitSegmentBase(br.Bytes(), s.w.Sum32(), chunkMode,
uint64(len(results)), storedIndexOffset, sectionsIndexOffset)
// get the bytes written before the interim's reset() call
// write it to the newly formed segment base.
totalBytesWritten := s.getBytesWritten()
if err == nil && s.reset() == nil {
s.lastNumDocs = len(results)
s.lastOutSize = len(br.Bytes())
sb.setBytesWritten(totalBytesWritten)
interimPool.Put(s)
}
return sb, uint64(len(br.Bytes())), err
}
var interimPool = sync.Pool{New: func() interface{} { return &interim{} }}
// interim holds temporary working data used while converting from
// analysis results to a zap-encoded segment
type interim struct {
results []index.Document
chunkMode uint32
w *CountHashWriter
// FieldsMap adds 1 to field id to avoid zero value issues
// name -> field id + 1
FieldsMap map[string]uint16
// FieldsInv is the inverse of FieldsMap
// field id -> name
FieldsInv []string
metaBuf bytes.Buffer
tmp0 []byte
tmp1 []byte
lastNumDocs int
lastOutSize int
// atomic access to this variable
bytesWritten uint64
opaque map[int]resetable
}
func (s *interim) reset() (err error) {
s.results = nil
s.chunkMode = 0
s.w = nil
for k := range s.FieldsMap {
delete(s.FieldsMap, k)
}
s.FieldsInv = s.FieldsInv[:0]
s.metaBuf.Reset()
s.tmp0 = s.tmp0[:0]
s.tmp1 = s.tmp1[:0]
s.lastNumDocs = 0
s.lastOutSize = 0
// reset the bytes written stat count
// to avoid leaking of bytesWritten across reuse cycles.
s.setBytesWritten(0)
if s.opaque != nil {
for _, v := range s.opaque {
err = v.Reset()
}
} else {
s.opaque = map[int]resetable{}
}
return err
}
type interimStoredField struct {
vals [][]byte
typs []byte
arrayposs [][]uint64 // array positions
}
type interimFreqNorm struct {
freq uint64
norm float32
numLocs int
}
type interimLoc struct {
fieldID uint16
pos uint64
start uint64
end uint64
arrayposs []uint64
}
func (s *interim) convert() (uint64, uint64, error) {
if s.FieldsMap == nil {
s.FieldsMap = map[string]uint16{}
}
s.getOrDefineField("_id") // _id field is fieldID 0
for _, result := range s.results {
result.VisitComposite(func(field index.CompositeField) {
s.getOrDefineField(field.Name())
})
result.VisitFields(func(field index.Field) {
s.getOrDefineField(field.Name())
})
}
sort.Strings(s.FieldsInv[1:]) // keep _id as first field
for fieldID, fieldName := range s.FieldsInv {
s.FieldsMap[fieldName] = uint16(fieldID + 1)
}
args := map[string]interface{}{
"results": s.results,
"chunkMode": s.chunkMode,
"fieldsMap": s.FieldsMap,
"fieldsInv": s.FieldsInv,
}
if s.opaque == nil {
s.opaque = map[int]resetable{}
for i, x := range segmentSections {
s.opaque[int(i)] = x.InitOpaque(args)
}
} else {
for k, v := range args {
for _, op := range s.opaque {
op.Set(k, v)
}
}
}
s.processDocuments()
storedIndexOffset, err := s.writeStoredFields()
if err != nil {
return 0, 0, err
}
// we can persist the various sections at this point.
// the rule of thumb here is that each section must persist field wise.
for _, x := range segmentSections {
_, err = x.Persist(s.opaque, s.w)
if err != nil {
return 0, 0, err
}
}
// after persisting the sections to the writer, account corresponding
for _, opaque := range s.opaque {
opaqueIO, ok := opaque.(segment.DiskStatsReporter)
if ok {
s.incrementBytesWritten(opaqueIO.BytesWritten())
}
}
// we can persist a new fields section here
// this new fields section will point to the various indexes available
sectionsIndexOffset, err := persistFieldsSection(s.FieldsInv, s.w, s.opaque)
if err != nil {
return 0, 0, err
}
return storedIndexOffset, sectionsIndexOffset, nil
}
func (s *interim) getOrDefineField(fieldName string) int {
fieldIDPlus1, exists := s.FieldsMap[fieldName]
if !exists {
fieldIDPlus1 = uint16(len(s.FieldsInv) + 1)
s.FieldsMap[fieldName] = fieldIDPlus1
s.FieldsInv = append(s.FieldsInv, fieldName)
}
return int(fieldIDPlus1 - 1)
}
func (s *interim) processDocuments() {
for docNum, result := range s.results {
s.processDocument(uint32(docNum), result)
}
}
func (s *interim) processDocument(docNum uint32,
result index.Document) {
// this callback is essentially going to be invoked on each field,
// as part of which preprocessing, cumulation etc. of the doc's data
// will take place.
visitField := func(field index.Field) {
fieldID := uint16(s.getOrDefineField(field.Name()))
// section specific processing of the field
for _, section := range segmentSections {
section.Process(s.opaque, docNum, field, fieldID)
}
}
// walk each composite field
result.VisitComposite(func(field index.CompositeField) {
visitField(field)
})
// walk each field
result.VisitFields(visitField)
// given that as part of visiting each field, there may some kind of totalling
// or accumulation that can be updated, it becomes necessary to commit or
// put that totalling/accumulation into effect. However, for certain section
// types this particular step need not be valid, in which case it would be a
// no-op in the implmentation of the section's process API.
for _, section := range segmentSections {
section.Process(s.opaque, docNum, nil, math.MaxUint16)
}
}
func (s *interim) getBytesWritten() uint64 {
return atomic.LoadUint64(&s.bytesWritten)
}
func (s *interim) incrementBytesWritten(val uint64) {
atomic.AddUint64(&s.bytesWritten, val)
}
func (s *interim) writeStoredFields() (
storedIndexOffset uint64, err error) {
varBuf := make([]byte, binary.MaxVarintLen64)
metaEncode := func(val uint64) (int, error) {
wb := binary.PutUvarint(varBuf, val)
return s.metaBuf.Write(varBuf[:wb])
}
data, compressed := s.tmp0[:0], s.tmp1[:0]
defer func() { s.tmp0, s.tmp1 = data, compressed }()
// keyed by docNum
docStoredOffsets := make([]uint64, len(s.results))
// keyed by fieldID, for the current doc in the loop
docStoredFields := map[uint16]interimStoredField{}
for docNum, result := range s.results {
for fieldID := range docStoredFields { // reset for next doc
delete(docStoredFields, fieldID)
}
var validationErr error
result.VisitFields(func(field index.Field) {
fieldID := uint16(s.getOrDefineField(field.Name()))
if field.Options().IsStored() {
isf := docStoredFields[fieldID]
isf.vals = append(isf.vals, field.Value())
isf.typs = append(isf.typs, field.EncodedFieldType())
isf.arrayposs = append(isf.arrayposs, field.ArrayPositions())
docStoredFields[fieldID] = isf
}
err := ValidateDocFields(field)
if err != nil && validationErr == nil {
validationErr = err
}
})
if validationErr != nil {
return 0, validationErr
}
var curr int
s.metaBuf.Reset()
data = data[:0]
// _id field special case optimizes ExternalID() lookups
idFieldVal := docStoredFields[uint16(0)].vals[0]
_, err = metaEncode(uint64(len(idFieldVal)))
if err != nil {
return 0, err
}
// handle non-"_id" fields
for fieldID := 1; fieldID < len(s.FieldsInv); fieldID++ {
isf, exists := docStoredFields[uint16(fieldID)]
if exists {
curr, data, err = persistStoredFieldValues(
fieldID, isf.vals, isf.typs, isf.arrayposs,
curr, metaEncode, data)
if err != nil {
return 0, err
}
}
}
metaBytes := s.metaBuf.Bytes()
compressed = snappy.Encode(compressed[:cap(compressed)], data)
s.incrementBytesWritten(uint64(len(compressed)))
docStoredOffsets[docNum] = uint64(s.w.Count())
_, err := writeUvarints(s.w,
uint64(len(metaBytes)),
uint64(len(idFieldVal)+len(compressed)))
if err != nil {
return 0, err
}
_, err = s.w.Write(metaBytes)
if err != nil {
return 0, err
}
_, err = s.w.Write(idFieldVal)
if err != nil {
return 0, err
}
_, err = s.w.Write(compressed)
if err != nil {
return 0, err
}
}
storedIndexOffset = uint64(s.w.Count())
for _, docStoredOffset := range docStoredOffsets {
err = binary.Write(s.w, binary.BigEndian, docStoredOffset)
if err != nil {
return 0, err
}
}
return storedIndexOffset, nil
}
func (s *interim) setBytesWritten(val uint64) {
atomic.StoreUint64(&s.bytesWritten, val)
}
// returns the total # of bytes needed to encode the given uint64's
// into binary.PutUVarint() encoding
func totalUvarintBytes(a, b, c, d, e uint64, more []uint64) (n int) {
n = numUvarintBytes(a)
n += numUvarintBytes(b)
n += numUvarintBytes(c)
n += numUvarintBytes(d)
n += numUvarintBytes(e)
for _, v := range more {
n += numUvarintBytes(v)
}
return n
}
// returns # of bytes needed to encode x in binary.PutUvarint() encoding
func numUvarintBytes(x uint64) (n int) {
for x >= 0x80 {
x >>= 7
n++
}
return n + 1
}

27
vendor/github.com/blevesearch/zapx/v16/plugin.go generated vendored Normal file
View File

@@ -0,0 +1,27 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
// ZapPlugin implements the Plugin interface of
// the blevesearch/scorch_segment_api pkg
type ZapPlugin struct{}
func (*ZapPlugin) Type() string {
return Type
}
func (*ZapPlugin) Version() uint32 {
return Version
}

944
vendor/github.com/blevesearch/zapx/v16/posting.go generated vendored Normal file
View File

@@ -0,0 +1,944 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"encoding/binary"
"fmt"
"math"
"reflect"
"github.com/RoaringBitmap/roaring/v2"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
var reflectStaticSizePostingsList int
var reflectStaticSizePostingsIterator int
var reflectStaticSizePosting int
var reflectStaticSizeLocation int
func init() {
var pl PostingsList
reflectStaticSizePostingsList = int(reflect.TypeOf(pl).Size())
var pi PostingsIterator
reflectStaticSizePostingsIterator = int(reflect.TypeOf(pi).Size())
var p Posting
reflectStaticSizePosting = int(reflect.TypeOf(p).Size())
var l Location
reflectStaticSizeLocation = int(reflect.TypeOf(l).Size())
}
// FST or vellum value (uint64) encoding is determined by the top two
// highest-order or most significant bits...
//
// encoding : MSB
// name : 63 62 61...to...bit #0 (LSB)
// ----------+---+---+---------------------------------------------------
// general : 0 | 0 | 62-bits of postingsOffset.
// ~ : 0 | 1 | reserved for future.
// 1-hit : 1 | 0 | 31-bits of positive float31 norm | 31-bits docNum.
// ~ : 1 | 1 | reserved for future.
//
// Encoding "general" is able to handle all cases, where the
// postingsOffset points to more information about the postings for
// the term.
//
// Encoding "1-hit" is used to optimize a commonly seen case when a
// term has only a single hit. For example, a term in the _id field
// will have only 1 hit. The "1-hit" encoding is used for a term
// in a field when...
//
// - term vector info is disabled for that field;
// - and, the term appears in only a single doc for that field;
// - and, the term's freq is exactly 1 in that single doc for that field;
// - and, the docNum must fit into 31-bits;
//
// Otherwise, the "general" encoding is used instead.
//
// In the "1-hit" encoding, the field in that single doc may have
// other terms, which is supported in the "1-hit" encoding by the
// positive float31 norm.
const FSTValEncodingMask = uint64(0xc000000000000000)
const FSTValEncodingGeneral = uint64(0x0000000000000000)
const FSTValEncoding1Hit = uint64(0x8000000000000000)
func FSTValEncode1Hit(docNum uint64, normBits uint64) uint64 {
return FSTValEncoding1Hit | ((mask31Bits & normBits) << 31) | (mask31Bits & docNum)
}
func FSTValDecode1Hit(v uint64) (docNum uint64, normBits uint64) {
return (mask31Bits & v), (mask31Bits & (v >> 31))
}
const mask31Bits = uint64(0x000000007fffffff)
func under32Bits(x uint64) bool {
return x <= mask31Bits
}
const DocNum1HitFinished = math.MaxUint64
var NormBits1Hit = uint64(1)
// PostingsList is an in-memory representation of a postings list
type PostingsList struct {
sb *SegmentBase
postingsOffset uint64
freqOffset uint64
locOffset uint64
postings *roaring.Bitmap
except *roaring.Bitmap
// when normBits1Hit != 0, then this postings list came from a
// 1-hit encoding, and only the docNum1Hit & normBits1Hit apply
docNum1Hit uint64
normBits1Hit uint64
chunkSize uint64
bytesRead uint64
}
// represents an immutable, empty postings list
var emptyPostingsList = &PostingsList{}
func (p *PostingsList) Size() int {
sizeInBytes := reflectStaticSizePostingsList + SizeOfPtr
if p.except != nil {
sizeInBytes += int(p.except.GetSizeInBytes())
}
return sizeInBytes
}
func (p *PostingsList) OrInto(receiver *roaring.Bitmap) {
if p.normBits1Hit != 0 {
receiver.Add(uint32(p.docNum1Hit))
return
}
if p.postings != nil {
receiver.Or(p.postings)
}
}
// Iterator returns an iterator for this postings list
func (p *PostingsList) Iterator(includeFreq, includeNorm, includeLocs bool,
prealloc segment.PostingsIterator) segment.PostingsIterator {
if p.normBits1Hit == 0 && p.postings == nil {
return emptyPostingsIterator
}
var preallocPI *PostingsIterator
pi, ok := prealloc.(*PostingsIterator)
if ok && pi != nil {
preallocPI = pi
}
if preallocPI == emptyPostingsIterator {
preallocPI = nil
}
return p.iterator(includeFreq, includeNorm, includeLocs, preallocPI)
}
func (p *PostingsList) iterator(includeFreq, includeNorm, includeLocs bool,
rv *PostingsIterator) *PostingsIterator {
if rv == nil {
rv = &PostingsIterator{}
} else {
freqNormReader := rv.freqNormReader
if freqNormReader != nil {
freqNormReader.reset()
}
locReader := rv.locReader
if locReader != nil {
locReader.reset()
}
nextLocs := rv.nextLocs[:0]
nextSegmentLocs := rv.nextSegmentLocs[:0]
buf := rv.buf
*rv = PostingsIterator{} // clear the struct
rv.freqNormReader = freqNormReader
rv.locReader = locReader
rv.nextLocs = nextLocs
rv.nextSegmentLocs = nextSegmentLocs
rv.buf = buf
}
rv.postings = p
rv.includeFreqNorm = includeFreq || includeNorm || includeLocs
rv.includeLocs = includeLocs
if p.normBits1Hit != 0 {
// "1-hit" encoding
rv.docNum1Hit = p.docNum1Hit
rv.normBits1Hit = p.normBits1Hit
if p.except != nil && p.except.Contains(uint32(rv.docNum1Hit)) {
rv.docNum1Hit = DocNum1HitFinished
}
return rv
}
// "general" encoding, check if empty
if p.postings == nil {
return rv
}
// initialize freq chunk reader
if rv.includeFreqNorm {
rv.freqNormReader = newChunkedIntDecoder(p.sb.mem, p.freqOffset, rv.freqNormReader)
rv.incrementBytesRead(rv.freqNormReader.getBytesRead())
}
// initialize the loc chunk reader
if rv.includeLocs {
rv.locReader = newChunkedIntDecoder(p.sb.mem, p.locOffset, rv.locReader)
rv.incrementBytesRead(rv.locReader.getBytesRead())
}
rv.all = p.postings.Iterator()
if p.except != nil {
rv.ActualBM = roaring.AndNot(p.postings, p.except)
rv.Actual = rv.ActualBM.Iterator()
} else {
rv.ActualBM = p.postings
rv.Actual = rv.all // Optimize to use same iterator for all & Actual.
}
return rv
}
// Count returns the number of items on this postings list
func (p *PostingsList) Count() uint64 {
var n, e uint64
if p.normBits1Hit != 0 {
n = 1
if p.except != nil && p.except.Contains(uint32(p.docNum1Hit)) {
e = 1
}
} else if p.postings != nil {
n = p.postings.GetCardinality()
if p.except != nil {
e = p.postings.AndCardinality(p.except)
}
}
return n - e
}
// Implements the segment.DiskStatsReporter interface
// The purpose of this implementation is to get
// the bytes read from the postings lists stored
// on disk, while querying
func (p *PostingsList) ResetBytesRead(val uint64) {
p.bytesRead = val
}
func (p *PostingsList) BytesRead() uint64 {
return p.bytesRead
}
func (p *PostingsList) incrementBytesRead(val uint64) {
p.bytesRead += val
}
func (p *PostingsList) BytesWritten() uint64 {
return 0
}
func (rv *PostingsList) read(postingsOffset uint64, d *Dictionary) error {
rv.postingsOffset = postingsOffset
// handle "1-hit" encoding special case
if rv.postingsOffset&FSTValEncodingMask == FSTValEncoding1Hit {
return rv.init1Hit(postingsOffset)
}
// read the location of the freq/norm details
var n uint64
var read int
rv.freqOffset, read = binary.Uvarint(d.sb.mem[postingsOffset+n : postingsOffset+binary.MaxVarintLen64])
n += uint64(read)
rv.locOffset, read = binary.Uvarint(d.sb.mem[postingsOffset+n : postingsOffset+n+binary.MaxVarintLen64])
n += uint64(read)
var postingsLen uint64
postingsLen, read = binary.Uvarint(d.sb.mem[postingsOffset+n : postingsOffset+n+binary.MaxVarintLen64])
n += uint64(read)
roaringBytes := d.sb.mem[postingsOffset+n : postingsOffset+n+postingsLen]
rv.incrementBytesRead(n + postingsLen)
if rv.postings == nil {
rv.postings = roaring.NewBitmap()
}
_, err := rv.postings.FromBuffer(roaringBytes)
if err != nil {
return fmt.Errorf("error loading roaring bitmap: %v", err)
}
chunkSize, err := getChunkSize(d.sb.chunkMode,
rv.postings.GetCardinality(), d.sb.numDocs)
if err != nil {
return fmt.Errorf("failed to get chunk size: %v", err)
}
rv.chunkSize = chunkSize
return nil
}
func (rv *PostingsList) init1Hit(fstVal uint64) error {
docNum, normBits := FSTValDecode1Hit(fstVal)
rv.docNum1Hit = docNum
rv.normBits1Hit = normBits
return nil
}
// PostingsIterator provides a way to iterate through the postings list
type PostingsIterator struct {
postings *PostingsList
all roaring.IntPeekable
Actual roaring.IntPeekable
ActualBM *roaring.Bitmap
currChunk uint32
freqNormReader *chunkedIntDecoder
locReader *chunkedIntDecoder
next Posting // reused across Next() calls
nextLocs []Location // reused across Next() calls
nextSegmentLocs []segment.Location // reused across Next() calls
docNum1Hit uint64
normBits1Hit uint64
buf []byte
includeFreqNorm bool
includeLocs bool
bytesRead uint64
}
var emptyPostingsIterator = &PostingsIterator{}
func (i *PostingsIterator) Size() int {
sizeInBytes := reflectStaticSizePostingsIterator + SizeOfPtr +
i.next.Size()
// account for freqNormReader, locReader if we start using this.
for _, entry := range i.nextLocs {
sizeInBytes += entry.Size()
}
return sizeInBytes
}
// Implements the segment.DiskStatsReporter interface
// The purpose of this implementation is to get
// the bytes read from the disk which includes
// the freqNorm and location specific information
// of a hit
func (i *PostingsIterator) ResetBytesRead(val uint64) {
i.bytesRead = val
}
func (i *PostingsIterator) BytesRead() uint64 {
return i.bytesRead
}
func (i *PostingsIterator) incrementBytesRead(val uint64) {
i.bytesRead += val
}
func (i *PostingsIterator) BytesWritten() uint64 {
return 0
}
func (i *PostingsIterator) loadChunk(chunk int) error {
if i.includeFreqNorm {
err := i.freqNormReader.loadChunk(chunk)
if err != nil {
return err
}
// assign the bytes read at this point, since
// the postingsIterator is tracking only the chunk loaded
// and the cumulation is tracked correctly in the downstream
// intDecoder
i.ResetBytesRead(i.freqNormReader.getBytesRead())
}
if i.includeLocs {
err := i.locReader.loadChunk(chunk)
if err != nil {
return err
}
i.ResetBytesRead(i.locReader.getBytesRead())
}
i.currChunk = uint32(chunk)
return nil
}
func (i *PostingsIterator) readFreqNormHasLocs() (uint64, uint64, bool, error) {
if i.normBits1Hit != 0 {
return 1, i.normBits1Hit, false, nil
}
freqHasLocs, err := i.freqNormReader.readUvarint()
if err != nil {
return 0, 0, false, fmt.Errorf("error reading frequency: %v", err)
}
freq, hasLocs := decodeFreqHasLocs(freqHasLocs)
if freq == 0 {
return freq, 0, hasLocs, nil
}
normBits, err := i.freqNormReader.readUvarint()
if err != nil {
return 0, 0, false, fmt.Errorf("error reading norm: %v", err)
}
return freq, normBits, hasLocs, nil
}
func (i *PostingsIterator) skipFreqNormReadHasLocs() (bool, error) {
if i.normBits1Hit != 0 {
return false, nil
}
freqHasLocs, err := i.freqNormReader.readUvarint()
if err != nil {
return false, fmt.Errorf("error reading freqHasLocs: %v", err)
}
freq, hasLocs := decodeFreqHasLocs(freqHasLocs)
if freq == 0 {
return hasLocs, nil
}
i.freqNormReader.SkipUvarint() // Skip normBits.
return hasLocs, nil // See decodeFreqHasLocs() / hasLocs.
}
func encodeFreqHasLocs(freq uint64, hasLocs bool) uint64 {
rv := freq << 1
if hasLocs {
rv = rv | 0x01 // 0'th LSB encodes whether there are locations
}
return rv
}
func decodeFreqHasLocs(freqHasLocs uint64) (uint64, bool) {
freq := freqHasLocs >> 1
hasLocs := freqHasLocs&0x01 != 0
return freq, hasLocs
}
// readLocation processes all the integers on the stream representing a single
// location.
func (i *PostingsIterator) readLocation(l *Location) error {
// read off field
fieldID, err := i.locReader.readUvarint()
if err != nil {
return fmt.Errorf("error reading location field: %v", err)
}
// read off pos
pos, err := i.locReader.readUvarint()
if err != nil {
return fmt.Errorf("error reading location pos: %v", err)
}
// read off start
start, err := i.locReader.readUvarint()
if err != nil {
return fmt.Errorf("error reading location start: %v", err)
}
// read off end
end, err := i.locReader.readUvarint()
if err != nil {
return fmt.Errorf("error reading location end: %v", err)
}
// read off num array pos
numArrayPos, err := i.locReader.readUvarint()
if err != nil {
return fmt.Errorf("error reading location num array pos: %v", err)
}
l.field = i.postings.sb.fieldsInv[fieldID]
l.pos = pos
l.start = start
l.end = end
if cap(l.ap) < int(numArrayPos) {
l.ap = make([]uint64, int(numArrayPos))
} else {
l.ap = l.ap[:int(numArrayPos)]
}
// read off array positions
for k := 0; k < int(numArrayPos); k++ {
ap, err := i.locReader.readUvarint()
if err != nil {
return fmt.Errorf("error reading array position: %v", err)
}
l.ap[k] = ap
}
return nil
}
// Next returns the next posting on the postings list, or nil at the end
func (i *PostingsIterator) Next() (segment.Posting, error) {
return i.nextAtOrAfter(0)
}
// Advance returns the posting at the specified docNum or it is not present
// the next posting, or if the end is reached, nil
func (i *PostingsIterator) Advance(docNum uint64) (segment.Posting, error) {
return i.nextAtOrAfter(docNum)
}
// Next returns the next posting on the postings list, or nil at the end
func (i *PostingsIterator) nextAtOrAfter(atOrAfter uint64) (segment.Posting, error) {
docNum, exists, err := i.nextDocNumAtOrAfter(atOrAfter)
if err != nil || !exists {
return nil, err
}
i.next = Posting{} // clear the struct
rv := &i.next
rv.docNum = docNum
if !i.includeFreqNorm {
return rv, nil
}
var normBits uint64
var hasLocs bool
rv.freq, normBits, hasLocs, err = i.readFreqNormHasLocs()
if err != nil {
return nil, err
}
rv.norm = math.Float32frombits(uint32(normBits))
if i.includeLocs && hasLocs {
// prepare locations into reused slices, where we assume
// rv.freq >= "number of locs", since in a composite field,
// some component fields might have their IncludeTermVector
// flags disabled while other component fields are enabled
if rv.freq > 0 {
if cap(i.nextLocs) >= int(rv.freq) {
i.nextLocs = i.nextLocs[0:rv.freq]
} else {
i.nextLocs = make([]Location, rv.freq, rv.freq*2)
}
if cap(i.nextSegmentLocs) < int(rv.freq) {
i.nextSegmentLocs = make([]segment.Location, rv.freq, rv.freq*2)
}
rv.locs = i.nextSegmentLocs[:0]
}
numLocsBytes, err := i.locReader.readUvarint()
if err != nil {
return nil, fmt.Errorf("error reading location numLocsBytes: %v", err)
}
j := 0
var nextLoc *Location
startBytesRemaining := i.locReader.Len() // # bytes remaining in the locReader
for startBytesRemaining-i.locReader.Len() < int(numLocsBytes) {
if len(i.nextLocs) > j {
nextLoc = &i.nextLocs[j]
} else {
nextLoc = &Location{}
}
err := i.readLocation(nextLoc)
if err != nil {
return nil, err
}
rv.locs = append(rv.locs, nextLoc)
j++
}
}
return rv, nil
}
// nextDocNum returns the next docNum on the postings list, and also
// sets up the currChunk / loc related fields of the iterator.
func (i *PostingsIterator) nextDocNumAtOrAfter(atOrAfter uint64) (uint64, bool, error) {
if i.normBits1Hit != 0 {
if i.docNum1Hit == DocNum1HitFinished {
return 0, false, nil
}
if i.docNum1Hit < atOrAfter {
// advanced past our 1-hit
i.docNum1Hit = DocNum1HitFinished // consume our 1-hit docNum
return 0, false, nil
}
docNum := i.docNum1Hit
i.docNum1Hit = DocNum1HitFinished // consume our 1-hit docNum
return docNum, true, nil
}
if i.Actual == nil || !i.Actual.HasNext() {
return 0, false, nil
}
if i.postings == nil || i.postings == emptyPostingsList {
// couldn't find anything
return 0, false, nil
}
if i.postings.postings == i.ActualBM {
return i.nextDocNumAtOrAfterClean(atOrAfter)
}
if i.postings.chunkSize == 0 {
return 0, false, ErrChunkSizeZero
}
i.Actual.AdvanceIfNeeded(uint32(atOrAfter))
if !i.Actual.HasNext() || !i.all.HasNext() {
// couldn't find anything
return 0, false, nil
}
n := i.Actual.Next()
allN := i.all.Next()
nChunk := n / uint32(i.postings.chunkSize)
// when allN becomes >= to here, then allN is in the same chunk as nChunk.
allNReachesNChunk := nChunk * uint32(i.postings.chunkSize)
// n is the next actual hit (excluding some postings), and
// allN is the next hit in the full postings, and
// if they don't match, move 'all' forwards until they do
for allN != n {
// we've reached same chunk, so move the freq/norm/loc decoders forward
if i.includeFreqNorm && allN >= allNReachesNChunk {
err := i.currChunkNext(nChunk)
if err != nil {
return 0, false, err
}
}
if !i.all.HasNext() {
return 0, false, nil
}
allN = i.all.Next()
}
if i.includeFreqNorm && (i.currChunk != nChunk || i.freqNormReader.isNil()) {
err := i.loadChunk(int(nChunk))
if err != nil {
return 0, false, fmt.Errorf("error loading chunk: %v", err)
}
}
return uint64(n), true, nil
}
var freqHasLocs1Hit = encodeFreqHasLocs(1, false)
// nextBytes returns the docNum and the encoded freq & loc bytes for
// the next posting
func (i *PostingsIterator) nextBytes() (
docNumOut uint64, freq uint64, normBits uint64,
bytesFreqNorm []byte, bytesLoc []byte, err error) {
docNum, exists, err := i.nextDocNumAtOrAfter(0)
if err != nil || !exists {
return 0, 0, 0, nil, nil, err
}
if i.normBits1Hit != 0 {
if i.buf == nil {
i.buf = make([]byte, binary.MaxVarintLen64*2)
}
n := binary.PutUvarint(i.buf, freqHasLocs1Hit)
n += binary.PutUvarint(i.buf[n:], i.normBits1Hit)
return docNum, uint64(1), i.normBits1Hit, i.buf[:n], nil, nil
}
startFreqNorm := i.freqNormReader.remainingLen()
var hasLocs bool
freq, normBits, hasLocs, err = i.readFreqNormHasLocs()
if err != nil {
return 0, 0, 0, nil, nil, err
}
endFreqNorm := i.freqNormReader.remainingLen()
bytesFreqNorm = i.freqNormReader.readBytes(startFreqNorm, endFreqNorm)
if hasLocs {
startLoc := i.locReader.remainingLen()
numLocsBytes, err := i.locReader.readUvarint()
if err != nil {
return 0, 0, 0, nil, nil,
fmt.Errorf("error reading location nextBytes numLocs: %v", err)
}
// skip over all the location bytes
i.locReader.SkipBytes(int(numLocsBytes))
endLoc := i.locReader.remainingLen()
bytesLoc = i.locReader.readBytes(startLoc, endLoc)
}
return docNum, freq, normBits, bytesFreqNorm, bytesLoc, nil
}
// optimization when the postings list is "clean" (e.g., no updates &
// no deletions) where the all bitmap is the same as the actual bitmap
func (i *PostingsIterator) nextDocNumAtOrAfterClean(
atOrAfter uint64) (uint64, bool, error) {
if !i.includeFreqNorm {
i.Actual.AdvanceIfNeeded(uint32(atOrAfter))
if !i.Actual.HasNext() {
return 0, false, nil // couldn't find anything
}
return uint64(i.Actual.Next()), true, nil
}
if i.postings != nil && i.postings.chunkSize == 0 {
return 0, false, ErrChunkSizeZero
}
// freq-norm's needed, so maintain freq-norm chunk reader
sameChunkNexts := 0 // # of times we called Next() in the same chunk
n := i.Actual.Next()
nChunk := n / uint32(i.postings.chunkSize)
for uint64(n) < atOrAfter && i.Actual.HasNext() {
n = i.Actual.Next()
nChunkPrev := nChunk
nChunk = n / uint32(i.postings.chunkSize)
if nChunk != nChunkPrev {
sameChunkNexts = 0
} else {
sameChunkNexts += 1
}
}
if uint64(n) < atOrAfter {
// couldn't find anything
return 0, false, nil
}
for j := 0; j < sameChunkNexts; j++ {
err := i.currChunkNext(nChunk)
if err != nil {
return 0, false, fmt.Errorf("error optimized currChunkNext: %v", err)
}
}
if i.currChunk != nChunk || i.freqNormReader.isNil() {
err := i.loadChunk(int(nChunk))
if err != nil {
return 0, false, fmt.Errorf("error loading chunk: %v", err)
}
}
return uint64(n), true, nil
}
func (i *PostingsIterator) currChunkNext(nChunk uint32) error {
if i.currChunk != nChunk || i.freqNormReader.isNil() {
err := i.loadChunk(int(nChunk))
if err != nil {
return fmt.Errorf("error loading chunk: %v", err)
}
}
// read off freq/offsets even though we don't care about them
hasLocs, err := i.skipFreqNormReadHasLocs()
if err != nil {
return err
}
if i.includeLocs && hasLocs {
numLocsBytes, err := i.locReader.readUvarint()
if err != nil {
return fmt.Errorf("error reading location numLocsBytes: %v", err)
}
// skip over all the location bytes
i.locReader.SkipBytes(int(numLocsBytes))
}
return nil
}
// DocNum1Hit returns the docNum and true if this is "1-hit" optimized
// and the docNum is available.
func (p *PostingsIterator) DocNum1Hit() (uint64, bool) {
if p.normBits1Hit != 0 && p.docNum1Hit != DocNum1HitFinished {
return p.docNum1Hit, true
}
return 0, false
}
// ActualBitmap returns the underlying actual bitmap
// which can be used up the stack for optimizations
func (p *PostingsIterator) ActualBitmap() *roaring.Bitmap {
return p.ActualBM
}
// ReplaceActual replaces the ActualBM with the provided
// bitmap
func (p *PostingsIterator) ReplaceActual(abm *roaring.Bitmap) {
p.ActualBM = abm
p.Actual = abm.Iterator()
}
// PostingsIteratorFromBitmap constructs a PostingsIterator given an
// "actual" bitmap.
func PostingsIteratorFromBitmap(bm *roaring.Bitmap,
includeFreqNorm, includeLocs bool) (segment.PostingsIterator, error) {
return &PostingsIterator{
ActualBM: bm,
Actual: bm.Iterator(),
includeFreqNorm: includeFreqNorm,
includeLocs: includeLocs,
}, nil
}
// PostingsIteratorFrom1Hit constructs a PostingsIterator given a
// 1-hit docNum.
func PostingsIteratorFrom1Hit(docNum1Hit uint64,
includeFreqNorm, includeLocs bool) (segment.PostingsIterator, error) {
return &PostingsIterator{
docNum1Hit: docNum1Hit,
normBits1Hit: NormBits1Hit,
includeFreqNorm: includeFreqNorm,
includeLocs: includeLocs,
}, nil
}
// Posting is a single entry in a postings list
type Posting struct {
docNum uint64
freq uint64
norm float32
locs []segment.Location
}
func (p *Posting) Size() int {
sizeInBytes := reflectStaticSizePosting
for _, entry := range p.locs {
sizeInBytes += entry.Size()
}
return sizeInBytes
}
// Number returns the document number of this posting in this segment
func (p *Posting) Number() uint64 {
return p.docNum
}
// Frequency returns the frequencies of occurrence of this term in this doc/field
func (p *Posting) Frequency() uint64 {
return p.freq
}
// Norm returns the normalization factor for this posting
func (p *Posting) Norm() float64 {
return float64(float32(1.0 / math.Sqrt(float64(math.Float32bits(p.norm)))))
}
// Locations returns the location information for each occurrence
func (p *Posting) Locations() []segment.Location {
return p.locs
}
// NormUint64 returns the norm value as uint64
func (p *Posting) NormUint64() uint64 {
return uint64(math.Float32bits(p.norm))
}
// Location represents the location of a single occurrence
type Location struct {
field string
pos uint64
start uint64
end uint64
ap []uint64
}
func (l *Location) Size() int {
return reflectStaticSizeLocation +
len(l.field) +
len(l.ap)*SizeOfUint64
}
// Field returns the name of the field (useful in composite fields to know
// which original field the value came from)
func (l *Location) Field() string {
return l.field
}
// Start returns the start byte offset of this occurrence
func (l *Location) Start() uint64 {
return l.start
}
// End returns the end byte offset of this occurrence
func (l *Location) End() uint64 {
return l.end
}
// Pos returns the 1-based phrase position of this occurrence
func (l *Location) Pos() uint64 {
return l.pos
}
// ArrayPositions returns the array position vector associated with this occurrence
func (l *Location) ArrayPositions() []uint64 {
return l.ap
}

43
vendor/github.com/blevesearch/zapx/v16/read.go generated vendored Normal file
View File

@@ -0,0 +1,43 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import "encoding/binary"
func (sb *SegmentBase) getDocStoredMetaAndCompressed(docNum uint64) ([]byte, []byte) {
_, storedOffset, n, metaLen, dataLen := sb.getDocStoredOffsets(docNum)
meta := sb.mem[storedOffset+n : storedOffset+n+metaLen]
data := sb.mem[storedOffset+n+metaLen : storedOffset+n+metaLen+dataLen]
return meta, data
}
func (sb *SegmentBase) getDocStoredOffsets(docNum uint64) (
uint64, uint64, uint64, uint64, uint64) {
indexOffset := sb.storedIndexOffset + (8 * docNum)
storedOffset := binary.BigEndian.Uint64(sb.mem[indexOffset : indexOffset+8])
var n uint64
metaLen, read := binary.Uvarint(sb.mem[storedOffset : storedOffset+binary.MaxVarintLen64])
n += uint64(read)
dataLen, read := binary.Uvarint(sb.mem[storedOffset+n : storedOffset+n+binary.MaxVarintLen64])
n += uint64(read)
return indexOffset, storedOffset, n, metaLen, dataLen
}

78
vendor/github.com/blevesearch/zapx/v16/section.go generated vendored Normal file
View File

@@ -0,0 +1,78 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"sync"
"github.com/RoaringBitmap/roaring/v2"
index "github.com/blevesearch/bleve_index_api"
)
type section interface {
// process is essentially parsing of a specific field's content in a specific
// document. any tracking of processed data *specific to this section* should
// be done in opaque which will be passed to the Persist() API.
Process(opaque map[int]resetable, docNum uint32, f index.Field, fieldID uint16)
// flush the processed data in the opaque to the writer.
Persist(opaque map[int]resetable, w *CountHashWriter) (n int64, err error)
// this API is used to fetch the file offset of the field for this section.
// this is used during search time to parse the section, and fetch results
// for the specific "index" thats part of the section.
AddrForField(opaque map[int]resetable, fieldID int) int
// for every field in the fieldsInv (relevant to this section) merge the section
// contents from all the segments into a single section data for the field.
// as part of the merge API, write the merged data to the writer and also track
// the starting offset of this newly merged section data.
Merge(opaque map[int]resetable, segments []*SegmentBase, drops []*roaring.Bitmap, fieldsInv []string,
newDocNumsIn [][]uint64, w *CountHashWriter, closeCh chan struct{}) error
// opaque is used to track the data specific to this section. its not visible
// to the other sections and is only visible and freely modifiable by this specifc
// section.
InitOpaque(args map[string]interface{}) resetable
}
type resetable interface {
Reset() error
Set(key string, value interface{})
}
// -----------------------------------------------------------------------------
const (
SectionInvertedTextIndex = iota
SectionFaissVectorIndex
SectionSynonymIndex
)
// -----------------------------------------------------------------------------
var (
segmentSectionsMutex sync.Mutex
// writes to segmentSections within init()s ONLY within lock,
// reads will not require lock access
segmentSections = make(map[uint16]section)
)
// Method to be invoked within init()s ONLY.
func registerSegmentSection(key uint16, val section) {
segmentSectionsMutex.Lock()
segmentSections[key] = val
segmentSectionsMutex.Unlock()
}

View File

@@ -0,0 +1,776 @@
// Copyright (c) 2023 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build vectors
// +build vectors
package zap
import (
"encoding/binary"
"fmt"
"math"
"math/rand"
"sync/atomic"
"time"
"github.com/RoaringBitmap/roaring/v2"
index "github.com/blevesearch/bleve_index_api"
faiss "github.com/blevesearch/go-faiss"
seg "github.com/blevesearch/scorch_segment_api/v2"
)
const defaultFaissOMPThreads = 1
func init() {
rand.Seed(time.Now().UTC().UnixNano())
registerSegmentSection(SectionFaissVectorIndex, &faissVectorIndexSection{})
invertedTextIndexSectionExclusionChecks = append(invertedTextIndexSectionExclusionChecks, func(field index.Field) bool {
_, ok := field.(index.VectorField)
return ok
})
faiss.SetOMPThreads(defaultFaissOMPThreads)
}
type faissVectorIndexSection struct {
}
func (v *faissVectorIndexSection) Process(opaque map[int]resetable, docNum uint32, field index.Field, fieldID uint16) {
if fieldID == math.MaxUint16 {
return
}
if vf, ok := field.(index.VectorField); ok {
vo := v.getvectorIndexOpaque(opaque)
vo.process(vf, fieldID, docNum)
}
}
func (v *faissVectorIndexSection) Persist(opaque map[int]resetable, w *CountHashWriter) (n int64, err error) {
vo := v.getvectorIndexOpaque(opaque)
vo.writeVectorIndexes(w)
return 0, nil
}
func (v *faissVectorIndexSection) AddrForField(opaque map[int]resetable, fieldID int) int {
vo := v.getvectorIndexOpaque(opaque)
return vo.fieldAddrs[uint16(fieldID)]
}
// information specific to a vector index - (including metadata and
// the index pointer itself)
type vecIndexInfo struct {
startOffset int
indexSize uint64
vecIds []int64
indexOptimizedFor string
index *faiss.IndexImpl
}
// keep in mind with respect to update and delete operations with respect to vectors
func (v *faissVectorIndexSection) Merge(opaque map[int]resetable, segments []*SegmentBase,
drops []*roaring.Bitmap, fieldsInv []string,
newDocNumsIn [][]uint64, w *CountHashWriter, closeCh chan struct{}) error {
vo := v.getvectorIndexOpaque(opaque)
// the segments with valid vector sections in them
// preallocating the space over here, if there are too many fields
// in the segment this will help by avoiding multiple allocation
// calls.
vecSegs := make([]*SegmentBase, 0, len(segments))
indexes := make([]*vecIndexInfo, 0, len(segments))
for fieldID, fieldName := range fieldsInv {
indexes = indexes[:0] // resizing the slices
vecSegs = vecSegs[:0]
vecToDocID := make(map[int64]uint64)
// todo: would parallely fetching the following stuff from segments
// be beneficial in terms of perf?
for segI, sb := range segments {
if isClosed(closeCh) {
return seg.ErrClosed
}
if _, ok := sb.fieldsMap[fieldName]; !ok {
continue
}
// check if the section address is a valid one for "fieldName" in the
// segment sb. the local fieldID (fetched by the fieldsMap of the sb)
// is to be used while consulting the fieldsSectionsMap
pos := int(sb.fieldsSectionsMap[sb.fieldsMap[fieldName]-1][SectionFaissVectorIndex])
if pos == 0 {
continue
}
// loading doc values - adhering to the sections format. never
// valid values for vector section
_, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
_, n = binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
// the index optimization type represented as an int
indexOptimizationTypeInt, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
numVecs, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
vecSegs = append(vecSegs, sb)
indexes = append(indexes, &vecIndexInfo{
vecIds: make([]int64, 0, numVecs),
indexOptimizedFor: index.VectorIndexOptimizationsReverseLookup[int(indexOptimizationTypeInt)],
})
curIdx := len(indexes) - 1
for i := 0; i < int(numVecs); i++ {
vecID, n := binary.Varint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
docID, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
// remap the docID from the old segment to the new document nos.
// provided. furthermore, also drop the now-invalid doc nums
// of that segment
if newDocNumsIn[segI][uint32(docID)] != docDropped {
newDocID := newDocNumsIn[segI][uint32(docID)]
// if the remapped doc ID is valid, track it
// as part of vecs to be reconstructed (for larger indexes).
// this would account only the valid vector IDs, so the deleted
// ones won't be reconstructed in the final index.
vecToDocID[vecID] = newDocID
indexes[curIdx].vecIds = append(indexes[curIdx].vecIds, vecID)
}
}
indexSize, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += n
indexes[curIdx].startOffset = pos
indexes[curIdx].indexSize = indexSize
pos += int(indexSize)
}
err := vo.flushSectionMetadata(fieldID, w, vecToDocID, indexes)
if err != nil {
return err
}
err = vo.mergeAndWriteVectorIndexes(vecSegs, indexes, w, closeCh)
if err != nil {
return err
}
}
return nil
}
func (v *vectorIndexOpaque) flushSectionMetadata(fieldID int, w *CountHashWriter,
vecToDocID map[int64]uint64, indexes []*vecIndexInfo) error {
tempBuf := v.grabBuf(binary.MaxVarintLen64)
// early exit if there are absolutely no valid vectors present in the segment
// and crucially don't store the section start offset in it
if len(indexes) == 0 || len(vecToDocID) == 0 {
return nil
}
fieldStart := w.Count()
// marking the fact that for vector index, doc values isn't valid by
// storing fieldNotUniverted values.
n := binary.PutUvarint(tempBuf, uint64(fieldNotUninverted))
_, err := w.Write(tempBuf[:n])
if err != nil {
return err
}
n = binary.PutUvarint(tempBuf, uint64(fieldNotUninverted))
_, err = w.Write(tempBuf[:n])
if err != nil {
return err
}
n = binary.PutUvarint(tempBuf, uint64(index.SupportedVectorIndexOptimizations[indexes[0].indexOptimizedFor]))
_, err = w.Write(tempBuf[:n])
if err != nil {
return err
}
// write the number of unique vectors
n = binary.PutUvarint(tempBuf, uint64(len(vecToDocID)))
_, err = w.Write(tempBuf[:n])
if err != nil {
return err
}
for vecID, docID := range vecToDocID {
// write the vecID
n = binary.PutVarint(tempBuf, vecID)
_, err = w.Write(tempBuf[:n])
if err != nil {
return err
}
// write the docID
n = binary.PutUvarint(tempBuf, docID)
_, err = w.Write(tempBuf[:n])
if err != nil {
return err
}
}
v.fieldAddrs[uint16(fieldID)] = fieldStart
return nil
}
func (v *vectorIndexOpaque) flushVectorIndex(indexBytes []byte, w *CountHashWriter) error {
tempBuf := v.grabBuf(binary.MaxVarintLen64)
n := binary.PutUvarint(tempBuf, uint64(len(indexBytes)))
_, err := w.Write(tempBuf[:n])
if err != nil {
return err
}
// write the vector index data
_, err = w.Write(indexBytes)
return err
}
// Divide the estimated nprobe with this value to optimize
// for latency.
const nprobeLatencyOptimization = 2
// Calculates the nprobe count, given nlist(number of centroids) based on
// the metric the index is optimized for.
func calculateNprobe(nlist int, indexOptimizedFor string) int32 {
nprobe := int32(math.Sqrt(float64(nlist)))
if indexOptimizedFor == index.IndexOptimizedForLatency {
nprobe /= nprobeLatencyOptimization
if nprobe < 1 {
nprobe = 1
}
}
return nprobe
}
// todo: naive implementation. need to keep in mind the perf implications and improve on this.
// perhaps, parallelized merging can help speed things up over here.
func (v *vectorIndexOpaque) mergeAndWriteVectorIndexes(sbs []*SegmentBase,
vecIndexes []*vecIndexInfo, w *CountHashWriter, closeCh chan struct{}) error {
// safe to assume that all the indexes are of the same config values, given
// that they are extracted from the field mapping info.
var dims, metric int
var indexOptimizedFor string
var validMerge bool
var finalVecIDCap, indexDataCap, reconsCap int
for segI, segBase := range sbs {
// Considering merge operations on vector indexes are expensive, it is
// worth including an early exit if the merge is aborted, saving us
// the resource spikes, even if temporary.
if isClosed(closeCh) {
freeReconstructedIndexes(vecIndexes)
return seg.ErrClosed
}
if len(vecIndexes[segI].vecIds) == 0 {
// no valid vectors for this index, don't bring it into memory
continue
}
// read the index bytes. todo: parallelize this
indexBytes := segBase.mem[vecIndexes[segI].startOffset : vecIndexes[segI].startOffset+int(vecIndexes[segI].indexSize)]
index, err := faiss.ReadIndexFromBuffer(indexBytes, faissIOFlags)
if err != nil {
freeReconstructedIndexes(vecIndexes)
return err
}
if len(vecIndexes[segI].vecIds) > 0 {
indexReconsLen := len(vecIndexes[segI].vecIds) * index.D()
if indexReconsLen > reconsCap {
reconsCap = indexReconsLen
}
indexDataCap += indexReconsLen
finalVecIDCap += len(vecIndexes[segI].vecIds)
}
vecIndexes[segI].index = index
validMerge = true
// set the dims and metric values from the constructed index.
dims = index.D()
metric = int(index.MetricType())
indexOptimizedFor = vecIndexes[segI].indexOptimizedFor
}
// not a valid merge operation as there are no valid indexes to merge.
if !validMerge {
return nil
}
finalVecIDs := make([]int64, 0, finalVecIDCap)
// merging of indexes with reconstruction method.
// the indexes[i].vecIds has only the valid vecs of this vector
// index present in it, so we'd be reconstructing only those.
indexData := make([]float32, 0, indexDataCap)
// reusable buffer for reconstruction
recons := make([]float32, 0, reconsCap)
var err error
for i := 0; i < len(vecIndexes); i++ {
if isClosed(closeCh) {
freeReconstructedIndexes(vecIndexes)
return seg.ErrClosed
}
// reconstruct the vectors only if present, it could be that
// some of the indexes had all of their vectors updated/deleted.
if len(vecIndexes[i].vecIds) > 0 {
neededReconsLen := len(vecIndexes[i].vecIds) * vecIndexes[i].index.D()
recons = recons[:neededReconsLen]
// todo: parallelize reconstruction
recons, err = vecIndexes[i].index.ReconstructBatch(vecIndexes[i].vecIds, recons)
if err != nil {
freeReconstructedIndexes(vecIndexes)
return err
}
indexData = append(indexData, recons...)
// Adding vector IDs in the same order as the vectors
finalVecIDs = append(finalVecIDs, vecIndexes[i].vecIds...)
}
}
if len(indexData) == 0 {
// no valid vectors for this index, so we don't even have to
// record it in the section
freeReconstructedIndexes(vecIndexes)
return nil
}
recons = nil
nvecs := len(finalVecIDs)
// index type to be created after merge based on the number of vectors
// in indexData added into the index.
nlist := determineCentroids(nvecs)
indexDescription, indexClass := determineIndexToUse(nvecs, nlist, indexOptimizedFor)
// freeing the reconstructed indexes immediately - waiting till the end
// to do the same is not needed because the following operations don't need
// the reconstructed ones anymore and doing so will hold up memory which can
// be detrimental while creating indexes during introduction.
freeReconstructedIndexes(vecIndexes)
faissIndex, err := faiss.IndexFactory(dims, indexDescription, metric)
if err != nil {
return err
}
defer faissIndex.Close()
if indexClass == IndexTypeIVF {
// the direct map maintained in the IVF index is essential for the
// reconstruction of vectors based on vector IDs in the future merges.
// the AddWithIDs API also needs a direct map to be set before using.
err = faissIndex.SetDirectMap(2)
if err != nil {
return err
}
nprobe := calculateNprobe(nlist, indexOptimizedFor)
faissIndex.SetNProbe(nprobe)
// train the vector index, essentially performs k-means clustering to partition
// the data space of indexData such that during the search time, we probe
// only a subset of vectors -> non-exhaustive search. could be a time
// consuming step when the indexData is large.
err = faissIndex.Train(indexData)
if err != nil {
return err
}
}
err = faissIndex.AddWithIDs(indexData, finalVecIDs)
if err != nil {
return err
}
mergedIndexBytes, err := faiss.WriteIndexIntoBuffer(faissIndex)
if err != nil {
return err
}
return v.flushVectorIndex(mergedIndexBytes, w)
}
// todo: can be parallelized.
func freeReconstructedIndexes(indexes []*vecIndexInfo) {
for _, entry := range indexes {
if entry.index != nil {
entry.index.Close()
}
}
}
// todo: is it possible to merge this resuable stuff with the interim's tmp0?
func (v *vectorIndexOpaque) grabBuf(size int) []byte {
buf := v.tmp0
if cap(buf) < size {
buf = make([]byte, size)
v.tmp0 = buf
}
return buf[0:size]
}
// Determines the number of centroids to use for an IVF index.
func determineCentroids(nvecs int) int {
var nlist int
switch {
case nvecs >= 200000:
nlist = int(4 * math.Sqrt(float64(nvecs)))
case nvecs >= 1000:
// 100 points per cluster is a reasonable default, considering the default
// minimum and maximum points per cluster is 39 and 256 respectively.
// Since it's a recommendation to have a minimum of 10 clusters, 1000(100 * 10)
// was chosen as the lower threshold.
nlist = nvecs / 100
}
return nlist
}
const (
IndexTypeFlat = iota
IndexTypeIVF
)
// Returns a description string for the index and quantizer type
// and an index type.
func determineIndexToUse(nvecs, nlist int, indexOptimizedFor string) (string, int) {
if indexOptimizedFor == index.IndexOptimizedForMemoryEfficient {
switch {
case nvecs >= 1000:
return fmt.Sprintf("IVF%d,SQ4", nlist), IndexTypeIVF
default:
return "IDMap2,Flat", IndexTypeFlat
}
}
switch {
case nvecs >= 10000:
return fmt.Sprintf("IVF%d,SQ8", nlist), IndexTypeIVF
case nvecs >= 1000:
return fmt.Sprintf("IVF%d,Flat", nlist), IndexTypeIVF
default:
return "IDMap2,Flat", IndexTypeFlat
}
}
func (vo *vectorIndexOpaque) writeVectorIndexes(w *CountHashWriter) (offset uint64, err error) {
// for every fieldID, contents to store over here are:
// 1. the serialized representation of the dense vector index.
// 2. its constituent vectorID -> {docID} mapping.
tempBuf := vo.grabBuf(binary.MaxVarintLen64)
for fieldID, content := range vo.vecFieldMap {
// calculate the capacity of the vecs and ids slices
// to avoid multiple allocations.
vecs := make([]float32, 0, len(content.vecs)*int(content.dim))
ids := make([]int64, 0, len(content.vecs))
for hash, vecInfo := range content.vecs {
vecs = append(vecs, vecInfo.vec...)
ids = append(ids, hash)
}
// Set the faiss metric type (default is Euclidean Distance or l2_norm)
var metric = faiss.MetricL2
if content.metric == index.InnerProduct || content.metric == index.CosineSimilarity {
// use the same FAISS metric for inner product and cosine similarity
metric = faiss.MetricInnerProduct
}
nvecs := len(ids)
nlist := determineCentroids(nvecs)
indexDescription, indexClass := determineIndexToUse(nvecs, nlist,
content.indexOptimizedFor)
faissIndex, err := faiss.IndexFactory(int(content.dim), indexDescription, metric)
if err != nil {
return 0, err
}
defer faissIndex.Close()
if indexClass == IndexTypeIVF {
err = faissIndex.SetDirectMap(2)
if err != nil {
return 0, err
}
nprobe := calculateNprobe(nlist, content.indexOptimizedFor)
faissIndex.SetNProbe(nprobe)
err = faissIndex.Train(vecs)
if err != nil {
return 0, err
}
}
err = faissIndex.AddWithIDs(vecs, ids)
if err != nil {
return 0, err
}
fieldStart := w.Count()
// writing out two offset values to indicate that the current field's
// vector section doesn't have valid doc value content within it.
n := binary.PutUvarint(tempBuf, uint64(fieldNotUninverted))
_, err = w.Write(tempBuf[:n])
if err != nil {
return 0, err
}
n = binary.PutUvarint(tempBuf, uint64(fieldNotUninverted))
_, err = w.Write(tempBuf[:n])
if err != nil {
return 0, err
}
n = binary.PutUvarint(tempBuf, uint64(index.SupportedVectorIndexOptimizations[content.indexOptimizedFor]))
_, err = w.Write(tempBuf[:n])
if err != nil {
return 0, err
}
// write the number of unique vectors
n = binary.PutUvarint(tempBuf, uint64(faissIndex.Ntotal()))
_, err = w.Write(tempBuf[:n])
if err != nil {
return 0, err
}
// fixme: this can cause a write amplification. need to improve this.
// todo: might need to a reformating to optimize according to mmap needs.
// reformating idea: storing all the IDs mapping towards the end of the
// section would be help avoiding in paging in this data as part of a page
// (which is to load a non-cacheable info like index). this could help the
// paging costs
for vecID := range content.vecs {
docID := vo.vecIDMap[vecID].docID
// write the vecID
n = binary.PutVarint(tempBuf, vecID)
_, err = w.Write(tempBuf[:n])
if err != nil {
return 0, err
}
n = binary.PutUvarint(tempBuf, uint64(docID))
_, err = w.Write(tempBuf[:n])
if err != nil {
return 0, err
}
}
// serialize the built index into a byte slice
buf, err := faiss.WriteIndexIntoBuffer(faissIndex)
if err != nil {
return 0, err
}
// record the fieldStart value for this section.
// write the vecID -> docID mapping
// write the index bytes and its length
n = binary.PutUvarint(tempBuf, uint64(len(buf)))
_, err = w.Write(tempBuf[:n])
if err != nil {
return 0, err
}
// write the vector index data
_, err = w.Write(buf)
if err != nil {
return 0, err
}
// accounts for whatever data has been written out to the writer.
vo.incrementBytesWritten(uint64(w.Count() - fieldStart))
vo.fieldAddrs[fieldID] = fieldStart
}
return 0, nil
}
func (vo *vectorIndexOpaque) process(field index.VectorField, fieldID uint16, docNum uint32) {
if !vo.init {
vo.realloc()
vo.init = true
}
if fieldID == math.MaxUint16 {
// doc processing checkpoint. currently nothing to do
return
}
//process field
vec := field.Vector()
dim := field.Dims()
metric := field.Similarity()
indexOptimizedFor := field.IndexOptimizedFor()
// caller is supposed to make sure len(vec) is a multiple of dim.
// Not double checking it here to avoid the overhead.
numSubVecs := len(vec) / dim
for i := 0; i < numSubVecs; i++ {
subVec := vec[i*dim : (i+1)*dim]
// NOTE: currently, indexing only unique vectors.
subVecHash := hashCode(subVec)
if _, ok := vo.vecIDMap[subVecHash]; !ok {
vo.vecIDMap[subVecHash] = &vecInfo{
docID: docNum,
}
}
// tracking the unique vectors for every field which will be used later
// to construct the vector index.
if _, ok := vo.vecFieldMap[fieldID]; !ok {
vo.vecFieldMap[fieldID] = &indexContent{
vecs: map[int64]*vecInfo{
subVecHash: &vecInfo{
vec: subVec,
},
},
dim: uint16(dim),
metric: metric,
indexOptimizedFor: indexOptimizedFor,
}
} else {
vo.vecFieldMap[fieldID].vecs[subVecHash] = &vecInfo{
vec: subVec,
}
}
}
}
// todo: better hash function?
// keep the perf aspects in mind with respect to the hash function.
// Uses a time based seed to prevent 2 identical vectors in different
// segments from having the same hash (which otherwise could cause an
// issue when merging those segments)
func hashCode(a []float32) int64 {
var rv, sum int64
for _, v := range a {
// Weighing each element of the vector differently to minimise chance
// of collisions between non identical vectors.
sum = int64(math.Float32bits(v)) + sum*31
}
// Similar to getVectorCode(), this uses the first 32 bits for the vector sum
// and the last 32 for a random 32-bit int to ensure identical vectors have
// unique hashes.
rv = sum<<32 | int64(rand.Int31())
return rv
}
func (v *faissVectorIndexSection) getvectorIndexOpaque(opaque map[int]resetable) *vectorIndexOpaque {
if _, ok := opaque[SectionFaissVectorIndex]; !ok {
opaque[SectionFaissVectorIndex] = v.InitOpaque(nil)
}
return opaque[SectionFaissVectorIndex].(*vectorIndexOpaque)
}
func (v *faissVectorIndexSection) InitOpaque(args map[string]interface{}) resetable {
rv := &vectorIndexOpaque{
fieldAddrs: make(map[uint16]int),
vecIDMap: make(map[int64]*vecInfo),
vecFieldMap: make(map[uint16]*indexContent),
}
for k, v := range args {
rv.Set(k, v)
}
return rv
}
type indexContent struct {
vecs map[int64]*vecInfo
dim uint16
metric string
indexOptimizedFor string
}
type vecInfo struct {
vec []float32
docID uint32
}
type vectorIndexOpaque struct {
init bool
bytesWritten uint64
lastNumVecs int
lastNumFields int
// maps the field to the address of its vector section
fieldAddrs map[uint16]int
// maps the vecID to basic info involved around it such as
// the docID its present in and the vector itself
vecIDMap map[int64]*vecInfo
// maps the field to information necessary for its vector
// index to be build.
vecFieldMap map[uint16]*indexContent
tmp0 []byte
}
func (v *vectorIndexOpaque) realloc() {
// when an opaque instance is reused, the two maps are pre-allocated
// with space before they were reset. this can be useful in continuous
// mutation scenarios, where the batch sizes are more or less same.
v.vecFieldMap = make(map[uint16]*indexContent, v.lastNumFields)
v.vecIDMap = make(map[int64]*vecInfo, v.lastNumVecs)
v.fieldAddrs = make(map[uint16]int, v.lastNumFields)
}
func (v *vectorIndexOpaque) incrementBytesWritten(val uint64) {
atomic.AddUint64(&v.bytesWritten, val)
}
func (v *vectorIndexOpaque) BytesWritten() uint64 {
return atomic.LoadUint64(&v.bytesWritten)
}
func (v *vectorIndexOpaque) BytesRead() uint64 {
return 0
}
func (v *vectorIndexOpaque) ResetBytesRead(uint64) {
}
// cleanup stuff over here for reusability
func (v *vectorIndexOpaque) Reset() (err error) {
// tracking the number of vecs and fields processed and tracked in this
// opaque, for better allocations of the maps
v.lastNumVecs = len(v.vecIDMap)
v.lastNumFields = len(v.vecFieldMap)
v.init = false
v.fieldAddrs = nil
v.vecFieldMap = nil
v.vecIDMap = nil
v.tmp0 = v.tmp0[:0]
atomic.StoreUint64(&v.bytesWritten, 0)
return nil
}
func (v *vectorIndexOpaque) Set(key string, val interface{}) {
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,786 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"encoding/binary"
"fmt"
"math"
"sort"
"github.com/RoaringBitmap/roaring/v2"
"github.com/RoaringBitmap/roaring/v2/roaring64"
index "github.com/blevesearch/bleve_index_api"
seg "github.com/blevesearch/scorch_segment_api/v2"
"github.com/blevesearch/vellum"
)
func init() {
registerSegmentSection(SectionSynonymIndex, &synonymIndexSection{})
invertedTextIndexSectionExclusionChecks = append(invertedTextIndexSectionExclusionChecks, func(field index.Field) bool {
_, ok := field.(index.SynonymField)
return ok
})
}
// -----------------------------------------------------------------------------
type synonymIndexOpaque struct {
results []index.Document
// indicates whether the following structs are initialized
init bool
// FieldsMap maps field name to field id and must be set in
// the index opaque using the key "fieldsMap"
// used for ensuring accurate mapping between fieldID and
// thesaurusID
// name -> field id + 1
FieldsMap map[string]uint16
// ThesaurusMap adds 1 to thesaurus id to avoid zero value issues
// name -> thesaurus id + 1
ThesaurusMap map[string]uint16
// ThesaurusMapInv is the inverse of ThesaurusMap
// thesaurus id + 1 -> name
ThesaurusInv []string
// Thesaurus for each thesaurus ID
// thesaurus id -> LHS term -> synonym postings list id + 1
Thesauri []map[string]uint64
// LHS Terms for each thesaurus ID, where terms are sorted ascending
// thesaurus id -> []term
ThesaurusKeys [][]string
// FieldIDtoThesaurusID maps the field id to the thesaurus id
// field id -> thesaurus id
FieldIDtoThesaurusID map[uint16]int
// SynonymIDtoTerm maps synonym id to term for each thesaurus
// thesaurus id -> synonym id -> term
SynonymTermToID []map[string]uint32
// SynonymTermToID maps term to synonym id for each thesaurus
// thesaurus id -> term -> synonym id
// this is the inverse of SynonymIDtoTerm for each thesaurus
SynonymIDtoTerm []map[uint32]string
// synonym postings list -> synonym bitmap
Synonyms []*roaring64.Bitmap
// A reusable vellum FST builder that will be stored in the synonym opaque
// and reused across multiple document batches during the persist phase
// of the synonym index section, the FST builder is used to build the
// FST for each thesaurus, which maps terms to their corresponding synonym bitmaps.
builder *vellum.Builder
// A reusable buffer for the vellum FST builder. It streams data written
// into the builder into a byte slice. The final byte slice represents
// the serialized vellum FST, which will be written to disk
builderBuf bytes.Buffer
// A reusable buffer for temporary use within the synonym index opaque
tmp0 []byte
// A map linking thesaurus IDs to their corresponding thesaurus' file offsets
thesaurusAddrs map[int]int
}
// Set the fieldsMap and results in the synonym index opaque before the section processes a synonym field.
func (so *synonymIndexOpaque) Set(key string, value interface{}) {
switch key {
case "results":
so.results = value.([]index.Document)
case "fieldsMap":
so.FieldsMap = value.(map[string]uint16)
}
}
// Reset the synonym index opaque after a batch of documents have been processed into a segment.
func (so *synonymIndexOpaque) Reset() (err error) {
// cleanup stuff over here
so.results = nil
so.init = false
so.ThesaurusMap = nil
so.ThesaurusInv = so.ThesaurusInv[:0]
for i := range so.Thesauri {
so.Thesauri[i] = nil
}
so.Thesauri = so.Thesauri[:0]
for i := range so.ThesaurusKeys {
so.ThesaurusKeys[i] = so.ThesaurusKeys[i][:0]
}
so.ThesaurusKeys = so.ThesaurusKeys[:0]
for _, idn := range so.Synonyms {
idn.Clear()
}
so.Synonyms = so.Synonyms[:0]
so.builderBuf.Reset()
if so.builder != nil {
err = so.builder.Reset(&so.builderBuf)
}
so.FieldIDtoThesaurusID = nil
so.SynonymTermToID = so.SynonymTermToID[:0]
so.SynonymIDtoTerm = so.SynonymIDtoTerm[:0]
so.tmp0 = so.tmp0[:0]
return err
}
func (so *synonymIndexOpaque) process(field index.SynonymField, fieldID uint16, docNum uint32) {
// if this is the first time we are processing a synonym field in this batch
// we need to allocate memory for the thesauri and related data structures
if !so.init {
so.realloc()
so.init = true
}
// get the thesaurus id for this field
tid := so.FieldIDtoThesaurusID[fieldID]
// get the thesaurus for this field
thesaurus := so.Thesauri[tid]
termSynMap := so.SynonymTermToID[tid]
field.IterateSynonyms(func(term string, synonyms []string) {
pid := thesaurus[term] - 1
bs := so.Synonyms[pid]
for _, syn := range synonyms {
code := encodeSynonym(termSynMap[syn], docNum)
bs.Add(code)
}
})
}
// a one-time call to allocate memory for the thesauri and synonyms which takes
// all the documents in the result batch and the fieldsMap and predetermines the
// size of the data structures in the synonymIndexOpaque
func (so *synonymIndexOpaque) realloc() {
var pidNext int
var sidNext uint32
so.ThesaurusMap = map[string]uint16{}
so.FieldIDtoThesaurusID = map[uint16]int{}
// count the number of unique thesauri from the batch of documents
for _, result := range so.results {
if synDoc, ok := result.(index.SynonymDocument); ok {
synDoc.VisitSynonymFields(func(synField index.SynonymField) {
fieldIDPlus1 := so.FieldsMap[synField.Name()]
so.getOrDefineThesaurus(fieldIDPlus1-1, synField.Name())
})
}
}
for _, result := range so.results {
if synDoc, ok := result.(index.SynonymDocument); ok {
synDoc.VisitSynonymFields(func(synField index.SynonymField) {
fieldIDPlus1 := so.FieldsMap[synField.Name()]
thesaurusID := so.getOrDefineThesaurus(fieldIDPlus1-1, synField.Name())
thesaurus := so.Thesauri[thesaurusID]
thesaurusKeys := so.ThesaurusKeys[thesaurusID]
synTermMap := so.SynonymIDtoTerm[thesaurusID]
termSynMap := so.SynonymTermToID[thesaurusID]
// iterate over all the term-synonyms pair from the field
synField.IterateSynonyms(func(term string, synonyms []string) {
_, exists := thesaurus[term]
if !exists {
pidNext++
pidPlus1 := uint64(pidNext)
thesaurus[term] = pidPlus1
thesaurusKeys = append(thesaurusKeys, term)
}
for _, syn := range synonyms {
_, exists := termSynMap[syn]
if !exists {
termSynMap[syn] = sidNext
synTermMap[sidNext] = syn
sidNext++
}
}
})
so.ThesaurusKeys[thesaurusID] = thesaurusKeys
})
}
}
numSynonymsLists := pidNext
if cap(so.Synonyms) >= numSynonymsLists {
so.Synonyms = so.Synonyms[:numSynonymsLists]
} else {
synonyms := make([]*roaring64.Bitmap, numSynonymsLists)
copy(synonyms, so.Synonyms[:cap(so.Synonyms)])
for i := 0; i < numSynonymsLists; i++ {
if synonyms[i] == nil {
synonyms[i] = roaring64.New()
}
}
so.Synonyms = synonyms
}
for _, thes := range so.ThesaurusKeys {
sort.Strings(thes)
}
}
// getOrDefineThesaurus returns the thesaurus id for the given field id and thesaurus name.
func (so *synonymIndexOpaque) getOrDefineThesaurus(fieldID uint16, thesaurusName string) int {
thesaurusIDPlus1, exists := so.ThesaurusMap[thesaurusName]
if !exists {
// need to create a new thesaurusID for this thesaurusName and
thesaurusIDPlus1 = uint16(len(so.ThesaurusInv) + 1)
so.ThesaurusMap[thesaurusName] = thesaurusIDPlus1
so.ThesaurusInv = append(so.ThesaurusInv, thesaurusName)
so.Thesauri = append(so.Thesauri, make(map[string]uint64))
so.SynonymIDtoTerm = append(so.SynonymIDtoTerm, make(map[uint32]string))
so.SynonymTermToID = append(so.SynonymTermToID, make(map[string]uint32))
// map the fieldID to the thesaurusID
so.FieldIDtoThesaurusID[fieldID] = int(thesaurusIDPlus1 - 1)
n := len(so.ThesaurusKeys)
if n < cap(so.ThesaurusKeys) {
so.ThesaurusKeys = so.ThesaurusKeys[:n+1]
so.ThesaurusKeys[n] = so.ThesaurusKeys[n][:0]
} else {
so.ThesaurusKeys = append(so.ThesaurusKeys, []string(nil))
}
}
return int(thesaurusIDPlus1 - 1)
}
// grabBuf returns a reusable buffer of the given size from the synonymIndexOpaque.
func (so *synonymIndexOpaque) grabBuf(size int) []byte {
buf := so.tmp0
if cap(buf) < size {
buf = make([]byte, size)
so.tmp0 = buf
}
return buf[:size]
}
func (so *synonymIndexOpaque) writeThesauri(w *CountHashWriter) (thesOffsets []uint64, err error) {
if so.results == nil || len(so.results) == 0 {
return nil, nil
}
thesOffsets = make([]uint64, len(so.ThesaurusInv))
buf := so.grabBuf(binary.MaxVarintLen64)
if so.builder == nil {
so.builder, err = vellum.New(&so.builderBuf, nil)
if err != nil {
return nil, err
}
}
for thesaurusID, terms := range so.ThesaurusKeys {
thes := so.Thesauri[thesaurusID]
for _, term := range terms { // terms are already sorted
pid := thes[term] - 1
postingsBS := so.Synonyms[pid]
postingsOffset, err := writeSynonyms(postingsBS, w, buf)
if err != nil {
return nil, err
}
if postingsOffset > uint64(0) {
err = so.builder.Insert([]byte(term), postingsOffset)
if err != nil {
return nil, err
}
}
}
err = so.builder.Close()
if err != nil {
return nil, err
}
thesOffsets[thesaurusID] = uint64(w.Count())
vellumData := so.builderBuf.Bytes()
// write out the length of the vellum data
n := binary.PutUvarint(buf, uint64(len(vellumData)))
_, err = w.Write(buf[:n])
if err != nil {
return nil, err
}
// write this vellum to disk
_, err = w.Write(vellumData)
if err != nil {
return nil, err
}
// reset vellum for reuse
so.builderBuf.Reset()
err = so.builder.Reset(&so.builderBuf)
if err != nil {
return nil, err
}
// write out the synTermMap for this thesaurus
err := writeSynTermMap(so.SynonymIDtoTerm[thesaurusID], w, buf)
if err != nil {
return nil, err
}
thesaurusStart := w.Count()
n = binary.PutUvarint(buf, fieldNotUninverted)
_, err = w.Write(buf[:n])
if err != nil {
return nil, err
}
n = binary.PutUvarint(buf, fieldNotUninverted)
_, err = w.Write(buf[:n])
if err != nil {
return nil, err
}
n = binary.PutUvarint(buf, thesOffsets[thesaurusID])
_, err = w.Write(buf[:n])
if err != nil {
return nil, err
}
so.thesaurusAddrs[thesaurusID] = thesaurusStart
}
return thesOffsets, nil
}
// -----------------------------------------------------------------------------
type synonymIndexSection struct {
}
func (s *synonymIndexSection) getSynonymIndexOpaque(opaque map[int]resetable) *synonymIndexOpaque {
if _, ok := opaque[SectionSynonymIndex]; !ok {
opaque[SectionSynonymIndex] = s.InitOpaque(nil)
}
return opaque[SectionSynonymIndex].(*synonymIndexOpaque)
}
// Implementations of the Section interface for the synonym index section.
// InitOpaque initializes the synonym index opaque, which sets the FieldsMap and
// results in the opaque before the section processes a synonym field.
func (s *synonymIndexSection) InitOpaque(args map[string]interface{}) resetable {
rv := &synonymIndexOpaque{
thesaurusAddrs: map[int]int{},
}
for k, v := range args {
rv.Set(k, v)
}
return rv
}
// Process processes a synonym field by adding the synonyms to the thesaurus
// pointed to by the fieldID, implements the Process API for the synonym index section.
func (s *synonymIndexSection) Process(opaque map[int]resetable, docNum uint32, field index.Field, fieldID uint16) {
if fieldID == math.MaxUint16 {
return
}
if sf, ok := field.(index.SynonymField); ok {
so := s.getSynonymIndexOpaque(opaque)
so.process(sf, fieldID, docNum)
}
}
// Persist serializes and writes the thesauri processed to the writer, along
// with the synonym postings lists, and the synonym term map. Implements the
// Persist API for the synonym index section.
func (s *synonymIndexSection) Persist(opaque map[int]resetable, w *CountHashWriter) (n int64, err error) {
synIndexOpaque := s.getSynonymIndexOpaque(opaque)
_, err = synIndexOpaque.writeThesauri(w)
return 0, err
}
// AddrForField returns the file offset of the thesaurus for the given fieldID,
// it uses the FieldIDtoThesaurusID map to translate the fieldID to the thesaurusID,
// and returns the corresponding thesaurus offset from the thesaurusAddrs map.
// Implements the AddrForField API for the synonym index section.
func (s *synonymIndexSection) AddrForField(opaque map[int]resetable, fieldID int) int {
synIndexOpaque := s.getSynonymIndexOpaque(opaque)
if synIndexOpaque == nil || synIndexOpaque.FieldIDtoThesaurusID == nil {
return 0
}
tid, exists := synIndexOpaque.FieldIDtoThesaurusID[uint16(fieldID)]
if !exists {
return 0
}
return synIndexOpaque.thesaurusAddrs[tid]
}
// Merge merges the thesauri, synonym postings lists and synonym term maps from
// the segments into a single thesaurus and serializes and writes the merged
// thesaurus and associated data to the writer. Implements the Merge API for the
// synonym index section.
func (s *synonymIndexSection) Merge(opaque map[int]resetable, segments []*SegmentBase,
drops []*roaring.Bitmap, fieldsInv []string, newDocNumsIn [][]uint64,
w *CountHashWriter, closeCh chan struct{}) error {
so := s.getSynonymIndexOpaque(opaque)
thesaurusAddrs, fieldIDtoThesaurusID, err := mergeAndPersistSynonymSection(segments, drops, fieldsInv, newDocNumsIn, w, closeCh)
if err != nil {
return err
}
so.thesaurusAddrs = thesaurusAddrs
so.FieldIDtoThesaurusID = fieldIDtoThesaurusID
return nil
}
// -----------------------------------------------------------------------------
// encodeSynonym encodes a synonymID and a docID into a single uint64 value.
// The encoding format splits the 64 bits as follows:
//
// 63 32 31 0
// +-----------+----------+
// | synonymID | docNum |
// +-----------+----------+
//
// The upper 32 bits (63-32) store the synonymID, and the lower 32 bits (31-0) store the docID.
//
// Parameters:
//
// synonymID - A 32-bit unsigned integer representing the ID of the synonym.
// docID - A 32-bit unsigned integer representing the document ID.
//
// Returns:
//
// A 64-bit unsigned integer that combines the synonymID and docID.
func encodeSynonym(synonymID uint32, docID uint32) uint64 {
return uint64(synonymID)<<32 | uint64(docID)
}
// writeSynonyms serilizes and writes the synonym postings list to the writer, by first
// serializing the postings list to a byte slice and then writing the length
// of the byte slice followed by the byte slice itself.
func writeSynonyms(postings *roaring64.Bitmap, w *CountHashWriter, bufMaxVarintLen64 []byte) (
offset uint64, err error) {
termCardinality := postings.GetCardinality()
if termCardinality <= 0 {
return 0, nil
}
postingsOffset := uint64(w.Count())
buf, err := postings.ToBytes()
if err != nil {
return 0, err
}
// write out the length
n := binary.PutUvarint(bufMaxVarintLen64, uint64(len(buf)))
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return 0, err
}
// write out the roaring bytes
_, err = w.Write(buf)
if err != nil {
return 0, err
}
return postingsOffset, nil
}
// writeSynTermMap serializes and writes the synonym term map to the writer, by first
// writing the length of the map followed by the map entries, where each entry
// consists of the synonym ID, the length of the term, and the term itself.
func writeSynTermMap(synTermMap map[uint32]string, w *CountHashWriter, bufMaxVarintLen64 []byte) error {
if len(synTermMap) == 0 {
return nil
}
n := binary.PutUvarint(bufMaxVarintLen64, uint64(len(synTermMap)))
_, err := w.Write(bufMaxVarintLen64[:n])
if err != nil {
return err
}
for sid, term := range synTermMap {
n = binary.PutUvarint(bufMaxVarintLen64, uint64(sid))
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return err
}
n = binary.PutUvarint(bufMaxVarintLen64, uint64(len(term)))
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return err
}
_, err = w.Write([]byte(term))
if err != nil {
return err
}
}
return nil
}
func mergeAndPersistSynonymSection(segments []*SegmentBase, dropsIn []*roaring.Bitmap,
fieldsInv []string, newDocNumsIn [][]uint64, w *CountHashWriter,
closeCh chan struct{}) (map[int]int, map[uint16]int, error) {
var bufMaxVarintLen64 []byte = make([]byte, binary.MaxVarintLen64)
var synonyms *SynonymsList
var synItr *SynonymsIterator
thesaurusAddrs := make(map[int]int)
var vellumBuf bytes.Buffer
newVellum, err := vellum.New(&vellumBuf, nil)
if err != nil {
return nil, nil, err
}
newRoaring := roaring64.NewBitmap()
newDocNums := make([][]uint64, 0, len(segments))
drops := make([]*roaring.Bitmap, 0, len(segments))
thesauri := make([]*Thesaurus, 0, len(segments))
itrs := make([]vellum.Iterator, 0, len(segments))
fieldIDtoThesaurusID := make(map[uint16]int)
var thesaurusID int
var newSynonymID uint32
// for each field
for fieldID, fieldName := range fieldsInv {
// collect FST iterators from all active segments for this field
newDocNums = newDocNums[:0]
drops = drops[:0]
thesauri = thesauri[:0]
itrs = itrs[:0]
newSynonymID = 0
synTermMap := make(map[uint32]string)
termSynMap := make(map[string]uint32)
for segmentI, segment := range segments {
// check for the closure in meantime
if isClosed(closeCh) {
return nil, nil, seg.ErrClosed
}
thes, err2 := segment.thesaurus(fieldName)
if err2 != nil {
return nil, nil, err2
}
if thes != nil && thes.fst != nil {
itr, err2 := thes.fst.Iterator(nil, nil)
if err2 != nil && err2 != vellum.ErrIteratorDone {
return nil, nil, err2
}
if itr != nil {
newDocNums = append(newDocNums, newDocNumsIn[segmentI])
if dropsIn[segmentI] != nil && !dropsIn[segmentI].IsEmpty() {
drops = append(drops, dropsIn[segmentI])
} else {
drops = append(drops, nil)
}
thesauri = append(thesauri, thes)
itrs = append(itrs, itr)
}
}
}
// if no iterators, skip this field
if len(itrs) == 0 {
continue
}
var prevTerm []byte
newRoaring.Clear()
finishTerm := func(term []byte) error {
postingsOffset, err := writeSynonyms(newRoaring, w, bufMaxVarintLen64)
if err != nil {
return err
}
if postingsOffset > 0 {
err = newVellum.Insert(term, postingsOffset)
if err != nil {
return err
}
}
newRoaring.Clear()
return nil
}
enumerator, err := newEnumerator(itrs)
for err == nil {
term, itrI, postingsOffset := enumerator.Current()
if prevTerm != nil && !bytes.Equal(prevTerm, term) {
// check for the closure in meantime
if isClosed(closeCh) {
return nil, nil, seg.ErrClosed
}
// if the term changed, write out the info collected
// for the previous term
err = finishTerm(prevTerm)
if err != nil {
return nil, nil, err
}
}
synonyms, err = thesauri[itrI].synonymsListFromOffset(
postingsOffset, drops[itrI], synonyms)
if err != nil {
return nil, nil, err
}
synItr = synonyms.iterator(synItr)
var next seg.Synonym
next, err = synItr.Next()
for next != nil && err == nil {
synNewDocNum := newDocNums[itrI][next.Number()]
if synNewDocNum == docDropped {
return nil, nil, fmt.Errorf("see hit with dropped docNum")
}
nextTerm := next.Term()
var synNewID uint32
if synID, ok := termSynMap[nextTerm]; ok {
synNewID = synID
} else {
synNewID = newSynonymID
termSynMap[nextTerm] = newSynonymID
synTermMap[newSynonymID] = nextTerm
newSynonymID++
}
synNewCode := encodeSynonym(synNewID, uint32(synNewDocNum))
newRoaring.Add(synNewCode)
next, err = synItr.Next()
}
if err != nil {
return nil, nil, err
}
prevTerm = prevTerm[:0] // copy to prevTerm in case Next() reuses term mem
prevTerm = append(prevTerm, term...)
err = enumerator.Next()
}
if err != vellum.ErrIteratorDone {
return nil, nil, err
}
// close the enumerator to free the underlying iterators
err = enumerator.Close()
if err != nil {
return nil, nil, err
}
if prevTerm != nil {
err = finishTerm(prevTerm)
if err != nil {
return nil, nil, err
}
}
err = newVellum.Close()
if err != nil {
return nil, nil, err
}
vellumData := vellumBuf.Bytes()
thesOffset := uint64(w.Count())
// write out the length of the vellum data
n := binary.PutUvarint(bufMaxVarintLen64, uint64(len(vellumData)))
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return nil, nil, err
}
// write this vellum to disk
_, err = w.Write(vellumData)
if err != nil {
return nil, nil, err
}
// reset vellum buffer and vellum builder
vellumBuf.Reset()
err = newVellum.Reset(&vellumBuf)
if err != nil {
return nil, nil, err
}
// write out the synTermMap for this thesaurus
err = writeSynTermMap(synTermMap, w, bufMaxVarintLen64)
if err != nil {
return nil, nil, err
}
thesStart := w.Count()
// the synonym index section does not have any doc value data
// so we write two special entries to indicate that
// the field is not uninverted and the thesaurus offset
n = binary.PutUvarint(bufMaxVarintLen64, fieldNotUninverted)
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return nil, nil, err
}
n = binary.PutUvarint(bufMaxVarintLen64, fieldNotUninverted)
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return nil, nil, err
}
// write out the thesaurus offset from which the vellum data starts
n = binary.PutUvarint(bufMaxVarintLen64, thesOffset)
_, err = w.Write(bufMaxVarintLen64[:n])
if err != nil {
return nil, nil, err
}
// if we have a new thesaurus, add it to the thesaurus map
fieldIDtoThesaurusID[uint16(fieldID)] = thesaurusID
thesaurusAddrs[thesaurusID] = thesStart
thesaurusID++
}
return thesaurusAddrs, fieldIDtoThesaurusID, nil
}

954
vendor/github.com/blevesearch/zapx/v16/segment.go generated vendored Normal file
View File

@@ -0,0 +1,954 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"encoding/binary"
"fmt"
"io"
"os"
"sync"
"sync/atomic"
"unsafe"
"github.com/RoaringBitmap/roaring/v2"
mmap "github.com/blevesearch/mmap-go"
segment "github.com/blevesearch/scorch_segment_api/v2"
"github.com/blevesearch/vellum"
"github.com/golang/snappy"
)
var reflectStaticSizeSegmentBase int
func init() {
var sb SegmentBase
reflectStaticSizeSegmentBase = int(unsafe.Sizeof(sb))
}
// Open returns a zap impl of a segment
func (*ZapPlugin) Open(path string) (segment.Segment, error) {
f, err := os.Open(path)
if err != nil {
return nil, err
}
mm, err := mmap.Map(f, mmap.RDONLY, 0)
if err != nil {
// mmap failed, try to close the file
_ = f.Close()
return nil, err
}
rv := &Segment{
SegmentBase: SegmentBase{
fieldsMap: make(map[string]uint16),
fieldFSTs: make(map[uint16]*vellum.FST),
vecIndexCache: newVectorIndexCache(),
synIndexCache: newSynonymIndexCache(),
fieldDvReaders: make([]map[uint16]*docValueReader, len(segmentSections)),
},
f: f,
mm: mm,
path: path,
refs: 1,
}
rv.SegmentBase.updateSize()
err = rv.loadConfig()
if err != nil {
_ = rv.Close()
return nil, err
}
err = rv.loadFieldsNew()
if err != nil {
_ = rv.Close()
return nil, err
}
err = rv.loadDvReaders()
if err != nil {
_ = rv.Close()
return nil, err
}
return rv, nil
}
// SegmentBase is a memory only, read-only implementation of the
// segment.Segment interface, using zap's data representation.
type SegmentBase struct {
// atomic access to these variables, moved to top to correct alignment issues on ARM, 386 and 32-bit MIPS.
bytesRead uint64
bytesWritten uint64
mem []byte
memCRC uint32
chunkMode uint32
fieldsMap map[string]uint16 // fieldName -> fieldID+1
fieldsInv []string // fieldID -> fieldName
fieldsSectionsMap []map[uint16]uint64 // fieldID -> section -> address
numDocs uint64
storedIndexOffset uint64
fieldsIndexOffset uint64
sectionsIndexOffset uint64
docValueOffset uint64
dictLocs []uint64
fieldDvReaders []map[uint16]*docValueReader // naive chunk cache per field; section->field->reader
fieldDvNames []string // field names cached in fieldDvReaders
size uint64
m sync.Mutex
fieldFSTs map[uint16]*vellum.FST
// this cache comes into play when vectors are supported in builds.
vecIndexCache *vectorIndexCache
synIndexCache *synonymIndexCache
}
func (sb *SegmentBase) Size() int {
return int(sb.size)
}
func (sb *SegmentBase) updateSize() {
sizeInBytes := reflectStaticSizeSegmentBase +
cap(sb.mem)
// fieldsMap
for k := range sb.fieldsMap {
sizeInBytes += (len(k) + SizeOfString) + SizeOfUint16
}
// fieldsInv, dictLocs
for _, entry := range sb.fieldsInv {
sizeInBytes += len(entry) + SizeOfString
}
sizeInBytes += len(sb.dictLocs) * SizeOfUint64
// fieldDvReaders
for _, secDvReaders := range sb.fieldDvReaders {
for _, v := range secDvReaders {
sizeInBytes += SizeOfUint16 + SizeOfPtr
if v != nil {
sizeInBytes += v.size()
}
}
}
sb.size = uint64(sizeInBytes)
}
func (sb *SegmentBase) AddRef() {}
func (sb *SegmentBase) DecRef() (err error) { return nil }
func (sb *SegmentBase) Close() (err error) {
sb.vecIndexCache.Clear()
sb.synIndexCache.Clear()
return nil
}
// Segment implements a persisted segment.Segment interface, by
// embedding an mmap()'ed SegmentBase.
type Segment struct {
SegmentBase
f *os.File
mm mmap.MMap
path string
version uint32
crc uint32
m sync.Mutex // Protects the fields that follow.
refs int64
}
func (s *Segment) Size() int {
// 8 /* size of file pointer */
// 4 /* size of version -> uint32 */
// 4 /* size of crc -> uint32 */
sizeOfUints := 16
sizeInBytes := (len(s.path) + SizeOfString) + sizeOfUints
// mutex, refs -> int64
sizeInBytes += 16
// do not include the mmap'ed part
return sizeInBytes + s.SegmentBase.Size() - cap(s.mem)
}
func (s *Segment) AddRef() {
s.m.Lock()
s.refs++
s.m.Unlock()
}
func (s *Segment) DecRef() (err error) {
s.m.Lock()
s.refs--
if s.refs == 0 {
err = s.closeActual()
}
s.m.Unlock()
return err
}
func (s *Segment) loadConfig() error {
crcOffset := len(s.mm) - 4
s.crc = binary.BigEndian.Uint32(s.mm[crcOffset : crcOffset+4])
verOffset := crcOffset - 4
s.version = binary.BigEndian.Uint32(s.mm[verOffset : verOffset+4])
if Version < IndexSectionsVersion && s.version != Version {
return fmt.Errorf("unsupported version %d != %d", s.version, Version)
}
chunkOffset := verOffset - 4
s.chunkMode = binary.BigEndian.Uint32(s.mm[chunkOffset : chunkOffset+4])
docValueOffset := chunkOffset - 8
s.docValueOffset = binary.BigEndian.Uint64(s.mm[docValueOffset : docValueOffset+8])
fieldsIndexOffset := docValueOffset - 8
// determining the right footer size based on version, this becomes important
// while loading the fields portion or the sections portion of the index file.
var footerSize int
if s.version >= IndexSectionsVersion {
// for version 16 and above, parse the sectionsIndexOffset
s.sectionsIndexOffset = binary.BigEndian.Uint64(s.mm[fieldsIndexOffset : fieldsIndexOffset+8])
fieldsIndexOffset = fieldsIndexOffset - 8
footerSize = FooterSize
} else {
footerSize = FooterSize - 8
}
s.fieldsIndexOffset = binary.BigEndian.Uint64(s.mm[fieldsIndexOffset : fieldsIndexOffset+8])
storedIndexOffset := fieldsIndexOffset - 8
s.storedIndexOffset = binary.BigEndian.Uint64(s.mm[storedIndexOffset : storedIndexOffset+8])
numDocsOffset := storedIndexOffset - 8
s.numDocs = binary.BigEndian.Uint64(s.mm[numDocsOffset : numDocsOffset+8])
// 8*4 + 4*3 = 44 bytes being accounted from all the offsets
// above being read from the file
s.incrementBytesRead(uint64(footerSize))
s.SegmentBase.mem = s.mm[:len(s.mm)-footerSize]
return nil
}
// Implements the segment.DiskStatsReporter interface
// Only the persistedSegment type implments the
// interface, as the intention is to retrieve the bytes
// read from the on-disk segment as part of the current
// query.
func (s *Segment) ResetBytesRead(val uint64) {
atomic.StoreUint64(&s.SegmentBase.bytesRead, val)
}
func (s *Segment) BytesRead() uint64 {
return atomic.LoadUint64(&s.bytesRead)
}
func (s *Segment) BytesWritten() uint64 {
return 0
}
func (s *Segment) incrementBytesRead(val uint64) {
atomic.AddUint64(&s.bytesRead, val)
}
func (sb *SegmentBase) BytesWritten() uint64 {
return atomic.LoadUint64(&sb.bytesWritten)
}
func (sb *SegmentBase) setBytesWritten(val uint64) {
atomic.AddUint64(&sb.bytesWritten, val)
}
func (sb *SegmentBase) BytesRead() uint64 {
return 0
}
func (sb *SegmentBase) ResetBytesRead(val uint64) {}
func (sb *SegmentBase) incrementBytesRead(val uint64) {
atomic.AddUint64(&sb.bytesRead, val)
}
func (sb *SegmentBase) loadFields() error {
// NOTE for now we assume the fields index immediately precedes
// the footer, and if this changes, need to adjust accordingly (or
// store explicit length), where s.mem was sliced from s.mm in Open().
fieldsIndexEnd := uint64(len(sb.mem))
// iterate through fields index
var fieldID uint64
for sb.fieldsIndexOffset+(8*fieldID) < fieldsIndexEnd {
addr := binary.BigEndian.Uint64(sb.mem[sb.fieldsIndexOffset+(8*fieldID) : sb.fieldsIndexOffset+(8*fieldID)+8])
// accounting the address of the dictLoc being read from file
sb.incrementBytesRead(8)
dictLoc, read := binary.Uvarint(sb.mem[addr:fieldsIndexEnd])
n := uint64(read)
sb.dictLocs = append(sb.dictLocs, dictLoc)
var nameLen uint64
nameLen, read = binary.Uvarint(sb.mem[addr+n : fieldsIndexEnd])
n += uint64(read)
name := string(sb.mem[addr+n : addr+n+nameLen])
sb.incrementBytesRead(n + nameLen)
sb.fieldsInv = append(sb.fieldsInv, name)
sb.fieldsMap[name] = uint16(fieldID + 1)
fieldID++
}
return nil
}
func (sb *SegmentBase) loadFieldsNew() error {
pos := sb.sectionsIndexOffset
if pos == 0 {
// this is the case only for older file formats
return sb.loadFields()
}
seek := pos + binary.MaxVarintLen64
if seek > uint64(len(sb.mem)) {
// handling a buffer overflow case.
// a rare case where the backing buffer is not large enough to be read directly via
// a pos+binary.MaxVarintLen64 seek. For eg, this can happen when there is only
// one field to be indexed in the entire batch of data and while writing out
// these fields metadata, you write 1 + 8 bytes whereas the MaxVarintLen64 = 10.
seek = uint64(len(sb.mem))
}
// read the number of fields
numFields, sz := binary.Uvarint(sb.mem[pos:seek])
// here, the pos is incremented by the valid number bytes read from the buffer
// so in the edge case pointed out above the numFields = 1, the sz = 1 as well.
pos += uint64(sz)
sb.incrementBytesRead(uint64(sz))
// the following loop will be executed only once in the edge case pointed out above
// since there is only field's offset store which occupies 8 bytes.
// the pointer then seeks to a position preceding the sectionsIndexOffset, at
// which point the responsibility of handling the out-of-bounds cases shifts to
// the specific section's parsing logic.
var fieldID uint64
for fieldID < numFields {
addr := binary.BigEndian.Uint64(sb.mem[pos : pos+8])
sb.incrementBytesRead(8)
fieldSectionMap := make(map[uint16]uint64)
err := sb.loadFieldNew(uint16(fieldID), addr, fieldSectionMap)
if err != nil {
return err
}
sb.fieldsSectionsMap = append(sb.fieldsSectionsMap, fieldSectionMap)
fieldID++
pos += 8
}
return nil
}
func (sb *SegmentBase) loadFieldNew(fieldID uint16, pos uint64,
fieldSectionMap map[uint16]uint64) error {
if pos == 0 {
// there is no indexing structure present for this field/section
return nil
}
fieldStartPos := pos // to track the number of bytes read
fieldNameLen, sz := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += uint64(sz)
fieldName := string(sb.mem[pos : pos+fieldNameLen])
pos += fieldNameLen
sb.fieldsInv = append(sb.fieldsInv, fieldName)
sb.fieldsMap[fieldName] = uint16(fieldID + 1)
fieldNumSections, sz := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += uint64(sz)
for sectionIdx := uint64(0); sectionIdx < fieldNumSections; sectionIdx++ {
// read section id
fieldSectionType := binary.BigEndian.Uint16(sb.mem[pos : pos+2])
pos += 2
fieldSectionAddr := binary.BigEndian.Uint64(sb.mem[pos : pos+8])
pos += 8
fieldSectionMap[fieldSectionType] = fieldSectionAddr
if fieldSectionType == SectionInvertedTextIndex {
// for the fields which don't have the inverted index, the offset is
// 0 and during query time, because there is no valid dictionary we
// will just have follow a no-op path.
if fieldSectionAddr == 0 {
sb.dictLocs = append(sb.dictLocs, 0)
continue
}
read := 0
// skip the doc values
_, n := binary.Uvarint(sb.mem[fieldSectionAddr : fieldSectionAddr+binary.MaxVarintLen64])
fieldSectionAddr += uint64(n)
read += n
_, n = binary.Uvarint(sb.mem[fieldSectionAddr : fieldSectionAddr+binary.MaxVarintLen64])
fieldSectionAddr += uint64(n)
read += n
dictLoc, n := binary.Uvarint(sb.mem[fieldSectionAddr : fieldSectionAddr+binary.MaxVarintLen64])
// account the bytes read while parsing the field's inverted index section
sb.incrementBytesRead(uint64(read + n))
sb.dictLocs = append(sb.dictLocs, dictLoc)
}
}
// account the bytes read while parsing the sections field index.
sb.incrementBytesRead((pos - uint64(fieldStartPos)) + fieldNameLen)
return nil
}
// Dictionary returns the term dictionary for the specified field
func (sb *SegmentBase) Dictionary(field string) (segment.TermDictionary, error) {
dict, err := sb.dictionary(field)
if err == nil && dict == nil {
return emptyDictionary, nil
}
return dict, err
}
func (sb *SegmentBase) dictionary(field string) (rv *Dictionary, err error) {
fieldIDPlus1 := sb.fieldsMap[field]
if fieldIDPlus1 > 0 {
rv = &Dictionary{
sb: sb,
field: field,
fieldID: fieldIDPlus1 - 1,
}
dictStart := sb.dictLocs[rv.fieldID]
if dictStart > 0 {
var ok bool
sb.m.Lock()
if rv.fst, ok = sb.fieldFSTs[rv.fieldID]; !ok {
// read the length of the vellum data
vellumLen, read := binary.Uvarint(sb.mem[dictStart : dictStart+binary.MaxVarintLen64])
if vellumLen == 0 {
sb.m.Unlock()
return nil, fmt.Errorf("empty dictionary for field: %v", field)
}
fstBytes := sb.mem[dictStart+uint64(read) : dictStart+uint64(read)+vellumLen]
rv.incrementBytesRead(uint64(read) + vellumLen)
rv.fst, err = vellum.Load(fstBytes)
if err != nil {
sb.m.Unlock()
return nil, fmt.Errorf("dictionary field %s vellum err: %v", field, err)
}
sb.fieldFSTs[rv.fieldID] = rv.fst
}
sb.m.Unlock()
rv.fstReader, err = rv.fst.Reader()
if err != nil {
return nil, fmt.Errorf("dictionary field %s vellum reader err: %v", field, err)
}
}
}
return rv, nil
}
// Thesaurus returns the thesaurus with the specified name, or an empty thesaurus if not found.
func (sb *SegmentBase) Thesaurus(name string) (segment.Thesaurus, error) {
thesaurus, err := sb.thesaurus(name)
if err == nil && thesaurus == nil {
return emptyThesaurus, nil
}
return thesaurus, err
}
func (sb *SegmentBase) thesaurus(name string) (rv *Thesaurus, err error) {
fieldIDPlus1 := sb.fieldsMap[name]
if fieldIDPlus1 == 0 {
return nil, nil
}
pos := sb.fieldsSectionsMap[fieldIDPlus1-1][SectionSynonymIndex]
if pos > 0 {
rv = &Thesaurus{
sb: sb,
name: name,
fieldID: fieldIDPlus1 - 1,
}
// skip the doc value offsets as doc values are not supported in thesaurus
for i := 0; i < 2; i++ {
_, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += uint64(n)
}
thesLoc, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
pos += uint64(n)
fst, synTermMap, err := sb.synIndexCache.loadOrCreate(rv.fieldID, sb.mem[thesLoc:])
if err != nil {
return nil, fmt.Errorf("thesaurus name %s err: %v", name, err)
}
rv.fst = fst
rv.synIDTermMap = synTermMap
rv.fstReader, err = rv.fst.Reader()
if err != nil {
return nil, fmt.Errorf("thesaurus name %s vellum reader err: %v", name, err)
}
}
return rv, nil
}
// visitDocumentCtx holds data structures that are reusable across
// multiple VisitDocument() calls to avoid memory allocations
type visitDocumentCtx struct {
buf []byte
reader bytes.Reader
arrayPos []uint64
}
var visitDocumentCtxPool = sync.Pool{
New: func() interface{} {
reuse := &visitDocumentCtx{}
return reuse
},
}
// VisitStoredFields invokes the StoredFieldValueVisitor for each stored field
// for the specified doc number
func (sb *SegmentBase) VisitStoredFields(num uint64, visitor segment.StoredFieldValueVisitor) error {
vdc := visitDocumentCtxPool.Get().(*visitDocumentCtx)
defer visitDocumentCtxPool.Put(vdc)
return sb.visitStoredFields(vdc, num, visitor)
}
func (sb *SegmentBase) visitStoredFields(vdc *visitDocumentCtx, num uint64,
visitor segment.StoredFieldValueVisitor) error {
// first make sure this is a valid number in this segment
if num < sb.numDocs {
meta, compressed := sb.getDocStoredMetaAndCompressed(num)
vdc.reader.Reset(meta)
// handle _id field special case
idFieldValLen, err := binary.ReadUvarint(&vdc.reader)
if err != nil {
return err
}
idFieldVal := compressed[:idFieldValLen]
keepGoing := visitor("_id", byte('t'), idFieldVal, nil)
if !keepGoing {
visitDocumentCtxPool.Put(vdc)
return nil
}
// handle non-"_id" fields
compressed = compressed[idFieldValLen:]
uncompressed, err := snappy.Decode(vdc.buf[:cap(vdc.buf)], compressed)
if err != nil {
return err
}
for keepGoing {
field, err := binary.ReadUvarint(&vdc.reader)
if err == io.EOF {
break
}
if err != nil {
return err
}
typ, err := binary.ReadUvarint(&vdc.reader)
if err != nil {
return err
}
offset, err := binary.ReadUvarint(&vdc.reader)
if err != nil {
return err
}
l, err := binary.ReadUvarint(&vdc.reader)
if err != nil {
return err
}
numap, err := binary.ReadUvarint(&vdc.reader)
if err != nil {
return err
}
var arrayPos []uint64
if numap > 0 {
if cap(vdc.arrayPos) < int(numap) {
vdc.arrayPos = make([]uint64, numap)
}
arrayPos = vdc.arrayPos[:numap]
for i := 0; i < int(numap); i++ {
ap, err := binary.ReadUvarint(&vdc.reader)
if err != nil {
return err
}
arrayPos[i] = ap
}
}
value := uncompressed[offset : offset+l]
keepGoing = visitor(sb.fieldsInv[field], byte(typ), value, arrayPos)
}
vdc.buf = uncompressed
}
return nil
}
// DocID returns the value of the _id field for the given docNum
func (sb *SegmentBase) DocID(num uint64) ([]byte, error) {
if num >= sb.numDocs {
return nil, nil
}
vdc := visitDocumentCtxPool.Get().(*visitDocumentCtx)
meta, compressed := sb.getDocStoredMetaAndCompressed(num)
vdc.reader.Reset(meta)
// handle _id field special case
idFieldValLen, err := binary.ReadUvarint(&vdc.reader)
if err != nil {
return nil, err
}
idFieldVal := compressed[:idFieldValLen]
visitDocumentCtxPool.Put(vdc)
return idFieldVal, nil
}
// Count returns the number of documents in this segment.
func (sb *SegmentBase) Count() uint64 {
return sb.numDocs
}
// DocNumbers returns a bitset corresponding to the doc numbers of all the
// provided _id strings
func (sb *SegmentBase) DocNumbers(ids []string) (*roaring.Bitmap, error) {
rv := roaring.New()
if len(sb.fieldsMap) > 0 {
idDict, err := sb.dictionary("_id")
if err != nil {
return nil, err
}
postingsList := emptyPostingsList
sMax, err := idDict.fst.GetMaxKey()
if err != nil {
return nil, err
}
sMaxStr := string(sMax)
for _, id := range ids {
if id <= sMaxStr {
postingsList, err = idDict.postingsList([]byte(id), nil, postingsList)
if err != nil {
return nil, err
}
postingsList.OrInto(rv)
}
}
}
return rv, nil
}
// Fields returns the field names used in this segment
func (sb *SegmentBase) Fields() []string {
return sb.fieldsInv
}
// Path returns the path of this segment on disk
func (s *Segment) Path() string {
return s.path
}
// Close releases all resources associated with this segment
func (s *Segment) Close() (err error) {
return s.DecRef()
}
func (s *Segment) closeActual() (err error) {
// clear contents from the vector and synonym index cache before un-mmapping
s.vecIndexCache.Clear()
s.synIndexCache.Clear()
if s.mm != nil {
err = s.mm.Unmap()
}
// try to close file even if unmap failed
if s.f != nil {
err2 := s.f.Close()
if err == nil {
// try to return first error
err = err2
}
}
return
}
// some helpers i started adding for the command-line utility
// Data returns the underlying mmaped data slice
func (s *Segment) Data() []byte {
return s.mm
}
// CRC returns the CRC value stored in the file footer
func (s *Segment) CRC() uint32 {
return s.crc
}
// Version returns the file version in the file footer
func (s *Segment) Version() uint32 {
return s.version
}
// ChunkFactor returns the chunk factor in the file footer
func (s *Segment) ChunkMode() uint32 {
return s.chunkMode
}
// FieldsIndexOffset returns the fields index offset in the file footer
func (s *Segment) FieldsIndexOffset() uint64 {
return s.fieldsIndexOffset
}
// StoredIndexOffset returns the stored value index offset in the file footer
func (s *Segment) StoredIndexOffset() uint64 {
return s.storedIndexOffset
}
// DocValueOffset returns the docValue offset in the file footer
func (s *Segment) DocValueOffset() uint64 {
return s.docValueOffset
}
// NumDocs returns the number of documents in the file footer
func (s *Segment) NumDocs() uint64 {
return s.numDocs
}
// DictAddr is a helper function to compute the file offset where the
// dictionary is stored for the specified field.
func (s *Segment) DictAddr(field string) (uint64, error) {
fieldIDPlus1, ok := s.fieldsMap[field]
if !ok {
return 0, fmt.Errorf("no such field '%s'", field)
}
return s.dictLocs[fieldIDPlus1-1], nil
}
// ThesaurusAddr is a helper function to compute the file offset where the
// thesaurus is stored with the specified name.
func (s *Segment) ThesaurusAddr(name string) (uint64, error) {
fieldIDPlus1, ok := s.fieldsMap[name]
if !ok {
return 0, fmt.Errorf("no such thesaurus '%s'", name)
}
thesaurusStart := s.fieldsSectionsMap[fieldIDPlus1-1][SectionSynonymIndex]
if thesaurusStart == 0 {
return 0, fmt.Errorf("no such thesaurus '%s'", name)
}
for i := 0; i < 2; i++ {
_, n := binary.Uvarint(s.mem[thesaurusStart : thesaurusStart+binary.MaxVarintLen64])
thesaurusStart += uint64(n)
}
thesLoc, _ := binary.Uvarint(s.mem[thesaurusStart : thesaurusStart+binary.MaxVarintLen64])
return thesLoc, nil
}
func (s *Segment) getSectionDvOffsets(fieldID int, secID uint16) (uint64, uint64, uint64, error) {
// Version is gonna be 16
var fieldLocStart uint64 = fieldNotUninverted
fieldLocEnd := fieldLocStart
sectionMap := s.fieldsSectionsMap[fieldID]
fieldAddrStart := sectionMap[secID]
n := 0
if fieldAddrStart > 0 {
// fixed encoding as of now, need to uvarint this
var read uint64
fieldLocStart, n = binary.Uvarint(s.mem[fieldAddrStart+read : fieldAddrStart+read+binary.MaxVarintLen64])
if n <= 0 {
return 0, 0, 0, fmt.Errorf("loadDvReaders: failed to read the docvalue offset start for field %d", fieldID)
}
read += uint64(n)
fieldLocEnd, n = binary.Uvarint(s.mem[fieldAddrStart+read : fieldAddrStart+read+binary.MaxVarintLen64])
if n <= 0 {
return 0, 0, 0, fmt.Errorf("loadDvReaders: failed to read the docvalue offset end for field %d", fieldID)
}
read += uint64(n)
s.incrementBytesRead(read)
}
return fieldLocStart, fieldLocEnd, 0, nil
}
func (s *Segment) loadDvReader(fieldID int, secID uint16) error {
start, end, _, err := s.getSectionDvOffsets(fieldID, secID)
if err != nil {
return err
}
fieldDvReader, err := s.loadFieldDocValueReader(s.fieldsInv[fieldID], start, end)
if err != nil {
return err
}
if fieldDvReader != nil {
if s.fieldDvReaders[secID] == nil {
s.fieldDvReaders[secID] = make(map[uint16]*docValueReader)
}
// fix the structure of fieldDvReaders
// currently it populates the inverted index doc values
s.fieldDvReaders[secID][uint16(fieldID)] = fieldDvReader
s.fieldDvNames = append(s.fieldDvNames, s.fieldsInv[fieldID])
}
return nil
}
func (s *Segment) loadDvReadersLegacy() error {
// older file formats to parse the docValueIndex and if that says doc values
// aren't there in this segment file, just return nil
if s.docValueOffset == fieldNotUninverted {
return nil
}
for fieldID := range s.fieldsInv {
var read uint64
start, n := binary.Uvarint(s.mem[s.docValueOffset+read : s.docValueOffset+read+binary.MaxVarintLen64])
if n <= 0 {
return fmt.Errorf("loadDvReaders: failed to read the docvalue offset start for field %d", fieldID)
}
read += uint64(n)
end, n := binary.Uvarint(s.mem[s.docValueOffset+read : s.docValueOffset+read+binary.MaxVarintLen64])
if n <= 0 {
return fmt.Errorf("loadDvReaders: failed to read the docvalue offset end for field %d", fieldID)
}
read += uint64(n)
s.incrementBytesRead(read)
fieldDvReader, err := s.loadFieldDocValueReader(s.fieldsInv[fieldID], start, end)
if err != nil {
return err
}
if fieldDvReader != nil {
// older file formats have docValues corresponding only to inverted index
// ignore the rest.
if s.fieldDvReaders[SectionInvertedTextIndex] == nil {
s.fieldDvReaders[SectionInvertedTextIndex] = make(map[uint16]*docValueReader)
}
// fix the structure of fieldDvReaders
// currently it populates the inverted index doc values
s.fieldDvReaders[SectionInvertedTextIndex][uint16(fieldID)] = fieldDvReader
s.fieldDvNames = append(s.fieldDvNames, s.fieldsInv[fieldID])
}
}
return nil
}
// Segment is a file segment, and loading the dv readers from that segment
// must account for the version while loading since the formats are different
// in the older and the Version version.
func (s *Segment) loadDvReaders() error {
if s.numDocs == 0 {
return nil
}
if s.version < IndexSectionsVersion {
return s.loadDvReadersLegacy()
}
// for every section of every field, load the doc values and register
// the readers.
for fieldID := range s.fieldsInv {
for secID := range segmentSections {
s.loadDvReader(fieldID, secID)
}
}
return nil
}
// since segmentBase is an in-memory segment, it can be called only
// for v16 file formats as part of InitSegmentBase() while introducing
// a segment into the system.
func (sb *SegmentBase) loadDvReaders() error {
// evaluate -> s.docValueOffset == fieldNotUninverted
if sb.numDocs == 0 {
return nil
}
for fieldID, sections := range sb.fieldsSectionsMap {
for secID, secOffset := range sections {
if secOffset > 0 {
// fixed encoding as of now, need to uvarint this
pos := secOffset
var read uint64
fieldLocStart, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
if n <= 0 {
return fmt.Errorf("loadDvReaders: failed to read the docvalue offset start for field %v", sb.fieldsInv[fieldID])
}
pos += uint64(n)
read += uint64(n)
fieldLocEnd, n := binary.Uvarint(sb.mem[pos : pos+binary.MaxVarintLen64])
if read <= 0 {
return fmt.Errorf("loadDvReaders: failed to read the docvalue offset end for field %v", sb.fieldsInv[fieldID])
}
pos += uint64(n)
read += uint64(n)
sb.incrementBytesRead(read)
fieldDvReader, err := sb.loadFieldDocValueReader(sb.fieldsInv[fieldID], fieldLocStart, fieldLocEnd)
if err != nil {
return err
}
if fieldDvReader != nil {
if sb.fieldDvReaders[secID] == nil {
sb.fieldDvReaders[secID] = make(map[uint16]*docValueReader)
}
sb.fieldDvReaders[secID][uint16(fieldID)] = fieldDvReader
sb.fieldDvNames = append(sb.fieldDvNames, sb.fieldsInv[fieldID])
}
}
}
}
return nil
}

59
vendor/github.com/blevesearch/zapx/v16/sizes.go generated vendored Normal file
View File

@@ -0,0 +1,59 @@
// Copyright (c) 2020 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"reflect"
)
func init() {
var b bool
SizeOfBool = int(reflect.TypeOf(b).Size())
var f32 float32
SizeOfFloat32 = int(reflect.TypeOf(f32).Size())
var f64 float64
SizeOfFloat64 = int(reflect.TypeOf(f64).Size())
var i int
SizeOfInt = int(reflect.TypeOf(i).Size())
var m map[int]int
SizeOfMap = int(reflect.TypeOf(m).Size())
var ptr *int
SizeOfPtr = int(reflect.TypeOf(ptr).Size())
var slice []int
SizeOfSlice = int(reflect.TypeOf(slice).Size())
var str string
SizeOfString = int(reflect.TypeOf(str).Size())
var u8 uint8
SizeOfUint8 = int(reflect.TypeOf(u8).Size())
var u16 uint16
SizeOfUint16 = int(reflect.TypeOf(u16).Size())
var u32 uint32
SizeOfUint32 = int(reflect.TypeOf(u32).Size())
var u64 uint64
SizeOfUint64 = int(reflect.TypeOf(u64).Size())
}
var SizeOfBool int
var SizeOfFloat32 int
var SizeOfFloat64 int
var SizeOfInt int
var SizeOfMap int
var SizeOfPtr int
var SizeOfSlice int
var SizeOfString int
var SizeOfUint8 int
var SizeOfUint16 int
var SizeOfUint32 int
var SizeOfUint64 int

126
vendor/github.com/blevesearch/zapx/v16/synonym_cache.go generated vendored Normal file
View File

@@ -0,0 +1,126 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"encoding/binary"
"fmt"
"sync"
"github.com/blevesearch/vellum"
)
func newSynonymIndexCache() *synonymIndexCache {
return &synonymIndexCache{
cache: make(map[uint16]*synonymCacheEntry),
}
}
type synonymIndexCache struct {
m sync.RWMutex
cache map[uint16]*synonymCacheEntry
}
// Clear clears the synonym cache which would mean tha the termID to term map would no longer be available.
func (sc *synonymIndexCache) Clear() {
sc.m.Lock()
sc.cache = nil
sc.m.Unlock()
}
// loadOrCreate loads the synonym index cache for the specified fieldID if it is already present,
// or creates it if not. The synonym index cache for a fieldID consists of a tuple:
// - A Vellum FST (Finite State Transducer) representing the thesaurus.
// - A map associating synonym IDs to their corresponding terms.
// This function returns the loaded or newly created tuple (FST and map).
func (sc *synonymIndexCache) loadOrCreate(fieldID uint16, mem []byte) (*vellum.FST, map[uint32][]byte, error) {
sc.m.RLock()
entry, ok := sc.cache[fieldID]
if ok {
sc.m.RUnlock()
return entry.load()
}
sc.m.RUnlock()
sc.m.Lock()
defer sc.m.Unlock()
entry, ok = sc.cache[fieldID]
if ok {
return entry.load()
}
return sc.createAndCacheLOCKED(fieldID, mem)
}
// createAndCacheLOCKED creates the synonym index cache for the specified fieldID and caches it.
func (sc *synonymIndexCache) createAndCacheLOCKED(fieldID uint16, mem []byte) (*vellum.FST, map[uint32][]byte, error) {
var pos uint64
vellumLen, read := binary.Uvarint(mem[pos : pos+binary.MaxVarintLen64])
if vellumLen == 0 || read <= 0 {
return nil, nil, fmt.Errorf("vellum length is 0")
}
pos += uint64(read)
fstBytes := mem[pos : pos+vellumLen]
fst, err := vellum.Load(fstBytes)
if err != nil {
return nil, nil, fmt.Errorf("vellum err: %v", err)
}
pos += vellumLen
numSyns, n := binary.Uvarint(mem[pos : pos+binary.MaxVarintLen64])
pos += uint64(n)
if numSyns == 0 {
return nil, nil, fmt.Errorf("no synonyms found")
}
synTermMap := make(map[uint32][]byte, numSyns)
for i := 0; i < int(numSyns); i++ {
synID, n := binary.Uvarint(mem[pos : pos+binary.MaxVarintLen64])
pos += uint64(n)
termLen, n := binary.Uvarint(mem[pos : pos+binary.MaxVarintLen64])
pos += uint64(n)
if termLen == 0 {
return nil, nil, fmt.Errorf("term length is 0")
}
term := mem[pos : pos+uint64(termLen)]
pos += uint64(termLen)
synTermMap[uint32(synID)] = term
}
sc.insertLOCKED(fieldID, fst, synTermMap)
return fst, synTermMap, nil
}
// insertLOCKED inserts the vellum FST and the map of synonymID to term into the cache for the specified fieldID.
func (sc *synonymIndexCache) insertLOCKED(fieldID uint16, fst *vellum.FST, synTermMap map[uint32][]byte) {
_, ok := sc.cache[fieldID]
if !ok {
sc.cache[fieldID] = &synonymCacheEntry{
fst: fst,
synTermMap: synTermMap,
}
}
}
// synonymCacheEntry is a tuple of the vellum FST and the map of synonymID to term,
// and is the value stored in the synonym cache, for a given fieldID.
type synonymCacheEntry struct {
fst *vellum.FST
synTermMap map[uint32][]byte
}
func (ce *synonymCacheEntry) load() (*vellum.FST, map[uint32][]byte, error) {
return ce.fst, ce.synTermMap, nil
}

View File

@@ -0,0 +1,239 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"encoding/binary"
"fmt"
"reflect"
"github.com/RoaringBitmap/roaring/v2"
"github.com/RoaringBitmap/roaring/v2/roaring64"
segment "github.com/blevesearch/scorch_segment_api/v2"
)
var reflectStaticSizeSynonymsList int
var reflectStaticSizeSynonymsIterator int
var reflectStaticSizeSynonym int
func init() {
var sl SynonymsList
reflectStaticSizeSynonymsList = int(reflect.TypeOf(sl).Size())
var si SynonymsIterator
reflectStaticSizeSynonymsIterator = int(reflect.TypeOf(si).Size())
var s Synonym
reflectStaticSizeSynonym = int(reflect.TypeOf(s).Size())
}
// SynonymsList represents a list of synonyms for a term, stored in a Roaring64 bitmap.
type SynonymsList struct {
sb *SegmentBase
synonymsOffset uint64
synonyms *roaring64.Bitmap
except *roaring.Bitmap
synIDTermMap map[uint32][]byte
buffer *bytes.Reader
}
// immutable, empty synonyms list
var emptySynonymsList = &SynonymsList{}
func (p *SynonymsList) Size() int {
sizeInBytes := reflectStaticSizeSynonymsList + SizeOfPtr
if p.except != nil {
sizeInBytes += int(p.except.GetSizeInBytes())
}
return sizeInBytes
}
// Iterator creates and returns a SynonymsIterator for the SynonymsList.
// If the synonyms bitmap is nil, it returns an empty iterator.
func (s *SynonymsList) Iterator(prealloc segment.SynonymsIterator) segment.SynonymsIterator {
if s.synonyms == nil {
return emptySynonymsIterator
}
var preallocSI *SynonymsIterator
pi, ok := prealloc.(*SynonymsIterator)
if ok && pi != nil {
preallocSI = pi
}
if preallocSI == emptySynonymsIterator {
preallocSI = nil
}
return s.iterator(preallocSI)
}
// iterator initializes a SynonymsIterator for the SynonymsList and returns it.
// If a preallocated iterator is provided, it resets and reuses it; otherwise, it creates a new one.
func (s *SynonymsList) iterator(rv *SynonymsIterator) *SynonymsIterator {
if rv == nil {
rv = &SynonymsIterator{}
} else {
*rv = SynonymsIterator{} // clear the struct
}
rv.synonyms = s
rv.except = s.except
rv.Actual = s.synonyms.Iterator()
rv.ActualBM = s.synonyms
rv.synIDTermMap = s.synIDTermMap
return rv
}
// read initializes a SynonymsList by reading data from the given synonymsOffset in the Thesaurus.
// It reads and parses the Roaring64 bitmap that represents the synonyms.
func (rv *SynonymsList) read(synonymsOffset uint64, t *Thesaurus) error {
rv.synonymsOffset = synonymsOffset
var n uint64
var read int
var synonymsLen uint64
synonymsLen, read = binary.Uvarint(t.sb.mem[synonymsOffset+n : synonymsOffset+n+binary.MaxVarintLen64])
n += uint64(read)
roaringBytes := t.sb.mem[synonymsOffset+n : synonymsOffset+n+synonymsLen]
if rv.synonyms == nil {
rv.synonyms = roaring64.NewBitmap()
}
rv.buffer.Reset(roaringBytes)
_, err := rv.synonyms.ReadFrom(rv.buffer)
if err != nil {
return fmt.Errorf("error loading roaring bitmap: %v", err)
}
return nil
}
// -----------------------------------------------------------------------------
// SynonymsIterator provides a way to iterate through the synonyms list.
type SynonymsIterator struct {
synonyms *SynonymsList
except *roaring.Bitmap
Actual roaring64.IntPeekable64
ActualBM *roaring64.Bitmap
synIDTermMap map[uint32][]byte
nextSyn Synonym
}
// immutable, empty synonyms iterator
var emptySynonymsIterator = &SynonymsIterator{}
func (i *SynonymsIterator) Size() int {
sizeInBytes := reflectStaticSizeSynonymsIterator + SizeOfPtr +
i.nextSyn.Size()
return sizeInBytes
}
// Next returns the next Synonym in the iteration or an error if the end is reached.
func (i *SynonymsIterator) Next() (segment.Synonym, error) {
return i.next()
}
// next retrieves the next synonym from the iterator, populates the nextSyn field,
// and returns it. If no valid synonym is found, it returns an error.
func (i *SynonymsIterator) next() (segment.Synonym, error) {
synID, docNum, exists, err := i.nextSynonym()
if err != nil || !exists {
return nil, err
}
if i.synIDTermMap == nil {
return nil, fmt.Errorf("synIDTermMap is nil")
}
// If the synonymID is not found in the map, return an error
term, exists := i.synIDTermMap[synID]
if !exists {
return nil, fmt.Errorf("synonymID %d not found in map", synID)
}
i.nextSyn = Synonym{} // clear the struct
rv := &i.nextSyn
rv.term = string(term)
rv.docNum = docNum
return rv, nil
}
// nextSynonym decodes the next synonym from the roaring bitmap iterator,
// ensuring it is not in the "except" set. Returns the synonymID, docNum,
// and a flag indicating success.
func (i *SynonymsIterator) nextSynonym() (uint32, uint32, bool, error) {
// If no synonyms are available, return early
if i.Actual == nil || i.synonyms == nil || i.synonyms == emptySynonymsList {
return 0, 0, false, nil
}
var code uint64
var docNum uint32
var synID uint32
// Loop to find the next valid docNum, checking against the except
for i.Actual.HasNext() {
code = i.Actual.Next()
synID, docNum = decodeSynonym(code)
// If docNum is not in the 'except' set, it's a valid result
if i.except == nil || !i.except.Contains(docNum) {
return synID, docNum, true, nil
}
}
// If no valid docNum is found, return false
return 0, 0, false, nil
}
// Synonym represents a single synonym, containing the term, synonymID, and document number.
type Synonym struct {
term string
docNum uint32
}
// Size returns the memory size of the Synonym, including the length of the term string.
func (p *Synonym) Size() int {
sizeInBytes := reflectStaticSizeSynonym + SizeOfPtr +
len(p.term)
return sizeInBytes
}
// Term returns the term of the Synonym.
func (s *Synonym) Term() string {
return s.term
}
// Number returns the document number of the Synonym.
func (s *Synonym) Number() uint32 {
return s.docNum
}
// decodeSynonym decodes a synonymCode into its synonymID and document ID components.
func decodeSynonym(synonymCode uint64) (synonymID uint32, docID uint32) {
return uint32(synonymCode >> 32), uint32(synonymCode)
}

159
vendor/github.com/blevesearch/zapx/v16/thesaurus.go generated vendored Normal file
View File

@@ -0,0 +1,159 @@
// Copyright (c) 2024 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"bytes"
"fmt"
"github.com/RoaringBitmap/roaring/v2"
index "github.com/blevesearch/bleve_index_api"
segment "github.com/blevesearch/scorch_segment_api/v2"
"github.com/blevesearch/vellum"
)
// Thesaurus is the zap representation of a Thesaurus
type Thesaurus struct {
sb *SegmentBase
name string
fieldID uint16
synIDTermMap map[uint32][]byte
fst *vellum.FST
fstReader *vellum.Reader
}
// represents an immutable, empty Thesaurus
var emptyThesaurus = &Thesaurus{}
// SynonymsList returns the synonyms list for the specified term
func (t *Thesaurus) SynonymsList(term []byte, except *roaring.Bitmap, prealloc segment.SynonymsList) (segment.SynonymsList, error) {
var preallocSL *SynonymsList
sl, ok := prealloc.(*SynonymsList)
if ok && sl != nil {
preallocSL = sl
}
return t.synonymsList(term, except, preallocSL)
}
func (t *Thesaurus) synonymsList(term []byte, except *roaring.Bitmap, rv *SynonymsList) (*SynonymsList, error) {
if t.fstReader == nil {
if rv == nil || rv == emptySynonymsList {
return emptySynonymsList, nil
}
return t.synonymsListInit(rv, except), nil
}
synonymsOffset, exists, err := t.fstReader.Get(term)
if err != nil {
return nil, fmt.Errorf("vellum err: %v", err)
}
if !exists {
if rv == nil || rv == emptySynonymsList {
return emptySynonymsList, nil
}
return t.synonymsListInit(rv, except), nil
}
return t.synonymsListFromOffset(synonymsOffset, except, rv)
}
func (t *Thesaurus) synonymsListFromOffset(synonymsOffset uint64, except *roaring.Bitmap, rv *SynonymsList) (*SynonymsList, error) {
rv = t.synonymsListInit(rv, except)
err := rv.read(synonymsOffset, t)
if err != nil {
return nil, err
}
return rv, nil
}
func (t *Thesaurus) synonymsListInit(rv *SynonymsList, except *roaring.Bitmap) *SynonymsList {
if rv == nil || rv == emptySynonymsList {
rv = &SynonymsList{}
rv.buffer = bytes.NewReader(nil)
} else {
synonyms := rv.synonyms
buf := rv.buffer
if synonyms != nil {
synonyms.Clear()
}
if buf != nil {
buf.Reset(nil)
}
*rv = SynonymsList{} // clear the struct
rv.synonyms = synonyms
rv.buffer = buf
}
rv.sb = t.sb
rv.except = except
rv.synIDTermMap = t.synIDTermMap
return rv
}
func (t *Thesaurus) Contains(key []byte) (bool, error) {
if t.fst != nil {
return t.fst.Contains(key)
}
return false, nil
}
// AutomatonIterator returns an iterator which only visits terms
// having the the vellum automaton and start/end key range
func (t *Thesaurus) AutomatonIterator(a segment.Automaton,
startKeyInclusive, endKeyExclusive []byte) segment.ThesaurusIterator {
if t.fst != nil {
rv := &ThesaurusIterator{
t: t,
}
itr, err := t.fst.Search(a, startKeyInclusive, endKeyExclusive)
if err == nil {
rv.itr = itr
} else if err != vellum.ErrIteratorDone {
rv.err = err
}
return rv
}
return emptyThesaurusIterator
}
var emptyThesaurusIterator = &ThesaurusIterator{}
// ThesaurusIterator is an iterator for term dictionary
type ThesaurusIterator struct {
t *Thesaurus
itr vellum.Iterator
err error
entry index.ThesaurusEntry
}
// Next returns the next entry in the dictionary
func (i *ThesaurusIterator) Next() (*index.ThesaurusEntry, error) {
if i.err != nil && i.err != vellum.ErrIteratorDone {
return nil, i.err
} else if i.itr == nil || i.err == vellum.ErrIteratorDone {
return nil, nil
}
term, _ := i.itr.Current()
i.entry.Term = string(term)
i.err = i.itr.Next()
return &i.entry, nil
}

173
vendor/github.com/blevesearch/zapx/v16/write.go generated vendored Normal file
View File

@@ -0,0 +1,173 @@
// Copyright (c) 2017 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package zap
import (
"encoding/binary"
"io"
"github.com/RoaringBitmap/roaring/v2"
)
// writes out the length of the roaring bitmap in bytes as varint
// then writes out the roaring bitmap itself
func writeRoaringWithLen(r *roaring.Bitmap, w io.Writer,
reuseBufVarint []byte) (int, error) {
buf, err := r.ToBytes()
if err != nil {
return 0, err
}
var tw int
// write out the length
n := binary.PutUvarint(reuseBufVarint, uint64(len(buf)))
nw, err := w.Write(reuseBufVarint[:n])
tw += nw
if err != nil {
return tw, err
}
// write out the roaring bytes
nw, err = w.Write(buf)
tw += nw
if err != nil {
return tw, err
}
return tw, nil
}
func persistFieldsSection(fieldsInv []string, w *CountHashWriter, opaque map[int]resetable) (uint64, error) {
var rv uint64
fieldsOffsets := make([]uint64, 0, len(fieldsInv))
for fieldID, fieldName := range fieldsInv {
// record start of this field
fieldsOffsets = append(fieldsOffsets, uint64(w.Count()))
// write field name length
_, err := writeUvarints(w, uint64(len(fieldName)))
if err != nil {
return 0, err
}
// write out the field name
_, err = w.Write([]byte(fieldName))
if err != nil {
return 0, err
}
// write out the number of field-specific indexes
// FIXME hard-coding to 2, and not attempting to support sparseness well
_, err = writeUvarints(w, uint64(len(segmentSections)))
if err != nil {
return 0, err
}
// now write pairs of index section ids, and start addresses for each field
// which has a specific section's data. this serves as the starting point
// using which a field's section data can be read and parsed.
for segmentSectionType, segmentSectionImpl := range segmentSections {
binary.Write(w, binary.BigEndian, segmentSectionType)
binary.Write(w, binary.BigEndian, uint64(segmentSectionImpl.AddrForField(opaque, fieldID)))
}
}
rv = uint64(w.Count())
// write out number of fields
_, err := writeUvarints(w, uint64(len(fieldsInv)))
if err != nil {
return 0, err
}
// now write out the fields index
for fieldID := range fieldsInv {
err := binary.Write(w, binary.BigEndian, fieldsOffsets[fieldID])
if err != nil {
return 0, err
}
}
return rv, nil
}
// FooterSize is the size of the footer record in bytes
// crc + ver + chunk + docValueOffset + sectionsIndexOffset + field offset + stored offset + num docs
const FooterSize = 4 + 4 + 4 + 8 + 8 + 8 + 8 + 8
// in the index sections format, the fieldsIndexOffset points to the sectionsIndexOffset
func persistFooter(numDocs, storedIndexOffset, fieldsIndexOffset, sectionsIndexOffset, docValueOffset uint64,
chunkMode uint32, crcBeforeFooter uint32, writerIn io.Writer) error {
w := NewCountHashWriter(writerIn)
w.crc = crcBeforeFooter
// write out the number of docs
err := binary.Write(w, binary.BigEndian, numDocs)
if err != nil {
return err
}
// write out the stored field index location:
err = binary.Write(w, binary.BigEndian, storedIndexOffset)
if err != nil {
return err
}
// write out the field index location
err = binary.Write(w, binary.BigEndian, fieldsIndexOffset)
if err != nil {
return err
}
// write out the new field index location (to be removed later, as this can eventually replace the old)
err = binary.Write(w, binary.BigEndian, sectionsIndexOffset)
if err != nil {
return err
}
// write out the fieldDocValue location
err = binary.Write(w, binary.BigEndian, docValueOffset)
if err != nil {
return err
}
// write out 32-bit chunk factor
err = binary.Write(w, binary.BigEndian, chunkMode)
if err != nil {
return err
}
// write out 32-bit version
err = binary.Write(w, binary.BigEndian, Version)
if err != nil {
return err
}
// write out CRC-32 of everything upto but not including this CRC
err = binary.Write(w, binary.BigEndian, w.crc)
if err != nil {
return err
}
return nil
}
func writeUvarints(w io.Writer, vals ...uint64) (tw int, err error) {
buf := make([]byte, binary.MaxVarintLen64)
for _, val := range vals {
n := binary.PutUvarint(buf, val)
var nw int
nw, err = w.Write(buf[:n])
tw += nw
if err != nil {
return tw, err
}
}
return tw, err
}

246
vendor/github.com/blevesearch/zapx/v16/zap.md generated vendored Normal file
View File

@@ -0,0 +1,246 @@
# ZAP File Format
## Legend
### File Sections
|========|
| | file section
|========|
### Fixed-size fields
|--------| |----| |--| |-|
| | uint64 | | uint32 | | uint16 | | uint8
|--------| |----| |--| |-|
### Varints
|~~~~~~~~|
| | varint(up to uint64)
|~~~~~~~~|
### Arbitrary-length fields
|--------...---|
| | arbitrary-length field (string, vellum, roaring bitmap)
|--------...---|
### Chunked data
[--------]
[ ]
[--------]
## Overview
Footer section describes the configuration of particular ZAP file. The format of footer is version-dependent, so it is necessary to check `V` field before the parsing.
+==================================================+
| Stored Fields |
|==================================================|
+-----> | Stored Fields Index |
| |==================================================|
| | Inverted Text Index Section |
| |==================================================|
| | Vector Index Section |
| |==================================================|
| | Sections Info |
| |==================================================|
| +-> | Sections Index |
| | |========+========+====+=====+======+====+====+====|
| | | D# | SF | F | S | FDV | CF | V | CC | (Footer)
| | +========+====+===+====+==+==+======+====+====+====+
| | | |
+---------------------+ |
|-----------------------------+
D#. Number of Docs.
SF. Stored Fields Index Offset.
F. Field Index Offset.
S. Sections Index Offset
FDV. Field DocValue Offset.
CF. Chunk Factor.
V. Version.
CC. CRC32.
## Stored Fields
Stored Fields Index is `D#` consecutive 64-bit unsigned integers - offsets, where relevant Stored Fields Data records are located.
0 [SF] [SF + D# * 8]
| Stored Fields | Stored Fields Index |
|================================|==================================|
| | |
| |--------------------| ||--------|--------|. . .|--------||
| |-> | Stored Fields Data | || 0 | 1 | | D# - 1 ||
| | |--------------------| ||--------|----|---|. . .|--------||
| | | | |
|===|============================|==============|===================|
| |
|-------------------------------------------|
Stored Fields Data is an arbitrary size record, which consists of metadata and [Snappy](https://github.com/golang/snappy)-compressed data.
Stored Fields Data
|~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~|
| MDS | CDS | MD | CD |
|~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~|
MDS. Metadata size.
CDS. Compressed data size.
MD. Metadata.
CD. Snappy-compressed data.
## Index Sections
Sections Index is a set of NF uint64 addresses (0 through F# - 1) each of which are offsets to the records in the Sections Info. Inside the sections info, we have further offsets to specific type of index section for that particular field in the segment file. For example, field 0 may correspond to Vector Indexing and its records would have offsets to the Vector Index Section whereas a field 1 may correspond to Text Indexing and its records would rather point to somewhere within the Inverted Text Index Section.
(...) [F] [F + F#]
+ Sections Info + Sections Index +
|============================================================================|=====================================|
| | |
| +---------+---------+-----+---------+---------+~~~~~~~~+~~~~~~~~+--+...+-+ | +-------+--------+...+------+-----+ |
+----> S1 Addr | S1 Type | ... | Sn Addr | Sn Type | NS | Length | Name | | | 0 | 1 | | F#-1 | NF | |
| | +---------+---------+-----+---------+---------+~~~~~~~~+~~~~~~~~+--+...+-+ | +-------+----+---+...+------+-----+ |
| | | | |
| +============================================================================+==============|======================+
| |
+----------------------------------------------------------------------------------------------+
NF. Number of fields
NS. Number of index sections
Sn. nth index section
## Inverted Text Index Section
Each field has its own types of indexes in separate sections as indicated above. This can be a vector index or inverted text index.
In case of inverted text index, the dictionary is encoded in [Vellum](https://github.com/couchbase/vellum) format. Dictionary consists of pairs `(term, offset)`, where `offset` indicates the position of postings (list of documents) for this particular term.
+================================================================+- Inverted Text
| | Index Section
| |
| Freq/Norm (chunked) |
| [~~~~~~+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] |
| +->[ Freq | Norm (float32 under varint) ] |
| | [~~~~~~+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] |
| | |
| +------------------------------------------------------------+ |
| Location Details (chunked) | |
| [~~~~~~+~~~~~+~~~~~~~+~~~~~+~~~~~~+~~~~~~~~+~~~~~] | |
| +->[ Size | Pos | Start | End | Arr# | ArrPos | ... ] | |
| | [~~~~~~+~~~~~+~~~~~~~+~~~~~+~~~~~~+~~~~~~~~+~~~~~] | |
| | | |
| +----------------------+ | |
| Postings List | | |
| +~~~~~~~~+~~~~~+~~+~~~~~~~~+----------+...+-+ | |
| +->+ F/N | LD | Length | ROARING BITMAP | | |
| | +~~~~~+~~|~~~~~~~~|~~~~~~~~+----------+...+-+ | |
| | +----------------------------------------------+ |
| +-------------------------------------------------+ |
| | |
| Dictionary | |
| +~~~~~~~~~~+~~~~~~~+~~~~~~~~+--------------------------+-...-+ |
+-----> DV Start | DV End| Length | VELLUM DATA : (TERM -> OFFSET) | |
| | +~~~~~~~~~~+~~~~~~~+~~~~~~~~+----------------------------...-+ |
| | |
| | |
| |================================================================+- Vector Index Section
| | |
| +================================================================+- Synonym Index Section
| | |
| |================================================================+- Sections Info
+-----------------------------+ |
| | |
| +-------+-----+-----+------+~~~~~~~~+~~~~~~~~+--+...+--+ |
| | ... | ITI | ITI ADDR | NS | Length | Name | |
| +-------+-----+------------+~~~~~~~~+~~~~~~~~+--+...+--+ |
+================================================================+
ITI - Inverted Text Index
## Synonym Index Section
In a synonyms index, the relationship between a term and its synonyms is represented using a Thesaurus. The Thesaurus is encoded in the [Vellum](https://github.com/couchbase/vellum) format and consists of pairs in the form `(term, offset)`. Here, the offset specifies the position of the postings list containing the synonyms for the given term. The postings list is stored as a Roaring64 bitmap, with each entry representing an encoded synonym for the term.
|================================================================+- Inverted Text Index Section
| |
|================================================================+- Vector Index Section
| |
+================================================================+- Synonym Index Section
| |
| (Offset) +~~~~~+----------+...+---+ |
| +--------->| RL | ROARING64 BITMAP | |
| | +~~~~~+----------+...+---+ +-------------------+
| |(Term -> Offset) |
| +--------+ |
| | Term ID to Term map (NST Entries) |
| +~~~~+~~~~+~~~~~[{~~~~~+~~~~+~~~~~~}{~~~~~+~~~~+~~~~~~}...{~~~~~+~~~~+~~~~~~}] |
| +->| VL | VD | NST || TID | TL | Term || TID | TL | Term | | TID | TL | Term | |
| | +~~~~+~~~~+~~~~~[{~~~~~+~~~~+~~~~~~}{~~~~~+~~~~+~~~~~~}...{~~~~~+~~~~+~~~~~~}] |
| | |
| +----------------------------+ |
| | |
| +~~~~~~~~~~+~~~~~~~~+~~~~~~~~~~~~~~~~~+ |
+-----> DV Start | DV End | ThesaurusOffset | |
| | +~~~~~~~~~~+~~~~~~~~+~~~~~~~~~~~~~~~~~+ +-------------------+
| | |
| | |
| |================================================================+- Sections Info
+-----------------------------+ |
| | |
| +-------+-----+-----+------+~~~~~~~~+~~~~~~~~+--+...+--+ |
| | ... | SI | SI ADDR | NS | Length | Name | |
| +-------+-----+------------+~~~~~~~~+~~~~~~~~+--+...+--+ |
+================================================================+
SI - Synonym Index
VL - Vellum Length
VD - Vellum Data (Term -> Offset)
RL - Roaring64 Length
NST - Number of entries in the term ID to term map
TID - Term ID (32-bit)
TL - Term Length
### Synonym Encoding
ROARING64 BITMAP
Each 64-bit entry consists of two parts: the first 32 bits represent the Term ID (TID),
and the next 32 bits represent the Document Number (DN).
[{~~~~~+~~~~}{~~~~~+~~~~}...{~~~~~+~~~~}]
| TID | DN || TID | DN | | TID | DN |
[{~~~~~+~~~~}{~~~~~+~~~~}...{~~~~~+~~~~}]
TID - Term ID (32-bit)
DN - Document Number (32-bit)
## Doc Values
DocValue start and end offsets are stored within the section content of each field. This allows each field having its own type of index to choose whether to store the doc values or not. For example, it may not make sense to store doc values for vector indexing and so, the offsets can be invalid ones for it whereas the fields having text indexing may have valid doc values offsets.
+================================================================+
| +------...--+ |
| +->+ DocValues +<-+ |
| | +------...--+ | |
|==|=================|===========================================+- Inverted Text
++~+~~~~~~~~~+~~~~~~~+~~+~~~~~~~~+-----------------------...--+ | Index Section
|| DV START | DV END | LENGTH | VELLUM DATA: TERM -> OFFSET| |
++~~~~~~~~~~~+~~~~~~~~~~+~~~~~~~~+-----------------------...--+ |
+================================================================+
DocValues is chunked Snappy-compressed values for each document and field.
[~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-]
[ Doc# in Chunk | Doc1 | Offset1 | ... | DocN | OffsetN | SNAPPY COMPRESSED DATA ]
[~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-]
Last 16 bytes are description of chunks.
|~~~~~~~~~~~~...~|----------------|----------------|
| Chunk Sizes | Chunk Size Arr | Chunk# |
|~~~~~~~~~~~~...~|----------------|----------------|