Indexing #2

mottosso · 2014-07-01T06:48:14Z

Index results for faster queries.

mottosso · 2014-07-02T18:41:00Z

Here is how I see this going down.

From a performance perspective, the goal is for no query to involve any waiting for results if every effort to optimise the file-system has been taken. I'm thinking it should be expected, and okay, for queries to take some time, possibly a substantial amount, if no optimisations has been taken.

In terms of optimsations, these are the routes I see, in order of increasing complexity:

No Optimisations

This means making every query live and will in some cases be cause for a noticable slowdown depending on the amount of directories are involved in a query. For upwards queries, this is usually not noticeable (~0.001s/level) but downwards queries could potentially touch millions of targets and as such may take several minutes or hours to complete.

Local Daemon

The simplest level of optimisation is one that indexes results during a query. Once a query has been performed, the results are stored in the currently running process and help speed up subsequent queries. This type of optimisations could be useful as default for applications relying on queries, as queries can sometimes be expected to be performed multiple times; such as when browsing a directory or such.

Dedicated Daemon

The next level of optimisation involves running a dedicated daemon that performs indexing either live, at a fixed interval or at events. The dedicated daemon has the advantage of being persistent across runs and facilitating a multi-user setup.

Central Deamon

Finally, the central daemon, like the dedicated daemon, is persistent but as opposed to the dedicated daemon the central daemon facilitates a multi-user/multi-site usage and would be hosted on a server somewhere. This type of optimsation would potentially see most use in an established environment with lots of variables depending on queries and is a place where the maximum amount of optimisation can take place; potentially involving prediction and machine learning algorithms.

Reference

Based on the initial results I made via Python and Go, both single-threaded, traversing about 40k directories in about 20 seconds, I was a bit discouraged to try and make any significant effort in filtering and searching large hierarchies of content. But I've done some reading and research and found several instances where performance could out-perform databases 40 to 1 [1].

More references:

Everything
Search engine with indexing, reports producing a full-scan of 40 000 directories in 1 sec, 1 million directories in 1 min.
Xapian
Search engine library with Python-bindings with indexing and support for fuzzy searching (like Google)
Organizing, Indexing, and Searching Large-Scale File Systems
Study on current, future and alternative search engine design and optimisation file-system.

mottosso · 2014-07-04T20:42:14Z

Taking a brute-force crack at this. The motivation is that the DOM doesn't do any indexing. It may be stored internally as a map/dict or b-tree like structure, but ultimately I believe the main factor in it's speed is due to it being located in-memory as opposed to disk.

So as a first attempt:

Stage 1 - Commit an entire project into memory

The idea is that, prior to running any queries, the entire contents of a project should be read into memory. The following alterations apply:

No hidden folders are included

$ /projects/spiderman/.trash  # Hidden
$ /projects/spiderman/.ng  # Hidden
$ /projects/spiderman/assets
$ /projects/spiderman/shots

Folders with metadata have their metadata applied as "data" or "properties" to the contained object in-memory. E.g.

>>> peter = contents['/projects/spiderman/assets/Peter']
>>> peter.data('class')
# Asset
>>> peter.class
# Asset
>>> version = contents['/projects/spiderman/assets/Peter/public/v001']
>>> version.data('source')
# /projects/spiderman/assets/Peter/private/marcus/maya-2015x64/scenes/v016.ma

For storage, each individual folder is stored as a unique key within a single dict.

{
    "/": Node,
    "/projects": Node,
    "/projects/spiderman": Node,
    "/projects/spiderman/assets": Node
}

A Node being a mere container of the path and rudimentary path-manipulation such as basename, name, suffix and finally data for anything stored within an existing .meta container.

Stage 2 - Persistence

Considering the brute-force approach along with estimating a < 10mb sized index of 100 000 directories, the entire dataset will get offloaded onto disk as JSON or a compressed pickle, depending on the complexity and necessity of serialising Node objects.

Dataset

As a dataset, I'll create stub projects with the following specifications to test against:

Small

10 assets
15 shots
100 image sequences
50 cache sets
100 versions
10 cosmetic directories

Large

5 000 assets
300 shots
50 000 image sequences
800 000 cache sets
200 000 versions
6 000 cosmetic directories

Each Node will have a minimum of 1 variable as metadata - its type - and will sporadically be contained within groups without metadata. For example:

$ /projects/spiderman/ASSETS/Peter

Here, ASSETS is a convenience directory without any metadata and exists only for cosmetic reasons.

Backends

Each project will get tested locally on an SSD, on a local network and remotely over the internet.

mottosso · 2014-07-05T09:11:54Z

Initial results from Stage 1, 20x performance increase

Project generator and indexer here:

1. Generate project

>>> # 42974 directories
>>> generate.project(
        root=r"\test_large",
        assets=6000,
        subassets=0,
        shots=800,
        imgseqs=20000,
        cacheseqs=500,
        versions=800,
        workspaces=5000,
        cosmetics=2)

2. Run Default cQuery

$ cd /test_large
$ cquery .Asset
# ...
#  Querying directory of C:\studio\content\jobs\test_large
#                 Selector .Asset
#                 6000 results in 14.463034s

3. Run In-memory cQuery

>>> index = brute.Index(r"\test_large")
>>> index.compute()
>>> index.stats
# {'count': 42974, 'timetaken': 15.004621581002993}
>>> index.cquery(".Asset")
# ...
#  Querying directory of c:\studio\content\jobs\test_large
#                 Selector .Asset
#                 6000 results in 0.72460919372s

Notes

As each directory is stored flatly, the ability to walk across a hierarchy is hampered which makes traversal potentially slow. In this example, I'm traversing across the entire index for results from any level within. Another point is the duplication of root paths within each entry; "/content/assets" appear a lot of times, potentially consuming memory and disk-space.

A more optimal method might be to store each item in a b-tree like structure, where each node is capable of storing parent/child relationships. From there, it would become trivial to walk, and to list the children of any node while at the same time conserving memory as there would be no duplication of path names.

mottosso added enhancement and removed help wanted labels Jul 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing #2

Indexing #2

mottosso commented Jul 1, 2014

mottosso commented Jul 2, 2014

mottosso commented Jul 4, 2014

mottosso commented Jul 5, 2014

Indexing #2

Indexing #2

Comments

mottosso commented Jul 1, 2014

mottosso commented Jul 2, 2014

No Optimisations

Local Daemon

Dedicated Daemon

Central Deamon

Reference

mottosso commented Jul 4, 2014

Stage 1 - Commit an entire project into memory

Stage 2 - Persistence

Dataset

Backends

mottosso commented Jul 5, 2014

Notes