Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing #2

Open
mottosso opened this issue Jul 1, 2014 · 3 comments
Open

Indexing #2

mottosso opened this issue Jul 1, 2014 · 3 comments

Comments

@mottosso
Copy link
Member

mottosso commented Jul 1, 2014

Index results for faster queries.

@mottosso
Copy link
Member Author

mottosso commented Jul 2, 2014

Here is how I see this going down.

From a performance perspective, the goal is for no query to involve any waiting for results if every effort to optimise the file-system has been taken. I'm thinking it should be expected, and okay, for queries to take some time, possibly a substantial amount, if no optimisations has been taken.

In terms of optimsations, these are the routes I see, in order of increasing complexity:

No Optimisations

This means making every query live and will in some cases be cause for a noticable slowdown depending on the amount of directories are involved in a query. For upwards queries, this is usually not noticeable (~0.001s/level) but downwards queries could potentially touch millions of targets and as such may take several minutes or hours to complete.

Local Daemon

The simplest level of optimisation is one that indexes results during a query. Once a query has been performed, the results are stored in the currently running process and help speed up subsequent queries. This type of optimisations could be useful as default for applications relying on queries, as queries can sometimes be expected to be performed multiple times; such as when browsing a directory or such.

Dedicated Daemon

The next level of optimisation involves running a dedicated daemon that performs indexing either live, at a fixed interval or at events. The dedicated daemon has the advantage of being persistent across runs and facilitating a multi-user setup.

Central Deamon

Finally, the central daemon, like the dedicated daemon, is persistent but as opposed to the dedicated daemon the central daemon facilitates a multi-user/multi-site usage and would be hosted on a server somewhere. This type of optimsation would potentially see most use in an established environment with lots of variables depending on queries and is a place where the maximum amount of optimisation can take place; potentially involving prediction and machine learning algorithms.

Reference

Based on the initial results I made via Python and Go, both single-threaded, traversing about 40k directories in about 20 seconds, I was a bit discouraged to try and make any significant effort in filtering and searching large hierarchies of content. But I've done some reading and research and found several instances where performance could out-perform databases 40 to 1 [1].

More references:

  • Everything
    Search engine with indexing, reports producing a full-scan of 40 000 directories in 1 sec, 1 million directories in 1 min.
  • Xapian
    Search engine library with Python-bindings with indexing and support for fuzzy searching (like Google)
  • Organizing, Indexing, and Searching Large-Scale File Systems
    Study on current, future and alternative search engine design and optimisation file-system.

@mottosso
Copy link
Member Author

mottosso commented Jul 4, 2014

Taking a brute-force crack at this. The motivation is that the DOM doesn't do any indexing. It may be stored internally as a map/dict or b-tree like structure, but ultimately I believe the main factor in it's speed is due to it being located in-memory as opposed to disk.

So as a first attempt:

Stage 1 - Commit an entire project into memory

The idea is that, prior to running any queries, the entire contents of a project should be read into memory. The following alterations apply:

  • No hidden folders are included
$ /projects/spiderman/.trash  # Hidden
$ /projects/spiderman/.ng  # Hidden
$ /projects/spiderman/assets
$ /projects/spiderman/shots
  • Folders with metadata have their metadata applied as "data" or "properties" to the contained object in-memory. E.g.
>>> peter = contents['/projects/spiderman/assets/Peter']
>>> peter.data('class')
# Asset
>>> peter.class
# Asset
>>> version = contents['/projects/spiderman/assets/Peter/public/v001']
>>> version.data('source')
# /projects/spiderman/assets/Peter/private/marcus/maya-2015x64/scenes/v016.ma

For storage, each individual folder is stored as a unique key within a single dict.

{
    "/": Node,
    "/projects": Node,
    "/projects/spiderman": Node,
    "/projects/spiderman/assets": Node
}

A Node being a mere container of the path and rudimentary path-manipulation such as basename, name, suffix and finally data for anything stored within an existing .meta container.

Stage 2 - Persistence

Considering the brute-force approach along with estimating a < 10mb sized index of 100 000 directories, the entire dataset will get offloaded onto disk as JSON or a compressed pickle, depending on the complexity and necessity of serialising Node objects.

Dataset

As a dataset, I'll create stub projects with the following specifications to test against:

Small

  • 10 assets
  • 15 shots
  • 100 image sequences
  • 50 cache sets
  • 100 versions
  • 10 cosmetic directories

Large

  • 5 000 assets
  • 300 shots
  • 50 000 image sequences
  • 800 000 cache sets
  • 200 000 versions
  • 6 000 cosmetic directories

Each Node will have a minimum of 1 variable as metadata - its type - and will sporadically be contained within groups without metadata. For example:

$ /projects/spiderman/ASSETS/Peter

Here, ASSETS is a convenience directory without any metadata and exists only for cosmetic reasons.

Backends

Each project will get tested locally on an SSD, on a local network and remotely over the internet.

@mottosso
Copy link
Member Author

mottosso commented Jul 5, 2014

Initial results from Stage 1, 20x performance increase

Project generator and indexer here:

1. Generate project

>>> # 42974 directories
>>> generate.project(
        root=r"\test_large",
        assets=6000,
        subassets=0,
        shots=800,
        imgseqs=20000,
        cacheseqs=500,
        versions=800,
        workspaces=5000,
        cosmetics=2)

2. Run Default cQuery

$ cd /test_large
$ cquery .Asset
# ...
#  Querying directory of C:\studio\content\jobs\test_large
#                 Selector .Asset
#                 6000 results in 14.463034s

3. Run In-memory cQuery

>>> index = brute.Index(r"\test_large")
>>> index.compute()
>>> index.stats
# {'count': 42974, 'timetaken': 15.004621581002993}
>>> index.cquery(".Asset")
# ...
#  Querying directory of c:\studio\content\jobs\test_large
#                 Selector .Asset
#                 6000 results in 0.72460919372s

Notes

As each directory is stored flatly, the ability to walk across a hierarchy is hampered which makes traversal potentially slow. In this example, I'm traversing across the entire index for results from any level within. Another point is the duplication of root paths within each entry; "/content/assets" appear a lot of times, potentially consuming memory and disk-space.

A more optimal method might be to store each item in a b-tree like structure, where each node is capable of storing parent/child relationships. From there, it would become trivial to walk, and to list the children of any node while at the same time conserving memory as there would be no duplication of path names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant