Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Treat directory formats like other data files #2064

Merged
merged 2 commits into from
Aug 7, 2024

Conversation

effigies
Copy link
Collaborator

@effigies effigies commented Aug 6, 2024

This PR does surgery on the fileTree to find data "files" that are actually directories, using schema.objects.extensions to identify them. A data file must have entities and a suffix, so this is finely targeted.

From experience, getting the detection wrong results in a huge number of false positives as anat/ doesn't match any file rules. We can also now drop the .zarr and .ds ignore rules and depend on the schema.

Follow-up to #2063.

Copy link

codecov bot commented Aug 6, 2024

Codecov Report

Attention: Patch coverage is 95.65217% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.19%. Comparing base (c98cb99) to head (e5b0aba).

Files Patch % Lines
bids-validator/src/schema/walk.ts 92.85% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2064      +/-   ##
==========================================
+ Coverage   85.69%   87.19%   +1.49%     
==========================================
  Files          91      139      +48     
  Lines        3782     6730    +2948     
  Branches     1220     1587     +367     
==========================================
+ Hits         3241     5868    +2627     
- Misses        455      772     +317     
- Partials       86       90       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@rwblair rwblair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I've been sitting on this, it got me thinking about twp things:
If directories should just get their own context, albeit it would probably look different from the current Context object. Outside of directories in objects.extensions I haven't come up with too many ideas on how this would be useful. (Assertions in the schema about empty directories?) Probably more trouble than its worth at the moment.

The other is if the call to replacePsuedoFiles shouldn't live in src/schema/walk.ts. walk would need access to the schema, we could piggy back this into dsContext which walk already receives. replacePseudoFiles is fundamentally a way of tricking walk into creating contexts for things it normally wouldn't. Nice because it reduces clutter in src/validators/bids.ts.

Happy to merge as is if that second point is problematic.

'tests/data/empty_files/sub-0001/meg/sub-0001_task-AEF_run-01_meg.ds/BadChannels',
),
undefined,
'BadChannels should not be included in EMPTY_FILES error',
'Contents of pseudo files should not be included in EMPTY_FILES error',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that directories and files that are children of psuedofile candidates should never show up in any error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the BadChannels and meg4 were only under discussion because the validator was given access to the directory contents, not the directory-as-file. Thus it makes sense to me to just back off from this micromanagement.

That said, I think it would be reasonable to add a reference to the original tree to the pseudofile, so that if we wanted to inspect contents to provide access to some metadata (e.g., OME-Zarr), it would be easy to do.

@effigies
Copy link
Collaborator Author

effigies commented Aug 7, 2024

The other is if the call to replacePsuedoFiles shouldn't live in src/schema/walk.ts. walk would need access to the schema, we could piggy back this into dsContext which walk already receives. replacePseudoFiles is fundamentally a way of tricking walk into creating contexts for things it normally wouldn't. Nice because it reduces clutter in src/validators/bids.ts.

We definitely could update walk as:

export async function* _walkFileTree(
  fileTree: FileTree,
  root: FileTree,
  issues: DatasetIssues,
  dsContext?: BIDSContextDataset,
): AsyncIterable<BIDSContext> {
  for (const file of fileTree.files) {
    yield new BIDSContext(root, file, issues, dsContext)
  }
  for (const dir of fileTree.directories) {
    if (checkPseudofile(dir)) {
      yield new BIDSContext(root, pseudoFile(dir), issues, dsContext)
    } else {
      yield* _walkFileTree(dir, root, issues, dsContext)
    }
  }
  loadTSV.cache.delete(fileTree.path)
}

Let me give that a try and see if it causes any interactions with the other PRs.

@effigies
Copy link
Collaborator Author

effigies commented Aug 7, 2024

@rwblair

I can do it with this diff:
diff --git a/bids-validator/src/files/surgery.ts b/bids-validator/src/files/surgery.ts
deleted file mode 100644
index ffdeb573..00000000
--- a/bids-validator/src/files/surgery.ts
+++ /dev/null
@@ -1,50 +0,0 @@
-import { BIDSFile, FileTree } from '../types/filetree.ts'
-import { readEntities } from '../schema/entities.ts'
-
-function* walk(dir: FileTree): Generator<BIDSFile> {
-  for (const file of dir.files) {
-    yield file
-  }
-  for (const subdir of dir.directories) {
-    yield* walk(subdir)
-  }
-}
-
-const nullFile = {
-  stream: new ReadableStream(),
-  text: async () => '',
-  readBytes: async (size: number, offset?: number) => new Uint8Array(),
-}
-
-function pseudoFile(dir: FileTree): BIDSFile {
-  return {
-    name: `${dir.name}/`,
-    path: `${dir.path}/`,
-    size: [...walk(dir)].reduce((acc, file) => acc + file.size, 0),
-    ignored: dir.ignored,
-    parent: dir.parent as FileTree,
-    viewed: false,
-    ...nullFile,
-  }
-}
-
-function substitute(dir: FileTree, isFileLike: (file: FileTree) => boolean): void {
-  const finalDirs = []
-  for (const subdir of dir.directories) {
-    if (isFileLike(subdir)) {
-      dir.files.push(pseudoFile(subdir))
-    } else {
-      finalDirs.push(subdir)
-      substitute(subdir, isFileLike)
-    }
-  }
-  dir.directories = finalDirs
-}
-
-export function replacePseudoFiles(dir: FileTree, extensions: string[]): void {
-  substitute(dir, (file) => {
-    const { suffix, extension, entities } = readEntities(file.name)
-    return (suffix !== '' && Object.keys(entities).length > 0 &&
-      extensions.includes(`${extension}/`))
-  })
-}
diff --git a/bids-validator/src/schema/context.ts b/bids-validator/src/schema/context.ts
index 8558f424..ee07b37b 100644
--- a/bids-validator/src/schema/context.ts
+++ b/bids-validator/src/schema/context.ts
@@ -6,7 +6,7 @@ import {
   ContextNiftiHeader,
   ContextSubject,
 } from '../types/context.ts'
-import { GenericSchema } from '../types/schema.ts'
+import { Schema } from '../types/schema.ts'
 import { BIDSFile, FileTree } from '../types/filetree.ts'
 import { ColumnsMap } from '../types/columns.ts'
 import { readEntities } from './entities.ts'
@@ -30,9 +30,12 @@ export class BIDSContextDataset implements ContextDataset {
   issues: DatasetIssues
   sidecarKeyValidated: Set<string>
   options?: ValidatorOptions
+  schema: Schema
+  pseudofileExtensions: Set<string>
 
   constructor(
     options?: ValidatorOptions,
+    schema?: Schema,
     tree?: FileTree,
     description?: Record<string, unknown>,
     ignored?: BIDSFile[],
@@ -40,6 +43,7 @@ export class BIDSContextDataset implements ContextDataset {
     modalities?: string[],
     issues?: DatasetIssues,
   ) {
+    this.schema = schema || {} as unknown as Schema
     this.dataset_description = description || {}
     this.tree = tree || new FileTree('/unknown', 'unknown')
     this.ignored = ignored || []
@@ -50,6 +54,11 @@ export class BIDSContextDataset implements ContextDataset {
       this.options = options
     }
     this.issues = issues || new DatasetIssues()
+    this.pseudofileExtensions = new Set<string>(
+      Object.values(this.schema?.objects?.extensions)
+        ?.map((ext) => ext.value)
+        ?.filter((ext) => ext.endsWith('/')),
+    )
   }
 
   get dataset_description(): Record<string, unknown> {
@@ -63,6 +72,15 @@ export class BIDSContextDataset implements ContextDataset {
         : 'raw'
     }
   }
+
+  isPseudoFile(file: FileTree): boolean {
+    const { suffix, extension, entities } = readEntities(file.name)
+    return (
+      suffix !== '' &&
+      Object.keys(entities).length > 0 &&
+      this.pseudofileExtensions.has(`${extension}/`)
+    )
+  }
 }
 
 export class BIDSContextDatasetSubjects implements ContextDatasetSubjects {
@@ -82,7 +100,6 @@ export class BIDSContextDatasetSubjects implements ContextDatasetSubjects {
 }
 
 export class BIDSContext implements Context {
-  schema?: GenericSchema
   dataset: BIDSContextDataset
   subject: ContextSubject
   // path: string  <- getter
@@ -116,7 +133,7 @@ export class BIDSContext implements Context {
     this.suffix = bidsEntities.suffix
     this.extension = bidsEntities.extension
     this.entities = bidsEntities.entities
-    this.dataset = dsContext ? dsContext : new BIDSContextDataset(undefined, fileTree)
+    this.dataset = dsContext ? dsContext : new BIDSContextDataset(undefined, undefined, fileTree)
     this.subject = {} as ContextSubject
     this.datatype = ''
     this.modality = ''
@@ -127,6 +144,10 @@ export class BIDSContext implements Context {
     this.associations = {} as ContextAssociations
   }
 
+  get schema(): Schema {
+    return this.dataset.schema
+  }
+
   get size(): number {
     return this.file.size
   }
diff --git a/bids-validator/src/schema/walk.ts b/bids-validator/src/schema/walk.ts
index caebf99d..5bfa4d3f 100644
--- a/bids-validator/src/schema/walk.ts
+++ b/bids-validator/src/schema/walk.ts
@@ -1,8 +1,35 @@
 import { BIDSContext, BIDSContextDataset } from './context.ts'
-import { FileTree } from '../types/filetree.ts'
+import { BIDSFile, FileTree } from '../types/filetree.ts'
 import { DatasetIssues } from '../issues/datasetIssues.ts'
 import { loadTSV } from '../files/tsv.ts'
 
+function* quickWalk(dir: FileTree): Generator<BIDSFile> {
+  for (const file of dir.files) {
+    yield file
+  }
+  for (const subdir of dir.directories) {
+    yield* quickWalk(subdir)
+  }
+}
+
+const nullFile = {
+  stream: new ReadableStream(),
+  text: async () => '',
+  readBytes: async (size: number, offset?: number) => new Uint8Array(),
+}
+
+function pseudoFile(dir: FileTree): BIDSFile {
+  return {
+    name: `${dir.name}/`,
+    path: `${dir.path}/`,
+    size: [...quickWalk(dir)].reduce((acc, file) => acc + file.size, 0),
+    ignored: dir.ignored,
+    parent: dir.parent as FileTree,
+    viewed: false,
+    ...nullFile,
+  }
+}
+
 /** Recursive algorithm for visiting each file in the dataset, creating a context */
 export async function* _walkFileTree(
   fileTree: FileTree,
@@ -12,7 +39,11 @@ export async function* _walkFileTree(
     yield new BIDSContext(file, dsContext)
   }
   for (const dir of fileTree.directories) {
-    yield* _walkFileTree(dir, dsContext)
+    if (dsContext.isPseudoFile(dir)) {
+      yield new BIDSContext(pseudoFile(dir), dsContext)
+    } else {
+      yield* _walkFileTree(dir, dsContext)
+    }
   }
   loadTSV.cache.delete(fileTree.path)
 }
diff --git a/bids-validator/src/types/context.ts b/bids-validator/src/types/context.ts
index 0db8a792..6c4d7991 100644
--- a/bids-validator/src/types/context.ts
+++ b/bids-validator/src/types/context.ts
@@ -1,4 +1,4 @@
-import { GenericSchema } from './schema.ts'
+import { Schema } from './schema.ts'
 import { ValidatorOptions } from '../setup/options.ts'
 import { FileTree } from '../types/filetree.ts'
 
@@ -95,7 +95,7 @@ export interface ContextNiftiHeader {
   sform_code: number
 }
 export interface Context {
-  schema?: GenericSchema
+  schema: Schema
   dataset: ContextDataset
   subject: ContextSubject
   path: string
diff --git a/bids-validator/src/types/schema.ts b/bids-validator/src/types/schema.ts
index b8a4f86f..9cc9414f 100644
--- a/bids-validator/src/types/schema.ts
+++ b/bids-validator/src/types/schema.ts
@@ -12,10 +12,18 @@ export interface Entity {
   format: string
 }
 
+export interface Value {
+  value: string
+}
+
 export interface SchemaObjects {
+  datatypes: Record<string, Value>
+  enums: Record<string, Value>
+  entities: Record<string, Entity>
+  extensions: Record<string, Value>
   files: Record<string, unknown>
   formats: Record<string, Format>
-  entities: Record<string, Entity>
+  suffixes: Record<string, Value>
 }
 
 export interface SchemaRules {
diff --git a/bids-validator/src/validators/bids.ts b/bids-validator/src/validators/bids.ts
index b39b6bd5..306777c5 100644
--- a/bids-validator/src/validators/bids.ts
+++ b/bids-validator/src/validators/bids.ts
@@ -17,7 +17,6 @@ import { sidecarWithoutDatafile, unusedStimulus } from './internal/unusedFile.ts
 import { BIDSContext, BIDSContextDataset } from '../schema/context.ts'
 import { parseOptions } from '../setup/options.ts'
 import { hedValidate } from './hed.ts'
-import { replacePseudoFiles } from '../files/surgery.ts'
 
 /**
  * Ordering of checks to apply
@@ -46,13 +45,6 @@ export async function validate(
   const schema = await loadSchema(options.schema)
   summary.schemaVersion = schema.schema_version
 
-  // @ts-expect-error
-  const dirExtensions = Object.values(schema.objects.extensions)
-    // @ts-expect-error
-    .map((ext) => ext.value)
-    .filter((ext) => ext.endsWith('/'))
-  replacePseudoFiles(fileTree, dirExtensions)
-
   /* There should be a dataset_description in root, this will tell us if we
    * are dealing with a derivative dataset
    */
@@ -60,7 +52,7 @@ export async function validate(
     (file: BIDSFile) => file.name === 'dataset_description.json',
   )
 
-  const dsContext = new BIDSContextDataset(options, fileTree)
+  const dsContext = new BIDSContextDataset(options, schema, fileTree)
   if (ddFile) {
     dsContext.dataset_description = await loadJSON(ddFile).catch((error) => {
       dsContext.issues.addNonSchemaIssue(error.key, [ddFile])

At this point it would be easiest to do it on top of #2066. LMK if you want me to go ahead.

@effigies
Copy link
Collaborator Author

effigies commented Aug 7, 2024

And if we add the BIDSContextDataset.schema attribute in #2066, that would keep this PR about as small as it was. Let me go ahead and push, and we can revert if needed.

@effigies
Copy link
Collaborator Author

effigies commented Aug 7, 2024

Rebased on #2066. e08ab56 is the essential bit here.

@effigies effigies added the schema label Aug 7, 2024
@effigies effigies force-pushed the feat/pseudofiles branch 2 times, most recently from c553477 to 462cfb1 Compare August 7, 2024 18:31
@rwblair rwblair merged commit a4e1a09 into bids-standard:master Aug 7, 2024
16 checks passed
@effigies effigies deleted the feat/pseudofiles branch August 7, 2024 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants