Open
Conversation
This is a .KD file with data for 3 different cuvettes
Added samples_cell_header to identify SAMPLES_CELL_x text for decoding data from multi-cuvette samples.
Added handle_samples and parse_samples functions
Initial implementation of samples_cell. Before debugging.
I tested it on both multi-cuvette and single-cuvette files and I don't get any errors.
Summary
Added multi-cuvette support by creating a new samples_cell attribute:
1. Added samples_cell_header class attribute (line 58-61):
- Header: RegName in UTF-16-LE
- Spacing: 18 bytes from header to first cell name
2. Added _handle_samples_cell method (lines 162-183):
- Finds the RegName header once
- Iterates through sequential 30-byte entries (2-byte prefix + 28-byte cell name)
- Stops when it encounters a non-SAMPLES_CELL string
- Returns a pd.Series with cell identifiers
3. Added _parse_samples_cell method (lines 217-223):
- Reads a fixed 28-byte UTF-16-LE encoded cell name
- Returns None on decode errors
4. Updated parse_kd to return samples_cell and updated the __init__ assignment
Results:
- Multi-cuvette file: Returns 357 entries with SAMPLES_CELL_1, SAMPLES_CELL_2, SAMPLES_CELL_3 (119 each)
- Single-cuvette files: Returns all entries as SAMPLES_CELL_1
I'm adding a file called "1229 PDC PYRUVATE 100MM-8KD" which I renamed to "multi_cuvette_test_data_corrupted.KD." It's an example of a file corruption where the final data point from the previously-saved file gets appended to the start of this file. I think this kind of corruption can be detected and fixed relatively easily.
Setting up tests for fixing a bug with a corrupted .KD file
I fixed the bug by adding validation for time values in the KD file parser. The changes:
1. Added warnings import (import_kd.py:11) to issue warnings about corrupted files.
2. Added _validate_and_fix_data() method (import_kd.py:161-246) that:
- Builds a working DataFrame by transposing the spectra and adding the sample cell column
- Uses pandas groupby to process each cuvette's data separately
- Detects non-increasing time values by finding "reset points" where time decreases
- Marks all preceding timepoints with values >= the reset time as invalid
- Issues two warnings: one about potential corruption and one about removed timepoints
- Returns cleaned spectra, spectra_times, and samples_cell with corrupt data removed
3. Integrated validation into parse_kd() (import_kd.py:132-135) so it runs automatically when parsing any .KD
file.
4. Created tests (tests/test_import_kd.py) to verify:
- Valid files produce no corruption warnings
- Corrupted files produce warnings and have bad data removed
The fix correctly identifies and removes the corrupt timepoint (730.3 seconds) from each of the 3 cuvettes in
the corrupted test file.
Author
|
I found that some of my .KD files were corrupted, and made some additional edits to this branch to identify and fix problems associated with this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I made some edits to the import_kd.py file to enable it to parse data about the cuvette ID (called SAMPLES_CELL_1, 2, 3, etc. in the .KD file). I also included an example .KD file with 3 cuvettes.
I tried to be as conservative as possible to avoid disrupting any downstream components. I added samples_cell as a property of the KDFile object and didn't change the exported spectra dataframe, although in the future it might be useful to include the samples_cell info to that dataframe.