Skip to content

HDF5 data formatter #179

Merged
AdriaanRol merged 65 commits into
masterfrom
data_structures
Sep 6, 2016
Merged

HDF5 data formatter #179
AdriaanRol merged 65 commits into
masterfrom
data_structures

Conversation

@AdriaanRol

@AdriaanRol AdriaanRol commented May 16, 2016

Copy link
Copy Markdown
Contributor

First version of the HDF5 data formatter.
@alexcjohnson , @MerlinSmiles @giulioungaretti

Data saving and loading works, including incremenatal writes.
I have added tests for the HDF5 format which still fail (as it is not done yet).
As a sidenote, I am not 100% satisfied yet but I think it is a good start.

I have added a notebook that shows examples and runs the test suite for just the hdf5 format.

  • Saving from the loop is still confusing to me, it runs fine in the notebook but when I use the same code in my test-suite it complains. I would like some help in understanding why it behaves differently.
  • I need to make the incremental write a bit more robust, It works when I tested it in the loop in the notebook but I made the test fail on purpose
  • I have not included support for the setpoint yet, the way the datset handles this is by refering a complete array, adding the key to the metadata seems and using that when extracting seems easiest, however it makes it critically dependent to the order in which arrays are extracted.
  • I hacked in a way of using the location formatter that I think is not pretty but was needed because of the way GNUPlot does it, this can be improved.

If the above issues have been resolved I think this can be merged, let me know what else should be on that list.
A screenshot of how the data is saved (using the 1D example)
screenshot 2016-05-16 23 43 16

While making this I ran into the following other issues, I think addressing these is out of the scope of this issue but definately part of #62

  • No metadata support yet, waiting for Metadata #107, will not do it in this PR
  • Not tested with multi-D and nested sweeps. As soon as there are good test-datasets for this I can implement this (in a separate PR), will also do a test using the loop with that.
  • I left a bunch of comments on top of the HDF5 formatter file for potential improvements
  • units are included in parameters but not in units (should be addressed in dataset)
  • I think it makes sense to make the dataset an ordered dict instead of a dict like structure
  • there is an assumption about datasets being preallocated with nan's in the datset, this breaks down when doing adaptive sweeps or preemptively aborting sweeps, the way I implemented this should be compatible if this changes.
  • the location formatter should be improved (both in the structure of folders/subfolders and in my opinion also cosmetically dashes create very long names etc)
  • there is no good way to search/find files based on a label when the exact location is unkown.

And there is a bunch more which I cannot think of now.

Added some formatting to the __repr__ of the dataset
Added force_write option (untested for GNUplot format)
Working test for writing a simple 1D file
Got a failing test for the read/loading of data
Loading from file (basic )
Saving of array metadata
Todo incremental write (does not work yet with a loop)
Also set_points reference in dataset is not saved correctly
(not robust yet)
@alexcjohnson

Copy link
Copy Markdown
Contributor

@AdriaanRol fantastic, I will take a look at this tomorrow!

💯 🌟 😍 for making tests that don't work yet!

@MerlinSmiles

MerlinSmiles commented May 16, 2016

Copy link
Copy Markdown
Contributor

@AdriaanRol great to see progress on this!

I have not included support for the setpoint yet, the way the datset handles this is by refering a complete array, adding the key to the metadata seems and using that when extracting seems easiest, however it makes it critically dependent to the order in which arrays are extracted.

Could you explain this? The setpoint array behaves like the measurement arrays, no?
Currently the arrays are saved with an unique array_id in the metadata I use that id to store all info on that array, also if it is a setpoint or measured array. Thus when the id is saved all info should come back easily.

I hacked in a way of using the location formatter that I think is not pretty but was needed because of the way GNUPlot does it, this can be improved.

Could you explain this too? What is the issue?

units are included in parameters but not in units (should be addressed in dataset)

Units not in units?
Unit is included in data_array in the metadata #107 though I'm still confused that there is only one units but no unit in parameters while there is name and names and label and labels

the location formatter should be improved (both in the structure of folders/subfolders and in my opinion also cosmetically dashes create very long names etc)

did you see #142 ?

there is no good way to search/find files based on a label when the exact location is unkown.

There is a wildcard version of io.list in #142 now

@AdriaanRol

Copy link
Copy Markdown
Contributor Author

@MerlinSmiles

Could you explain this? The setpoint array behaves like the measurement arrays, no?

Yes it does, however the problem is that the setpoint requires the entire array upon initialization, the consequence of this is that the order in which I extract the data matters. Additionally (and I don't know how this works) if the setpoint has it's own setpoint array, as I can imagine in the future for some form of nested loop this will also require a reference to this.
The problem of extracting can be solved by using an ordered dict for the dataset as then the order in which it can be constructed is uniquely defined. However I think it would make sense if the setpoint_array reference would not contain the array itself but only a reference (key) of the array_id. We can then let the DataSet class handle the redirecting without having to worry to have this information be complete at array initialization time.
I think that the structural solution to this problem is a bit beyond the scope of this PR, if I misunderstood the structure of the dataset or if there is a super simple alternate solution I would be very happy.

You are right, just storing the id is easy and something I will probably add, however just hacking it in there if I don't know yet how to extract it seems a bit pointless to me.

units are included in parameters but not in units (should be addressed in dataset)

Units not in units?

Just a typo, I meant units are not included in the dataset even though they are part of the parameter, I have included them in the formatter so that it should be forwards compatible.

I'm still confused that there is only one units but no unit in parameters

I guess this is a grammer thing, if you have a parameter, say a thermometer you ask what "units" it has, not what "unit" it has. It is confusing though as you would expect the singular form just like label, name etc. I think "unit" would just look plain weird.

did you see #142 ?

I did, however I find that loc_provider = FormatLocation(fmt='{date}/{time}_#{counter}_{name}_{label}') Is not really sufficient for my purposes. This is not an issue with your formatter ( I think it improves over the old one quite a bit) but rather about the implicit choices about what should and should not be a folder.
Below is a figure of what it looks now:
screenshot 2016-05-17 09 24 18

To achieve this I need the following code (lines 79 - 82 in hdf5_format.py)

location = data_set.location
self.filepath = io_manager.join(
    io_manager.base_location,
    data_set.location_provider(io_manager)+'/'+location+'.hdf5')

This seems a bit convoluted to me (but most importantly is something that should happen in the location provider). What I would like is to have a formatter that provides the following for me {date}/{time}_#{counter}_{name}_{label}/{time}_#{counter}_{name}_{label}.hdf5, this explicitly nests the hdf5 file within a folder of the same name.
This allows saving figures and other things within the same folder ( a pattern we commonly use) additionally it allows easily browing all the folders corresponding to a certain day as the filename is saved in the foldername. Then I don't like the "-" in the timestamp as it makes the text too wide (thus hiding the actually useful information) but that belongs to a different discussion.

The io.list or wildcard function seems to be what I am looking for, I should replace this in the test but I'll wait for #142 to be merged first.

@MerlinSmiles

Copy link
Copy Markdown
Contributor

Interesting with the setpoint arrays, now this also confuses me.

You are right, just storing the id is easy and something I will probably add, however just hacking it in there if I don't know yet how to extract it seems a bit pointless to me.

Just a typo, I meant units are not included in the dataset even though they are part of the parameter, I have included them in the formatter so that it should be forwards compatible.

The unit is part of the data_array, and I guess that's where it should belong. You can get it from within the dataset as array.units, also that is where you get the array_id the data_set is just the container.
Or I misunderstand your problem :)

I'm still confused that there is only one units but no unit in parameters

I guess this is a grammer thing, if you have a parameter, say a thermometer you ask what "units" it has, not what "unit" it has. It is confusing though as you would expect the singular form just like label, name etc. I think "unit" would just look plain weird.

But we use name for a single value parameter, and names for parameters with two or more values, I have no option of two different units for a parameter with 2 values.
I'd rather ask, what unit does that value have ^^

What I would like is to have a formatter that provides the following for me {date}/{time}#{counter}{name}{label}/{time}#{counter}{name}{label}.hdf5

Maybe this is easier?

location = data_set.location
filename = io_manager.join(location, location.split('/')[-1] + '.hdf5')

But it could also be useful to extend the FormatLocation to format a folder and a filename...

Then I don't like the "-" in the timestamp as it makes the text too wide

The FormatLocation in #142 allows you to provide fmt_date and fmt_time.

@AdriaanRol

Copy link
Copy Markdown
Contributor Author

@alexcjohnson do you have test cases/test datasets for more complex datasets?
I am talking about 2D scans (how does it work with setpoints) and multi-valued parameters?

But we use name for a single value parameter, and names for parameters with two or more values, I have no option of two different units for a parameter with 2 values.

I think the confusion comes from the fact that I have one column per name and am combining multiple parameters in one datagroup, the list of names then contains each name for every column (being either a parameter or multiple columns for a composite parameter (containing multiple parameters).

@AdriaanRol

AdriaanRol commented Jul 15, 2016

Copy link
Copy Markdown
Contributor Author

Based on the discussions today (Friday QCodes meeting) the (currently functional) hdf5 formatter needs to be rewritten.

The formatter will follow the following three rules

  1. One (hdf5) dataset /dataarray per parameter
  2. dataset (hdf5) shape relfects instrument operations ((mn,1) vs (m,n))
  3. Arrays contain ref to setpoints in metadata.

Additionally (not in this PR) direct access to the dataset from the main process will be required.

@giulioungaretti

Copy link
Copy Markdown
Contributor

I guess this is v 0.1 right ?

started as a style cleanup, can you find the bug it fixes?
Another fix due to switching to a nicer pattern...
The same symbol (s) had three meanings on a single line!
looks for a "close_file" method in the formatter, so this should
be extensible to other formatters we may write.
Comment thread qcodes/data/data_set.py
self.write()

if hasattr(self.formatter, 'close_file'):
self.formatter.close_file(self)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdriaanRol re: closing the file at the end of a loop: the loop already calls DataSet.finalize() at the end, which seems like a pretty sensible place to close the file, before we go to the trouble of implementing a context manager. Seem reasonable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however do you think it reasonable to create an issue for this once this is merged? It is mostly important when interrupting loops (intended or due to exception)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is already taken care of, as finalize gets called in a try / finally block. Which is exactly how a context manager works under the hood, though I agree that it would be cleaner to implement it that way instead.

@alexcjohnson

Copy link
Copy Markdown
Contributor

I haven't looked through the tests yet, but I fixed a few bugs in untested parts of the code. Currently I get this for coverage on the new formatter:

data/hdf5_format.py    217 Statements     35 Missed    84% Covered
Missing:
60, 66, 189, 193, 220-241, 257, 283, 286-290, 305-314, 320-323, 327, 366, 378-379, 389

Which is a good start! But we will do ourselves a big favor if we can increase it.

if data_name not in data_set._h5_base_group.keys():
arr_group = data_set._h5_base_group.create_group(data_name)
else:
arr_group = data_set._h5_base_group[data_name]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdriaanRol this is the only actual code change I had to make to get rid of all the other attributes stashed on the formatter. OK?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, looks good :)

@AdriaanRol

Copy link
Copy Markdown
Contributor Author

@alexcjohnson , updated todolist in comment above to include increasing coverage. Thanks for the changes. p.s. feel free to give it a go as I will not be able to get to it on a short (days) timescale and I think this should be merged sooner rather than later (the merge frequency has gone down quite a bit recently).

@peendebak

Copy link
Copy Markdown
Contributor

@AdriaanRol @alexcjohnson Not my call, but I would be happy with merging as well.

@AdriaanRol

Copy link
Copy Markdown
Contributor Author

All tests pass (locally), 100% coverage in hdf5_formatter.py as far as I can tell people are happy with this formatter.
@alexcjohnson , @giulioungaretti can we 💃

screenshot 2016-09-04 13 23 32

Comment thread qcodes/tests/test_hdf5formatter.py Outdated
# general unsupported type in dict
# reading metadata for dataset that does not have a metadata attribute
# unrecognized list type when reading in dict
# boolean string that is not True or False raised Value Error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdriaanRol are these comments still useful?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they are not, removing now

@alexcjohnson

Copy link
Copy Markdown
Contributor

@AdriaanRol thanks for completing the HDF5 tests. I added tests for compare_dictionaries, and some non-blocking notes and todos there. Take a quick look at those before proceeding, but I'm ready to 💃 👯 💃 !!

@alexcjohnson

Copy link
Copy Markdown
Contributor

@AdriaanRol I meant to point out earlier this change I made - I've seen this various places , and not just from you but other contributors as well. I suppose you get the type h5py._hl.group.Group from a repr of an object you're debugging with, but it's more robust to use the alias to this object at the top level of the package (h5py.Group) if one exists. You never know when the package will reorganize, particularly when you're referencing an underscored submodule, but anything they've exposed at the top level is unlikely to change - and will also be more meaningful to readers of the code.

This also applies to using qcodes, even from within drivers and tests (although I notice I haven't been consistent about that myself... but it would actually make the tests better if they used the top-level names when possible!). The objects people are generally expected to use directly are aliased in qcodes/__init__.py so, for example, you don't need from qcodes.instrument.visa import VisaInstrument, you can just do from qcodes import VisaInstrument.

That said, for linkages within the core of the package itself it's better to point to the item you want directly, as that can avoid circular imports. But it's only within the core that this applies, and the core is all tested as a unit so we'll catch it if a reorganization breaks anything.

@alexcjohnson

Copy link
Copy Markdown
Contributor

...and that last comment reminds me, we should include HDF5Format in the top-level __init__.py too!

@AdriaanRol AdriaanRol merged commit 797b004 into master Sep 6, 2016
@AdriaanRol AdriaanRol deleted the data_structures branch September 6, 2016 07:39
@giulioungaretti

Copy link
Copy Markdown
Contributor

Wait a sec, this merge never passed travis right ?

@AdriaanRol

Copy link
Copy Markdown
Contributor Author

@giulioungaretti , All tests that were part of the formatter passed on Travis and all tests passed locally. I think the final failure (most of the Travis runs did pass) has to do with the stochastic bugs on travis.

Just so you can sleep well, there is no untested code in the master.

@giulioungaretti

Copy link
Copy Markdown
Contributor

@AdriaanRol, cool! I did not check the code yet, but there's a new error with the dataset. May have not come from this either but just checking if it was something you knew :D

I will look deeper into it soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants