HDF5 data formatter by AdriaanRol · Pull Request #179 · microsoft/Qcodes

AdriaanRol · 2016-05-16T21:54:15Z

First version of the HDF5 data formatter.
@alexcjohnson , @MerlinSmiles @giulioungaretti

Data saving and loading works, including incremenatal writes.
I have added tests for the HDF5 format which still fail (as it is not done yet).
As a sidenote, I am not 100% satisfied yet but I think it is a good start.

I have added a notebook that shows examples and runs the test suite for just the hdf5 format.

Saving from the loop is still confusing to me, it runs fine in the notebook but when I use the same code in my test-suite it complains. I would like some help in understanding why it behaves differently.
I need to make the incremental write a bit more robust, It works when I tested it in the loop in the notebook but I made the test fail on purpose
I have not included support for the setpoint yet, the way the datset handles this is by refering a complete array, adding the key to the metadata seems and using that when extracting seems easiest, however it makes it critically dependent to the order in which arrays are extracted.
I hacked in a way of using the location formatter that I think is not pretty but was needed because of the way GNUPlot does it, this can be improved.

If the above issues have been resolved I think this can be merged, let me know what else should be on that list.
A screenshot of how the data is saved (using the 1D example)

While making this I ran into the following other issues, I think addressing these is out of the scope of this issue but definately part of #62

No metadata support yet, waiting for Metadata #107, will not do it in this PR
Not tested with multi-D and nested sweeps. As soon as there are good test-datasets for this I can implement this (in a separate PR), will also do a test using the loop with that.
I left a bunch of comments on top of the HDF5 formatter file for potential improvements
units are included in parameters but not in units (should be addressed in dataset)
I think it makes sense to make the dataset an ordered dict instead of a dict like structure
there is an assumption about datasets being preallocated with nan's in the datset, this breaks down when doing adaptive sweeps or preemptively aborting sweeps, the way I implemented this should be compatible if this changes.
the location formatter should be improved (both in the structure of folders/subfolders and in my opinion also cosmetically dashes create very long names etc)
there is no good way to search/find files based on a label when the exact location is unkown.

And there is a bunch more which I cannot think of now.

…structures

…nto data_structures

Added some formatting to the __repr__ of the dataset Added force_write option (untested for GNUplot format)

Working test for writing a simple 1D file Got a failing test for the read/loading of data

Loading from file (basic ) Saving of array metadata

Todo incremental write (does not work yet with a loop) Also set_points reference in dataset is not saved correctly

(not robust yet)

alexcjohnson · 2016-05-16T22:00:34Z

@AdriaanRol fantastic, I will take a look at this tomorrow!

💯 🌟 😍 for making tests that don't work yet!

MerlinSmiles · 2016-05-16T22:20:32Z

@AdriaanRol great to see progress on this!

I have not included support for the setpoint yet, the way the datset handles this is by refering a complete array, adding the key to the metadata seems and using that when extracting seems easiest, however it makes it critically dependent to the order in which arrays are extracted.

Could you explain this? The setpoint array behaves like the measurement arrays, no?
Currently the arrays are saved with an unique array_id in the metadata I use that id to store all info on that array, also if it is a setpoint or measured array. Thus when the id is saved all info should come back easily.

I hacked in a way of using the location formatter that I think is not pretty but was needed because of the way GNUPlot does it, this can be improved.

Could you explain this too? What is the issue?

units are included in parameters but not in units (should be addressed in dataset)

Units not in units?
Unit is included in data_array in the metadata #107 though I'm still confused that there is only one units but no unit in parameters while there is name and names and label and labels

the location formatter should be improved (both in the structure of folders/subfolders and in my opinion also cosmetically dashes create very long names etc)

did you see #142 ?

there is no good way to search/find files based on a label when the exact location is unkown.

There is a wildcard version of io.list in #142 now

AdriaanRol · 2016-05-17T07:47:43Z

@MerlinSmiles

Could you explain this? The setpoint array behaves like the measurement arrays, no?

Yes it does, however the problem is that the setpoint requires the entire array upon initialization, the consequence of this is that the order in which I extract the data matters. Additionally (and I don't know how this works) if the setpoint has it's own setpoint array, as I can imagine in the future for some form of nested loop this will also require a reference to this.
The problem of extracting can be solved by using an ordered dict for the dataset as then the order in which it can be constructed is uniquely defined. However I think it would make sense if the setpoint_array reference would not contain the array itself but only a reference (key) of the array_id. We can then let the DataSet class handle the redirecting without having to worry to have this information be complete at array initialization time.
I think that the structural solution to this problem is a bit beyond the scope of this PR, if I misunderstood the structure of the dataset or if there is a super simple alternate solution I would be very happy.

You are right, just storing the id is easy and something I will probably add, however just hacking it in there if I don't know yet how to extract it seems a bit pointless to me.

units are included in parameters but not in units (should be addressed in dataset)

Units not in units?

Just a typo, I meant units are not included in the dataset even though they are part of the parameter, I have included them in the formatter so that it should be forwards compatible.

I'm still confused that there is only one units but no unit in parameters

I guess this is a grammer thing, if you have a parameter, say a thermometer you ask what "units" it has, not what "unit" it has. It is confusing though as you would expect the singular form just like label, name etc. I think "unit" would just look plain weird.

did you see #142 ?

I did, however I find that loc_provider = FormatLocation(fmt='{date}/{time}_#{counter}_{name}_{label}') Is not really sufficient for my purposes. This is not an issue with your formatter ( I think it improves over the old one quite a bit) but rather about the implicit choices about what should and should not be a folder.
Below is a figure of what it looks now:

To achieve this I need the following code (lines 79 - 82 in hdf5_format.py)

location = data_set.location
self.filepath = io_manager.join(
    io_manager.base_location,
    data_set.location_provider(io_manager)+'/'+location+'.hdf5')

This seems a bit convoluted to me (but most importantly is something that should happen in the location provider). What I would like is to have a formatter that provides the following for me {date}/{time}_#{counter}_{name}_{label}/{time}_#{counter}_{name}_{label}.hdf5, this explicitly nests the hdf5 file within a folder of the same name.
This allows saving figures and other things within the same folder ( a pattern we commonly use) additionally it allows easily browing all the folders corresponding to a certain day as the filename is saved in the foldername. Then I don't like the "-" in the timestamp as it makes the text too wide (thus hiding the actually useful information) but that belongs to a different discussion.

The io.list or wildcard function seems to be what I am looking for, I should replace this in the test but I'll wait for #142 to be merged first.

MerlinSmiles · 2016-05-17T09:38:27Z

Interesting with the setpoint arrays, now this also confuses me.

You are right, just storing the id is easy and something I will probably add, however just hacking it in there if I don't know yet how to extract it seems a bit pointless to me.

Just a typo, I meant units are not included in the dataset even though they are part of the parameter, I have included them in the formatter so that it should be forwards compatible.

The unit is part of the data_array, and I guess that's where it should belong. You can get it from within the dataset as array.units, also that is where you get the array_id the data_set is just the container.
Or I misunderstand your problem :)

I'm still confused that there is only one units but no unit in parameters

I guess this is a grammer thing, if you have a parameter, say a thermometer you ask what "units" it has, not what "unit" it has. It is confusing though as you would expect the singular form just like label, name etc. I think "unit" would just look plain weird.

But we use name for a single value parameter, and names for parameters with two or more values, I have no option of two different units for a parameter with 2 values.
I'd rather ask, what unit does that value have ^^

What I would like is to have a formatter that provides the following for me {date}/{time}#{counter}{name}{label}/{time}#{counter}{name}{label}.hdf5

Maybe this is easier?

location = data_set.location
filename = io_manager.join(location, location.split('/')[-1] + '.hdf5')

But it could also be useful to extend the FormatLocation to format a folder and a filename...

Then I don't like the "-" in the timestamp as it makes the text too wide

The FormatLocation in #142 allows you to provide fmt_date and fmt_time.

AdriaanRol · 2016-05-18T15:44:08Z

@alexcjohnson do you have test cases/test datasets for more complex datasets?
I am talking about 2D scans (how does it work with setpoints) and multi-valued parameters?

But we use name for a single value parameter, and names for parameters with two or more values, I have no option of two different units for a parameter with 2 values.

I think the confusion comes from the fact that I have one column per name and am combining multiple parameters in one datagroup, the list of names then contains each name for every column (being either a parameter or multiple columns for a composite parameter (containing multiple parameters).

…structures

AdriaanRol · 2016-07-15T14:13:57Z

Based on the discussions today (Friday QCodes meeting) the (currently functional) hdf5 formatter needs to be rewritten.

The formatter will follow the following three rules

One (hdf5) dataset /dataarray per parameter
dataset (hdf5) shape relfects instrument operations ((mn,1) vs (m,n))
Arrays contain ref to setpoints in metadata.

Additionally (not in this PR) direct access to the dataset from the main process will be required.

giulioungaretti · 2016-07-21T09:19:45Z

I guess this is v 0.1 right ?

started as a style cleanup, can you find the bug it fixes?

Another fix due to switching to a nicer pattern...

The same symbol (s) had three meanings on a single line!

looks for a "close_file" method in the formatter, so this should be extensible to other formatters we may write.

alexcjohnson · 2016-09-01T04:23:47Z

            self.write()
+
+            if hasattr(self.formatter, 'close_file'):
+                self.formatter.close_file(self)


@AdriaanRol re: closing the file at the end of a loop: the loop already calls DataSet.finalize() at the end, which seems like a pretty sensible place to close the file, before we go to the trouble of implementing a context manager. Seem reasonable?

I agree, however do you think it reasonable to create an issue for this once this is merged? It is mostly important when interrupting loops (intended or due to exception)

I believe that is already taken care of, as finalize gets called in a try / finally block. Which is exactly how a context manager works under the hood, though I agree that it would be cleaner to implement it that way instead.

alexcjohnson · 2016-09-01T04:31:43Z

I haven't looked through the tests yet, but I fixed a few bugs in untested parts of the code. Currently I get this for coverage on the new formatter:

data/hdf5_format.py    217 Statements     35 Missed    84% Covered
Missing:
60, 66, 189, 193, 220-241, 257, 283, 286-290, 305-314, 320-323, 327, 366, 378-379, 389

Which is a good start! But we will do ourselves a big favor if we can increase it.

alexcjohnson · 2016-09-01T04:34:17Z

+        if data_name not in data_set._h5_base_group.keys():
+            arr_group = data_set._h5_base_group.create_group(data_name)
+        else:
+            arr_group = data_set._h5_base_group[data_name]


@AdriaanRol this is the only actual code change I had to make to get rid of all the other attributes stashed on the formatter. OK?

yes, looks good :)

AdriaanRol · 2016-09-01T07:50:34Z

@alexcjohnson , updated todolist in comment above to include increasing coverage. Thanks for the changes. p.s. feel free to give it a go as I will not be able to get to it on a short (days) timescale and I think this should be merged sooner rather than later (the merge frequency has gone down quite a bit recently).

peendebak · 2016-09-02T08:59:38Z

@AdriaanRol @alexcjohnson Not my call, but I would be happy with merging as well.

Removed a function that was not used (from older version)

AdriaanRol · 2016-09-04T11:27:35Z

All tests pass (locally), 100% coverage in hdf5_formatter.py as far as I can tell people are happy with this formatter.
@alexcjohnson , @giulioungaretti can we 💃

alexcjohnson · 2016-09-06T03:30:30Z

+# general unsupported type in dict
+# reading metadata for dataset that does not have a metadata attribute
+# unrecognized list type when reading in dict
+# boolean string that is not True or False raised Value Error


@AdriaanRol are these comments still useful?

No they are not, removing now

alexcjohnson · 2016-09-06T03:52:44Z

@AdriaanRol thanks for completing the HDF5 tests. I added tests for compare_dictionaries, and some non-blocking notes and todos there. Take a quick look at those before proceeding, but I'm ready to 💃 👯 💃 !!

alexcjohnson · 2016-09-06T04:03:02Z

@AdriaanRol I meant to point out earlier this change I made - I've seen this various places , and not just from you but other contributors as well. I suppose you get the type h5py._hl.group.Group from a repr of an object you're debugging with, but it's more robust to use the alias to this object at the top level of the package (h5py.Group) if one exists. You never know when the package will reorganize, particularly when you're referencing an underscored submodule, but anything they've exposed at the top level is unlikely to change - and will also be more meaningful to readers of the code.

This also applies to using qcodes, even from within drivers and tests (although I notice I haven't been consistent about that myself... but it would actually make the tests better if they used the top-level names when possible!). The objects people are generally expected to use directly are aliased in qcodes/__init__.py so, for example, you don't need from qcodes.instrument.visa import VisaInstrument, you can just do from qcodes import VisaInstrument.

That said, for linkages within the core of the package itself it's better to point to the item you want directly, as that can avoid circular imports. But it's only within the core that this applies, and the core is all tested as a unit so we'll catch it if a reorganization breaks anything.

alexcjohnson · 2016-09-06T04:04:57Z

...and that last comment reminds me, we should include HDF5Format in the top-level __init__.py too!

giulioungaretti · 2016-09-12T16:43:20Z

Wait a sec, this merge never passed travis right ?

AdriaanRol · 2016-09-12T19:25:39Z

@giulioungaretti , All tests that were part of the formatter passed on Travis and all tests passed locally. I think the final failure (most of the Travis runs did pass) has to do with the stochastic bugs on travis.

Just so you can sleep well, there is no untested code in the master.

giulioungaretti · 2016-09-12T19:31:15Z

@AdriaanRol, cool! I did not check the code yet, but there's a new error with the dataset. May have not come from this either but just checking if it was something you knew :D

I will look deeper into it soon!

AdriaanRol added 15 commits April 28, 2016 20:06

Added a mock parabola with noise to mock instruments

e7a8ec1

Test meta instrument

c07644d

Merge branch 'master' of https://github.com/qdev-dk/Qcodes into data_…

d90083d

…structures

Merge branch 'master' of https://github.com/qdev-dk/Qcodes into data_…

cf24a6d

…structures

Merge branch 'data_structures' of https://github.com/qdev-dk/Qcodes i…

8615e8f

…nto data_structures

Added some commits + an __init__.py

7e73cdf

Initial hdf5 saving (primitive)

1454a01

Intermediate commit

fc3a742

Added some formatting to the __repr__ of the dataset Added force_write option (untested for GNUplot format)

Start of tests

44eece8

Working test for writing a simple 1D file Got a failing test for the read/loading of data

Datasaving updates

93f8f3c

Loading from file (basic ) Saving of array metadata

Working read function

1101eae

Full read/write funcitonal

dd1e8c2

Todo incremental write (does not work yet with a loop) Also set_points reference in dataset is not saved correctly

Added incremental write

e80df0a

(not robust yet)

Added a test for writing in a loop

c023de8

Cleaned up the notebook a bit

d547fa6

AdriaanRol added the help wanted label May 16, 2016

This was referenced May 21, 2016

DataSet array_id names #184

Closed

Metadata #107

Merged

add 2D mock dataset for testing and development #193

Merged

AdriaanRol added 4 commits May 27, 2016 11:32

Merge branch 'master' of https://github.com/qdev-dk/Qcodes into data_…

18335bb

…structures

Merge branch 'master' of https://github.com/qdev-dk/Qcodes into data_…

5329fe6

…structures

Merge branch 'master' of https://github.com/qdev-dk/Qcodes into data_…

7756097

…structures

Merge branch 'master' of https://github.com/qdev-dk/Qcodes into data_…

7d74bef

…structures

alexcjohnson added 4 commits August 31, 2016 23:30

fix: bug in HDF5Format._create_data_arrays_grp

73539f1

started as a style cleanup, can you find the bug it fixes?

fix: Bug hiding in _encode_to_utf8

0652413

Another fix due to switching to a nicer pattern...

fix: change confusing line in _encode_to_utf8

849e46c

The same symbol (s) had three meanings on a single line!

feat: Use DataSet.finalize to close HDF5 files

8f8670d

looks for a "close_file" method in the formatter, so this should be extensible to other formatters we may write.

alexcjohnson reviewed Sep 1, 2016
View reviewed changes

AdriaanRol added 5 commits September 3, 2016 16:31

added a test for properly closing a datafile

d8961f5

Added tests for flushing, closing and finalizing a datafile

811a03b

Added some docstrings

afcc5aa

Working on increasing coverage

26ca76f

Removed a function that was not used (from older version)

100% coverage!!!

5dabc6b

alexcjohnson added 2 commits September 5, 2016 23:27

test: utils.helpers.compare_dictionaries

08f3dfa

style: pep8

95ed3ac

alexcjohnson reviewed Sep 6, 2016
View reviewed changes

Final minor fixes before merge

1d3cf22

AdriaanRol merged commit 797b004 into master Sep 6, 2016

AdriaanRol deleted the data_structures branch September 6, 2016 07:39

Uh oh!

Conversation

AdriaanRol commented May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcjohnson commented May 16, 2016

Uh oh!

MerlinSmiles commented May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdriaanRol commented May 17, 2016

Uh oh!

MerlinSmiles commented May 17, 2016

Uh oh!

AdriaanRol commented May 18, 2016

Uh oh!

AdriaanRol commented Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giulioungaretti commented Jul 21, 2016

Uh oh!

alexcjohnson Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

AdriaanRol Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

alexcjohnson Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

alexcjohnson commented Sep 1, 2016

Uh oh!

alexcjohnson Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

AdriaanRol Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

AdriaanRol commented Sep 1, 2016

Uh oh!

peendebak commented Sep 2, 2016

Uh oh!

AdriaanRol commented Sep 4, 2016

Uh oh!

alexcjohnson Sep 6, 2016

Choose a reason for hiding this comment

Uh oh!

AdriaanRol Sep 6, 2016

Choose a reason for hiding this comment

Uh oh!

alexcjohnson commented Sep 6, 2016

Uh oh!

alexcjohnson commented Sep 6, 2016

Uh oh!

alexcjohnson commented Sep 6, 2016

Uh oh!

giulioungaretti commented Sep 12, 2016

Uh oh!

AdriaanRol commented Sep 12, 2016

Uh oh!

giulioungaretti commented Sep 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

AdriaanRol commented May 16, 2016 •

edited

Loading

MerlinSmiles commented May 16, 2016 •

edited

Loading

AdriaanRol commented Jul 15, 2016 •

edited

Loading