Reading - datasharing articles, continued

Empirical Study of Data Sharing by Authors Publishing in PLoS Journals

DOI: https://doi.org/10.1371/journal.pone.0007078 Caroline J. Savage, Andrew J. Vickers

We received only one of ten raw data sets requested. This suggests that journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

Enough said, really.

Reading - datasharing articles

Here are four articles I’ve been reading around data sharing, for my covid-19 article.

Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: survey of studies published in The BMJ and PLOS Medicine

Florian Naudet, Charlotte Sakarovitch, Perrine Janiaud, Ioana Cristea, visiting scholar, Daniele Fanelli, David Moher, John P A Ioannidis DOI: https://doi.org/10.1136/bmj.k400

  • When the study could reproduce a result, conclusions were generally the same, which seemed nice, until they point out in the discussion that they usually start from pre-processed data, whereas if you start from 100% raw data, the results might be different.
    • Example: my covid-19 data sharing study is all interviews. I “code” my interviews - saying something like “this sentence represents a desire to balance ethical datasharing with patient privacy.” - and I can publish my codes safely, but not the raw interview transcripts. People reproducing my work might reproduce similar results from the codes, but might produce different results if they had the raw transcripts and had to do the coding themselves. (That’s why I get a second coder to review my codes and produce an inter-coder reliability rating.)
  • DMPs are good, but may not be sufficient to facilitate actual data sharing (rather than plans to share…) - people don’t necessarily follow up on their DMP plans.
  • Reproducing analyses can be challenging, as data are rarely homogenous and require cleaning, analysis, standard use.

Open-access policy and data-sharing practice in UK academia

Yimei Zhu,DOI: https://doi.org/10.1177/0165551518823174

  • 1/5 uk respondents have shared data (given that medical and social scientists have good reasons to keep data private in some cases - and possibly others? That doesn’t worry me so much), but interestingly, more people have re-used data than shared it…
  • Interestingly, I’d have expected younger researchers to have been more likely to share data, but fewer have. Maybe it’s because more senior researchers are more likely to have had the chance to do so since they’ve been working longer?
  • I think it’s fair to say here that experience making publications OA or using open data may make people more likely to make their work open later on. That is, familiarity with some open practices can result in expanding to other open research practices?

Factors influencing the data sharing behavior of researchers in sociology and political science

DOI: https://doi.org/10.1108/JD-09-2017-0126 Wolfgang Zenk-Möltgen, Esra Akdeniz, Alexia Katsanidou, Verena Naßhoven, Ebru Balaban

A classic case of an article about open research being paywalled 😵

So, this article was nice because it’s an approach to social sciences, which seems under-studied around openness compared to many other domains. Social studies often need to keep their data closed due to confidentiality concerns (this is very much the case with one of my current studies, for example - people won’t reveal vulnerable things to me if they think I’ll share them around the web!) - political science has less of this concern. They compare between the two domains a bit, which doesn’t really interest me too much - I’m interested in the broader strokes.

It re-confirms a lot of the same things the other data-sharing articles I’ve been reading have said, namely:

  • almost everyone thinks datasharing is good, but
  • effort of datasharing (preparing data to be shared) and risks (misinterpretation, risk of scoop, lack of incentive) often outweigh an individual’s reasons to share data.
  • It takes time and effort to prepare data for sharing. Data availability statements are often not true - that is “you can access data via…. x mechanism” - x often doesn’t actually work when ppl try to get data.
  • where people have infrastructure, they’re more likely to share. Biomedical domains in particular tend to have infrastructure for data sharing, humanities and social often doesn’t.
  • authors who share at some point tend to share things again later.
  • making it easy to share is important if you want ppl to do it.

Patient privacy in the COVID-19 era: Data access, transparency, rights, regulation and the case for retaining the status quo

DOI: https://doi.org/10.1177%2F1833358320966689 Joan Henderson

Has useful notes on balancing privacy vs transparency.

Sudden and rapid changes to services […] were introduces well ahead of any considered legal protections for patient privacy and governance of these processes.

Is there an argument for the public good needing privacy overridden? This article presents arguments and ultimately concludes that the benefits probably do not outweigh the downsides. Phew. A large part of this is also around the fact that commercial interests may benefit from this without giving back in any way.

Reading - semistructured data models - a whole thesis!

“Designing end-user information environments built on semistructured data models” by Quan, Dennis A.

This is a thesis, 214 pages. It’s OCRed so I can’t easily post favourite snippets. Sob. 😭 So here are some interesting bits:

  • Chapter 2.2, psychological problems in data retrieval. Talks about the need to sort information clearly when there is too much. Classic “big data is a problem”. Can reference Bush’s memex idea from 1945 - even 75 years later, we haven’t solved big data problems.
  • Chapter 8,
    • “Finally, we show with a user study that multiple categorisation - allowing resources to be in more than one collection at once - is superior in many respects to the hierarchical folder schemes in popular use today.”
    • It asserts that few modern interfaces make it easy to multiple categorise files. This is too old to cite anymore (18 year old thesis), but maybe look into papers on tagging (i.e. multiple categorisation…) that are more recent?
    • Chapter 15 - Page 185 bioinformatics scattered identifiers and lack of integration problem is identified, good place to cite. Also suggested unified identifiers (lol).

This also links me to Lansdale’s 1998 “The psychology of personal information management” good for citing that items need multiple categories.

Recruitment - covid study - SO TOUGH

I have spent so much effort trying to recruit people, but the “not allowed to approach people directly” directive I was given by ethics basically means I am allowed to do the study so long as I don’t ask anyone to sign up? I’m just firing off impersonal emails to people knowing they won’t respond. It’s very disheartening, and I keep on having to remind myself that I’m not being slack or naive in my attempts to contact people. I do know how to ask for things in such a way that I get better uptake - which helps me remember why approaching people is deemed coercive and therefore forbidden.

I don’t think this is unethical research, and if I’m not allowed to ask people to participate what is the point?

I am very sick of shouting ineffectually at the ether.

Reading - mental models

Ileana Maria Greca & Marco Antonio Moreira (2000) Mental models, conceptual models, and modelling, International Journal of Science Education, 22:1, 1-11, DOI: 10.1080/095006900289976

This paper provides a summary of what the meaning of the term mental model is and the fact that it can mean many things depending on the paper and discipline.

It also cites a nice quote from Don Norman (an HCI expert) in Gentner and Stevens, 1983 - a book that is heinously expensive, ranging from £30 for an ebook to £200 for a physical copy. 😤 Luckily the preface and cited pages seem to be free on Google Books preview (for now, at least). Page 7-9 of Norman’s introduction has several rules of mental models and a definition - may be useful to cite.

Some apologies as I’m typing with a crappy butterfly keyboard that likes to add in extra full stops (aka periods), and also to add in extra “r”s. I’m not intentionally talking like a pirate unless it’s September 19th, and I’m a little frightened to say I did NOT need to google to find out what date this important event was.

Back to the original paper, I think I’m reading here that mental models usually apply to interactive systems, or events. Biology certainly runs over time, but the data models I’m looking at generally don’t (I’m not covering, say, workflows!)

I’ve highlighted this sentence:

“for Johnson-Liard, mental models are working models of situations and events in/of the world, and that through their mental manipulation we are capable of understanding and explaining phenomena and are able to act accordingly to the resulting predictions.”

They also say “First of all, we must emphasise that as is common in science education, these terms are not used in a univocal way; on the contrary, behind their vagueness there is also a diversifications of meanings. Particularly in the case of mental models, this diversification has led us to wonder whether they are not actually ‘mental muddles’.”

Long time no update

Long time, no update, due to a mix of things happening in the background, some of which weren’t the type of thing one can put on a public blog, and a massive blow to my self esteem and self confidence, and COVID-19 meaning it’s been a year since I’ve seen a friend or family member face to face, or gone to a restaurant, or gone to a conference, or gone on bike rides or hikes or days out or had breakfast at the cafe at the garden centre down the road.

Is this a bit of a whinge? Yep. Anyway, the mental health effects of an ongoing pandemic with most of my normal coping mechanisms gone is um, not fun, and is certainly affecting my productivity. But also, the stuff I can’t talk about is a thing.

I did pass my year two review and didn’t have to go spend the day in bed crying, so there’s that.

Reading - Qualitative data analysis and Cohen's Kappa

Recent work has largely been about coding (qualitatively speaking, not computerly speaking). Both my own data and that of others. Since I’ve been second coder on a colleague’s work, I spent some time reading about Cohen’s Kappa, a method used to measure agreement between coders.

Some useful resources and why they’re good:

In such research, the data may be collected qualitatively, but it is often analyzed quantitatively, using frequencies, percentages, averages, or other statistical analyses to determine relationships. Qualitative research, however, is more holistic and often involves a rich collection of data from various sources to gain a deeper understanding of individual participants, including their opinions, perspectives, and attitudes. Qualitative research collects data qualitatively, and the method of analysis is also primarily qualitative.

Reading - algorithms aren't pure or objective. Neither is science.

This quote from the abstract makes it clear why this is useful!

We advocate that data scientists should be intentional about modeling and reducing discriminatory outcomes. Without doing so, their efforts will result in perpetuating any systemic discrimination that may exist, but under a misleading veil of data-driven objectivity.

Conscientious Classification: A Data Scientist’s Guide to Discrimination-Aware Classification 10.1089/big.2016.0048

Reading - The Four Pillars of Research Software Engineering

Cohen et al. - 2020 - The Four Pillars of Research Software Engineering

DOI: 10.1109/MS.2020.2973362

This looks really useful - it wraps together the state of RSEs today, justifications for why they are important, and different angles that the issue should be approached from.

“For open-source software, putting code where everyone can see and comment on it can help to gain valuable feedback and even develop new collaborations that can lead to better tested, better quality output”

Software sustainable communities project is launched

I daydreamed of it around this time last year, sitting at Elsa’s wishing I had time to participate in Mozilla OLX. I eventually did, with my co-authored OLX application being successful and becoming Open Life Science, aka OLS. OLS-2 applications are open for a few short days, closing on the 30th of June.

I’ve finally gotten the details written up and the ethics through to run surveys and github metrics on OLS-1, no matter how hilarious late it is - still relevant once I figured out how to structure the study to be non-time-critical (originally had planned to study the effects of the intervention - tricky, with so many factors - and instead pivoted to study metrics of projects over a year.)

Small clap. Tomorrow’s my birthday. I launched a study for my birthday!

Here’s the linky: https://sustainable-open-science-and-software.github.io/

Reading - scientific writing tips

“Simple rules for concise scientific writing” Scott Hotaling https://doi.org/10.1002/lol2.10165

Good read, nice concise list. Probably good to refer to when junior writers need advice, and also has some nice “how to reduce your words” tables, e.g. “It is obvious that” -> Clearly.

🦠 Covidy progress - 💎 DMPs are forever, or at least they feel like it

I’ve spent the last three weeks thinks and rethinking approaches to ethically gathering this data. The chat with K mentioned in the last post - I said, at one point “yeah, but ethics isn’t actually about ethics, it’s about covering the university’s ass” which has been resonating in my brain a lot. Remember to try to think constructively myself about whether I’m happy with my behaviour (and would I be happy about my behaviour if I was one of my research subjects?) - don’t just think about ethics as a hoop to be jumped. I’m a little/quite disgusted with myself, as someone who prides themselves on being vegan and atheist-agnostic. I wonder if this is some hangover from my days trying to market and sell things, like mattresses. (That feels like a long time ago now - left that company in 2012!)

Anyway, next iteration of the DMP is D-O-N-E! It felt like it took forever. Just like a diamond. 💎

Anyway the RDA recommendations are really good. A few thoughts for the study:

  • reference the RDA guidelines, and note that multiple of the sub-groups emphasise the importance of open sources.
  • recommend the RDA guidelines.
  • consider interviewing people involved in the RDA guidelines - I know quite a few of them :)

🦠 Covid possible skeleton for write up

Early stages, need to go Zoom with in-laws soon (hahah, came back 6 days later to finish this), BUT:

  • introduce problems, explain data sharing has been an ongoing and longstanding problem
  • review previous work in epidemics, e.g. Zika, Ebola. AIDS? Review WHO for list of epidemics?
  • SCOPE:
    • acknowledge that there are many places in which there is an imperative requirement for private data (clinical, movement, personal safety).
    • “omics” data - genomic, proteomic
    • not clinical data or patient data - so virus genome yes, host genome no.

🦠 Covidy progress - feels like treading water

Research is hard.

Quick summary notes about progress so far - lots of thinking and ruminating and worrying, but feeling better today:

  1. Ethics submitted after several painfully late nights.
  2. Ethics came back with a lot of questions and a clear requirement to simplify scope and make permissions clearer.
  3. Chatted with supervisor, managed to narrow scope - remove log of data sources (would be a huge duplicate of things that exist anyway). Only retrospective collections in order to avoid changing behaviour. Was originally planning to get rid of the interviews too, but…
  4. whilst revising the ethics I worried I still needed interviews for proper context of the actions.
  5. also worried about if retrospective would provide enough info. After some review of sources - I think probably yes.
  6. Chatted with K, which helped emphasise to me that privacy+consent is always going to be a tough issue and sometimes you need to be more thoughtful than you had originally. I gained a lot of useful perspective. Also worried that just auditing closed sources won’t be interesting enough as a result of this convo. this combines with (4) to push me back over to wanting to add interviews in again. Another important point was that people are likely to ignore @all notifications on slack. As I mark 32 unread slack notifications read from a hackathon, I’m inclined to agree.
  7. Woke up with renewed determination and a plan that might work? targeted interviews based on slacklogs. Be prepared to remove interviews if supervisor thinks I’m extending too far again.

some ruminations:

  • Useful preprint for data sources: https://www.researchgate.net/publication/340687152_Leveraging_Data_Science_To_Combat_COVID-19_A_Comprehensive_Review/link/5ea55133a6fdccd794550d80/download - has useful text mining suggestions for twitter
  • GISAID scraper takedown request - https://github.com/bioinf-mcb/gisaid-scrapper/issues/15

🦠 Covidy progress - feels like treading water

Research is hard.

Quick summary notes about progress so far - lots of thinking and ruminating and worrying, but feeling better today:

  1. Ethics submitted after several painfully late nights.
  2. Ethics came back with a lot of questions and a clear requirement to simplify scope and make permissions clearer.
  3. Chatted with supervisor, managed to narrow scope - remove log of data sources (would be a huge duplicate of things that exist anyway). Only retrospective collections in order to avoid changing behaviour. Was originally planning to get rid of the interviews too, but…
  4. whilst revising the ethics I worried I still needed interviews for proper context of the actions.
  5. also worried about if retrospective would provide enough info. After some review of sources - I think probably yes. Update a few hours later: did I say yes? I think… maybe no?
  6. Chatted with K, which helped emphasise to me that privacy+consent is always going to be a tough issue and sometimes you need to be more thoughtful than you had originally. I gained a lot of useful perspective. Also worried that just auditing closed sources won’t be interesting enough as a result of this convo. this combines with (4) to push me back over to wanting to add interviews in again. Another important point was that people are likely to ignore @all notifications on slack. As I mark 32 unread slack notifications read from a hackathon, I’m inclined to agree.
  7. Woke up with renewed determination and a plan that might work? targeted interviews based on slacklogs. Be prepared to remove interviews if supervisor thinks I’m extending too far again.

some ruminations:

  • Useful preprint for data sources: https://www.researchgate.net/publication/340687152_Leveraging_Data_Science_To_Combat_COVID-19_A_Comprehensive_Review/link/5ea55133a6fdccd794550d80/download - has useful text mining suggestions for twitter
  • GISAID scraper takedown request - https://github.com/bioinf-mcb/gisaid-scrapper/issues/15

🦠 Covidy planning notes

Things to track

  • sources of data useful to biologists and bioinformaticians
    • licence - this may need to have multiple questions, and potentially may change over time
    • attitudes and struggles
    • machine readability
  • streams of information to glean from users.
  • responses to attitudes and struggles

Things to do

  • Ethics! (need to know scope first)
  • sign up to virtual biohackathon slack and group ☑️
    • contact them to ask if they mind being studied! (just the organisers) ☑️
  • NF-core covid group ☑️
  • RDA group
  • https://docs.google.com/document/d/1ExyphyMfvUTlPj7vZ3wvbIIEqv4nhBpyLcV0_n1H5e8/edit
  • ask (twitter?) for experiences // wait for ethics
    • slight possibility of it being really difficult to manage responses.
    • possible mitigation?
      • Form? Pro - easy to submit, Con: possible duplications - but maybe good - more reported is more popular.
      • GitHub PR? Pro- less duplication and manual work on my side, Con, may not represent the strength / magnitude of the issues. May be a tech barrier. Many bioinformaticians would have the skill to do this or willingness to figure it out.
      • maybe both. PR if you can, form if you can’t. Ideally a google form or some other form of data collection that facilitates collaboration would be more effective.


  • Look up literature regarding previous Ebola and Zika outbreaks - e.g. Nick Loman’s data sharing stores.
  • Other stuff. 👈👈👈 Become less vague about that this is 😆

Research questions to answer

In times of a pandemic or epidemic when rapid response is required, what are attitudes towards pathogen-related data sharing and data access? In particular:

  • Are these data licenced in a way that permits re-use and redistribution?
  • Are they made available in ways that are easy to download and re-use, e.g. API or bulk download, machine-readable with relevant metadata?
  • What response do various communities have to these restrictions?


  • In Scope: biology and bioinformatics oriented data sources - genetic sequence data, protein data, viral strains, and statistical data relating to infections - infected, recovered, death, locations.
    • there are likely to be too many data streams to be comprehensive about monitoring them all, but we can probably find most of the data sources out there.
  • biomedical / personal data? - possibly out of scope maybe we should avoid strongly personal data, such as mobile tracking app data - there is good reason to be cautious about sharing this. Don’t disregard completely - write about concerns such as anonymising safely and state surveillance, securely gathering data.
  • imaging?

PIVOT! Onto Covid studies.

So what has happened since last update, dear reader? Mostly, just a global viral outbreak with an incredibly high death rate, A coronavirus, aka COVID-19. The UK has been holed up in their homes for a week legally, with many others staying at home for a week or three before the government mandate. There are around 600,000 confirmed cases worldwide, in a world where many can’t get access to tests or are told by their government not to bother testing. This includes the UK - tests are only administered if you are hospitalised. The incubation period can be days or up to around two weeks, so the number infected is - I’m guessing here - probably much more than a million right now.

Anyway, I was expecting my meeting with my supervisors to be short and uneventful, but after 1- or 20 minutes of Covid-talk, we pivoted to talking about research about the virus - specifically data availability. Through work we’ve already been made aware how rubbish some data are - hard to use (i.e. not machine readable or not easily available, requiring repeated manual human intervention to download), or illegal to re-use due to restrictive licencing terms.

For fans of open things, closed licence stuff usually makes you swear a lot at people who would rather line shareholder pockets than share their work and knowledge with humanity. When it’s in pandemics, it just seems downright dastardly. Anyway - this sparked an idea for a really useful study: What data are available and what challenges do we face in a pandemic situation? Observe and record as it happens.

So, that open study? will still happen, but for now we’re focusing on the pandemic. PIVOT!


Reading - 🐘 About elephants in the room, and blind men. And blind Yos as well I suppose.

I’ve just been reading “The blind men and the elephant: towards an empirical evaluation framework for software sustainability” - doi: 10.5334/jors.ao.

Comments on the reason I read this paper - kind of by accident?

I’m on the plane and the price for wifi is extortionate, so I’m not able to follow links and thoughts too much - this is okay for the purposes of dedicated study downtime, to be fair! Anyway - I had been looking at the CHAOSS guidelines, and come to the conclusion that the Elephant Factor was confusing and didn’t necessarily sound like a meaningful measurement. Looking at the citations in the end of this section of metric, I saw this paper and figured I should read it, especially since Caroline was one of the co-authors on the paper.

I printed it out, read it, and part way through realised it probably had nothing to do with the CHOASS Elephant Factor metric.

Thankfully I do have a copy of the CHAOSS metrics downloaded offline - reading through, I noticed this explanatory snippet:

Elephant Factor provides an easy-to-consume indication of the minimum number of companies performing a parameterized filter (i.e. 50%) of the work. The origin of the term “elephant factor” is not clearly delineated in the literature, though it may arise out of the general identification of software sustainability as a critical non-functional software requirements by Venters et al (2014).

I guess I skimmed that when I decided to read this paper, or possibly I didn’t quite fathom just how unrelated it was with the exception of worlds “elephant” and “software”?

Actual notes on the paper

It was an interesting read despite the fact it wasn’t what I thought it was, which was nice. It mostly looked at different ways people define sustainability (spoiler: many different ways), pointed out that metrics for sustainability are often wholly subjective, and seemed to focus on technical architecture as an important part of sustainability.

Reasons it may be useful for me:

  • it has lots more reading about sustainability measures
  • repeatedly states that software sustainability is a non-functional requirement, and points out that academic software often has haphazard construction and may be especially proof-of-concept-ey, and/or focused on nonfunctional requirements.
  • points out that software metrics can not always be measured in a quantifiable way - e.g. reliability.
  • Uses the parable of the bling men and the elephant (one feels the trunk (or maybe tail?) and asset it’s like a rope, another feels a different part of the elephant and makes different claims about the elephant) to remind us that we need to find metrics that genuinely assess what we want to assess clearly. Using this theme could be useful to point out vanity metrics and gamifiable metrics.
  • lots of nice bits I can quote. Some samples:

“The concept of sustainability goes beyond the software artifact itself”

“What measures and metrics are suitable to demonstrate software sustainability is an open research question”

“Selection of the appropriate methods is highly dependent on the context…”

CHAOSS - reading through the metrics and thinking.

I’ve been reading the CHAOSS Metrics (Community Health Analytics Open Source Software) recently. Emmy printed them for me so I could easily annotate them 😍.

Some notes as I think of them:

Good stuff I probably want to use to measure these projects over time.

  1. Is there a GitHub repo (or other) already? Y/N. If Y:
    1. CoC for project
      1. Is there one? (easy to find automatically, manually, other?)
      2. Is there enforcement info?
    2. Is there evidence of mentorship?
      1. programs (GSoC, Outreachy, internships) (Manual assessment required)
      2. people who help out - e.g. first-timers only, hacktoberfest could be good proxy measures. (auto)
      3. contributing md
    3. How old? Is it the oldest repo in the org? Is it even an org, or a user? This may need to be assessed manually on a case-by-case basis
    4. Commit metrics:
      1. # of lines
      2. # of commits
    5. Review Metrics
      1. # of PRs
      2. # accepted / rejected / left open
      3. Median time from open-to-resolved.
    6. Issue/ticket based metrics
      1. # opened
      2. # reopened (there are caveats about measuring closed and reopened tickets!)
      3. # closed
      4. Time to close (?)
    7. Committer Metrics
      1. How many orgs do they come from?
      2. # of committers
      3. # loc of commits.
    8. Licence - present, copyleft, permissive?
    9. Velocity. The metrics had number of commits vs number of PRs/issues as velocity. Is this meaningful? (I don’t see the point).
    10. Derived from CHAOSS but not directly their metric - also possibly a hard one to measure. Twitter mentions? //todo investigate how hard/easy this is to track.

Stuff I’m not sure about / maybe dislike / need to think some more

Licence - this is where it gets tricky. The metrics discuss number of files without a licence in them. This seems contrived to me as a metrics of sustainability, BUT it justifies things by pointing out that companies may not want to pick up higher risk projects that don’t have clear safe licencing processes. It also mentions number of licences, because - logically - this adds legal complexity if some files are licenced one way, some the other. (biopython might be an example here).

Testing hard to assess, but I feel like we should at least check for its presence

Elephant Factor I’m not sure I get it. Number of companies that contribute 50% or more of the effort. I should read the paper (Caroline is one of the authors on this!). Update - see my notes on the paper - it’s totally irrelephant! 🐘

Open Life Science, transcribing, still looking for participants

Mental models study

Not done 😶 - still on the hunt for two more bio participants, but wondering if I should just give up. 😭 I have 22 people in total.

Transcribing is a special hell. I think I’ve said that before? Sigh. Anyway, getting it done bit by bit.

More of a fun update - Open Life Science and the sustainability of open projects over time.

I’ve been ramping this up a lot - mostly because the timeline of launching OLS in January required it. Applicants are selected and have been notified, mentors are (mostly) assigned.

Small achievements:

  • Emmy Tsang, Mateusz Kuzak, and I have a talk accepted at CHAOSScon 🎉
  • Unsurprisingly following on from the last thing, I’ll be at SustainOSS, CHAOSScon, and FOSDEM at the end of Jan / start of Feb. Yay! (Also: In Brussels on BrexitDay. 😬)
  • I’ve been reading the CHAOSS guidelines. The first 70 pages are amazing and I’ve made lots of notes. I’m not sure the last few are really measurable or useful for me. Another blog post about thoughts coming up.

Biological mental model study call for participants - wetlab edition

I’m still looking for a few final participants for my mental models study. To recap on the previous recruitment notice, I’m running a study to learn about what differences there are (if any) between the way biologists think about biological data compared to programmers.

I’ve already interviewed quite a few software developers and bioinformaticians - now I’m looking for pure wet lab biologists with little or no programming experience, in the Cambridge area. (Sorry programmers! I love you but I have enough interviews from you already ❤️).

Interested in participating? A typical interview takes 30-60 minutes, and I can come to your site to interview you - all we need is a quiet-ish area to sit with a table.

How you can help:

  • Sign up to do an interview:
  • Share [this poster (pdf] in your research institute, via email and/or noticeboards.
  • Share word with others via Twitter.

I want to know more!

The best way to learn more is to sign up to participate or to watch my twitter account; I’ll release a preprint as soon as I reasonably feel I can, and I aim to do this by the end of the year. You can also read more about the study in the participant information sheet - apologies if it’s a bit dry!

Where / when?

I’m based at Cambridge, and right now I’m specifically looking for destinations in Cambridge where I can visit and perform 1-2 interviews onsite - wet lab participants only!

Reading - why software needs funding, especially for maintenance

This paper talks about sustainability of scientific software projects in the Earth Sciences - it’s an interesting read and also useful to cite if you’re ever writing about the need to maintain software, not just race towards a paper/end of grant then forget about it. There’s a set of recommendations at the end for individuals and for organisations.

Nine month review - anxiety and fear

9 Month review is over and passed. Slides from my presentation: https://docs.google.com/presentation/d/1f3bwX-SUArcFU88XA3XhaN0ThwaMhZe_WVxZbo8Swxw/edit#slide=id.g621cf502c8_0_63

This was scary - I was incredibly nervous beforehand - and despite passing I felt like I hadn’t really done all that well. My inner perfectionist being a jerk, I guess.

Reading - measuring the value of open source

doi: https://doi.org/10.1145/3126673.3126679 Title: The Value of Engaging with Open Source Communities George P Link.

Just quick notes: Mostly useful for references to CHAOSS and ways to measure / “signal” the importance of open source software. Slightly more business oriented than academic/research oriented.

Tool to try for literature review

I’ve spent all evening thinking about systematic reviews, then this pops up like a ray of sunlight in my twitter feed:

‘LISC: A Python Package for Scientific Literature Collection and Analysis’ https://joss.theoj.org/papers/10.21105/joss.01674

I haven’t read it yet but the title sounds glorious.

Acknowledgements - running list

People who I need to mention as various thanks in papers and eventually my thesis:

  • Elsa, for helping me brainstorm the community intervention task.
  • Gos, for adding the GitHub element to it
  • Emmy and Naomi for always being there to talk
  • Berenice and Malvika for bringing together Open Life Science
  • All the people who helped me arrange interviews, especially: Robert Davey, Catherine, Pete, Laura Clark, Aidan Budd.
  • Matus Kalas for sharing his thesis online and then walking me through the contents in-person later
  • Bjorn for printing out my updated consent forms while he was in the middle of organising a massive conference.

This list is incomplete and will be added to whenever I can, but probably less often than it should be.

Further musings on mental models

I did four more interviews on Friday the 20th in Oxford. Recruitment was a little harder than it has been, probably because I didn’t go to a dedicated biology institute like I had the last two times. Still, it was incredibly interesting! I had a very varied bunch, from pure wetlab to someone who probably had less bio than me. Some of the things I noticed:

  • Wetlab person: no data schema or relationships between data almost at all. Just a set of piles with titles. Will others be like this, or was it specific to this person? I feel like it might be personality, but I can’t be sure.
  • Computer person: even when they didn’t necessarily understand the data or the cards, they could still make educated guess based on identifier format (e.g. the mix of letters and numbers in a file or card). Suggestion: Always provide an example identifier in column mapping tasks.
  • Sometimes people focus on the data in the cards, when they’re doing the card-sorting task (this is probably a code I need to be using - “seeking meaning of data” maybe). It’s not everyone, but it is a noticable chunk. I was first inclined to ignore this as a mistake people are making, but now I’m thinking of it more there are implications when transferring to the real world - it seems possible that people may make the same mistake during, say, column mapping exercises in a UI, and look for “cancer” when asked to map BRCA1 to a column, rather than looking for “gene identifier”.

OLX is a go go!

So - We’re accepted to the open leaders program! Berenice, Malvika, and I applied together and were accepted. We’ve tentatively named it “Open Life Sciences” which sounds great to me.

Here’s the list of others running OL programs: https://foundation.mozilla.org/en/opportunity/mozilla-open-leaders/open-leaders-x/participants/

And here’s the medium announcement: https://medium.com/@MozOpenLeaders/meet-the-open-leaders-x-cohort-1dc230a4c56a

Fifteen Interviews In!

Wow, apparently I’ve already done 15 interviews and one of the most intimidating things is that I only have a couple transcribed so far. Transcription is the worst thing EVER. I can’t use cloud services (ethics, remember) and Siri offline on my mac is….. bad. Just bad.

Some thoughts and notes on what’s happened so far.

  1. I think there are indeed clusters of some sort in how people create their models. This is based on my personal observation rather than data analysis.
    • There’s the data-centric cluster, which sees data and biology as inherently intertwined - a gene is simply a data object which has attributes such as names, identifiers, etc. - and then there are links between data objects, with a clearly defined relationship, perhaps a “is-a” or “has-a” relationship.
    • There’s also a Biology vs Data cluster - these people sit down, form a nice blob of biology terms, and a second blob of
  2. For most people, mental models of biological data are squishy. Asking questions will often result in “well, I could have done it this way!” - and some people will then adjust the cards, but not everyone will.
  3. Many people want to draw lines between items. I provided paper so people could sketch relationships if they wished, but no one actually did for the first few. Eventually, someone did. This person, hilariously, claims little computer programming / computer science experience but created the most clearly defined schema I’ve seen so far. After seeing them put paper behind the cards, I realised that if I’d had better foresight, I could have prepared a flipchart and allowed people to write underneath. After fifteen interviews, I definitely do not wish to change things around this much, but it would have made transcription a heck of a lot easier.
  4. While many people added a few cards, mostly people didn’t feel the need to add much. The model we started with seems to be reasonably complete.
  5. I’m glad I decided to alternate the order of tasks A and B - one participant worried that the ordering had affected their performance, and I think it possibly does as well.
  6. Some other possible codes: some people made clusters of related things, some people made hierarchic piles, and some people made trees of data.
  7. Some card values were too generic to be clear to most people - e.g. “name”, “accession”. Others were potentially applicable to many different places - e.g. “length”, “molecular weight”.

Updates and general notes

Some updates:

  • Ethics: I eventually got ethical approval for my mental models study during GCC 2019 (early July) but it appears that the ethical review system is somewhat faulty - submissions could appear to be finalised with nothing else to do on my end, but still show as pending submission on the other end. The only way to make it pop through was to unsubmit and resubmit for signatures. Lesson learned: always verify with the ethics approvers that they have the application in the correct status, and don’t assume they’re reviewing it or that anything worked as expected. :(
  • Interviews: - three interviews done, heavily recruiting now that I’m back from BOSC 2019. I couldn’t do any interviews at BOSC as it’s in Switzerland and would be considered unethical to do research outside the EU.
  • Mozilla Open Leaders X - currently in talks with Berenice and Malvika about applying for this. Some of the OBF are aware of our intents, but we should raise it at a formal board meeting.

Biological mental model study call for participants

UPDATE, September 2019: I’m now hunting primarily for wet lab backgrounds or anyone with computational background but less than ten years experience in a biological domain!

I’m conducting a study to see whether people from biological/wet-lab backgrounds vs more computational backgrounds tend to envision biological data structures differently. This involves in-person recorded interviews (30-60 minutes) so participants must be in the UK, although I am able to travel to meet you, so long as it’s within a day trip from Cambridge (or several interviews at once at your institute). Weekends and evenings are fine for me too if that works for you.

How you can help:

  • Sign up to do an interview: email Yo Yehudi at yochannah.yehudi@postgrad.manchester.ac.uk for more info, or DM me on twitter if I follow you. I’m @yoyehudi.
  • Share this poster (docx) (Also available in pdf) in your research institute, via email and/or noticeboards.
  • Share word with others via Twitter. You can retweet this tweet.

I want to know more!

The best way to learn more is to sign up to participate or to watch my twitter account; I’ll release a preprint as soon as I reasonably feel I can, and I aim to do this by the end of the year. You can also read more about the study in the participant information sheet - apologies if it’s a bit dry!

Where / when?

I’m based at Cambridge, so anywhere that’s a day trip or less I can come to easily - or you’re welcome to come to me in Cambridge if you’re nearby. I also have the following dates at specific sites planned (newer is on top).

  • Sept 20 2019: e-Research Centre, Oxford. Sign up here to participate
  • August 28 2019: EBI, Hinxton [sign up now closed, thank you everyone who participated!]
  • August 21 2019: Earlham Institute, Norfolk [sign up now closed, thank you everyone who participated!]

Idea formation - OLBio

Based on Open Leaders & conveniently timed to coincide with the OLX announcement.

Basic structure:

  1. Course spanning N weeks, aiming at getting open source projects at a higher, more reproducible standard. What is N? Base on content for curriculum.
    • “cohort” interactive calls w/ all? USA + Europe cohort? No, ideally only one for starters - 5pm UK time so USA and euro can attend?
    • One-on-one mentors for participants. Alternate cohort calls with mentors calls, like OL3/4/5/6/7
  2. Longitudinal study. Observe projects before, after, 6 and 12 months after?
    • observations: need to look up current accepted measurement techniques for A) science software projects and B) open source projects.
    • Surveys: self-reporting project health & medium-term predictions / roadmap
    • GitHub / GitLab API to compare self-reported metrics and potentially to continue measuring even if dropoff (is this ethical? The info is public)



  • speakers at cohort calls
  • mentors
  • cohort call hosts
  • curriculum designers / reviewers

Collaborators / sponsors

  • Might any journals be interested in this? Could champion as part of the open source toolkit on PLOS?
  • Encourage participants to publish in JOSS
  • Can Mozilla offer any support?
  • Apply to OBF as a member organisation / consider

Video conferences / cohort calls

  • Where does the Zoom come from (needs to be Zoom specifically; this allows good breakout rooms)

Topics to cover (to be added to, preliminary)

  • Basics of an open source project - readme, roadmap
  • Basics of git? Assess if needed
  • Diversity and inclusion
  • what makes scientific software good and reproducible?
    • definitely guest speakers for this
  • possibly also writing and reviewing scientific abstracts. Aim for BOSC or GCC-ready submissions? Other conferences?

Search terms to try

Search for

  • eliciting mental models / ontologies
  • capturing mental models / ontologies.

Reading discards - ontology design

Carole suggested that ontology design is similar to mental models. So far I think I’m looking in the wrong places though. These are all about technical design considerations, not how to get them and understand them.

  • Obitko, M., Snásel, V., & Smid, J. (2004). Ontology Design with Formal Concept Analysis. ResearchGate. Retrieved from https://www.researchgate.net/publication/200473556_Ontology_Design_with_Formal_Concept_Analysis
    • Why not useful? Mostly about methods to structure ontologies, less about eliciting them.
  • Nebot, V., Berlanga, R., Pérez, J. M., Aramburu, M. J., & Pedersen, T. B. (2009). Multidimensional Integrated Ontologies: A Framework for Designing Semantic Data Warehouses. SpringerLink, 1–36. doi: 10.1007/978-3-642-03098-7_1
    • Why not useful? Very technical re datastorage, semantics, etc.
  • Designing and Evaluating Generic Ontologies. (1996). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=
    • No human factors, again.

I think this is still a relevant places to explore, but maybe I need different search terms. “Eliciting” might be good.

Submitted for ethical approval

Ethical approval form submitted Last week, signed today by Caroline.

Awaiting response…. nervously.

Protocol - card sort

## Protocol planning


  1. Introduce participant and explain purpose of the research. Provide participant information sheet.
  2. Present participant with consent form and explain what different levels of consent mean (e.g. consenting to data sharing in aggregate vs allowing video to be shared as an example).
  3. Background questionnaire will be conducted, by loading questionnaire on laptop and asking participant to fill out.
  4. Set up recording to record audio and/or video as per participant’s preference.
  5. Participants will be asked to perform Task A and Task B, with the starting task alternating between A and B in case one task affects performance in the other.
  6. Tasks:
    • Task A:
    • Explain task to participant: They will be offered a stack of cards with names of biological concepts/entities (such as “Gene”, “Protein”, “BRCA1”) written on them, and we would like the participant to sort the cards in a way that makes sense to them. They will also be offered paper and pens, to allow them to sketch / make notes if they wish to, and additional blank cards if they feel like they need to add their own biological entities in addition to the ones that were provided to them.
    • Once the participant is done sorting and sketching if needed, ask them to explain how they reached the decisions they did. Participants are allowed to re-arrange their cards / re-draw their data model if they wish.
    • Ask participants which entities from the cards they think are the most important / interesting entities. - Task B:
    • Provide the participant with paper copies of common biological files such as FASTA and GFF3. Ask them to match certain parts of the file to the same entities found in the card sort.
  7. The participant will be asked if they have any further questions or comments.
  8. Once all queries and comments have been addressed, take still photos of the resulting sketches / card arrangements.

Invitation emails

To a third party, asking for them to help recruit


I’m running a study at the moment about the way people from different computational and biological backgrounds view biological data models, and I’m looking for volunteers. Participants are asked to spend 30 minutes to an hour (maximum) participating in a semi-structured interview, which will be recorded.

We’re looking for people who work with genomic and proteomic data, coming from a wet lab background, bioinformaticians, people a research software engineering background who work in a biological domain, or anyone who fits somewhere in between. Could you share this with anyone you think might be interested? I’ve also attached a poster you could print and attach to a noticeboard.

Anyone who is interested should contact Yo Yehudi at: yochannah.yehudi@postgrad.manchester.ac.uk

To a friend/acquaintance, asking them directly for help:


I’m running a study at the moment about the way people from different computational and biological backgrounds view biological data models, and I’m looking for volunteers. Participants are asked to spend 30 minutes to an hour (maximum) participating in a semi-structured interview, which will be recorded.

We’re looking for people who work with genomic and proteomic data, coming from a wet lab background, bioinformaticians, people a research software engineering background who work in a biological domain, or anyone who fits somewhere in between. I was wondering if you might be willing to participate?

<DELETE IF IRRELEVANT:> I think we’ll both be at - is there a time we might be able to meet?

Could you also share this with anyone you think might be interested? I’ve also attached a poster you could print and attach to a noticeboard.

### Sample tweets

Are you any of:

  • A wet lab biologist?
  • A bioinformatician
  • A research software engineer who works in a biological domain? I’m conducting research into how people create mental models of biological data - can I interview you? 30-60 minutes max. (In-person, Cambridge, UK)

#GCC attendees: I’m conducting research into how people create mental models of biological data, and y’all are a target audience of mine - do you have 30-60 minutes to spare for a semi-structured interview during a spare moment during / before/ after the conference?

Mental models in biological genomic data - measuring.

Mapping data models. I think I would like to preregister this when we feel that the plan is as clear as possible.

Research question:

Draft wording: Does a person’s educational background in biology and computational subjects have an effect of their ability to create effective biological data models, and to map biological data onto this model?

Maybe break down into sub-questions + null hypotheses, with expectation that biology + less computer will be different from biology+computer knowledge or even just computer w/ low biology. More notes about this here

Background questionnaire

  1. Educational background - biology (Do we want to get more specific than this?)
    • High School or lower
    • Undergraduate
    • Postgraduate or higher
  2. Computer background (How can we make this more granular?)
    • Casual computer use (uses software as needed for work, prefers to be shown new things and/or have help troubleshooting)
    • I write some code or tweak code but don’t consider myself to be a programmer. Excel spreadsheets, script tweaking, etc.
    • I write code on a regular basis.
  3. Possibly a better approach than the previous Q: - check marks or scale for familiarity?
    1. Biology software: Are you familiar with any of the following?
      • Galaxy
      • BioMart
      • GeneCards
      • InterMine
      • Molgenis
      • (etc. // add more)
      • other (?)
    2. Computing languages (mix of common scripting languages + languages that teach users certain concepts, e.g. object oriented programming, graph DBs, SQL relationships)
      • R
      • Python
      • SQL
      • Perl
      • Java
      • Functional languages, e.g. Haskell, Clojure
      • Graph databases, e.g. Neo4J
      • Git or other version control
      • Other
  4. What file formats do you work with, if any? (e.g. FASTA, GFF, BAM, VCF, etc…) (free form text)
  5. Do you focus on any specific organisms? (free form text, too many to list!)

Research design

Independent variable

User characteristics as listed above.

### Dependent variables

Models and mappings as generated by user.

Eliciting model details

Possible approach: card arrangement, with cards covering the following three types of data:

  1. Model entities, such as
    • Gene
    • Protein
    • Experiment
    • Publication
    • Organism
    • Data File
  2. Entity properties - e.g.
    • Gene Symbol
    • Gene Identifiers
    • Gene Orthologues
    • Organism name
    • Organism taxon ID
  3. Property values - e.g. for the previous properties -
    • Gene Symbol - BRCA1 (H. sapiens)
    • Gene Identifiers - ENSG00000012048
    • Gene Orthologues - Brca1 (M. musculus)
    • Organism name - H. sapiens
    • Organism taxon ID - 9606

Possible problems: Providing cards with all three of these types of data may result in people creating simple columns of data - e.g. a Gene header with cards underneath providing properties and property values.

Possible ways to address this

A: Structure this in a three or four-part process:

  1. Provide entities and ask for people to arrange them and explain the relationships.
  2. Provide property cards and ask for them to be added to the model.
  3. Provide the property value cards, and repeat.

Problem with this approach: I’m already providing artificial constraints onto the mental model here. People might not see things this way!

B: Provide people with all of the cards at once (or no cards?) and ask them to sketch how they think they’re related.

Q: Should we provide some sort of demo model? something unrelated, e.g. pets and owners, or similar? Is this leading people too much?
Q2: Populating the property values - I was thinking possibly famous genes / proteins / etc. from a mix of popular organisms - ones that are popular in undergrad courses, cancer genes that are associated with celebrities, etc.

## Analysis of results…

// TBD

Trial run - card sort

Trial run of the card sort with a friend went better than expected.

  • Sorted cards into piles of related items, correctly matching properties and property values to the right type. Later relationships were sketched on paper, related o a given pile of cards.
  • Felt some attributes may have been missing, especially
    • Transcripts
    • Pathways
    • MRNA
    • Interactions
    • Expression
    • Regulatory Regions
    • Protein Function?

Cards to discard:

  • Experiment
  • attribute values that are just numbers. too vague.
  • Consider renaming database to dataset, or use both on the cards.

Reading discard- training effects on acceptance of biology software.

Reading list discard: Effect of training on biologist acceptance of tools

Shachak, A. and Fine, S. (2008), The Effect of training on biologists acceptance of bioinformatics tools: A field experiment. J. Am. Soc. Inf. Sci., 59: 719-730. doi:10.1002/asi.20772

This article was mostly related to how biologists felt about certain bioinformatics tools after hands-on training workshops using two different training methods. The study was broadly about the training rather than about usability or mental models formed. The one slightly interesting note for me (and reason it probably turned up in my search) was this sentence, near the conclusion:

It is suggested that the ability of experienced learners to form a mental model of the system is not dependent on the hands‐on training method and hence this did not affect their perceptions of the bioinformatics tools.

Further investigations related to training (and especially lack of) for new biology related systems could be interesting - I should check out the citations of this article.

Reading - Millennial Students' Mental Models of Search - Implications for Academic Librarians and Database Developers

Article: Millennial Students’ Mental Models of Search: Implications for Academic Librarians and Database Developers

Lucy Holman, Millennial Students’ Mental Models of Search: Implications for Academic Librarians and Database Developers, The Journal of Academic Librarianship, Volume 37, Issue 1, 2011, Pages 19-27, ISSN 0099-1333, https://doi.org/10.1016/j.acalib.2010.10.003. (http://www.sciencedirect.com/science/article/pii/S0099133310002545)

The discussion of mental models is particularly interesting here, as well as citations that lead to other interesting articles, such as this note on the transferability (or lack of) mental models:

Although some research[23] indicates that users may adapt models as they explore new tools, others find that work in similar systems actually may complicate learning and confuse users. For example, Scharlotte Saxon discovered that middle-school students mistakenly assume that different systems work more similarly than they actually do; students experienced problems in transferring their models from one system to another.[24]

Reference 24 refers to Saxon, S.A. Seventh-grade students and electronic information retrieval systems: An exploratory study of mental model formation, completeness and change. Ph.D. thesis, The Florida State University. Accessed from https://www.learntechlib.org/p/119200/ on 2019-04-21. One concern - as interesting as this thesis is, is the study of technology from 1997 relevant now, or has technology changed enough these lessons no longer apply? Frustratingly, only the abstract seems to be available and I can’t see the author’s current contact details online to request a copy of the manuscript directly. Sigh.

Apart from this the methods and results generally aren’t likely to apply to my research; it focuses too much on search methodology which I am not concerned with.

towards a first experiment - what mental models do people need to interact with biological software?

So, my last meeting helped me come up with a much firmer plan.

  • the usability testing I’ve been planning will probably happen, but not yet.
  • there may well be a pre-test testing round where we firm up our ideas more clearly
  • before this, let’s assess mental models of biological software / data models.
    • biological data models exist, e.g. the intermine data model.
    • advanced users probably have a mental model that is similar to the existing model
    • naive users won’t have this model at all, but they probably have some mental model that represents biological data. Some hypotheses (I think this is the right word!):
      • people with more computational knowledge are more likely to have a closer mental model
        • especially if they understand SQL, since queries are modelled after sql based queries. (Maybe this is a bad premise - am I assessing ability to query or just ability to match the models?)
      • Biological knowledge may help some aspects, e.g. a genes and proteins will always have a relationship.
      • Might the programming languages known or other biological software used affect the understanding of the model?
    • TASKS
      1. Map data from a familiar file format to the InterMine model.
        • familiarity with relevant file formats (GFF, FASTA) will probably affect this. Should all subjects know this data?
        • Familiarity with organism data may help or hinder; suggest we split this so some people work with familiar data and others do not.
      2. Query data from InterMine and retrieve correct results? (not sure if this is needed)
        • this will always be affected by UI - maybe pseudo-query is what is needed?

uxls notes and musings

Looking at the guides that they have - many are rather sparse but provide good reminders about possible directions to take and techniques to use.

  • Personas is likely to be useful
  • Prototype testing.
    • People we could target
      • Undergrads who know biology?
      • Grad students who know biology? <– probably better.
      • people from the intermine community
        • who haven’t we targeted before? :)

Task types that aren’t likely to be useful right now:

  • card sorting - this might be useful for sorting report / list / templates pages, but not right now for the wizard.

First project planning meeting

Present: Caroline + Carole + Gos + Me

First project planning meeting. I shared my plans for the first project - looking at the usability of the InterMine Cloud Wizard, and talked about plans to assess its usability. Turns out I was thinking too big, and I need to think of much smaller chunks, and stop thinking so much like a software engineer - more theory, less implementation focus.

Some possible chunks:

  • Reviewing usability of other bioinformatics tools (Possible: Galaxy, InterMine, Molgenis, Biothings)
  • uxls toolkit applied in practice
    • work with users who know intermine
    • also users who do not.

Things to think about & learn:

  • What is good wizardry practice?
  • Work with E regarding literature.

Interesting / relevant conferences:

And always think - “so what?”.

Expanding into a longer set of thoughts:

  • Why did you do x? Define questions more clearly.
  • Why does it matter?
  • How can we measure it?
  • What knowledge do I gain that others can re-use?

Reading - some notes

Note to self: never read an interesting paper and don’t take notes. Never, never! I’ve been looking up the same papers over and over…. here are some of them:

Beyond the five-user assumption- Benefits of increased sample sizes in usability testing

Beyond the five-user assumption- Benefits of increased sample sizes in usability testing - LAURA FAULKNER

The title gives a lot of this away - but there are a few points I particularly liked in the main text, especially:

if, for example, only novice users were tested, a large number of usability problems may have been revealed, but the test would not show which are the most severe and deserve the highest priority fixes. Expert results may highlight severe or unusual problems but miss problems that are fatal for novice users

Results can vary with 5, and get most errors - or one study with only 5 returned 35% of errors! 😧

I also find this interesting personally as it’s debunking Neilsen things, and I always thought Neilsen was basically usability God.

Reading - Supporting cognition in systems biology analysis- findings on users' processes and design implications

DOI: https://doi.org/10.1186/1747-5333-4-2
Author(s): Barbara Mirel

The author reviews 15 scientists workflow needs and notes that broadly the existing software tools do not do as much as might be hoped (note, this article was 2008). Specifically this refers to tools that explore and analyse data, rather than parsing.

Tools have advanced to the point of being able to support users fairly successfully in finding and reading off data (e.g. to classify and find multidimensional relationships of interest) but not in being able to interactively explore these complex relationships in context to infer causal explanations and build convincing biological stories amid uncertainty.

  • existing tools allow strict categorisation but little novel creative analysis.
  • the tool that was analysed (MiMI) no longer exists :(
  • comments on the testing included a regular desire to know how we know a given statement is true (i.e. what is the provenance of the data I see?)
  • The general structure of the paper looks good for a BlueGenes usability paper.
  • it provides some nice heuristics that might be good general recommendations for science / bio papers.
    • explain provenance of data
    • ensure data can be manipulated exploratively easily.
  • different views of data are important for different task types:

For example, users benefit most from side-by-side views – such as the network and tabular views in MiMI-Cytoscape – when their tasks involve detecting patterns of interest and making transitions to new modes of reasoning. But they need single views rich in relevant information and conceptual associations when their goal is to understand causal relationships and diagnose problems [33]. Conceiving and then designing these rich views are vital but challenging.

Reading - A large-scale analysis of bioinformatics code on GitHub

A large-scale analysis of bioinformatics code on GitHub (Pamela H. Russell, Rachel L. Johnson, Shreyas Ananthan, Benjamin Harnke, Nichole E. Carlson)

This would be a good article to cite if I need statistics on

  • number of articles associated with code repos year-on-year
  • statistics regarding repos and teams on GitHub
  • community / external contributors
  • gender breakdown in bioinf paper authorship
  • length and quality of commits and repos.

Publishing commits after the paper is a very interesting metric…

We looked at the simple binary feature of whether any commits were contributed to each repository after the associated article appeared in PubMed. …. However, interestingly, the association with the proportion of commits contributed by outside authors was not statistically significant, suggesting that overall team size may be the principal feature driving the relationship with the number of outside commit authors. Additionally, the metric was associated with frequency of citations in PubMed Central, which could indicate that people are discovering the code through the paper and using it, and the code is therefore being maintained.

Reading discard

Non-coding RNA detection methods combined to improve usability, reproducibility and precision.
Peter Raasch, Ulf Schmitz, Nadja Patenge, Julio Vera, Bernd Kreikemeyer and Olaf WolkenhauerEmail http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-491

Has usability in the title but is tool-focused, not really usability focused.

Reading - bioinformatics tools analysis framework at EMBL-EBI

A new bioinformatics analysis tools framework at EMBL–EBI

Mickael Goujon, Hamish McWilliam, Weizhong Li, Franck Valentin, Silvano Squizzato, Juri Paern and Rodrigo Lopez* http://nar.oxfordjournals.org/content/38/suppl_2/W695.short

Why not useful?

The paper itself is fine, but focuses on describing a suite of tools with a common interface, rather than any specific usability analysis. There were a few brief notes about user friendly interactivity - wizards, meaningful error messages, etc. but this was not the focus of the paper.

The only other thing of any real note was that they had a UI and APIs, allowing both user friendly and programmatic access.

Reading: Bioinformatics meets user-centred design: a perspective

Bioinformatics Meets User-Centred Design: A Perspective Katrina Pavelin, Jennifer A. Cham, Paula de Matos, Cath Brooksbank, Graham Cameron, Christoph Steinbeck http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002554

BLU, etc. etc. I liked this bit:

“There is also a lack of incentive: it is the novelty of the tool that gets the paper published, not the UCD work associated with it. Moreover, once the paper has been published, there may be less motivation to improve the tool”

Discusses an EBI redesign focusing on users and how successful it ended up being (very).

Overall it generally makes a strong case for why usability is important, and suggests training people in ux who already have domain knowledge in software development and/or bioinformatics.

#Good for Presenting a backing case in the intro of a paper with regards to why usability needs more focus.

Reading: Beyond power: making bioinformatics tools user-centered

This is an older paper, from 2004. It’s still entirely relevant, however - it begins by pointing out just how important making usable bioinformatics tools is, alongside the fact that many people are unlikely to adopt tools with poor usability if they’re used to richer interfaces elsewhere.

The researchers in this paper redesigned the NCBI website by aiming to adhere to known design patterns (Pattern Oriented Design), alongside a set of personas.

Why is this paper useful? Mostly as a backing reference saying we need to make bioinformatics more useable.

The Enzyme Portal

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-103 Matos, Paula de, Jennifer A. Cham, Hong Cao, Rafael Alcántara, Francis Rowland, Rodrigo Lopez, and Christoph Steinbeck. 2013. “The Enzyme Portal: A Case Study in Applying User-Centred Design Methods in Bioinformatics.” BMC Bioinformatics 14 (March): 103.

This article looks to approach things from the same direction I am inclined to: looking at the background of usability and making it clear that in general within bioinformatics, usability is lacking, with several useful citations leading to locations where others have identified the same problem.

I could have quoted almost every section of the “background” in this paper, as it’s so useful. It takes into account the varying level of skills between computational biologists / bioinformaticians and wet lab scientists.

Personas: interview people who fit into personas to ensure they fit correctly, ensured they had entry and exit criteria that satisfied each persona.

They used a group workshop to discuss and identify needs in the enzyme portal, with mix of researchers, pis, phd students, etc.

paper prototype testing was followed by iterative interactive prototypes

at the end they reported specific findings about the enzyme portal, rather than generalised methods.

Overall: really good article, early stages should be cited and used as inspiration for any of my related usability related papers.

Reading - evaluating a tool's usability based on scientists' actual questions

B. Mirel, “Usability and Usefulness in Bioinformatics: Evaluating a Tool for Querying and Analyzing Protein Interactions Based on Scientists’ Actual Research Questions,” Professional Communication Conference, 2007. IPCC 2007. IEEE International, Seattle, WA, 2007, pp. 1-8.

Section 1: The intro discusses the need for bioinformatics software to help lab scientists find out more regarding the genes, proteins etc. that they are working with, and the fact that many of the available tools lack the usability to make the tools truly useful.

“Based on even this limited scope, findings show that when tools have surface level usability experimental scientists are able to readily engage in productive patterns of interaction for their research purposes. However, although they can easily find and operate features the interactions and outcomes are not ultimately useful.”

The author suggests that in order to be useful, tools need to be dedicated towards specific complex tasks. (Whilst it’s not explicitly stated, I’m inferring that overgeneralisation can harm usability and usefulness).

Section 2: Usability testing performed on bioinformatics tools is often too simplistic and doesn’t go into the depth of a real use-case, instead being a simple pre-defined task.

I may possibly be missing parts of this article, I stopped reading and came back ages later. Publishing for now.

Reading: CLI usability guidelines

Seemann, Torsten. “Ten recommendations for creating usable bioinformatics command line software.” GigaScience 2.1 (2013): 1-3. APA

This paper is written by someone with experience of CLI bioinformatics tools, covering 10 guidelines for greater CLI usability. Whilst I think I’m typically concerned with UIs, this may also be relevant.

They cover providing useful feedback a well a general programming-relevant guidelines like not hardcoding, managing dependencies, etc.

Overall these are reasonable and decent guidelines but probably not something I’ll refer to in the future.

Reading list discard

Veretnik, Stella, J. Lynn Fink, and Philip E. Bourne. “Computational biology resources lack persistence and usability.” PLoS computational biology 4.7 (2008).


Why not useful?

Sure, usability is lacking. This is known. But there is too much focus on the lack of persistence, e.g. outdated databases that aren’t maintained when grants run out. I care more about usability flaws - specifics - than the grant politics surrounding it. (Don’t get me wrong, I care about grants a lot, but I’m not sure that this is the context I’m looking for.)

Reviewed: 18 March 2016.

First reading article

Title: “Better bioinformatics through usability analysis”

Link: https://bioinformatics.oxfordjournals.org/content/25/3/406.full

More usable web applications would enable bioinformatics researchers to find, interact with, share, compare and manipulate important information resources more effectively and efficiently, thus providing the enabling conditions for gaining new insights into biological processes.

  • sets tasks to investigate gene info in CATH, NCBI, BioCarta, and SwissProt regarding a breast cancer case. Observes users & encourages to think aloud
  • find homologues in drosophila

CATH: Discusses “navigation usability”.

For large web repositories, however, the complexity of the information and navigation structures being designed and the multiplicity of micro-design interventions over time can cause designers to lose control of what is offered to the user at any given moment.

Different ages of sub-systems within a bioinformatics application can cause poor user experience - e.g. linking to old data from an up-to-date page. (Section 4.2, re CATH). User should always know what data they are working with

Section 4.4, CATH: sorting browse-only data by sub family can make it hard to find the desired item. (e.g. mystery categories, user has to manually scan them all).

Section 5, Search:

  • Alternative identifiers, e.g. spellings (oestrogen/estrogen) and synonyms of identifiers, need to be associated.
  • DBs assume knowledge of data model. “SwissProt, for example, uses names of databases to communicate the search domains: SwissProt/trEMBL, SwissProt/tremble (full), SwissProt/tremble (beta), PROSITE, NWT-Taxonomy, SWISS-2D Page, just to name a few. Instead of being able to select the ‘content domain’ to search for, the user is faced with a list of technical names of databases they may not be familiar with.”
  • Makes three recommendations for clearer searching. Inform user about:
    1. search scope
    2. ontology
    3. query syntax
  • Overlong list results mean either:

“(i) intimidated by the long list of items, they do not explore further and try to reformulate the query; (ii) they focus on the first, second or third results, hoping the first few results to be the most relevant ones (which is not always the case)”

  • “it is important to explicitly communicate to the user the actual ranking criteria used for displaying the results…. possibly, to allow sorting the obtained results by multiple, additional attributes (e.g. by publication/release date, by alphabetical order).””

subscribe via RSS