Mapping data models. I think I would like to preregister this when we feel that the plan is as clear as possible.

Research question:

Draft wording: Does a person’s educational background in biology and computational subjects have an effect of their ability to create effective biological data models, and to map biological data onto this model?

Maybe break down into sub-questions + null hypotheses, with expectation that biology + less computer will be different from biology+computer knowledge or even just computer w/ low biology. More notes about this here

Background questionnaire

  1. Educational background - biology (Do we want to get more specific than this?)
    • High School or lower
    • Undergraduate
    • Postgraduate or higher
  2. Computer background (How can we make this more granular?)
    • Casual computer use (uses software as needed for work, prefers to be shown new things and/or have help troubleshooting)
    • I write some code or tweak code but don’t consider myself to be a programmer. Excel spreadsheets, script tweaking, etc.
    • I write code on a regular basis.
  3. Possibly a better approach than the previous Q: - check marks or scale for familiarity?
    1. Biology software: Are you familiar with any of the following?
      • Galaxy
      • BioMart
      • GeneCards
      • InterMine
      • Molgenis
      • (etc. // add more)
      • other (?)
    2. Computing languages (mix of common scripting languages + languages that teach users certain concepts, e.g. object oriented programming, graph DBs, SQL relationships)
      • R
      • Python
      • SQL
      • Perl
      • Java
      • Functional languages, e.g. Haskell, Clojure
      • Graph databases, e.g. Neo4J
      • Git or other version control
      • Other
  4. What file formats do you work with, if any? (e.g. FASTA, GFF, BAM, VCF, etc…) (free form text)
  5. Do you focus on any specific organisms? (free form text, too many to list!)

Research design

Independent variable

User characteristics as listed above.

### Dependent variables

Models and mappings as generated by user.

Eliciting model details

Possible approach: card arrangement, with cards covering the following three types of data:

  1. Model entities, such as
    • Gene
    • Protein
    • Experiment
    • Publication
    • Organism
    • Data File
  2. Entity properties - e.g.
    • Gene Symbol
    • Gene Identifiers
    • Gene Orthologues
    • Organism name
    • Organism taxon ID
  3. Property values - e.g. for the previous properties -
    • Gene Symbol - BRCA1 (H. sapiens)
    • Gene Identifiers - ENSG00000012048
    • Gene Orthologues - Brca1 (M. musculus)
    • Organism name - H. sapiens
    • Organism taxon ID - 9606

Possible problems: Providing cards with all three of these types of data may result in people creating simple columns of data - e.g. a Gene header with cards underneath providing properties and property values.

Possible ways to address this

A: Structure this in a three or four-part process:

  1. Provide entities and ask for people to arrange them and explain the relationships.
  2. Provide property cards and ask for them to be added to the model.
  3. Provide the property value cards, and repeat.

Problem with this approach: I’m already providing artificial constraints onto the mental model here. People might not see things this way!

B: Provide people with all of the cards at once (or no cards?) and ask them to sketch how they think they’re related.

Q: Should we provide some sort of demo model? something unrelated, e.g. pets and owners, or similar? Is this leading people too much?
Q2: Populating the property values - I was thinking possibly famous genes / proteins / etc. from a mix of popular organisms - ones that are popular in undergrad courses, cancer genes that are associated with celebrities, etc.

## Analysis of results…

// TBD