Welcome to The Carpentries Etherpad!

This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.

Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try etherpad.wikimedia.org).

Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html

All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/

Additional Resources


Instrutors/Helpers

Attendees

Notes: 

Intro to Data

Jargon Busting

Resolved
Unresolved

A Computational approach 
Is it worth the time? https://xkcd.com/1205/ 

Keyboard Shortcuts
Link to keyboard shortcuts: https://en.wikipedia.org/wiki/Table_of_keyboard_shortcuts
Alt-Tab to switch between applications
Ctrl-Shift-v to paste plain text
Ctrl-Tab to switch to the next tab in your browser

Filenaming and formatting
Never look in someone's document folder: https://xkcd.com/1459/
Bad filenaming examples: http://20px.com/blog/2015/07/16/catalogue-b'ad-file-nam ing/

Plain text formats (.txt, .csv) are better than propriatary formats (.doc, etc) for preservation
Markdown -- Plain text format that can be exported into many different types

Regular Expressions
https://regex101.com/
https://regexper.co


m
Cheatsheet
Square brackets can be used to define a list or range of characters to be found. So:
Then there are:


https://regex101.com/
https://regexper.com

^[Oo]rgani.e\b.



^[Oo]rgani.e\w?\b
Matching words: (only at the start of the line, words can come after)

^[Oo]rgani.e\w?$
Matching words: (only word on line)
    

Fr[ea]nc[eh]


Write a regex to find colour and color - case insensitive.

/\bcolou?r\b/i
c/iolou?r
[cC]olou*r
\b[Cc]ol[ou]r\b
colo.r/i
^[Cc]ol[o|ou]r$
^[Cc]olo+r$


\b\d{2}-\d{2}-\d{4}\b - to find dates in a document in dd-mm-yyyy format or mm-dd-yyyyy format


How would you write a regex to find a publication format like British Library : London, 2015 and Manchester University Press : Manchester, 1999
.\s:\s\w,\s\d{4}
.+\s\:\s.+\s\:\s.+



\w*\s:\s\w*,\s\d{4}
\b\w?\s:\s\w?,\s\d{4}\b
\b\w\s+:\s\w\s,\d{4}\b
\w\s:\s\w,\d{4}
[A-Z]+\s:\s[A-Z],\s\d{4}


.* ?: .*, \d{4}


regex101.com
https://regex101.com
https://github.com/LibraryCarpentry/lc-data-intro/blob/gh-pages/data/swcCoC.md

More exercises
Exercises: https://librarycarpentry.org/lc-data-intro/04-exercises/index.html

Git/GitHub

Go to github.com and sign up for an account! 


Tidy data
Spreadsheets are good for:


Spreadsheets are frustrating because:
    
Key rules of data management
  1. Do not modify raw data! (Always make a copy of your data and modify that)
  2. Keep track of what you do. - For your future self and your collaborators. Tip: Have a separate document with notes. A README is always a good idea. 
  3. Observations are in rows. Variables in columns. Each cell should contain only one value and one type of data (i.e. don't mix strings and integers). 
  4. One spreadsheet per file. Don't have multiple tabs! Does it need to be a new file? Can you add a new variable (like year)?
  5. Add 0 when you mean 0. Don't leave it blank. NAs is a safe way to represent missing values. 
  6. Don't use formatting to convey information. Don't use colors, font, etc. to represent infomation. Computers will not be able to process the formatting. 
  7. Don't put formatting information (aka a merged title at the top). It will mess up your analysis!
  8. Variable names should be self-describing, but not very long. No spaces or special characters. Underscores are a good option to replace spaces. 


 Formatting data table- Exercise 1: https://librarycarpentry.org/lc-spreadsheets/01-format-data/index.html

What was wrong with the data as it was structured and how did you fix it?
Problems:
Steps taken to fix it


Dates: https://librarycarpentry.org/lc-spreadsheets/03-dates-as-data/index.html

Dates are really tough to handle in spreadsheets. Differences in software, platform. Sometimes spreadsheet software will try to help us with our dates, but it can actually cause more problems down the line. 
 
Usually dates are stored in one column. 

Notes for historic data. Excel will not parse dates from before 1899-12-31, but it will change more recent data. All the problems! 

Excel stores dates as numbers in internal function. Use different date formats in different platforms (aka Mac vs PC) makes moving in between systems hard. 

Tips:
1. Store data in multiple columns (Year, month, day, hour, minute, second) - Some handy excel formulas include =YEAR(source); =MONTH(source); =DAY(source)
2. If using a string, be consistent! Consider a standard. ISO standard is YYYY-MM-DD
3. Here is a quick reference for all the excel tricks we did around dates: https://librarycarpentry.org/lc-spreadsheets/03-dates-as-data/index.html
    
Day of the year formula: = [A4] - DATE(YEAR([A4]);1, 0) --> This examples assumes your full date value is stored in A4



Feedback and additional resources:
    Post workshop survey: https://www.surveymonkey.com/r/lcpostworkshopsurvey?workshop_id=2019-11-01-rochester