Welcome to The Carpentries Etherpad!
This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try etherpad.wikimedia.org).
Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
Additional Resources
Instrutors/Helpers
- Sarah Pugachev, University of Rochester, @sclayton29
- Vratika chaudhary, University of Florida, @scinat1
Attendees
- Name, Institution, twitter?
- Ryan Perry, Central NY Library Resources Council rperry@clrc.org
- Tim Furgal, Southern Tier Library System, @HauntedOldBook / furgalt@stls.org
- Devin Kerr, SUNY Upstate, kerrd@upstate.edu
- Marcy Strong, University of Rochester, mstrong@library.rochester.edu
- Angela Grunzweig, University of Rochester, agrunzwe@library.rochester.edu
- Ryan Hughes, Rochester Regional Library Council
- Lindsay Stratton, Westchester Library System @LindsaySez
- Sarah Siddiqui, University of Rochester, @sarahsiddy1
- Patrick WIlliams, Syracuse University Libraries / @activitystory
- Wendy Way, Rochester Institute of Technology
- Beth Mamo, Rochester Regional Health @ehmamo
- Jen Barth, Henrietta Public Library
- Melissa McHenry Gates Public Library
- Michael Riordan, Independent Good, @independentgood
- Andy Austin, Genesee Valley Educational Partnership (BOCES)
- Hannah Ralston, Henrietta Public Library
- Lara Nicosia, University of Rochester
- Rebekah Walker, Rochester Institute of Technology
- Mary Ann Warner, Schenectady County Public LIbrary @marlibrarian
- Emily Sherwood, Univeristy of Rochester, @emilygwynne
- Blair Tinker, University of Rochester
- Joe Easterly, University of Rochester, @joeeasterly
- Jessica Regitano, Chittenango Central Schools (middle school), @jessicaregitano
- Phil Mendoza, SUNY Westchester Community College, phil.mendoza@sunywcc.edu
- Lynne Kvinnesland, Colgate University, lkvinnesland@colgat.edu
Notes:
Intro to Data
Jargon Busting
Resolved
- String: textual data
- Metadata
- Normalization: Cleaning your data, standardizing it using controlled vocabularies, consistent and comparable state so that it can be used by others
- Versioning
- Version Control
- Data Viz.
Unresolved
- Big Data : "In traditional analysis, the development of a statistical model takes more time than the calculation by the computer. When it comes to Big Data this proportion is turned upside down. Big Data comes into play when the CPU time for the calculation takes longer than the cognitive process of designing a model."- Hadely Wickham
- Most of the words in the GitHub interface
- Normalization
- APIs: Something that makes two programs talk to each other (Application Programming Interface)
- Preservation and Access v/s Accessibility
- Data
- Refinement
- control structure
- git
- conditionals
- object/object oriented
A Computational approach
Is it worth the time? https://xkcd.com/1205/
Keyboard Shortcuts
Link to keyboard shortcuts: https://en.wikipedia.org/wiki/Table_of_keyboard_shortcuts
Alt-Tab to switch between applications
Ctrl-Shift-v to paste plain text
Ctrl-Tab to switch to the next tab in your browser
Filenaming and formatting
Never look in someone's document folder: https://xkcd.com/1459/
Bad filenaming examples: http://20px.com/blog/2015/07/16/catalogue-b'ad-file-nam ing/
Plain text formats (.txt, .csv) are better than propriatary formats (.doc, etc) for preservation
Markdown -- Plain text format that can be exported into many different types
Regular Expressions
https://regex101.com/
https://regexper.co
m
Cheatsheet
Square brackets can be used to define a list or range of characters to be found. So:
- [ABC] matches A or B or C.
- [A-Z] matches any upper case letter.
- [A-Za-z] matches any upper or lower case letter.
- [A-Za-z0-9] matches any upper or lower case letter or any digit.
Then there are:
- . matches any character.
- \d matches any single digit.
- \w matches any part of word character (equivalent to [A-Za-z0-9]).
- \s matches any space, tab, or newline.
- \ used to escape the following character when that character is a special character. So, for example, a regular expression that found .com would be \.com because . is a special character that matches any character.
- ^ is an “anchor” which asserts the position at the start of the line. So what you put after the caret will only match if they are the first characters of a line. The caret is also known as a circumflex.
- $ is an “anchor” which asserts the position at the end of the line. So what you put before it will only match if they are the last characters of a line.
- \b asserts that the pattern must match at a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words. So:
- the regular expression mark will match not only mark but also find marking, market, unremarkable, and so on.
- the regular expression \bword will match word, wordless, and wordlessly.
- the regular expression comb\b will match comb and honeycomb but not combine.
- the regular expression \brespect\b will match respect but not respectable or disrespectful.
- * matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc.
- + matches the preceding element one or more times. For example, ab+c matches “abc”, “abbbc” but not “ac”.
- ? matches when the preceding character appears zero or one time.
- {VALUE} matches the preceding character the number of times defined by VALUE; ranges, say, 1-6, can be specified with the syntax {VALUE,VALUE}, e.g. \d{1,9} will match any number between one and nine digits in length.
- | means or.
- /i renders an expression case-insensitive (equivalent to [A-Za-z]).
https://regex101.com/
https://regexper.com
^[Oo]rgani.e\b.
^[Oo]rgani.e\w?\b
Matching words: (only at the start of the line, words can come after)
- Organized
- Organizer
- organizer
- Organizes
- organizes
Organizers- organise
- organize
- organize1
^[Oo]rgani.e\w?$
Matching words: (only word on line)
- Organized
- organized
- Organizer
Organizers- Organizes
Fr[ea]nc[eh]
Write a regex to find colour and color - case insensitive.
/\bcolou?r\b/i
c/iolou?r
[cC]olou*r
\b[Cc]ol[ou]r\b
colo.r/i
^[Cc]ol[o|ou]r$
^[Cc]olo+r$
\b\d{2}-\d{2}-\d{4}\b - to find dates in a document in dd-mm-yyyy format or mm-dd-yyyyy format
How would you write a regex to find a publication format like British Library : London, 2015 and Manchester University Press : Manchester, 1999
.\s:\s\w,\s\d{4}
.+\s\:\s.+\s\:\s.+
\w*\s:\s\w*,\s\d{4}
\b\w?\s:\s\w?,\s\d{4}\b
\b\w\s+:\s\w\s,\d{4}\b
\w\s:\s\w,\d{4}
[A-Z]+\s:\s[A-Z],\s\d{4}
.* ?: .*, \d{4}
regex101.com
https://regex101.com
https://github.com/LibraryCarpentry/lc-data-intro/blob/gh-pages/data/swcCoC.md
More exercises
Exercises: https://librarycarpentry.org/lc-data-intro/04-exercises/index.html
Git/GitHub
Go to github.com and sign up for an account!
Tidy data
Spreadsheets are good for:
- Sorting/FIltering
- metadata
- Budgeting
- Statistics
- Making lists
- Survey Results
- Web/App Form Inputs
- Creating tables
- Pivot Tables
- Normalizing values
Spreadsheets are frustrating because:
- Clunky
- Pivot Tables
- Formulas
- Macros
- Funky date formatting
- Cell values can't have multiple attributes
- Repeating values can be a hassle
- Feels like unharnessable information
- Hard to manage multiple datapoints - relatively flat
- hard to manage large data sets
- Poor interface (Excel)
Key rules of data management
- Do not modify raw data! (Always make a copy of your data and modify that)
- Keep track of what you do. - For your future self and your collaborators. Tip: Have a separate document with notes. A README is always a good idea.
- Observations are in rows. Variables in columns. Each cell should contain only one value and one type of data (i.e. don't mix strings and integers).
- One spreadsheet per file. Don't have multiple tabs! Does it need to be a new file? Can you add a new variable (like year)?
- Add 0 when you mean 0. Don't leave it blank. NAs is a safe way to represent missing values.
- Don't use formatting to convey information. Don't use colors, font, etc. to represent infomation. Computers will not be able to process the formatting.
- Don't put formatting information (aka a merged title at the top). It will mess up your analysis!
- Variable names should be self-describing, but not very long. No spaces or special characters. Underscores are a good option to replace spaces.
Formatting data table- Exercise 1: https://librarycarpentry.org/lc-spreadsheets/01-format-data/index.html
What was wrong with the data as it was structured and how did you fix it?
Problems:
- Spaces between D and the month
- Highlight for cancelled events
- Dates not standardized
- Dates on 2017 may be a mix of month/day and day/month (unclear)
- Blank cells don't indicate whether the event was cancelled or if data was not recorded
- Data for 20 Feb 2017 should have GQ & DF in different rows
- Two dates on one cell
- Three data points in one cell
- Data collection strategy changed over time
Steps taken to fix it
- Split PGR/PDRA/Other into 3 columns
- Standardize date format (ISO?)
- Add Cancelled column
- Copied cells for two day training
Dates: https://librarycarpentry.org/lc-spreadsheets/03-dates-as-data/index.html
Dates are really tough to handle in spreadsheets. Differences in software, platform. Sometimes spreadsheet software will try to help us with our dates, but it can actually cause more problems down the line.
Usually dates are stored in one column.
Notes for historic data. Excel will not parse dates from before 1899-12-31, but it will change more recent data. All the problems!
Excel stores dates as numbers in internal function. Use different date formats in different platforms (aka Mac vs PC) makes moving in between systems hard.
Tips:
1. Store data in multiple columns (Year, month, day, hour, minute, second) - Some handy excel formulas include =YEAR(source); =MONTH(source); =DAY(source)
2. If using a string, be consistent! Consider a standard. ISO standard is YYYY-MM-DD
3. Here is a quick reference for all the excel tricks we did around dates: https://librarycarpentry.org/lc-spreadsheets/03-dates-as-data/index.html
Day of the year formula: = [A4] - DATE(YEAR([A4]);1, 0) --> This examples assumes your full date value is stored in A4
Feedback and additional resources:
Post workshop survey: https://www.surveymonkey.com/r/lcpostworkshopsurvey?workshop_id=2019-11-01-rochester