Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ Workshop website: https://scottcpeterson.github.io/2019-12-16-UCBerkeley/ Library Carpentry Lessons: https://librarycarpentry.org/lessons/ Sign In Here: Zac Painter (Stanford, Instructor) Scott Peterson (instructor) UC Berkeley Rick Jaffe (UC Berkeley, Helper) Ariel Deardorff (UCSF, Helper) Celia Emmelhainz (UC Berkeley, Instructor) Christina V. Fidler (UC Berkeley) Becky Escamilla (UCB LIT) Janine Gericke (Bancroft, janineg@berkeley.edu) Kristina Bush (UC Berkeley Library) Natalia Estrada (UCB Library) Molly Rose (UCB Library) Megan Van Noord (UC Davis) Amelia Llorens (Stanford, llorensa@stanford.edu) Kristen Greenland (UCB) Marri Atienza (UC Berkeley Library) Daina Dickman (Sacramento State, dickman@csus.edu) Christy Navarro (UC Davis Blaisdell Medical Library, Cenavarro@ucdavis.edu) Sam Teplitzky (UCB Library) Stephen Kiyoi (UC Davis) Yuko Okubo (UCB Library) DeeAnn Tran (SJSU Library) Ann Glusker (UCB Library) Tuzun Guvener (Freelance Consulting, guvenerzt@gmail.com) Connie Wong (Stanford, hongnei@stanford.edu) Eric Fernandez (UCB LIbrary) Tony Sucheston (UCB Library) ########## Jargon Busting & Tidy Data Jargon Busting (Terms to define): -Git - version control software (for keeping track of different versions of your files/code ex: file_v1.txt is a basic form of version control) -Repository - a folder for your stuff/code -GitHub - a central online place to store/share code - Unix: scripting language underlying your computer - SHELL - a way of interacting directly with the computer - writing commands to tell your computer what to do - terminal - the place where you type your commands to the computer - BASH - a kind of shell (bourne again shell) - Zsh - another kind of shell - GUI: graphical user interface (clicking on folders/files with a mouse) - pipes - a way to use the output of one set of commands as the input for another set of commands - GREL - the programming language behind Open Refine (tool for cleaning data introduced on day 2) - Regular expressions (regex) - shortcut commands to find and filter data - pattern matching - Python - a programming language great for working with data science/big data/regular data R - a programming language that started for statistics but is now very popular for a variety of uses - especially by librarians :) - SQL - a programming language for working with databases - Cloud computing - lots of computers networked together to give you more power - Cluster computing - like the above, but more localized to one specific location - open source - when you make your code available for free to others to use/build upon (open access code) Libre--an ethos about contibuting and making sure it is always free. Free--it is no cost, but has financial backers behind it. Can not develop your code. Password manager: BitWarden Regex--Regular expression--a search for a patterns Rexex101: https://regex101.com/ grep--global regular experssions print Tidy Data: Download the data here: https://librarycarpentry.org/lc-spreadsheets/data/training_attendance.xlsx Save as (modify name) and note changes on copy. Do not directly edit the original sheet. Notes tab - codebook- or readme-like...keep track of change there Variables in columns Observations in rows Don't combine multiple pieces of information in the same cell Leave raw data raw PGR|PDRA|Other - mixed values in one cell - hard to keep track. Cancelled should be its own column - color coding will get lost in export; can sort by value if its in a column; would be clear which table and row the value applies to 2016 tab: Split tables into two tabs so RDM training and Open access are on separate sheets Combine tables across years Combine workshops, too? Add cancelled column, enter Y or N; clear formatting (Don't forget to update notes tab!) dates - five digit string to represent days since Jan 1, 1900. Excel has problems with dates prior to 1900 ISO-8601 - standard time and date format. Designed to reduce ambiguity when communicating across cultural barriers and machines. YYYY-MM-DD Carpentries suggests splitting year, month and date into separate columns Split values in RDM training column C - in Excel, Data> Text to table. Wizard appears; define delimiter = '|' (Add two empty columns to the right of column C to avoid overwriting data in columns D and E) Other issues: undefined abbreviations in headings; combined values in Delivered by cell; use same number of decimal places in all numbers. Empty values: blank, null. DO NOT use: 0, -999, 999, NA ########## Unix (Shell) folder to download for lesson: https://raw.githubusercontent.com/librarycarpentry/lc-shell/gh-pages/data/shell-lesson.zip https://www.theverge.com/2019/6/4/18651872/apple-macos-catalina-zsh-bash-shell-replacement-features (In Catalina, Apple replaced the Bash shell with zsh; older versions still use Bash) REPL: Read, evaluate, print, loop (see, e.g., https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop) MARCtools on GitHub - various command line (shell) utilities for MARC records (https://github.com/tomcrane/MarcTools ??) / Root Directory (the default, base place everything comes from) ~ Home Directory (where you usually go to first in a Terminal) $ prompt to enter text Commands: pwd = print working directory - shows you where in your directory (folder and file) structure you are ls = list contents of the current working directory cd is a command to change directory. It takes "arguments" - i.e., additional information provided to the command. For example: cd .. = moves you to the directory up one level in the folder tree cd - = moves to previous directory. Use this to toggle between two directories cd ~ = move to your home directory cd = (i.e., without an argument) also a way to move to your home directory ls -l = list contents in long list format ls -lh = long list format, human readable Arguments are separated by a space use quotes around arguments that contain spaces, e.g. cd "evil plan to destroy the world" For help: manual page (man command) man ls (Mac) ls --help (Windows) >> On Mac, type q to quit out of man and return to prompt clear = clears screen Open up a GUI window showing the current directory: open . (Mac) explorer . (Windows) xtg-open . (Linux) *Using ls command** Use the manual page or help pages and figure out how to list the files in a directory ordered by their filesize. Try it out in different directories. Afterwards, find out how you can order a list of files based on their last modification date. Try ordering files in different directories. ls -lSh. = list in long list form, Sorted by filesize, displaying human-readable filesizes ls -lt = list in long list form, sorted by modification date navigate to shell-lesson cd ~/Desktop/shell-lesson mk firstdir make directory, name it firstdir tab autocomplete ... start typing, click tab, shell will autocomplete the argument up to the point that one or more files or folders share a common name, that is, moving from left to right, each character is the same). At that point, you can type another character or more to disambiguate the name, then hit tab again to autocomplete further. cat = concatenate command control - c = kills the current command head [filename] = look at top 10 lines (by default) of a file. Empty lines count. tail [filename] = look at last 10 lines (by default) of a file. Hint: use tab autocomplete to save typing when entering file name head -n [filename] = n argument determines number of lines to look at tail -n [filename] works the same. less [file] = read file. within less: G = move to end g= move to beginning q = quit less for help: man less or less --help less -N [filename] = number lines Shell history Use up and down arrows to move line by line through the commands that you have entered. Hit enter to replay a command Wildcards: ?, * (shift+8) ? matches one character * matches zero or more characters Shell converts wildcard before passing argument to command. Command never sees the wildcard. Rename a file: mv. (think of renaming as moving from an old filename to a new filename) e.g., mv 829-0.txt gulliver.txt changes name of file 829-0.txt to gulliver.txt **Copy a file (cp)** Instead of moving a file, you might want to copy a file (make a duplicate), for instance to make a backup before modifying a file. Just like the mv command, the cp command takes two arguments: the old name and the new name. How would you make a copy of the file gulliver.txt called gulliver-backup.txt? **Renaming a directory** Renaming a directory works in the same way as renaming a file. Try using the mv command to rename the firstdir directory to backup. **Moving a file into a directory** If the last argument you give to the mv command is a directory, not a file, the file given in the first argument will be moved to that directory. Try using the mv command to move the file gulliver-backup.txt into the backup folder To remove (i.e. delete) a directory or file: use the rm command. NOTE: THERE IS NO UNDO!!!! Take care. Better yet, use the -i argument to be asked if you really want to delete. For example: rm -i diary.html remove diary.html? y (type y or n for yes or no) history command shows the history of the shell session: history One history is displayed, scroll up to see past commands Do a reverse-i-search through your history: Control-R '[text-to-find]. Control-R to continue finding earlier matches. Click return to replay the command that has been found. Use ![line of command in history] to replay that command To print out history to a text file: history > history.txt echo command Prints what you type to the screen. Use quotes to contain text with spaces Useful in a for loop for outputting current status of script to the screen Variables name=Scott creates the variable 'name' and gives it the value 'Scott' $variablename refers to variable. echo $name echo "Finally, it is sunny on" $(date) (date is a built-in variable.) loops - let us go through multiple files instead of just one touch - a command to create a file for loops: for filename in *.doc do echo "$filename" cp "$filename" backup_"$filename" done the above copies a list of files that end in .doc and adds backup to the beginning Complete the blanks in the for loop below to print the name, first line, and last line of each text file in the current directory. for file in *.txt do echo "$file" head -n 1 "$file" tail -n 1 "$file" done nano - a built in text editor to save in nano: ctrl + o to save in nano: ctrl + x save our bash scripts as a .sh file to run our program type bash and then name of the file: bash myfirstbashscript.sh wow we wrote and ran our first script! :) wc: word count command wc -l *.tsv now we are going to take the output and save that as a text file: wc -l *.tsv > lengths.txt now lets sort that file by length of the line and then save that as a new file: sort -n lengths.txt > sorted-lengths.txt now we want to do less work by using a pipe! if we want to take the wordcount and then sort it we can do: wc -l *.tsv | sort -n now we want to get the top line: wc -l *.tsv | sort -n | head -n 1 - **Count, sort and print (fadedgtail -n 2 **Counting number of files** Let’s make a different pipeline. You want to find out how many files and directories there are in the current directory. Try to see if you can pipe the output from ls into wc to find the answer, or something close to the answer. ls | wc **Writing to files** The date command outputs the current date and time. Can you write the current date and time to a new file called logfile.txt? Then check the contents of the file. date > logfile.txt **Appending to a file** While > writes to a file, >> appends something to a file. Let’s append the current date and time to the file logfile.txt? date >> logfile.txt ****Dates and OS X***** OS X doesn't like spaces in filenames in command line so date has to be formatted Example: echo $(date +%Y_%m_%d) write to file with current date example: grep i revolution *.tsv > "$(date +%y_%d_%m)_JAi-revolution.tsv" grep! - a useful tool for finding things in our files to find where 1999 appears in our tsv files: grep 1999 *.tsv to count the number of times 1999 appears in each tsv file: grep -c 1999 *.tsv to find all instances of french or france in our tsv files: grep -iwEo 'fr[ae]nc[eh]' *.tsv on macs: grep -E '\d{4}-\d{4}' 2014-01_JA.tsv >issns.tsv on pcs: grep -P '\d{4}-\d{4}' 2014-01_JA.tsv >issns.tsv grep -Eo '\d{4}-\d{4}' 2014-01_JA.tsv | sort | uniq | wc -l time for a for loop! for name in "Jo" "Meg" "Beth" "Amy" do echo "$name" grep -wo "$name" littlewomen.txt | wc -l done Day two Attendance Scott Peterson UC Berkeley Zac Painter (Stanford) Eric Fernandez (UCB LIbraryIT Kristina Bush (UCB) Becky Escamilla Daina Dickman (Sacramento State) Amelia Llorens (Stanford) Ann Glusker (UCB) Kristen Greenland (UCB) Marri Atienza (UC Berkeley) Natalia Estrada (UC Berkeley) Molly Rose (UCB) Janine Gercke (UC Berkeley) Tony Sucheston (UCB) Connie Wong (Stanford) DeeAnn Tran (SJSU) Rick Jaffe (UC Berkeley, Helper) Megan Van Noord (UC Davis) Christy Navarro (UC Davis) Stephen Kiyoi (UC Davis) Celia Emmelhainz (UC Berkeley) Anna Sackmann (UC Berkeley) Yuko Okubo (UCB) ########## Git Good at tracking changes from multiple individuals, highlighting conflicts between individual contributions ("commits") Create a folder (mkdir) on desktop (name it what you want), change (cd) to that directory git init git config --list - shows your global git configuration values git config --global user.name "yourname" git config --global user.email "yourgithubemail@domain.xyz" ########## OpenRefine Notes: https://librarycarpentry.org/lc-open-refine/ Download DOAJ file here: https://librarycarpentry.org/lc-open-refine/setup.html http://http://127.0.0.1:3333/ Difference between rows and records. We will see this when we split Authors column in multiple rows: Author dropdown > Edit cells > Split multivalued cells... [pop-up window appears] by Separator ... enter '|' (without quotes) Click ok Note that new rows have been added, without a title... these rows belong to the record above To join rows again: Author dropdown > Edit cells > Join multivalued cells ... Enter Separator ... '|' click ok Facets - try these: Publisher pulldown > Facet > Text facet Unique values appear in Facet column at left Select a facet to see records with that value Click Edit to change this value in all records. Type exclude to de-select that facet License pulldown > Facet > Text facet Language pulldown > Facet > Text facet Select 'English' Click 'Edit' Change value to 'EN' Click Apply to make a mass edit Clustering: Compare similar values within a column To begin, let's resplit the values in the Author column Author pulldown > Edit cells > Split multivalues cells ... by Separator '\' To cluster: Author pulldown > Edit cells > Cluster & Edit Q. Can I identify only the values that differ by a period after an initial and merge those? To adjust columns: All pulldown > Edit columns > Re-order / remove columns To undo: Click Undo/Redo (next to Facet/Filter) Step up to step prior to one that you want to undo. All later steps will be undone, too. On Undo/Redo screen, you can extract your steps to a file that can be replayed later. Transformations: e.g., splitting values in one column into multiples, standardizing date formatting, trimming white space, etc. Uses GREL (General Refine Expression Language) Publishers > Facet > Text facet Note that MDPI AG appears twice, one with 3 entries Select that facet. Publishers > Edit cells > Common transformations > Trim leading and trailing whitespace Also try: ... Title > Edit cells > Common transformations > To uppercase Title > Edit cells > Common transformations > To titlecase Title > Edit cells > Transformations Enter in text field: value.toUppercase() ok value.toTitlecase() ok to see history: Title > Edit cells > Transformations ... click History tab to see your previous transformations Export button in upper right of screen to export the project or the data Date transformations Date pulldown > Edit cells > Transformations GREL: value.toDate("dd/MM/yyy") ok Working with data types other than Strings To create a Date column that is more readable: Date pulldown > Edit column > Add column based on this column... Enter: Column name: Formatted date GREL: value.toString("dd MMMM yyyy") click ok Boolean searches on Authors Example: Authors > Edit cells > Transformations if(value.contains("Flavia"), "It's Flavia!", value) To create a facet showing Authors' names with a comma: Start by splitting Authors by pipe (see above) Authors > Facet > Custom text facet... value.contains(",").toString() Click on True facet to see Authors' names that contain a comma {Beware: The following notes get sketchy!} Quick discussion of arrays [value 0, value 1, value 2 ...] To see: Subjects > Edit cells > transform value.split() To split, sort alphabetically, then rejoin with a pipe value.split("|").sort().join("|") Changing last, first to first last Start with facet showing Authors' names with a comma, select that facet (see above) Authors dropdown > Edit cells > Transformer value.match(/(.*),(.*)/).reverse().join(" ") ok {Notes are caught up again!} Connecting to external data From an external URL: ex. Journal ISSN to get information from CrossRef Select one row (click star at left of row) All pulldown > Facet > Facet by star Select that facet ISSN pulldown > Edit column > Add column by fetching URL... On that screen, give column name 'Journal details'; click 'show' link next to HTTP headers to be used ...; add mailto:[emailaddress] to User-agent line, e.g. OpenRefine 3.2 [55c921b]; mailto:rjaffe@berkeley.edu In 'Formulate the URLs to fetch' field, enter: "https://api.crossref.org/journals/"+value click ok You should see a new column named "Journal details" with a mess of text in it. To trim down the text: Journal details pulldown > Edit column > New column based on column Enter Column name "Journal title" Enter GREL: value.parseJson().message.title click ok Reconciling our data against an external source We will use Viaf reconciliation server written by Jeff Chu (public service) Reconcile publisher Publisher pulldown > Reconcile... > Start reconciling Add standard service Enter URL: http://refine.codefork.com/reconcile/viaf Select Viaf in screen that appears Uncheck "Auto-match candidates with high confidence" Click Two facets appear. You can close 'Publisher: Best candidate score' Publisher pulldown > Facet > Text facet Select facet 'International Union of Crystallography Click double check next to 'International Union of Crystallography' to accept our local name for all values Click reset in facets panel to show all facets To change all values at once: Publisher pulldown > Reconcile... > Actions > Match each cell to its best candidate Create column with VIAF ID: Publisher pulldown > Edit column > Add column based on this column Name the column VIAF ID Enter GREL: cell.recon.match.id Rejoin rows into records with multivalued author fields (see above) Now you can export cleaned data as a .csv file You might want to export your steps at this point, too: UNDO/REDO > Extract --- Discussion of Open Refine extensions (google it) Download and add to webapp folder where your Open Refine application files are located Postworkshop survey link: https://carpentries.typeform.com/to/UgVdRQ?slug=2019-12-16-UCBerkeley