Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ ---------------------------------------------------------------------------- * Welcome to Software Carpentry! *Links: Workshop Website: https://uw-madison-datascience.github.io/2022-06-13-uwmadison-swc/ Intro Slides: https://docs.google.com/presentation/d/10WG7GYS7Egg0v1yAjQHlHdYTuK0x7FP9iAf93rxys0Y/edit?usp=sharing Daily Feedback Form: https://forms.gle/Bo3DXNAoEC3D5P7ZA Pre-Workshop Survey: https://carpentries.typeform.com/to/wi32rS?slug=2022-06-13-uwmadison-swc Post-Workshop Survey: https://carpentries.typeform.com/to/UgVdRQ?slug=2022-06-13-uwmadison-swc *Follow-Up Resources Coding Meetup (office hours): https://datascience.wisc.edu/hub/#dropin Data Science Hub Newsletter (upcoming workshops, seminars, jobs): https://datascience.wisc.edu/newsletter/ Workshops Via Data Science Hub: https://datascience.wisc.edu/training-resources/ Additional resources in wrap-up slides: https://docs.google.com/presentation/d/1DVWf61qHJvVhbaxKXK-xfT7d6Z1-hdi7mUEsf1vkTQI/edit?usp=sharing *Day 1 Unix Shell lesson: https://swcarpentry.github.io/shell-novice/ *Sign in Name, pronounces (optional), department/program/affiliation, describe your research/work in 1-2 sentences. * Chris Endemann (he/him), Data Science Hub, I help researchers apply data science and machine learning tools to their research projects * Scott Prater (he/him), UW Digital Collections Center. digital library architect, I manage projects, technical infrastructure and strategy for digital libraries * Mary Murphy (any pronouns), Research Cyberinfrastructure, I lead the Electronic Lab Notebook SaaS and help in other research cyberinfrastructure needs * Daven Quinn (he/him), Research scientist, Geoscience department. I am a structural geologist who builds data and software infrastructure for geoscience research, including especially https://macrostrat.org. * Trisha Adamus (she/her), Ebling Library. I help researchs with their data needs. * Louk (he/him) Prep research student * Nikhil Damle (he/him) - BDS Summer Research Program * Taylor Boldoe(she/her- BDS Summer Research Program * Quinn White (she/her) - BDS Summer Research Program * Nistha Panda (she/they) - BDS SROP * Mackenzie Ray (she/her), PREP * Trenton Mercadel (he/him) PREP * Josselyn Muñoz (she/her) PREP * Jiewen Chen (she/her) PREP * Marlin Lee (he/him) * Sarai Garcia (she/her), PREP - Incarceration and Mental Health Lab * Lisa Padua (she/her), PREP * Samantha Voelker (she/her), UW Health Systems - Implementation Science and Engineering Lab * Nadeshka J. Ramirez (she/her) PREP * Nafisa Raisa(she/her), Biomedical Data Science * * I *Notes - open git bash (windows) or terminal (mac) - bash stands for "born again shell" - this next step is optional, but can help with readability in bash export PS1="$ " - list directories ls cd Desktop - download the data from: https://swcarpentry.github.io/shell-novice/data/shell-lesson-data.zip - move the folder to desktop and unzip it (Desktop/shell-lesson-data) - check where you are (your path) using pwd (print working directory) pwd - run ls to confirm that you can see your data folder, shell-lesson-data, on your desktop ls - use tab-complete (press tab as you start typing a folder name) to auto-complete folder names cd shell-lesson-data/ - Run 'ls' again. You should see 'exercise-data/' and 'north-pacific-gyre' folders ls - move to specific data folder cd north-pacific-gyre/ pwd ls - what if I want to go back to my desktop? cd Desktop # this doesn't work - move UP one directory cd .. pwd cd .. pwd cd shell-lesson-data/ pwd ls cd north-pacific-gyre/ pwd - move back two folders at once cd ../.. pwd - back from break; navigate to your Desktop folder using pwd to determine where you currently are - use 'cd ..' to move up a directory cd Desktop # Use -F to indicate a flag; this appends an indicator to folder and file names: * indicates executable file, / after an item indicates a folder ls -F # view help/documentation for command ls --help # windows man ls # mac (use q to quit the manual view) # list items in folder with extra info about file size, creation date ls -l # human readable format ls - lh ls -l -h # move up one level in directory cd .. pwd G # get info on Desktop ls -F Desktop # get info on shell lesson data ls -F Desktop/shell-lesson-data pwd # use "full/absolute paths" to cd cd /c/Users/yourUsername/Desktop/shell-lesson-data pwd # in shell lesson data folder :) # use ~ to cd to "home directory" cd ~ *Absolute vs. relative paths Starting from /Users/amanda/data, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda? 1. cd . 2. cd / 3. cd /home/amanda 4. cd ../.. 5. cd ~ 6. cd home 7. cd ~/data/.. 8. cd 9. cd .. Valid answers: 5, 7, 8, 9 # Let's cd to our Desktop cd ~/Desktop/shell-lesson-data/exercise-data/ pwd # check cd worked ls # confirm you have the same subfolders Trisha is showing: animal-counts/, creatures/, numbers.txt, proteins/, and writing/ cd writing/ pwd ls -F # two files show up: LittleWomen.txt and haiku.txt mkdir thesis # make a folder called thesis pwd # check directory again # make a project/data folder and a project/results folder # the -p flag allows us to create the project folder before we create the subfolders. Without this flag, the below command doesn't work because there isn't a 'project' folder to put the new subfolders into mkdir -p ../project/data ../project/results ls cd .. cd project/ ls # data/ and results/ can be seen cd ~/Desktop/shell-lesson-data/exercise-data/writing # open nano editor nano draft.txt # add some text to file It's not "publish or perish" anymore. It's "share and thrive". # save the file Ctrl/Cmd + O # Ctrl for windows, cmd for mac Enter Ctrl/Cmd + X # to exit # touch command to create a blank file touch my_file.txt ls # cd to Desktop/shell-lesson-data/ cd ~/Desktop/shell-lesson-data/exercise-data/writing # use movefile (mv) to rename a file (draft.txt -> quotes.txt) mv thesis/draft.txt thesis/quotes.txt cd thesis ls # draft.txt is now named quotes.txt # create a copy of quotes file and name it, quotations.txt cp quotes.txt quotations.txt ls nano quotations.txt Ctrl/Cmd + X # to exit Suppose that you created a plain-text file in your current directory to contain a list of the statistical tests you will need to do to analyze your data, and named it: statstics.txt After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so? 1. cp statstics.txt statistics.txt 2. mv statstics.txt statistics.txt 3. mv statstics.txt . 4. cp statstics.txt . # 1 works, but the 2nd answer it a bit easier # back from break clear # clear console # cd to writing folder cd ~/Desktop/shell-lesson-data/exercise-data/writing pwd ls cd thesis ls # remove quotes.txt file rm quotes.txt ls # file is removed! # the -i option gives you a chance to back out of a delete command rm -i quotations.txt n # n for "no" Enter pwd cd ../.. pwd # in exercise-data folder # create a backup directory l mkdir backup cp creatures/minotaur.dat creatures/unicorn.dat backup/ ls # we have a backup folder cd backup ls cd .. pwd # in exercise-data folder # let's move to our proteins folder ls cd proteins pwd # in proteins folder ls # bunch of .pdb files # view only pdb files ls *.pdb # view all files that start with p and end with .pdb ls p*.pdb # the ? wildcard substitues ONE character rather than a variable amount of characters like the asterisk does ls ?ethane.pdb # methane shows up ls *ethane.pdb # ethane and methane shows up When run in the proteins directory, which ls command(s) will produce this output? ethane.pdb methane.pdb 1. ls *t*ane.pdb 2. ls *t?ne.* 3. ls *t??ne.pdb 4. ls ethane.* cd ~/Desktop/shell-lesson-data/exercise-data/proteins pwd # wc = word count, but it returns number of lines, number of words, and number of characters wc cubane.pdb # number of lines, number of words, number of characters # get line/word/character count for all pdb files wc *.pdb # get line counts only wc -l *.pdb # save line counts of pdb files to a file named lengths.txt wc -l *.pdb > lengths.txt cat lengths.txt # another way to view file contents less lengths.txt pwd cd .. pwd # in exercise-data folder ls # view contents of numbers.txt cat numbers.txt sort numbers.txt # sorts in alphabetical order # sort by numbers sort -n numbers.txt cd proteins/ pwd ls cat lengths.txt # sort lengths.txt numerically sort -n lengths.txt > sorted-lengths.txt # view first line of document head -n 1 sorted-lengths.txt # view whole file cat sorted-lengths.txt # use echo to print things echo hello # displays "hello" to console echo hello > textfiles01.txt cat textfiles01.txt # hello is saved to textfiles01.txt # Use two greater than signs to add content to a file echo hello >> textfiles02.txt cat textfiles02.txt echo hello > textfiles01.txt echo hello > textfiles01.txt echo hello > textfiles01.txt cat textfiles01.txt # single "hello" in file echo hello >> textfiles02.txt echo hello >> textfiles02.txt echo hello >> textfiles02.txt cat textfiles02.txt # hello is entered in file multiple times # the pipe operator can be used to join multiple commands together sort -n lengths.txt | head -n 1 # shows "9 methane.pdb" pwd # in proteins folder wc -l *.pdb wc -l *.pdb | sort -n cd north-pacific-gyre/ ls wc -l *.txt # look at the last 5 results of the sort wc -l *txt | sort -n | tail -n 5 ls *z.txt # look at data for A and B files (not Z files) wc -l NENE*A.txt NENE*B.txt history # shows your history # you can run a line of history like this: !690 # to re-run line 690 of your history output pwd # in north-pacific-gyre cd .. ls cd exercise-data pwd ls cd creatures pwd # look at first 5 lines of a few files head -n 5 basilisk.dat minotaur.dat unicorn.dat # print last line of first two lines of basilisk.dat head -n 2 basilisk.dat | tail -n 1 # for loop that will run "head -n 2 $filename | tail -n 1" on each file listed in the first line of code for a for-loop for filename in basilisk.dat minotaur.dat unicorn.dat do head -n 2 $filename | tail -n 1 done for number in 0 1 2 3 4 5 do echo $number done # you can also type a for-loop in one line by separating lines with semi-colons * *Day 2: GitHub and Version Control GitHub: https://carpentries-incubator.github.io/git-novice-branch-pr/ Sign-In, Name, pronounces (optional), department/program/affiliation, * Mary Murphy (any pronouns), Research Cyberinfrastructure * Chris Endemann (he/him), Data Science Hub * Daven Quinn (he/him), Geoscience * Louk (He/Him) PREP * Nadeshka Ramirez Perez (she/her) PREP * Taylor Boldoe (she/her) BDS Summer Program * Yanissa rivera (she her) prep * Scott Prater (he/him), UW Digital Collections Center, Helper * Lisa Padua (she/her), PREP * Mackenzie Ray (she/her), PREP * Quinn White (she/her) BDS Summer Program * Casey Schacher (she/her), Science & Engineering Libraries, Instructor * Samantha Voelker (she/her), Implementation Science and Engineering Lab * Sarai Garcia PREP * Nikhil Damle (he/him) - BDS Summer Program * Nistha Panda (she/they) BDS Summer Program * Jiewen Chen (she/her) PREP * Josselyn Muñoz (she/her) PREP * Trenton Mercadel (he/him) PREP * Marlin Lee (he/him) * * * *Notes # open terminal (mac) or GitBash (windows) git --version # tells you version number of git # update git on windows git update-git-for-windows # update git on mac brew upgrade git # some configuration steps git config --global user.name "you name" # config git to use your GitHub aNccount email address git config --global user.email "yourEmailAddress" # set color of user interface git config --global color.ui "auto" # run the below command if you're on windows git config --global core.autocrlf true # mac/linux users, run below command git config --global core.autocrlf input git config --global core.editor "nano -w" git config --list # list all config options # go to github.com and sign in (create an account if you don't have one already) https://github.com/ # check if you have key-pairs setup already: the below command will produce an error if you don't ls -al ~/.ssh # setup key pairs ssh-keygen -t ed25519 -C "yourEmailAddress" # Enter file in which to save the key: Enter # don't type anything, just hit enter # If you get a prompt asking you if you want to overwrite, type y for yes y # Enter a passphrase that you can easily remember yourPassword Enter # check if you have key-pairs setup successfully: you should see something like id_ed25519 and id_ed25519.pub ls -al ~/.ssh # head back over to github.com - click profile icon in upper right corner - click settings - on the lefthand side, find "SSH and GPG keys" and click on it - click the green "New SSH key" button - add a title: we recommend the title reflects whichever machine you're currently working from # in gitbash/terminal, use cat to view your public key cat ~/.ssh/id_ed25519.pub - select the public key (output of cat) and Ctrl+C to copy it - head back over to GitHub and Ctrl+V to paste the public key into the "Key" textbox - Click the green "Add SSH key" button # head back over to GitBash ssh -T git@github.com # hit Enter # you'll be prompted to enter the passphrase you recently entered in our above steps # if successful, it'll say "Hi yourName! You've successfully authenticated, but GitHub does not provide shell access." # in GitBash (windows) / terminal (mac) clear # this will clear all past commands so GitBash looks a little cleaner # Next we're going to create a "repository" to store our code and all past versions of our code pwd # check current/working directory # cd to home directory, then desktop cd ~ cd Desktop pwd # make a planets folder mkdir planets ls # cd into planets folder cd planets pwd # initialize rpository git init # check folder/repository contents ls -a # the -a flag shows hidden folders/files # check the status of your repository: are there any changes made to the directory, new files, etc. git status cd .. git status # error: this folder is not a git repository cd planets pwd # open up nano nano mars.txt # let's add some text to our mars.txt file "Cold and dry, but everything is my favorite color." # exit file and save Ctrl+x y enter ls # we see mars.txt cat mars.txt # we see the text we added to our file # let's check the status of our repository ("repo") # no commits yet # in red, we see an "untracked file" (our mars.txt file) git status # next, we will "stage" our new file git add mars.txt # check status of repo # our new file now shows up in green because the file is now staged git status # commit our new file to the repo git commit -m "Start notes on Mars." # check status git status # no new files show up anymore, and we no longer see a "No commits" message # check log of commits git log # Let's add some more text to our mars.txt file nano mars.txt # add the following text "The two moons may be a problem for Wolfman." # save and exit nano Ctrl+x y enter # view updated file contents cat mars.txt # check status git status # untracked file shows up in red # check difference between our updated file (uncommitted) and the last version of the file that was committed to our repo git diff # stage our updated file git add mars.txt # commit the updated file git commit -m "add concerns about effects of Mars' moons on Wolfman." # check status git status # let's add another line to our file nano mars.txt # add the following line "But the Mummy will appreciate the lack of humidity." # save and exit nano Ctrl+x y enter # check difference between new updates and previously committed file git diff # Let's stage our file git add mars.txt # run git diff again git diff # no output because our changes are already staged # add modifier to diff to view diff between staged changes and last commit git diff --staged git commit -m "Discuss concerns about Mars' climate for Mummy" # view log of commits git log # add a new line to our mars.txt file nano mars # add the following words (shown in bold below) to previous sentences "The two moons may be a problem for the nocturnal Wolfman." "But the dry Mummy will appreciate the lack of humidity." # view updated file contents cat mars.txt # check diff # the output we see is hard to interpret — it shows which lines have changed, but it doesn't highlight new additions to existing lines (word-wise differences) git diff mars.txt # run git diff with a modifier to see word-wise differences git diff --color-words mars.txt # add/stage and commit all in one step by listing the updated file at the end of the commit command git commit -m "Added a couple of random words" mars.txt # display n number of past commits git log -1 # look at most recent commit git log -3 # look at last 3 commits # condensed view of just the commit messages associated with each commit git log --oneline # back from break, let's clear our GitBash clear # check contents of mars.txt cat mars.txt git log --oneline # add new line to file nano mars.txt "An ill-considered change." # save and exit nano Ctrl+x y enter # confirm change was added using cat cat mars.txt # git diff with a HEAD modifier: compare updated file with last version committed. git diff HEAD mars.txt # equivalent to git diff mars.txt # you can use HEAD to compare to other versions committed to your repo git diff HEAD~1 mars.txt # compare updated file to 2nd most recent commit git diff HEAD~2 mars.txt # compare updated file to three versions ago (3rd most recent commit) # check git log git log --oneline # on the left you'll see unique identifiers for each commit. We can use these identifiers to compare our updated file to specific commits git diff e3ddc76 mars.txt # you will likely have a different unique identifier than "e3ddc76". Use one of the identifier codes that show up when you run "git log --oneline" # check status git status # mars.txt file is unstaged # let's overwrite our change # delete the last line we added (an ill considerd change) # add the following line "We will need to manufacture our own oxygen" # save and exit nano Ctrl+x y enter cat mars.txt git status # add and commit file in one step git commit -m "This is a change we don't want." mars.txt cat mars.txt # revert to previous version of our file git checkout HEAD~1 mars.txt cat mars.txt # Git branches: branches are a great way to make changes to a project without impacting the previous state of the project. You can create a new branch whenever you have a series of changes you want to make. When you're done making changes, you can merge your branch into the main/master branch. # check current branches git branch # create a new branch called pythondev git branch pythondev # create a new branch git branch # see all branches # let's move into our pythondev branch. This branch is currently an exact copy of the work we committed to the main/master branch git checkout pythondev # branch switch git branch # pythondev branch now shows up green with an asterisk # view files in branch ls # mars.txt file shows up in this branch # make a new python file touch analysis.py ls # new file can be seen # let's add and commit the new file git add analysis.py git commit -m "Wrote and tested a python script" # check status git status # return to master/main branch git checkout master # create a new branch and switch to it: the -b flag will create a branch before moving you to that branch git checkout -b bashdev ls # create a analysis.sh file touch analysis.sh git add analysis.sh git commit -m "Wrote and tested bash script" git status git checkout master git branch # merge branches: pythondev merge with main git merge pythondev ls # analysis.py shows up # view branches git branch # delete our pythondev branch since we no longer need it (this branch was merged with master/main) git branch -d pythondev # attempt to delete bashdev branch git branch -d bashdev # you'll get an error stating that this branch is not fully merged. # force delete with uppercase D git branch -D bashdev git branch # now we just have our master branch # take a look at mars.txt cat mars.txt # create a new branch called marsTemp git branch marsTempt nano mars.txt # add the following line: "I'll be able to get 40 extra minutes of beauty rest." # save and exit nano Ctrl+x y enter # view updated file contents cat mars.txt # add and commit in one line git commit -m "Add a line about the daylight on Mars." mars.txt # switch branches git checkout marsTemp git branch cat mars.txt # add new line to our file nano mars.txt # new line below "Yeti will appreciate the cold." # save and exit Ctrl+x y enter # view file contents cat mars.txt # add/commit changes git add mars.txt git commit -m "Add a line about temperature on Mars." # switch back to master branch git checkout master # attempt to merge git checkout master # CONFLICT message shows up because we have edited our mars.txt file in both branches cat mars.txt # use nano to keep only the lines you want nano mars.txt # exit/save nano Ctrl+x y enter # add/commit git add mars.txt git commit -m "merged changes from marsTemp." git status # head back over to github.com - under your profile icon, click "Your repositories" - click the green "New" button to create a repository - name the repo, "planets" - click "Create repository" - click SSH instead of HTTPS near the top of the repo window - copy the SSH path # back in GitBash/terminal git remote add origin PLACE_COPIED_ADDRESS_HERE # check that it worked by typing... git remote -v N # push your local repo to GitHub git push origin master # enter in the passphrase you created earlier # go back to GitHub.com - From the planets repo page, click "add file" and then "create new file" - name the file, pullTest.txt - scroll down and commit the file via GitHub # pull changes from GitHub to local repo git pull origin master # enter your passphrase ls # we now see our pullTest.txt file in our local repo *Day 3: Plotting & Programming in Python GitHub: http://swcarpentry.github.io/python-novice-gapminder/ Sign-In, Name, pronounces (optional), department/program/affiliation * Taylor Boldoe she/her BDS Summer Program * Daven Quinn (he/him), Research scientist, Geoscience (Macrostrat lab) * Lisa Padua (she/her), PREP * Sarai Garcia (she/her, PREP) * Casey Schacher (she/her), Science & Engineering Libraries - helper * Samantha Voelker (she/her), Implementation Science and Engineering Lab * Nikhil Damle (he/him) - BDS Summer Program * Nistha Panda (she/they) - BDS summer Program * Josselyn Muñoz (she/her) - PREP * Mackenzie Ray (she/her), PREP * Nadeshka Ramirez Perez (she/her) PREP * Trisha Adamus (she/her), Ebling Library - helper * Yanissa Rivera(she her) PREP * Jiewen Chen (she/her) PREP * Quinn White (she/her) - BDS Summer Program * Marlin Lee (he/him) * Nafisa Raisa * Louk (he/him) Prep * Trenton Mercadel (he/him) PREP * * * * * *Notes #Note: if you don't have the data downloaded yet, you can get it from here: http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip # Open Terminal (mac) or GitBash (windows)jupyter # cd into your data folder. If it's stored on your Desktop... cd ~/Desktop/data/ jupyter lab # opens jupyter lab File -> new -> notebook # Use the plus button to add more cells # To run a cell, use Shift+Enter. You can also use Ctrl+Enter to run a cell without advancing to the next cell # You can select different cell types: Markdown or Code. Markdown is good for leaving sections of text in your notebook. # Use asterisks to create bulleted lists in markdown # can indent to create sublists # You can use hashtags, #, to add headings. The more hashtags you use, the smaller the heading... # Largest heading ## Large heading ### Smaller heading #### Even smaller heading # Markdown guide: https://www.markdownguide.org/basic-syntax/ # Restarting a kernel clears the notebook of what was run previously — it doesn't remove any code; just resets the environment. Kernel -> Restart Kernel. It is also good practice to use Kernel -> Restart Kernel and Run All Cells when your script is complete. This guarantees that all cells will be run in order. # in second cell age = 42 # assign 42 to the variable age Age = 43 # also valid AGE = 44 # also valid first_name = "Ralf" # assign a string to the variable first_name prtin(first_name) # print the variable contents # print multiple things using commas print(first_name, "is", age, "years old") # restarting kernel and running all cells would produce an error with the below code print(last_name) last_name = "Kotulla" # instead, we want to define last_name prior to printing it last_name = "Kotulla" print(last_name) # adding value to a variable age = age + 3 print("age in three years", age) atom_name = "Helium" print(atom_name) atom_name = 'Helium' # you can use single-quotes or double-quotes. Try to stick with one for consistency. print(atom_name) # print on multiple lines using three single/double quotes atom_name = """Helium is an atom""" print(atom_name) # print just first character of atom_name atom_name = "Helium" print(atom_name[0]) # print multiple characters of atom_name atom_name = "Sodium" print(atom_name[0:3]) # print first three characters # print length of string (number of characters) print(len(atom_name)) print(len(Atom_name)) # produces an error because we have a typo (A in atom shouldn't be capitalized) # Try to use meaningful variable names. The below code works, but it uses meaningless variable names, and it is hard for the average human to interpret at a glance. Use names that reflect the data you are storing in a variable (e.g., age and first_name in this case). flabadad = 42 kshd = "Ralf" print(kshd, 'is', flabadad, 'years old') # print type print(type(52)) # 52 is an integer ('int') print(type("some text")) # string type print(type(first_name)) # string type print(type(52.4)) # float type (number with decimal) print(5 - 3) # prints 2 print('hello' - 'h') # can't subtract strings # you can add strings full_name = "Ralf" + " " + "Kotulla" print(full_name) # you can also multiple a string with an iteger separator = "=" * 25 print(separator) # print length of full_name)) print(len(full_name)) # numbers do not have length print(len(52)) # produces an error # you can't add strings with integers print(1 + '2') # error # you can "cast" a string to an integer using int() print(1 + int('2')) # adds the integers, 1 and 2 # you can also do the reverse: cast integer to a string print(str(1) + '2') # 12 # you can print both strings and numbers together print("half is ", 1/2.0) # use two asterisks to raise a number to a power, e.g. 3 squared below print("three squared is", 3.0 ** 2) variable_one = 1 variable_two = 5 * variable_one variable_one = 2 print(variable_one, variable_two) # 2 5 # you can add comments to python by using hashtags # this sentence is a comment, and not executed by python adjustment = 0.5 # this is also ignored print("before") # print is a function, and "before" is the argument we pass to this function print() # prints nothing / newline print("after") # functions return things — the print functoin returns None result = print("example") print("result of print is", result) # None is the default output of a function # the max function print(max(1,2,3)) # 3 print(max('a', 'f', '0')) # f print(round(3.14)) # round to nearest integer print(round(3.14, 1)) # round to first decimal place my_string = "Hello World!" print(len(my_string)) print(my_string.swapcase()) # the string class has a set of functions associated with it # other string functions print(my_string.isupper()) # checks whether all characters (letters) of the string are uppercase. print(my_string.upper()) # convert string to upper case print(my_string.upper().isupper()) # convert string to uppercase and then check if entire string is uppercase # get help help(str) # see functions available for strings help(round) # see documentation for round function # common errors name = 'Ralf # Syntax Error: EOL (end of line) while scanning string literal; you need to add a single quote to the end of Ralf to make this line work age = = 42 # invalid syntax error. Python does not know how to interpret two equals signs that are separated by a space print("hello World" # Syntax Error: unexpected EOF (end of file). You need to close parentheses to fix this age = 52 remaining = 100 - AGe # NameError: name 'AGe' is not defined. Check your variable name spelling when you see this kind of error. AGe should be spelled as "age". There is no variable AGe defined in our script. age = 52 remaining = 100 - age # unexpected indent # let's import the math library/package import math print("pi is", math.pi) print("cos(pi) is", math.cos(math.pi)) # get help on math: you can only do this after importing the math library help(math) # import just two functions from the math library # careful not to use cos and pi as variable or function names in your script. If you do, you will override the math library's cos and pi functions. from math import cos, pi print("cos(pi) is ", cos(pi)) # you can give libraries a nickname import math as m # with this syntax, we reference m. when we want to use the math library print("cos(pi) is ", m.cos(m.pi)) # Intro to the pandas library. First, we'll import the library import pandas as pd # read in a csv file data = pd.read_csv('data/gapminder_oceania.csv') print(data) # run a shell command in python! !ls !pwd # specify index column to control which variable should be displayed on the rows of the data table data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col="country") print(data) # we can get more info on our data with .info() data.info() # print columns of your data print(data.columns) # print a transposed version (swap rows and columns) of your data print(data.T) # get some descriptive stats on your data (mean, std, min, max, count, etc. print(data.describe) # let's read in a different file data = pd.read_csv("data/gapminder_gdp_europe.csv", index_col="country") print(data) # use iloc to print value in first row, first col # Note: python indexing starts at 0 (unlike R, which starts at 1) print(data.iloc[0, 0]) # use loc to index based on row and column names # Typically, you'll want to use .loc instead of .iloc because it is more stable. That is, even if your csv file was reorganized, the code would still work when using .loc. print(data.loc['Albania', 'gdpPercap_1952']) # print all columns for albania by using a colon to index all columns print(data.loc['Albania', :]) # print all rows (countries) for a specific column (gdpPercap_1952) print(data.loc[:, 'gdpPercap_1952']) # print range of rows and columns print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']) # get max of a subset of data print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max()) # get min of a subset of data print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min()) # save subset of data into a new variable subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'] # Note: the \n in the print statement below adds a newline print("subset of data:\n", subset) # return trues and falses for a specific condition. In this example, we check where in our data gdp is > 10000 print("large gdp values:\n", subset > 10000) # generate a mask of True/False values. Use this mask to display values of gdp that are > 10000 mask = subset > 10000 print(subset[mask]) # Note: our mask is a matrix, and so we don't need to specify both rows and columns when using mask as an index # describe data that has a gdp > 10000 print(subset[mask].describe()) # let's create another mask that will be True for all values that are greater than the mean of gdp for a specific year # Note: when you sum over "Boolean data" (i.e., data containing True and False values), the Trues are treated as 1's, and the Falses are treated as 0's. mask_higher = data > data.mean() wealth_score = mask_higher.aggregate('sum', axis=1) # axis=1 means we will sum across all rows instead of columns (axis=0) print(wealth_score) # import plotting library import matplotlib.pyplot as plt time = [0, 1, 2, 3] position = [0, 100, 200, 300] plt.plot(time, position) plt.xlabel("Time [hours]") plt.ylabel("Position [km]") # if you're working in a different python IDE, you may need to run the following to get your plot to show up plt.show() # plot using pandas import pandas as pd data = pd.read_csv("data/gapminder_gdp_oceania.csv", index_col="country") years = data.columns.str.strip('gdpPercap_') # Convert year values to integers, saving results back to dataframe data.columns = years.astype(int) data.loc['Australia'].plot() data.T.plot() plt.ylabel('GDP per capita') # barplot plt.style.use('ggplot') data.T.plot(kind='bar') plt.ylabel('GDP per capita') # use plt.plot() to plot years = data.columnsgdp_australia = data.loc['Australia'] plt.plot(years, gdp_australia, 'g--') # plot multiple datasets together # Select two countries' worth of data. gdp_australia = data.loc['Australia'] gdp_nz = data.loc['New Zealand'] # Plot with differently-colored markers. plt.plot(years, gdp_australia, 'b-', label='Australia') plt.plot(years, gdp_nz, 'g-', label='New Zealand') # Create legend. plt.legend(loc='upper left') # add axis labels plt.xlabel('Year') plt.ylabel('GDP per capita ($)' *Day 4: Plotting & Programming in Python Continued GitHub: http://swcarpentry.github.io/python-novice-gapminder/ Sign-In, Name, pronounces (optional), department/program/affiliation * Chris Endemann (he/him), Data Science Hub * Daven Quinn (he/him), research scientist in Geoscience * Nadeshka Ramirez (she/her) PREP * Lisa Padua (she/her) PREP * Nikhil Damle (he/him) - BDS Summer Program * Taylor Boldoe she/her BDS Summer Program * Casey Schacher (she/her), Science & Engineering Libraries - helper * Louk (he/him) PREP * Nistha Panda (she/they) BDS Summer Program * Samantha Voelker (she/her), Implementation Science and Engineering Lab * Jiewen Chen (sher/her) PREP * Mackenzie Ray (she/her), PREP * Sarai Garcia (she/her), PREP * Trenton Mercadel (he/him) PREP * Marlin Lee (he/him) * Trisha Adamus (she/her), Ebling Library * Quinn White (she/her) BDS Summer Program * Josselyn Muñoz (she/her) PREP * yanissa rivera (she/her) PREP * * *Notes First, open JupyterLab: in Terminal/Git Bash shell cd ~/Desktop cd python-novice-gapminlder-data # or whatever you called your data folder from yesterday ls jupyter lab # opens jupyterlab # In JupyterLab, create a new notebook and change its name to "lists" (Pick "Notebook">"Python 3" (ipykernel)) Once in new notebook, right-click on title to rename # We're creating lists! ## Run this code in your Python notebook using Shift+Enter # Let's create a list in Python pressures = [.273, .275, .277, .275, .276] print('pressures:', pressures) print('length:', len(pressures)) print('first item of pressures:', pressures[0]) print('last item of pressures:', pressures[4]) # also can get the last item of pressures using pressures[len(pressures)-1] OR pressures[-1] -- # Lists' values can be replaced by assigning to them pressures[0] = .265 print(pressures) # Adding items to a list # incorrect way primes = [2, 3, 5] print(primes) primes = primes + [7] # This works but not the best way to do it # Why avoid this? Adding items to a list in this way forces python # to temporarily store multiple lists in your computer's memory: # 1) initial primes list # 2) [7] # 3) new list that has a length = len(primes) + 1 # After the 3rd new list is created, the contents of primes and [7] # is transferred to this new list. # This might not sound like a big deal, # but when working with large datasets, creating copies of # large lists can eat up your computer's memory very quickly. # Instead, we're going to use what's called an "in-place" operation # # An in-place operation is an operation that changes the # content of a variable directly — this avoids having to make a copy # of our list when we want to add data to it. primes = [2, 3, 5] primes.append(7) # This operation directly changes the value of primes print(primes) # [2, 3, 5, 7] # Combine two lists teen_primes = [11, 13, 17] middle_aged_primes = [37, 41, 43] primes.extend(teen_primes) primes.extend(middle_aged_primes) # primes will now contain all primes in order # remove items from a list primes = [2, 3, 5, 7, 9] del primes[4] print(primes) # [2, 3, 5, 7] # initialize an empty list my_list = [] my_list.append(1) print(my_list) # [1] # lists can contain different types goals = [1, 'Create lists', 2, 'Extract items from lists'] # This list contains both strings and integers # You can treat strings like lists! element = "carbon" print(element[0]) # "c" # ...but you can't replace/edit items like in a list (Strings are immutable) element[0] = "C" # Returns a TypeError # instead, you can use a string modification function element = str.replace(element, 'c', 'C') # common error : Indexing error print('99th item of element:', element[98]) # Returns IndexingError: string index out of range Challenge exercise: Fill in the blanks so that the program below produces the following output... first time: [1, 3, 5] second time: [3, 5] values = ____ values.____(1) values.____(3) values.____(5) print('first time:', values) values = values[____] print('second time:', values) # Answers values = [] values.append(1) values.append(3) values.append(5) print('first time:', values) values = values[1:] print('second time:', values) # Cast a string to a list print('string to list:', list('tin')) print('list to string:', "".join(["g", "o", "l", "d"])) # Output: # string to list: ['t', 'i', 'n'] # list to string: gold # print the last letter: element = 'helium' print(element[-1]) # equivalent to print(element[len(element)-1]) # step through a list using slices element = 'fluorine' begin_index = 0 end_index = len(element) stride = 2 print(element[begin_index:end_index:stride]) # "furn" # if starting and ending at beginning and end, respectively, you can write print(element[::stride]) # "furn" # How to print 'lre' from element element[1::3] # lre # Sorting lists result = sorted(numbers) # or, using an in-place operation result = numbers.sort() # result will be 'None', but numbers will contain the sorted list # making a copy of a list (or not) old = list('gold') new = old new[0] = 'D' # Both new and old are ["D", "o", "l", "d"]! # only a "reference" to the data is copied # if you copy with a slice, you make a true copy new = old[:] # Now create a new notebook called 'For loops' # for loops allow you to execute code for each value of a list my_list = [2, 3, 5] # print each item for item in my_list: print(item) # tab before print is important # can include multiple lines of code primes = [2, 3, 5] for p in primes: squared = p ** 2 cubed = p ** 3 print(p, squared, cubed) # use range to loop through a range of values for number in range(0, 3): print(number) # if starting at 0, you can omit first argument for number in range(3): print(number) # use for loops to sum up values in a collection total = 0 # accumulator variable for number in range(1, 11): print(number) total = total + number print(total) # 55 # In iPython notebook, hit Shift+tab to get out of for loop # can also loop through strings total = 0 for char in "tin": print(char) total = total + 1 print(total) # 3 *Exercises - go to 10:30 # Fill in the blanks in each of the programs below to produce the indicated result. # Total length of the strings in the list: # ["red", "green", "blue"] = 12 total = 0 for word in ["red", "green", "blue"]: total = total + len(word) print(total) # List of word lengths: ["red", "green", "blue"] = [3, 5, 4] lengths = [] for word in ["red", "green", "blue"]: lengths.append(len(word)) print(lengths) # New notebook: rename it to "conditionals" # running code based on some condition to be met (if-statements) mass = 3.54 if mass > 3.0: # don't forget colon! # if condition is true, this will execute print(mass, "is large") # 3.54 is large # conditionals are often used inside of loops masses = [3.0, 3.7, 9.2, 1.8, 1.7] for m in masses: if m > 3.0: print("m is large") # can also use an 'else' statement masses = [3.0, 3.7, 9.2, 1.8, 1.7] for m in masses: if m > 3.0: print("m is large") else: print("m is small") # checking multiple conditions masses = [3.0, 3.7, 9.2, 1.8, 1.7] for m in masses: if m > 9.0: print("m is HUGE") elif m > 3.0: print("m is large") else: print("m is small") # Once a condition is met all other conditions are ignored grade = 85 if grade >= 70: print('grade is C') elif grade >= 80: print('grade is B') elif grade >= 90 print('grade is A') # This will print 'grade is C' for everything, ignoring later cases # can "evolve" the value of variables in a for loop velocity = 10.0 for i in range(5): print(i, ":", velocity) if velocity > 20.0: print('moving too fast') velocity = velocity - 5.0 else: print('moving too slow') velocity = velocity + 10.0 print('final velocity:', velocity) # check multiple conditions in one if statement # check multiple conditions in one statement mass = [3.54, 2.07, 9.22, 1.86, 1.71] velocity = [10.0, 20.0, 30.0, 25.0, 20.0] for i in range(5): print('object' + str(i) + ': mass=' + str(mass[i]) + ', velocity=' + str(velocity[i])) # check conditions if mass[i] > 5 and velocity[i] > 20: print("Fast heavy object. Duck!") if mass[i] <= 2 or velocity[i] <=20: print('object is either light, slow, or both') #### Exercise # Modify this program so that it only processes files with fewer than 50 records/countries. import glob import pandas as pd for filename in glob.glob('data/*.csv'): contents = pd.read_csv(filename) if len(contents) < 50: print(filename, len(contents)) # Create a new notebook called "loop-through-data" import pandas as pd files = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv'] for filename in files: # read in our data data = pd.read_csv(filename, index_col='country') print(filename) print(data.min()) # use glob.glob to find sets of files matching a pattern import glob print(glob.glob("data/*.csv")) # if there are no text files, you end up with an empty list print(glob.glob("data/*.csv")) # result: [] # use glob to process a batch of files for filename in glob.glob("data/*.csv"): data = pd.read_csv(filename) # print the min GDP across all countries in a file for the year 1952 print(filename, data['gdpPercap_1952'].min()) # Result: gapminder_gdp_americas.csv 1397.717137 gapminder_gdp_europe.csv 973.5331948 gapminder_all.csv 298.8462121 gapminder_gdp_oceania.csv 10039.59564 gapminder_gdp_africa.csv 298.8462121 gapminder_gdp_asia.csv 331.0 # Exercise Which of these files is not matched by the expression glob.glob('data/*as*.csv')? 1. data/gapminder_gdp_africa.csv 2. data/gapminder_gdp_americas.csv 3. data/gapminder_gdp_asia.csv Answer: 1 #### Exercise # Modify this program so that it prints the number of records in the file that has the fewest records. import glob import pandas as pd fewest = float('Inf') for filename in glob.glob('data/*.csv'): dataframe = pd.read_csv(filename) fewest = min(fewest, dataframe.shape[0]) print('smallest file has', fewest, 'records') # Functions Functions allow us to: - easily re-use code without having to re-write it - break up complicated software into small, smanageable components *#create function def print_greeting(): * print('Hello!') #call function print_greeting() #create function *def print_date(year, month, day): * joined = str(year) + '/' + str(month) + '/' + str(day) * print(joined) #call function using positional arguments *print_date(1871, 3, 19) #call function using named arguments *print_date(month=3, day=19, year=1871) ## Functions can also return outputs #create function *def average(values): * if len(values) == 0: * return None * return sum(values) / len(values) #call function *a = average([1, 3, 4]) print('average of actual values:', round(a, 2)) ## Challenge a = average([]) print('average of empty list =', a) ANSWER: Returns: average of empty list = None Every function returns something - even if there is no explicity return statement Ex: *result = print_date(1871, 3, 19) print('result of call is:', result) Exercise: Identifying Syntax Errors 1. Read the code below and try to identify what the errors are without running it. 2. Run the code and read the error message. Is it a SyntaxError or an IndentationError? 3. Fix the error. 4. Repeat steps 2 and 3 until you have fixed all the errors. def another_function print("Syntax errors are annoying.") print("But at least python tells us about them!") print("So they are usually not too hard to fix.") #### What's wrong with this example? result = print_time(11, 37, 59) def print_time(hour, minute, second): time_string = str(hour) + ':' + str(minute) + ':' + str(second) print(time_string) #### Exercise Fill in the blanks to create a function that takes a single filename as an argument, loads the data in the file named by the argument, and returns the minimum value in that data. import pandas as pd def min_in_data(____): data = ____ return ____ #### Exercise Fill in the blanks to create a function that takes a list of numbers as an argument and returns the first negative value in the list. What does your function do if the list is empty? What if the list has no negative numbers? def first_negative(values): for v in ____: if ____: return ____ #scope of variables *pressure = 103.9 #this is a global variable *def adjust(t): '''Adjust calculates temperature based on pressure''' #this is a documentation string (i.e. docstring) * temperature = t * 1.43 / pressure #this is a local variable * return temperature *print('adjusted:', adjust(0.9)) print('temperature after call:', temperature) #this will return an error because you can't reference a loca variables at the global scope #this will print the docstring help(adjust) # Programming style Use assertions to check for internal errors. Good practice to use if there assumptions being made in your code (ex: a value is greater than 0. or a value is a string. etc) #create function *def calc_bulk_density(mass, volume): * '''Return dry bulk density = powder mass / powder volume.''' * assert volume > 0 #checks if the assumption that volume > 0 is true * return mass / volume #call function calc_bulk_density(10, -1) #since volumn < 0, you get an error #Wrap up Daily Feedback Form: https://forms.gle/Bo3DXNAoEC3D5P7ZA Coding Meetup (office hours): https://datascience.wisc.edu/hub/#dropin Data Science Hub Newsletter (upcoming workshops, seminars, jobs): https://datascience.wisc.edu/newsletter/