Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ ---------------------------------------------------------------------------- *Welcome to Software Carpentry! *Links: Workshop Webpage: https://uw-madison-datascience.github.io/2021-08-09-uwmadison-swc/ Pre-workshop Survey: https://carpentries.typeform.com/to/wi32rS?slug=2021-08-09-uwmadison-swc Feedback: https://forms.gle/EJnp7kTSHgu73BfS6 Intro Slides: https://docs.google.com/presentation/d/1hL_JkVviKn8lGhT36U9tk0f85DOEWcnx_WENATuJEeo/edit?usp=sharing Wrap-up Slides: https://docs.google.com/presentation/d/1_AI1PkdrITILKH3Os2ssCjUJHKm6cVkxWWz8BSRUTZg/edit?usp=sharing Get one-on-one help from the data science facilitators by attending our office hours or coding-meetups: https://datascience.wisc.edu/hub/#dropin *Day 5 - Git and/or Workflows Lesson: https://carpentries-incubator.github.io/swc-ext-python/ *Sign-in: Name (pronouns optional), Affiliation, what is one thing you've leared so far you'd like to implement in your work? * Sarah Stevens (she/her/hers), Data Science Hub, Instructor * Clare Michaud (she/her/hers), Data Science Hub, * Steven Warren (he/him) - Helper, iSchool MA Student, * Chandler Meyer (she/her/hers), Plant Breeding and Plant Genetics. Modifying my data set with python * Jeremiah Yee, biostatistics, glob * Joni Sedillo, medical genetics, using git * Katie Ziebarth, chemistry, Git * Chris Endemann (he/him/his), Data Science Hub, Helper * Chris Kirby - GLAS Education * * * *Notes: Vote below by adding a +1 to your preferred option Review and learn about Collaborating in GitHub, then Workflows: +1,+1 Start with Workflows instead: +1,+1+1+1+1 initial gdp_plots.py: *import pandas *# we need to import part of matplotlib *# because we are no longer in a notebook *import matplotlib.pyplot as plt * *# load data and transpose so that country names are *# the columns and their gdp data becomes the rows *data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col = 'country').T * *# create a plot of the transposed data *ax = data.plot() * *# display the plot *plt.show() git add gdp_plots.py git commit -m "First commit of analysis script" git status nano .gitignore data/*.csv *.ipynb Ctrl+x y enter git add .gitignore git commit -m "adding ignore file" git log --oneline nano gdp_plots.py import pandas import matplotlib.pyplot as plt data=pandas.read_csv(filename, index_col='country') ax=data.plot() filename='data/gapminder_gdp_oceania.csv' # set some plot attributes ax.set_xlabel('Year') ax.set_ylabel('GDP Per Capita') ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation=45) plt.show() Ctrl+x y enter gdp_plots.py v2 import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt filename='data/gapminder_gdp_oceania.csv' # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot() # set some plot attributes ax.set_xlabel('Year') ax.set_ylabel('GDP Per Capita') # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # display the plot plt.show() # test your script python gdp_plots.py # add/commit git add gdp_plots.py git commit -m "improving plot format" # want to be able to run our script using variable input values: python gdp_plots.py FILENAME nano args_list.py import sys print('the argument list is:', sys.argv) python args_list.py # prints out, "the argument list is: ['args_list.py'] python args_list.py arg1 arg2 arg3 # prints name of script followed by all arguments # edit our original script to take inputs nano gdp_plots.py import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt # OLD: filename='data/gapminder_gdp_oceania.csv' filename=sys.argv[1] # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot() # set some plot attributes ax.set_xlabel('Year') ax.set_ylabel('GDP Per Capita') # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # display the plot plt.show() Ctrl+x y enter # Run script on oceania gdp python gdp_plots.py data/gapminder_gdp_oceania.csv python gdp_plots.py data/gapminder_gdp_asia.csv # Ctrl+A: move to beginning of previous command entered in command line git status rm args_list.py # remove this file git status git diff gdp_plots.py git commit -m "adding cmdline arguments" gdp_plots.py # file has to be tracked previously for this to work # to change the name from master to main git checkout -b main git branch -d master # make 2 branches git branch py-multi-files # python branch git branch sh-multi-files # git bash branch git branch # see all branches # switch to python branch git checkout py-multi-files git branch # current branch highlighted in green with asterisk next to it # open our script nano gdp_plots.py # want to be able to run multiple files at once - use a for loop! #filename=sys.argv[1] # comment this out for filename in sys.argv[1:]: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot() # set some plot attributes ax.set_xlabel('Year') ax.set_ylabel('GDP Per Capita') # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # display the plot plt.show() python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv git add gdp_plots.py git commit -m "allowing plot generation for multiple files at once" nano gdp_plots.py # change plt.show to.... split_name1=filename.split('.')[0] # data/gapminder_gdp_X split_name2=split_name1.split('/')[1] # data/gapminder_gdp_X save_name='figs/' + split_name2 + '.png' plt.savefig(save_name) Ctrl+x y enter git add gdp_plots.py git commit -m "saves fig for each plot as file" # edit .gitignore to ignore any files in figs folder nano .gitignore figs/ Ctrl+x y enter git add .gitignore git gommit -m "ignoring figures" git log --oneline --graph --all --decorate # create bash script touch gdp_plots.sh nano gdp_plots.sh for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv do python gdp_plots.py $filename done # exit nano Ctrl+x y enter # run script bash gdp_plots.sh # Edit python script nano gdp_plots.py # save plot with unique filename split_name1=filename.split('.')[0] # data/gapminder_gdp_X split_name2=split_name1.split('/')[1] save_name='figs/' + split_name2 + '.png' plt.savefig(save_name) git add gdp_plots.sh git add gdp_plots.py git status git commit -m "wrote .sh script and updated python script to save figs to unique names" echo "figs/" >> .gitignore # echo prints out whatever comes after it, >> will append output of echo onto file specified after ">>" cat .gitignore git add .gitignore git commit -m "ignore figs folder" # let's time our scripts and see which is faster time bash gdp_plots.sh git checkout py-multi-files # switch branch time python gdp_plots.py data # nano gdp_plots.py # check for -a flag in arguments if '-a' in sys.argv: filenames=glob.glob('data/*gdp*.csv') else: filenames=sys.argv[1:] for filename in filenames: # exit nano Ctrl+x y enter cat gdp_plots.py python gdp_plots.py -a # doesn't work due to *gdp_americas.csv file being formatted differently than the others # edit script nano gdp_plots.py if '-a' in sys.argv: filenames=glob.glob('data/*gdp*[ae].csv') else: filenames=sys.argv[1:] # ls -l figs # add/commit git add gdp_plots.py git commit -m "adding a flag to run script for all gdp datasets except americas python gdp_plots.py cd .. ### nano gdp_plots.py if '-a' in sys.argv: filenames=glob.glob('data/*gdp*[ae].csv') if filenames == []: # file list is empty (no files found) print("No files found in this folder.") print("Make sure data folder and files are located in current directory") else: filenames=sys.argv[1:] Version after correcting for silent errors: import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt # make sure additional arguements or flag # have been provided by the user if len(sys.argv) == 1: # why the program will not continue print("Not enough arguments have been provide") # how this can be corrected print("Usage: python gdp_plots.py < filenames >") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if '-a' in sys.argv: filenames = glob.glob('data/*gdp*[ae].csv') if filenames == []: # file list is empty (no files found) print("No files found in this folder.") print("Make sure the data folder and files are located") print("in the current directory") else: filenames = sys.argv[1:] for filename in filenames: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot() # set some plot attributes ax.set_xlabel('Year') ax.set_ylabel('GDP Per Capita') # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = split_name1.split('/')[1] save_name = 'figs/' + split_name2 + '.png' plt.savefig(save_name) cd .. python swc-gapminder/gdp_plots.py -a # error message is outputted! cd swc-gapminder git status git add gdp_plots.py git commit -m "handling case if no files are present in current directory" ### Let's start a new branch to begin refactoring or reorganizing our code git checkout -b refactor git branch # check branch cat gdp_plots.py # how should we reorganize our script into a set of fxns? - one function that parses arguments - one function for creating one plot - one function that creates multiple plots - one function that will call all possible functions--the "main" function *import sys *import glob *import pandas *# we need to import part of matplotlib *# because we are no longer in a notebook *import matplotlib.pyplot as plt * * *def parse_arguments(argv): * """ * Parse the argument list passed from the command line * (after the program filename is removed) and return a list * of filenames. * * Input: * ------ * argument list (normally sys.argv[1:]) * * Returns: * -------- * filenames: list of strings, list of files to plot * """ * * *def create_plot(filename): * """ * Creates a plot for the specified * data file. * * Input: * ------ * filename: string, path to file to plot * * Returns: * -------- * none * """ * * *def create_plots(filenames): * """ * Takes in a list of filenames to plot * and creates a plot for each file. * * Input: * ------ * filenames: list of strings, list of files to plot * * Returns: * -------- * none * """ * * *def main(): * """ * main function - does all the work * """ * * * *# call main *main() # create a new script touch refactored_gdp_plot.py nano refactored_gdp_plot.py # past template above into this new file # try to past original code into template where appropriate Refactored script. import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt def parse_arguments(argv): """ Parse the argument list passed from the command line (after the program filename is removed) and return a list of filenames. Input: ------ argument list (normally sys.argv[1:]) Returns: -------- filenames: list of strings, list of files to plot """ # make sure additional arguements or flag # have been provided by the user if len(sys.argv) == 1: # why the program will not continue print("Not enough arguments have been provide") # how this can be corrected print("Usage: python gdp_plots.py < filenames >") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if '-a' in sys.argv: filenames = glob.glob('data/*gdp*[ae].csv') if filenames == []: # file list is empty (no files found) print("No files found in this folder.") print("Make sure the data folder and files are located") print("in the current directory") else: filenames = sys.argv[1:] return filenames def create_plot(filename): """ Creates a plot for the specified data file. Input: ------ filename: string, path to file to plot Returns: -------- none """ # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot() # set some plot attributes ax.set_xlabel('Year') ax.set_ylabel('GDP Per Capita') # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = split_name1.split('/')[1] save_name = 'figs/' + split_name2 + '.png' plt.savefig(save_name) def create_plots(filenames): """ Takes in a list of filenames to plot and creates a plot for each file. Input: ------ filenames: list of strings, list of files to plot Returns: -------- none """ for filename in filenames: create_plot(filename) def main(): """ main function - does all the work """ # parse arguments files_to_plot = parse_arguments(sys.argv[1:]) # generate plots create_plots(files_to_plot) # call main main() git add gdp_plots.py git commit -m "refactored code into functions" git checkout main git merge refactor # jupyter notebook time jupyter lab import gdp_plots # produces an error # back to command prompt nano gdp_plots.py # change main() to.... if __name__ == '__main__': main() # if you're just importing fxn, fxn will not run Ctrl+x y enter gdp_plots.create_plot("data/gapminder_gdp_oceania.csv") # back to command prompt git add gdp_plots.py git commit -m "moving call to the main function" *Day 4 - Python pt 2 Lesson: http://swcarpentry.github.io/python-novice-gapminder/ *Sign-in: Name (pronouns optional), Affiliation, icebreaker: what's the best book, movie, or tv show you've seen/read recently? * Clare Michaud (she/her), Data Science Hub, Better Call Saul * Steven Warren (he/him), iSchool MA, Good Time is a good thriller that I watched on Netflix recently * Chris Endemann, Data Science Hub, The Hobbit * Sarah Stevens (she/her/hers), Data Science Hub, CODA * Anthony Boyd (he/him), Undergrad student, Nuclear Engineering, Peaky Blinders * Stephan Blanz, BME / WITNe, Manifest (TV Show) * Joni Sedillo (she/her), Postdoc medical genetics, Klara and the Sun (book) * Jeremiah Yee, biostatistics, White Lotus * Katie Ziebarth, chemistry * * * * * *Notes: *Lists myList=['a', 'b', 'c'] myList=[1, 2, 3] myList[0] # index first element of myList myList[0] = .265 # store .265 as first element of myList primes = [2, 3, 5] primes.append(7) # add 7 as last/4th element of list teen_primes=[11, 13, 17, 19] middle_aged_primes=[37, 41, 43, 47] primes.extend(teen_primes) # extend list to include elements of teen_primes primes.append(middle_aged_primes) # append whole list at a single index. primes=[2, 3, 5, 7, 9] del primes[4] # delete the last element of primes. 9 isn't a prime value primes=[] # empty list goals = [1, 'Create lists.', 2, 'Extract items from lists.', 3, 'Modify lists.'] # list is mix of integers and strings. you can mix object types in list element='carbon' print(element[0]) # 1st element print(element[3]) # 4th element element[0]='C' # this doesn't work. strings have different properties than lists element[99] # string index out of range *Exercise print('string to list:' list('tin')) # convert string to list print('list to string:', ''.join(['g', 'o', 'l', 'd'])) # convert list into a string/word list('some string') # ['s', 'o', 'm', 'e', ' ', 's', 't', 'r', 'i', 'n', 'g'] print('-'.join(['x', 'y', 'z'])) # x-y-z element='fluorine' print(element[::2]) # go from start of string to end (two colons)-- print every other letter (2) print(element[::-1]) # print string in reverse order ### Stepping through a list 1. If we write a slice as low:high:stride, what does stride do? - stride is the size of the step when moving from low to high elements 2. What expression would select all of the even-numbered items from a collection? - myList[1::2] # Program A old=list('gold') new = old # simple assignment. new and old both reference the same object. new[0]='D' # both new and old are identical # Program B old=list('gold') new=old[:] # assigning a slice. this method of assignment creates a new object implicitly new[0] = 'D' *For Loops # iterate through the collection of numbers, [2, 3, 5]. As you iterate, 'number' is used to reference each element of list. for number in [2, 3, 5]: # don't forget the colon! print(number) # body of loop (run for each element of list). Need to indent when adding code to the body of the loop. # Sum the first 10 integers. total=0 for nuimber in range(10): total = total + (number + 1) print(total) # Exercise: Print out 'nit' using the skeleton code provided below original = 'tin' result = '' for char in original: result = char + result print(result) Loop Exercises: Practice Accumulating # Exercise 1) *# Total length of the strings in the list: ["red", "green", "blue"] => 12 total = 0 for word in ["red", "green", "blue"]: ____ = ____ + len(word) print(total) *# Exercise 2) # List of word lengths: ["red", "green", "blue"] => [3, 5, 4] lengths = ____ for word in ["red", "green", "blue"]: * lengths.____(____) print(lengths) # Exercise 3) *# Concatenate all words: ["red", "green", "blue"] => "redgreenblue" words = ["red", "green", "blue"] result = ____ for ____ in ____: * ____ print(result) # Exercise 4) Create an acronym: Starting from the list ["red", "green", "blue"], create the acronym "RGB" using a for loop. Hint: You may need to use a string method to properly format the acronym. *## Conditionals mass=3.54 if mass > 3.0: print(mass, 'is large') mass=2.07 if mass > 3.0: print(mass, 'is large') *if (mass[i] <= 2 or mass[i] >= 5) and velocity[i] > 20: # can use and/or to check for combinations of certain conditions. Use parentheses just like you would in math. * *# Looping over datasets import pandas as pd for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']: data = pd.read_csv(filename, index_col='country') print(filename, data.min()) # How to find files that match a certain pattern import glob print('all csv files in data directory:', glob.glob('data/*.csv')) # find and print all .csv files in data folder # * matching zero or more characters # ? matches exactly one character print('all jpg files:' glob.glob('*.jpg')) for filename in glob.glob('data/gapminder_*.csv'): data=pd.read_csv(filename) print(filename, data['gdpPercap_1952'].min()) for filename in glob.glob('data/*as*.csv'): data=pd.read_csv(filename) print(filename, data['gdpPercap_1952'].min()) import glob import pandas as pd for filename in glob.glob('data/*.csv'): contents = pd.read_csv(filename) if len(contents) < 50: print(filename, len(contents)) *Exercise! Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart. import glob import pandas as pd import matplotlib.pyplot as plt fig, ax = plt.subplots(1,1) for filename in glob.glob('data/gapminder_gdp*.csv'): dataframe = pd.read_csv(filename) # extract from the filename, expected to be in the format data/gapminder_gdp_.csv'. # we will split the string using the split method and `_` as our separator, # retrieve the last string in the list that split returns (`.csv`), # and then remove the `.csv` extension from that string. region = filename.split('_')[-1][:-4] dataframe.mean().plot(ax=ax, label=region) plt.legend() plt.xlabel('Year') plt.ylabel('Mean GDP per Capita') plt.xticks(rotation=90) plt.show() *Functions def print_greeting(): print('Hello!') print_greeting() def print_date(year, month, day): joined = str(year) + '/' + str(month) + '/' + str(day) print(joined) print_date(1871, 3, 19) # 1871/3/19 print_date(month=3, day=10, year=1871 # 1871/3/19 def average(values): if len(values)==0: return None return sum(values) / len(values) a=average([1, 3, 4]) # Use 3 single quotes to start a comment block. Useful for adding info on the purpose of your fxn. # local vs. global variables pressure=103.9 def adjust(t): '''Takes input t and returns temperature as output''' temperature=t*1.43/pressure retrun temperature print('adjusted:', adjust(0.9)) print('temperature after the call:', temperature) *Day 3 - Python pt 1 Lesson: http://swcarpentry.github.io/python-novice-gapminder/ *Sign-in: Name (pronouns optional), Affiliation, ICEBREAKER: something fun you did over the weekend * Clare Michaud (she/her), Data Science Hub, spent a lot of time in various east side parks yesterday - helper/host * Heather Shimon (she/her), Science & Engineering Libraries, bike ride through the arboretum * Chris Endemann (he/him/his), Data Science Hub, went to "sessions at McPike park" on Friday * Stephan Blanz, Department of Biomedical Engineering, spent time with friends around a bonfire * Scott Prater (he/him/his), UW Digital Collections Center, went to Van Gogh Experience show in Milwaukee * Sarah Stevens (she/her/hers), Data Science Hub, went to the memorial union terrace for the first (and second) time in awhile! - Helper * Anthony Boyd (he/him), Undergrad student, Nuclear Engineering, * Jeremiah Yee, Biostatistics, tiled a shower <- wow! is this fun? I missed the fun part. Not really lol * Joni Sedillo (she/her), Medical genetics postdoc, took my daughter and dog to the park * Katie Ziebarth; chemistry grad student; went to Green Bay * Steven Warren (he/him) - helper, iSchool MA student, had a picnic with friends at Tenney Park * * * * * * *Notes: Enter markdown mode: Esc+m Run current cell: Shift+Enter or Ctrl+Enter Headings - One hashtag: level 1 heading - Two hastags: level 2 heading - etc. Make a new line below: Esc (for editing mode), then press b Make a new line above: Esc (for editing mode), then press a Printing - Print variable contents to screen: print() - combine additional strings in print output using commas: print(first_name, 'is', my_age, 'years old') How to index different parts of a string variable - print(my_name[0]) # first index is always "0" in python - prints out first letter of my_name - print(first_name[0:5]) # the second index (5) says you'll print up to, but not including, the 5th index - print(first_name[6]) # gets last element of string - print(first_name[-1]) # gets last element of string Variable Types - Variable types control what kinds of commands can be run on the variable (e.g. can't subtract strings) - Print type of variable: print(type(my_variable)) full_name = "Stephan" + " " + "Blanz" # combine these three strings into single string stored in full_name separator = '=' * 10 # repeat '=' 10 times print(len(separator)) # print length of separator print(my_name[len(my_name)-1]) # print last element of my_name Casting variables number_two = int('2') # cast string of '2' as an integer Division and remainder operations - use / for "floor division", e.g. print(5 / 3) yields 1 - use % to get remainder after floor division, e.g. print(5 % 3) yields 2 Imaginary/real components complex = 6 + 2j print(complex.real) # print real component of complex number print(complex.imag) # print imaginary component of complex number Max/min/round print(max(1, 2, 3)) # prints max of list of numbers (3) print(min('a', 'A', '0', '1')) # all inputs are characters. sorts characters by numerical characters. prints 0 round(3.712) # rounds to 4 round(3.712, 1) # round to the first decimal place Get help on a function - help(round) - press and hold shift + tab after spelling out a given function to see info about the function Object Methods my_string = "Hello world" print(len(my_string)) print(my_string.swapcase()) # can run this on any string print(my_string.isupper()) # can run this on any string, checks if entire string is upper case or not print(my_string.upper()) # can run this on any string, changes string to be entirely upper case print(my_string.upper().isupper) # can run this on any string Challenge: easy_string = "abc" rich = "gold" poor = "tin" number_two_str = '2' Think about what will happen when you run the following. Then give it a try. print(max(easy_string)) print(max(rich, poor, number_two_str)) print(min(rich, poor, number_two_str)) print(max(len(rich), len(poor))) Import a library - do this at the top of your script to maintain a clean script import math help(math) # get help on math library print(math.pi) print(math.cos(math.pi)) # get cosine of pi # Import only a handful of fxns - warning: this can get messy if you're importing fxns that are used by multiple libraries from math import cos, pi # get only two fxns from math library print(cos(pi)) # can give your imported package a nickname to reference throughout your script import math as m m.cos(m.pi) base = "TATTAGCTTA" print(type(base)) Using random import random # do this at top of script random.randrange(0, len(base)) # random integer from 0 to length of base (exludes last index) print("Random int:", rand_int, base[rand_int]) # print index value and element at that index *bases="ACTTGCTTGAC" import math import random ___ = random.randrange(n_bases) ___ = len(bases) print("random base ", bases[___], "base index", ___) Pandas dataframes! import pandas as pd # most python users refer to pandas as "pd" data = pd.read_csv('data/gapminder_gdp_oceania.csv') data = pd.read_csv(r'fullpath/gapminder_gdp_oceania.csv') data = pd.read_csv(r'fullpath/gapminder_gdp_oceania.csv', index_col='country) # country is now your index column data # view dataframe data.info() # get some info on dataframe data.columns # print out list of all columns # notice that country is not longer technically a column of dataframe--it is an indexing variable data.T # transpose dataframe data.describe # print some summary stats (e.g. min/max, mean) on dataframe data.iloc[rowInd,colInd] # index specific rows/columns based on iteger indices data.loc["Albania", "gdpPercap_1952"] # index specific columns/rows based on labels data.loc["Albania", :] subset = data.loc['Croatia':'Finland', 'gdpPercap_1967':'gdpPercap_1997'] # from Croatia to Finalnd, from 1967 to 1997 mask = subset > 12000 subset[max] # shows all elements in dataframe where gdp was > 12000# everything else is NaN subset_greater12k = subset[mask] subset_greater12k.describe() # summary stats on this subset mask_higher = data > data.mean() wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns) # axis=1 means sum over columns instead of rows wealth_score data.groupby(wealth_score).sum() # Exercise - how to get gdp data from Austria, 1952-1957 data.loc['Austria', gdpPercap_1952' : gdpPercap_1957'] data.iloc[1, 0:2] New notebook import matplotlib.pyplot as plt import pandas as pd data = pd.read_csv(r'fullpath/gapminder_gdp_oceania.csv', index_col='country) # country is now your index column time = [0, 1, 2, 3] position = [0, 100, 150, 200] plt.plot(time, position) plt.xlabel('Time (hr)') plt.ylabel('Position (km)') plt.show() # if not using jupyter notebook, need this to actually display the plot years = data.columns.str.strip('gdpPercap_') # remove gdpPercap_ from beginning of each column names data.column = years.astype(int) # replace dataframe columns with new column names data.loc['Australia'].plot() # plot gdp per year of Australia # to plot across columns instead of rows, transpose the dataframe data.T.plot() # plot gdp for each country plt.xlabel('Year') plt.ylabel('GDP per capita') # creating a "ggplot" -- a plotting object borrowed from R plot.style.use('ggplot') data.T.plot(kind='bar') plt.xlabel('Year') plt.ylabel('GDP per capita') years = data.columns gdp_australia = data.loc['Australia'] plt.plot(years, gdp_australia, 'g--') # make a green dashed line on plot gdp_nz = data.loc['New Zealand'] plt.plot(years, gdp_australia, 'b-', label='Australia') plt.plot(years, gdp_nz, 'g--', label=='New Zealand') plt.legend(loc='upper left') # add a legend and explicitly control location of legend plt.scatter(gdp_australia, gdp_nz) # Let's start with a fresh dataframe data = pd.read_csv(r'fullpath/gapminder_gdp_oceania.csv', index_col='country) # country is now your index column data.head() data.max().plot() data_all = pd.read_csv(r'fullpath/gapminder_gdp_allcsv', index_col='country) # country is now your index column data_all.head(10) # look at first 10 rows data_all.plot(kind='scatter', x='gdpPercap_2007', y='lifeExp_2007', s=data_all['pop_2007']/1e6) *Day 2 - git/GitHub Lesson: https://carpentries-incubator.github.io/git-novice-branch-pr/ *Sign-in: Name (pronouns optional), Affiliation, one project you'd like to be able to use version control (git) on and why * Sarah Stevens (she/her/hers), Data Science Hub, was trying to think of a project but I think I do use git on most of them already, yay! - Helper * Chris Endemann (he/him/his), Data Science Hub - Helper, I like to use Git on is all of my research projects. Most recently I have used it on a project involving estimating brain connectivity patterns. * Clare Michaud (she/her/hers), Data Science Hub - Helper/Host, one project that I will be using git on is new Carpentry workshop material * Chris Kirby (he/him/his) - I would like to use git on our LENNS Project * Una Baker (she/her/hers), postdoc in nuclear engineering. My team uses git for multiple projects and I always forget which order to execute the commands! * Jeremiah Yee (he/him/his) postdox in biostatistics. Using git to make sure using right version * Casey Schacher (she / her), Science and Engineering Librarian, Instructor * Anthony Boyd (he/him), Undergrad student, Nuclear Engineering, I would like to use git for research project * Andrey Vega. Plant Breeding and Plant Genetics. * Steven Warren (he/him), iSchool MA student - Helper, I'm interested in using Git for building/hosting new workshop/website content. * Chandler Meyer(She/her), Plant Breeding and Plant Genetics. I would like to use it for a sequencing project or other projects * Amanda DeWitt (she/her), Center for Health Disparities Research. I would like to use a version control on the data processing programs used in our upcoming data collection/cleaning * Scott Wildman (he/him), Academic Planning and Institutioinal Research - Helper * Joni Sedillo (she/her), Post Doc, I work in a medical genetics lab studying epigenetics *Notes Git Configuration - set name: git cofig --global user.name "yourName" - set email: git config -global user.email "emailAddress" - control git console color scheme: git config --global color.ui "auto" - (Windows only): git config --global core.autocrlf true - (Mac only): git config --global core.autocrlf input - set editor: git config --global core.editor "nano -w" Git Help git config -h git config --help Creating a repository ("repo") - check current directory: pwd cd Desktop ls mkdir gitWorkshop # creates the folder if you don't have it already cd gitWorkshop pwd - create a subdirectory: mkdir planets cd planets - initialize repo: git init - Note: The planets subdirectory is not a Git repository. Git will now track any changes in this directory. ls - See hidden git-related files: ls -a - Check status: git status - Note: you don't want to nest repos. You already are tracking the contents of your subdirectory--no need to track additional folders inside the 'planets' subdirectory. - git status shows you that you are currently on the "master" branch--more on this later pwd Tracking a new file - Open up a plain text file using nano: nano mars.txt - Enter some text into the blank file: "Cold and dry, but everything is my favorite color." - Ctrl+x or Command+x to exit nano editor - press "y" - enter - Check that it worked: cat mars.txt - check that Git sees the new file as untracked: git status shows that you have a new file--mars.txt - add the file: git add mars.txt - You will get a warning, but it's not a concern to us. Unix/windows have different line endings. It's saying that it will convert windows line endings to unix line endings. git status: shows that file is added, but not yet committed - Commit the file and describe what is being committed/changed in your repo: git commit -m "Start notes on Mars" - There is now a permanent copy of this version of the file stored in a hidden .git folder in your planets folder/repo - check that file is commited: git status - To see a log of all commits, type: git log - Edit your original file: nano mars.txt - type: "The two moons may be a problem for Wolfman." - Ctrl+x or Command+x to exit nano editor - press "y" - enter - Check that it worked: cat mars.txt - see if Git notices the change: git status - Need to re-add and re-commit the file. But before that... git diff - shows difference between first file committed and the current version of the file Challenge time! Try to add and commit your most recent change to your repository. git add mars.txt git commit -m "Add concerns about effects of Mars' moons on Wolfman" - Note: best to commit files/changes one at a time so that your message can reflect a specific change. Also so that you can go back to very specific versions of files, if need be. git status - Let's add a third line to our file: nano mars.txt - Enter some new text on a new line: "But the Mummy will appreciate the lack of humidity." - Ctrl+x or Command+x to exit nano editor - press "y" - enter - Check that it worked: cat mars.txt - See difference between current version and last committed version: git diff git add mars.txt git diff - Nothing shows up as a change. This is because you've added the latest change. git diff --staged - The above command shows the difference between the files in staging area (files that are added), and files that are already committed git commit -m "Discuss concerns about Mars' lcimate for Mummy." What happens if we only change specific words in a file? nano mars.txt - up/down and left/right arrows to move across and within lines - add a couple of words in the file - Ctrl+x or Command+x to exit nano editor - press "y" - enter - Check that it worked: cat mars.txt git diff - to see which specific words are changed, enter the following command: git diff --color-words mars.txt git status git add mars.txt git commit -m "Adding descriptors for monsters." git status - Check out all current commits: git log - Most recent commit listed first - How can we shorten the log display? - View most recent commit: git log -1 - View second to most recent commit: git log -2 - get a condensed view of log: git log --oneline - Add another line of text nano mars.txt "An ill-considered change" (or any text you want) - Ctrl+x or Command+x to exit nano editor - press "y" - enter - Check that it worked: cat mars.txt git diff HEAD mars.txt - look at difference between current version and HEAD-1 (back 1 from most recent commit): git diff HEAD~1 mars.txt git diff HEAD~2 mars.txt - get a condensed view of log: git log --oneline - Look at differences and commit message: git show HEAD~2 mars.txt git log --online - You can use the commit's unique identifiers to look at differences from specific commits git diff UniqueCommitIdentifier mars.txt How can we restore a previous version? - let's first edit the file again nano mars.txt - delete everything! Replace with one line of text if you'd like, but it's not critical. - Ctrl+x or Command+x to exit nano editor - press "y" - enter - Check that it worked: cat mars.txt - everything is deleted :( git status - Restore file to most recent version committed in repo: git checkout HEAD mars.txt - see if the file is restored: cat mars.txt - Review log: git log --online - Restore an earlier version of the file: git checkout HEAD~2 mars.txt cat mars.txt git status git checkout HEAD mars.txt cat mars.txt - HEAD (most recently committed version of file) is restored! git status - Incorrectly detatch the HEAD: git checkout HEAD~1 - Fix the detatched head: git checkout master Challenge time! nano venus.txt - Add text: "Venus is beautiful and full of love" cat venus.txt git add venus.txt nano venus.txt - Add "Venus is too hot to be suitable as a base" as text git commit -m "comment on Venus as an unsuitable base" git checkout HEAD git log --oneline venus.txt git checkout HEAD venus.txt cat venus.txt - Only has one line because the git add happened before the 2nd line was added so the commit only had the first line in it. So the take home is that if you've changed something for a file that is already added to the staging already it will commit the staged version. You can update the staged file using `git reset venus.txt` which will bring it down from the staging area, then you can `git add` it again. What are branches? Sometimes we want to keep our main work "safe" from experimental changes. Branches can be merged into the main/master branch (after you've tested it). # to see what branches are available for a repo git branch # we only have one so far, shows and asterisk next to the branch you are on Dracula is running an analysis in python and in bash to see which one is faster - will have two experimental branches (one for python, one for bash) and then will merge the one that is fastest with the master branch git branch pythondev # create a new branch called pythondev git branch # check again and we are still on the master branch but have a new branch called pythondev # need to move to the pythondeve branch git checkout pythondev # switching to the pythondev branch touch analysis.py # creating the analysis script - pretending we worked on it a lot git add analysis.py git commit -m "wrote and tested python analysis script" git log --oneline # we can see we have the new commit in the pythondev branch git checkout master # going back to the master branch to confirm the new file and commit don't exist there git branch # check that we are on the master branch git log --oneline # see that we don't have the python analysis commmit ls # check that the python analysis script doesn't exist # now we want to do this same experiment for the bash analysis git checkout -b bashdev # this both creates the branch and moves to it in one step, above we did this in two steps for the pythondev branch git branch # confirm we on the bashdev branch ls # can see that we don't have the python analysis script because we made the branch from master so it only has the commits and files that the master branch hat when we created the new branch touch analysis.sh # created bash analysis script, imagining we worked a lot on it git add anlaysis.sh git commit -m "wrote and tested bash analysis script" git stats # checking we commited it # turns out the python script is faster and we only want to keep the python script git checkout master # switch back to the master branch git branch # double check that we are on the master branch git merge pythondev # merging the pythondev branch (files and commits) into the branch we are currently on (the master branch) ls # see that analysis.py is now in the master branch git log --oneline # see that we have the commit from the pythondev branch # clean up the experimental branches - since we don't need them anymore and if we find them later we might get confused git branch -d pythondev # deleting the pythondev git branch -d bashdev # trying to delete the bashdeve branch but.. # we get an error because these commits are not in any other branches...we still want to delete it so we can do.. git branch -D bashdev # and it deletes the bashdev branch # conflicts - how to handle issues with different edits to the files in separate branches # create a conflict on purpose so we can learn to resolve it git branch marsTemp nano mars.txt # add a line about day length git add mars.txt git commit mars.txt # adding this line to the master git checkout marsTemp # The new change (day length) isn't in the marsTemp branch as it was added AFTER marsTemp was created nano mars.txt # add a line about temperature (Yeti) git add mars.txt git commit mars.txt # adding this to the marsTemp branch git checkout master git merge marsTemp # CONFLICT - some changes are made to mars.txt that help us identify where the conflict is # Fix this be editing the mars.txt file and removing (or otherwise editing) the extra text - be sure to remove the identifier lines ("<<<<<<<", ">>>>>>>", "=======") cat mars.txt # just to make sure we have what we want after editing. # But this hasn't completed the merge -> commit git commit -m "Merged changes from marsTemp" git checkout marsTemp # going back to marsTemp branch nano mars.txt # opening up the file # in nano you can see that it only has the marsTemp changes not the master changes - the merge is not bi-directional # adding line to end of this branch The polar caps will appreaciate will probably be Yeti's home # out of nano git add mars.txt git commit -m "wrote a note about Yeti's home" git checkout master # switching back to the master branch git log --oneline #looking at the log and see that we don't have the last commit in marsTemp git merge marsTemp # merging in the latest changes from marsTemp #challenge Create and switch to a new branch, change something in that branch, and then merge it back into the master branch #github!!! - repository hosting services # Create a github account - https://github.com/ In github... click new repository button repository name - planets description - can leave blank or put something like "practice repo for learning git" keep public (you can make it private but this repo doesn't need to be private) - casey made hers private don't initialize a readme, gitignore or license - because we are importing an existing repo click create repository choose the ssh authentication optino and then copy the address next to it back in the bash shell # connecting the local planets repo to the github one git remote add origin git@github.com:USERNAME/planets.git # check that you setup the remote connection git remote -v # should show two origin conections (fetch and push) # setting up authentication for github # check for ssh keys ls -al ~/.ssh # creating a new key ssh-keygen -t ed25519 -C "YOUREMAIL4GITHUB" #asks where to save the key, press enter to use the default location # asks if you'd like to have a passphase - leave blank if you'd like not to be asked for a passphrase or enter password if you'd like ssh -T git@github.com # will show permission denyed cat ~/.ssh/id_ed25519.pub # then copy your public key from the screen - we will need it to put it into github Back in github... Click on your icon in the upper right hand corner, and choose settings option From the menu list on the left hand slide, scroll down and choose "SSH and GPG keys" Click New SSH Key button Name it with the name of your computer e.g. "Casey's Work Laptop" Paste the key you copied from the cmdline into the Key section Click Add SSH Key Enter your password for github Back in command line / unix shell ssh -T git@github.com #confirming that the ssh connection is setup # should get successfull message git push origin master WE WILL PICK THIS BACK UP ON THE EXTRA DAY NEXT FRIDAY - AT LEAST FOR A BIT *Day 1 - Unix Shell Lesson: http://swcarpentry.github.io/shell-novice/ *Sign-in: Name (pronouns optional), Affiliation, a sentence or two describing your research/work * Sarah Stevens (she/her/hers), Data Science Hub, I help researchers learn the computational/data science skills they need to do their work. - Instructor * Karl Broman (he/him/his), Biostatistics & Medical Informatics, applied statistics, mostly in genetics - Helper * Chris Endemann (he/him/his), Data Science Hub - Helper * Scott Prater (he/him/his), UW Digital Collections Center, Digital Library Architect - Helper * Heather Shimon (she/her), Science & Engineering Libraries - Helper * Clare Michaud (she/her), Data Science Hub - Helper * Steven Warren (he/him), iSchool MA student - Helper * Una Baker (she/hers), Engineering Physics, Postdoc in Nuclear Engineering * Chandler Meyer (she/hers), Graduate student, Plant Breeding and Plant Genetics, genetic engineering and gene function research * Anthony Boyd (he/him), Undergrad student, Nuclear Engineering * Tariq Alauddin, Research Analyst * Joni Sedillo (she/her), Post Doc, I work in a medical genetics lab studying epigenetics * Chris Kirby (he/him) I have borrowed a laptop from Emily hence the screen name. I work for GLAS Education we do outreach for Yerkes Observatory programs. * * * *Notes: Download data: http://swcarpentry.github.io/shell-novice/data/shell-lesson-data.zip Unzip/extract all data and save to desktop Rename folder to data-shell if needed What happens if you type a command that doesn't exist? ks error message - ks: command not found Commands in the Unix Shell ls -list the contents ls -F displays a slash after the paths that are directories pwd - print working directory, tells you what folder you are working in Help ls -- command for help for Windows man for Mac (stands for manual) Challenge: Exploring More ls Flags 1. You can also use two options at the same time. What does the command ls do when used with the -l option? What about if you use both the -l and the -h option? Some of its output is about properties that we do not cover in this lesson (such as file permissions and ownership), but the rest should be useful nevertheless. ls -l "l" stands for long form, additional information like file size and the time of its last modification ls -lh gives sizes with more information, "h" stands for human readable 2. By default, ls lists the contents of a directory in alphabetical order by name. The command ls -t lists items by time of last change instead of alphabetically. The command ls -r lists the contents of a directory in reverse order. Which file is displayed last when you combine the -t and -r flags? Hint: You may need to use the -l flag to see the last changed dates. -- More commands and flags ls -t lists by last modified date order ls- tr adding the "r" flag reverses the order ls -ltr lists long form by time in reverse order ls Desktop -to print contents of desktop ls Desktop/data-shell prints: creatures/ data/ molecules/ north-pacific-gyre/ notes.txt pizza.cfg solar.pdf writing/ to move to a different directory cd -change directory cd Desktop Reminder - pwd to see what directory you are in / is your first folder at the beginning of your file system cd data-shell cd data How to move back up one folder? cd .. to go back to data-shell cd .. to go back to desktop can move more than one directory at a time cd data-shell/data Challenge: absolute vs relative paths Starting from /Users/amanda/data, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda? 1. cd . No: . stands for the current directory. 2. cd / No: / stands for the root directory. 3. cd /home/amanda No: Amanda’s home directory is /Users/amanda. 4. cd ../.. No: this command goes up two levels, i.e. ends in /Users. 5. cd ~ Yes: ~ stands for the user’s home directory, in this case /Users/amanda. 6. cd home No: this command would navigate into a directory home in the current directory if it exists. 7. cd ~/data/.. Yes: unnecessarily complicated, but correct. 8. cd Yes: shortcut to go back to the user’s home directory. 9. cd .. Yes: goes up one level. shortcut: cd - takes you to the last folder that you were in Challenge: Relative Path Resolution Using the filesystem diagram below, if pwd displays /Users/thing, what will ls -F ../backup display? 1. ../backup: No such file or directory No: there is a directory backup in /Users. 2. 2012-12-01 2013-01-08 2013-01-27 No: this is the content of Users/thing/backup, but with .., we asked for one level further up. 3. 2012-12-01/ 2013-01-08/ 2013-01-27/ No: see previous explanation. 4. original/ pnas_final/ pnas_sub/ Yes: ../backup/ refers to /Users/backup/. -- cd .. ls look inside folder north-pacific-gyre ls nor +Tab for tab complete Tab again to see inside folder inside of north-pacific-gyre mkdir -"make directory"; create a new folder mkdir thesis Avoid spaces in file names, can use underscore or dashes - be consistent Also avoid other special characters - $&()" cd thesis nano draft.txt to create file (nano is a text editor) ^ is the same as Ctrl - It's not "publish or perish" any more, It's "share and thrive". Ctrl X to save and exit - Y for yes, press enter to keep the same name cat - concatenate, or show file contents cat draft.txt mv - move a file to a new path and can use to rename mv thesis/draft.txt thesis/quotes.txt cat thesis/quotes.txt mv thesis/quotes.txt . Challenge: Moving Files to a new folder After running the following commands, Jamie realizes that she put the files sucrose.dat and maltose.dat into the wrong folder. The files should have been placed in the raw folder. Fill in the blanks to move these files to the raw/ folder (i.e. the one she forgot to put them in) $ mv sucrose.dat maltose.dat ____/____ Solution: $ mv sucrose.dat maltose.dat ../raw Without having cd analyzed first: mv analyzed/maltose.dat analyzed/sucrose.dat raw or mv analyzed/maltose.dat analyzed/sucrose.dat ./raw Challenge: renaming files Suppose that you created a plain-text file in your current directory to contain a list of the statistical tests you will need to do to analyze your data, and named it: statstics.txt Solution: mv statstics.txt statistics.txt -- Deleting files - be careful! Very hard to get back. rm example: rm quotes.txt rm -r thesis to remove a directory rm -i to check in with you before deleting cd molecules Using wildcards for accessing multiple files at once: * ? * to access all files that end in .pdb ls *.pdb touch my_notes.txt can scroll back using up arrow for previous commands scroll to ls *.pdb ? wildcard for one character ls p???ane.pdb Challenge: wildcards When run in the molecules directory, which ls command(s) will produce this output? ethane.pdb methane.pdb 1. ls *t*ane.pdb shows all files whose names contain zero or more characters (*) followed by the letter t, then zero or more characters (*) followed by ane.pdb. This gives ethane.pdb methane.pdb octane.pdb pentane.pdb. 2. ls *t?ne.* shows all files whose names start with zero or more characters (*) followed by the letter t, then a single character (?), then ne. followed by zero or more characters (*). This will give us octane.pdb and pentane.pdb but doesn’t match anything which ends in thane.pdb. 3. ls *t??ne.pdb - Correct fixes the problems of option 2 by matching two characters (??) between t and ne. This is the solution 4. ls ethane.* only shows files starting with ethane. -- Running multiple commands at a time - pipes and filters wc - count - it counts the number of lines, words, and characters in files wc cubane.pdb 20 156 1158 cubane.pdb wc *.pdb - count all of the files ending with .pdb with total counts wc -l *.pdb to show only the line count to print output to file name > wc -l *.pdb > lengths.txt cat lengths.txt to see what is in file to sort by number in lenghts.txt sort -n lengths.txt > sorted-lengths.txt cat sorted-lengths.txt head - gives you the top of the file head -n 1 sorted-lengths.txt -n 1 gives the top line echo - prints whatever is typed after it echo hello > testfile01.txt overwrites the file content in testfile01.txt echo hello >> testfile02.txt echo hello >> testfile02.txt appends both hellos into the same file (hello is included twice) > overwrites >> appends combine with pipe | sort -n lengths.txt | head -n 1 wc -l *.pdb | sort -n | head -n 1 Challenge: Piping Commands Together In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work? 1. wc -l * > sort -n > head -n 3 2. wc -l * | sort -n | head -n 1-3 3. wc -l * | head -n 3 | sort -n 4. wc -l * | sort -n | head -n 3 Correct -- cd ../north-pacific-gyre/2012-07-03/ wc -l *.txt | sort -n | head -n 5 to see if there are any files that are too small wc -l *.txt | sort -n | tail -n 5 to see if there are any files that are too big ls *Z.txt to check for files that end with Z ls *[AB].txt to check for files that end with AB cd ../../creatures head unicorn.dat Loops for filename in *.dat > do > head -n 2 $filename | tail -n 1 > done printed: CLASSIFICATION: basiliscus vulgaris CLASSIFICATION: bos hominus CLASSIFICATION: equus monoceros for x in basilisk.dat minotaur.dat unicorn.dat > do > head -n 2 $x | tail -n 1 > done One line version for x in basilisk.dat minotaur.dat unicorn.dat; do head -n 2 $x | tail -n 1; done history - to see history of commands Shell Scripts can be useful in creating scripts for commands that you run for multiple datasets: https://swcarpentry.github.io/shell-novice/06-script/index.html Last lesson is for finiding things * Use grep to select lines from text files that match simple patterns. * Use find to find files and directories whose names match simple patterns.