Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ Kees den Heijer (instructor) Heather Andrews (instructor) Yasemin Turkyilmaz (helper) Nicolas Dintzner (helper) Santosh Ilamparuthi (helper) Sobhan Omranian Khorasani (helper) Yan Wang (helper) Esther Plomp (helper) Suellen Pereira Espindola Francesca Greco Banafsheh Abdollahi Stefano Lovato Nidarshan Kumar Jeroen van Dijk Boram Kim Samprajani Rout Aad Vijn Fatma Tomas Pippia Adrian Gonzalez Lucia Beloqui Samaneh Babayan Wilfred van der Vegte Alina Colling Wouter van der Wal Renfei Bu Yuki Murakami Yelena Davide Joost Tanveer Ahmad Shahrzad Nikghadam Xiaocong Lyu Cehao Yu Changrang Zhou Yelena Lucia Beloqui Larumbe For information website of the carpentries: https://software-carpentry.org/ Introduction to the Unix Shell Download the data used in the Unix shell lesson if you have not already : https://swcarpentry.github.io/shell-novice/data/data-shell.zip Starting the shell: WINDOWS : Git Bash (if you installed it) MACOS : Terminal will work fine Linux/Others : Terminal - the true shell :) Commands: $ ls : list files in the current directory $ ks: command not found ! Shell tells you when it can't find a program you are calling $ pwd: print current working directory (where you are in the file system - on your hard drive) $ ls -F: -F indicate whether each entry is a file or directory, * indicates if it is an executable file, @ sign means it is a link $ ls -F /: gives us a listing of files and directories in the root directory. / means root of your directory tree $ ls-- help or man ls: if you are not sure about the additional options with ls. there are difference between different operating systems, that is why there are two options "--help" will work on Git Bash with Windows, "man help" will work for MacOS and Linux terminal (in most cases) q or control + c to return to your prompt $ ls -F Desktop $ ls -F Desktop/data-shell: to see the contents of data-shell directory $ cd Desktop: to change our directory to Desktop $ pwd: to print working directory, to confirm we are in Desktop $ cd data-shell: to go into data-shell directory $ cd data: to go into data-shell/data directory $ ls -F: to see the contents of data-shell/data directory $ cd data-shell: doesnt work, because we are in data and this is relative command and doesnt work to go up $ cd ..: to go one directory up $ pwd: to confirm we are in data-shell $ ls -F -a: -a stands for ‘show all’ (including hidden files); it forces ls to show us file and directory names that begin with ., such as ... . Single dot is the current location, double dot is one level up $ ls north-pacific-gyre/2012-07-03/: to look into contents of this folder $ ls nor + tab: it auto completes your command based on what is available in your directory. Pressing tab can auto complete your command and save you a lot of typing $ pwd: to check where we are $ mkdir thesis: to make a new directory called thesis $ls -F: now we can see that there is a thesis directory. When you make new directories, it is recommended not to use spaces and instead - or _ in file names. $ ls -F thesis: $ cd thesis $ nano draft.txt: nano is a of the text editor and with typing this we are directed to nano. at the bottom, shortcuts are listed. control + X: exit, control + O: write out meaning save control + O $ touch my_file.txt: this is another way to create a txt file. now the file is created but you are not directed to nano. $ ls -l: to see the file you just generated and its size $ cd ~/Desktop/data-shell $ pwd: to check current directory $ mv thesis/draft.txt thesis/quotes.txt: the first argument mv tells what we’re ‘moving’, while the second is where it’s to go. In this case, we’re moving thesis/draft.txt to thesis/quotes.txt, which has the same effect as renaming the file. $ mv thesis/quotes.txt . :to move the file from the directory it was in to the current working directory $ cp quotes.txt thesis/quotations.txt: The cp command works very much like mv, except it copies a file instead of moving it. $ ls quotes.txt thesis/quotations.txt: To check that mv did the right thing using ls with two paths as arguments — like most Unix commands, ls can be given multiple paths at once cp -r thesis thesis_backup: To copy a directory and all its contents by using the recursive option $ rm quotes.txt: to remove quotes.txt $ ls quotes.txt: to confirm quotes.txt is deleted $ rm thesis: doesnt work because rm by default only works on files, not directories. $ rm -r thesis $ ls thesis: to confirm it is deleted $ cd molecules: to go to molecules $ ls: to check its contents $ ls *.pdb: * is a wildcard, which matches zero or more characters. Exercise: When run in the molecules directory, which ls command(s) will produce this output? ethane.pdb methane.pdb 1. ls *t*ane.pdb 2. ls *t?ne.* 3. ls *t??ne.pdb xXX xXxxxxxxxxxxxxx 4. ls ethane.*foan 1. shows all files whose names contain zero or more characters (*) followed by the letter t, then zero or more characters (*) followed by ane.pdb. This gives ethane.pdb methane.pdb octane.pdb pentane.pdb. 2. shows all files whose names start with zero or more characters (*) followed by the letter t, then a single character (?), then ne. followed by zero or more characters (*). This will give us octane.pdb and pentane.pdb but doesn’t match anything which ends in thane.pdb. 3. fixes the problems of option 2 by matching two characters (??) between t and ne. This is the solution. 4. only shows files starting with ethane. ===== Pipes and Filtes ====== $ wc *.pdb: Show the number of files, words and characters => (1st) column number for lines (2nd) column number of words (3rd) column number of characters $ wc -l *.pdb: only shows the number of lines $ wc -l *.pdb > lengths.txt Inserts the output of the "wc -l *.pdb" into the file lengths.txt $ cat lengths.txt Shows the content of "lengths.txt", which was populated by the last command. $ sort -n lengths.txt Sorts the given file numerically. $ sort -n lengths.txt > sorted-lengths.txt Sorts the file and insert the output in the file "sorted-lengths.txt" $ head -n 1 sorted-lengths.txt Shows the first line of the given file. $ sort -n lengths.txt > lengths.txt Sorts the file and inserts the output to itself. * Please be careful whenever you redirect the output to the same file as you might overwrite critical data. $ echo The echo command prints text Prints the sentence "The echo command prints text" $ echo hello Prints hello. $ echo hello > testfile01.txt Tries to print hello and redirects the output to the given file. $ echo hello >> testfile01.txt Tries to print hello and appends the output to the given file. Note: '>>' appends the output to the target file which means it is going to be written to the end of the file. (does not overwrite). Alternatively, '>' only redirects the output and overwrites the target file. $ wc -l *.pdb > lengths.txt As before, this shows the number of lines for all files ending with ".pdb" and inserts the output in the given file. $ cat lengths.txt Shows the content of the given file. $ sort -n lengthx.txt | head -n 1 Sorts the given file numerically and only shows the first top line. $ wc -l *.pdb | sort -n This is the same command as before but w are not redirecting the output to a file, so it just shows in the terminal. $ wc -l *.pdb | sort -n | head -n 1 This is the same command as before but we are not redirecting the output to a file, so it just shows in the terminal. Go to north-pacific-gyre/2012-07-03/ directory $ cd north-pacific-gyre/2012-07-03/ Navigates to the given directory $ wc -l *.txt Shows the number of lines for all the files ending with ".txt" $ wc -l *.txt | sort -n | head -n 5 Shows the 5 files with least number of lines $ wc -l *.txt | sort -n | tail -n 5 Shows the 5 files with most number of lines $ ls *Z.txt Shows the files ending with the pattern "Z.txt" $ ls *[AB].txt Shows the files ending with the patterns "A.txt" and "B.txt" Note: Brackets is used as a placeholder for multiple specified characters. So [AB] means that this character can be either 'A' or 'B'. ===== Loops ====== Go to creatures/ directory $ cd creatures/ $ head -n 5 basilisk.dat minotaur.dat unicorn.dat Shows the top 5 lines of these given files. So you can give multiple files as the argument. $ for filename in basilisk.dat minotaur.dat unicorn.dat If you execute this, the bash gives you a prompt. The command has not been executed yet. It is still waiting for the rest of the loop. Now the $ sign has been changed to '>' which means it is waiting for the input. "filename" is the name of the variable. It can be anything. > do This is required by the syntax and represents the start of the loop body. > head -n 2 $filename | tail -n 1 This is the body of the loop. In each iteration, prints the second line of the file. Note that $filename is refering to the loop variable that we defined previously. > done This is required by the syntax, and represents the end of the loop body. * Note: sed -n `2p` basilisk.dat minotaur.dat unicorn.dat This is more advanced. Note: The variable "filename" can be called anything, for example "x". But it is a best practice to give variables meaningful names since it helps you understand the code easier after some time. Go to molecules folder $ for filename in c* > do > ls $filename > done Output: For each file with the pattern "NENE*[AB].txt", prints the name of the file. Go to north-pacific-gyre/2012-07-03/ directory for datafile in NENE*[AB].txt > do > echo $datafile > done Output: For each file with the pattern "NENE*[AB].txt", prints the name of the file. for datafile in NENE*[AB].txt > do > echo $datafile stats-$datafile > done Output: For each file with the pattern "NENE*[AB].txt", prints the name of the file, followed by a "stats-" and then name of the file. It is a good idea to first test your commands with "echo" to see what the output would be, and that's what we are doing here. for datafile in NENE*[AB].txt; do bash goostats $datafile stats-$datafile; done This is the syntax for writing loops in a single line. We separate the commands with a semicolon. This is the same as before, but instead of printing, we are running the goostats program. The program unfortuantly doesn't give any feedback in the terminal, so we don't know when it will end. We will fix that in the next command. If you wish to, use ctrl+c to stop a long running command. $ for datafile in NENE*[AB].txt; do echo $datafile; bash goostats $datafile stats-$datafile; done Here in the body of the loop, we print the filename and then run the goostats program on it, so we have a clue which file is being processed. After the command is finished, results can be found in stats-*.txt files. Go to molecules folder How to create a script file? $ nano middle.sh creates a new file name middle.sh and opens it in nano. Inside the file, write the following lines: head -n 15 octane.pdb | tail -n 5 => Prints the lines 11 to 15 of the given file save the file and exit. $ bash middle.sh Runs the script file which we created. To make it more flexible, edit middle.sh and rewrite the command as: head -n 5 "$1" | tail -n 5 => Prints the lines 11 to 15 of the argument that is supplied when the script is run. '$1' referes to the first positional argument for the script. Now run the script, while providing the argument and it will run the script for that file. $ bash middle.sh octane.pdb $ bash middle.sh pentane.pdb To make it more flexible, edit middle.sh and rewrite the command as: head -n "$2" "$1" | tail -n "$3" => Prints the lines between the range of the argument that is supplied when the script is run. '$1' and '$2' and "$3" referes to the first and second and third positional arguments for the script respectively. Now run the script, while providing the arguments for the filename, head and tail numbers: $ bash middle.sh pentante.pdb 15 5 $ bash middle.sh pentante.pdb 20 5 To make the code readable, edit middle.sh and add a comment line on top of the command: # Select lines from the middle of a file # Usage: bash middle.sh filename end_line num_lines Create a new scripts named "sorted.sh" and start documenting: $ nano sorted.sh Inside: # Sort files by their length # Usage: bash sorted.sh one_or_more_filenames wc -l "$@" | sort in Note: "$@" means one or more arguments. So instead of saying, specifically the first or second argument, here we are saying any number of arguments. Run the script: bash sorted.sh *.pdb ../creatures/*.dat So inside the script, "$@" is going to contain all the files ending with ".pdb" in the current directory AND all the files ending with ".dat" in the creatures folder. Go to writing folder. ==================================== Programming with Python Lesson ==================================== IMPORTANT: From now onwards, whatever is written in this etherpad after a '#' refers to comments, and whatever is written after a '$' refers to commands you should write in the terminal #How to run python #There are 3 different ways: # -running in your terminal: python # -running in your terminal: ipython (interactive python) # iPython is the backend of Jupyter Notebooks # -running anaconda and launching Jupyter Notebooks (which you can also find by launching Jupyterlab for example) #Let's create a simple script which is going to print "hello people! in the screen of your terminal: $ nano hello_people.py # in nano write: print("hello people!") -> contr + O -> contr + X $ cat hello_people.py #to see the contents of our python script $ python hello_people.py # hello_people.py is a file that can be opened in any text editor. In mac for example you can type in your terminal: $ open -a textedit hello_people.py #"open -a" tells your mac to open the file "hello_people.py" using the application ("-a") "textedit". # In windows open will not be recognized as a command. But you can set things like this by setting your environment variables, and include the path to the executables of your programs in the $PATH of your system. #Now, to run the hello_people.py script from your terminal you have to type: $ python hello_people.py #For now we are going to use Jupyter notebooks to run our python coding. As mentioned before, you can access Jupyter notebooks via Anaconda. ==================================== Here we will write some text for explanations. Comments to scripts from now onwards will be written after a # within the same line of a code. Code will be written after a '>' sign. To download Anaconda: For windows: 1. Open https://www.anaconda.com/download/#windows with your web browser. 2. Download the Python 3 installer for Windows. 3. Install Python 3 using all of the defaults for installation except make sure to check Add Anaconda to my PATH environment variable. For other OS, check: https://4turesearchdata-carpentries.github.io/2019-10-08-delft/ in Anaconda-navigator, launch Jupyter Notebook Jupyter is a web based application, so it opens in a web-browser. This does not mean you need internet to work with Jupyter (or Anaconda). Jupyter just opens its initial interface in a web-browser. Also Jupyter allows you to do several things remotely via internet (e.g., a Jupyter notebook can be used to run in remote servers or call Microsoft Azure resources). The first thing you will see when you launch Jupyter is essentially the contents of your current directory (so basically the same that you can see from your terminal doing $ls). To create a new notebook go to the upper right and select 'New' --> Python 3. Here you are telling Jupyter that you want to create a new notebook in the current directory (or wherever you are) and that the language by default will be Python 3. Once your notebook is created: change the title of the notebook by clicking in the upper part (click in Untitled). In this case we will call this notebook: SWC_Day1 --> rename. Then you can see one empty cell in your notebook. To run a cell (and also to create more cells) use the keys : shift + enter The cell you are currently working on will be highlighted. At the top part of your notebook, there is a dropdown menu in which it is possible to choose the type of a cell from Code, Markdown, RawNBConvert, Heading. If you want to write Python 3 code, then the type of the cell has to be set to 'code'. If you want to write documentation in a cell, you have to change the type of the cell to 'markdown'. In markdown, you can also include text in Latex by writing the text in between '$' signs. Then Jupyter will convert whatever you put inside the '$' signs to Latex. For example: $\begin{equation} y = f(x) + x_{min} \end{equation}$ press shift + enter -> you see the equation If you set the type of a cell to 'Heading' you will see that Jupyter sends a message, essentially saying this type of cell is useless because to write headings in the documentation you can do so by using the '#' sign in a markdown cell. Try it out! Set the type of a cell to markdown and write the following: # Variables in Python --> shift+enter Then 'Variables in Python' is going to be printed in the screen as a heading in your text. Let's start writing some code. For that always make sure that the type of the cells you are working with are 'code'. > 3 + 35 # press shift + enter to run the cell and you will see in the following line the output: 38 > weight_kg = 60 > print(weight_kg) Variables in python are case sensitive: > Weight_kg = 90 > print(weight_kg, Weight_kg) - You cannot start the name of a variable with a digit. But you can put digits in anywhere else in the name. See: > 0weight_kg = 80 # you will get an error message. > we0t = 80 #no error > weight_0 = 90 #no error Python writes errors as traceback. That means it will show you all the chain of commands (even back-end commands that you do not see) and how the error spread throughout all those different levels. Do not be afraid of such long tracebacks. The last line of your error would explain what type of error it is (SyntaxError in the case of trying to name a variable starting with a digit) with an explanation (invalid syntax in this case). Python can work with integers and floating numbers. for very big numbers, you can work with the scientific notation; such as 1e16 (or 1e+16). You can also work with strings (a string is a sequence of characters): > arr = ''hjbknlmv;s'' > print(arr) Today we are going to be working with a dataset. We are going to be learning python as we use it for data analysis. The data we will be working with are .csv files where we have registered the number of inflammation episodes a given patient suffer in a given day of a certain clinical trial. Each inflammation file contains information for several patients and several days within one clinical trial. In order to work with our inflammation data, we will be using the functions within a library called NumPy which stands for Numerical Python. In general, you should use this library if you want to do fancy things with numbers, especially if you have matrices or arrays. We can import NumPy using: > import numpy You can find information online about the numpy library in numpy.org, including the licence which states how to reuse it. You can also look at the glossary and see there are a lot of functions and methods within numpy. Whenever you use a function of numpy you have to call it as numpy.name_function. And this is not just for numpy but for every library you import: name_library.name_function Since you have to type the name of the library over and over again, you can also use nicknames for the libraries. Some of this nicknames are pretty standard in the community. So please do not come up with your own nicknames. Just use the ones the community usually uses. For example for numpy, the typical nickname is 'np'. To call numpy as np you just have to import it in the following way: > import numpy as np We are going to tell python now to take a look at the data of one of our files (inflammation-01.csv located in data/). For this we use the function loadtxt() within numpy. To see what this function that you can use the help() function of Python: >help(np.loadtxt) >data = np.loadtxt('data/inflammation-01.csv', delimiter = ',') > print(data) > print(type(data)) #The output tells us that data currently refers to an N-dimensional array > print(data.dtype) # tells us that the type of the NumPy array’s elements > dir(data) #to see attributes of the 'data' object where all the contents of our inflammation-01.csv file is > print(data.shape) #shows the dimensions of our 'data' object, which is essentially a 2dimensional matrix of 60 rows and 40 columns > data[0, 0] #to access a single value (the one in the first row and first column of our 'data' object) > print('some data value within data:', data[30, 20]) EXTRA > help(print) > print('some data value within data:', data[30, 20], sep='\t') #print() function by default prints the inputs to the screen separated by a single space. If instead we want a tab-space or an enter, we can change the parameter 'sep' to '\t' or '\n' respectively. https://swcarpentry.github.io/python-novice-inflammation/01-numpy/index.html Slicing data: > print(data[0:4, 0:10]) The slice 0:4 means, “Start at index 0 and go up to, but not including, index 4.”Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice. > print(data[:4, 35:]) printing the first 4 rows and the last five columns (from column index 35 to whatever total number of columns data has) > doubledata = data * 2.0 > print('data: \n') > print(data[:4, 35:]) > print('doubledata: \n') > print(doubledata[:4, 35:]) > tripledata = doubledata + data > print('tripledata: \n') > print(tripledata[:4, 35:]) EXTRA TIP - HOW TO GET RID OF YOUR DEFAULT TERMINAL PROMPT IN MAC ------------------------------------------------------ How to get rid of the long command prompt in the Mac OS terminal: Edit the file ~\.bash_profile (or creeate it if it doesn't exist) by adding the folloeing line: export PS1="$ " (source: https://mattmazur.com/2012/01/27/how-to-change-your-default-terminal-prompt-in-mac-os-x-lion/) ------------------------------------------------------- EXTRA TIP - HOW TO SEE WHICH VARIABLES ARE STORED IN MEMORY USING JUPYTER write in a code cell: who -------------------------------------------------------- Plotting with Python For plotting, we are going to use the matplotlib library. More specifically we will work on the functions within the pyplot module of the matplotlib library (thus: matplotlib.pyplot). num To see plots in your Jupyter notebook, sometimes you have to explicitly tell Jupyter to print the plots in the notebook. Sometimes you will be able to see the plots in the notebook and sometimes not. But just to make sure you always see the plot in there, always add the following line in a code cell in the notebook: %matplotlib inline Now we import the pyplot module within the library. Here we are going to use again an typical acronym used to avoid typing 'matplotlib.pyplot' everytime we want to use a function: > import matplotlib.pyplot as plt > plt.plot(np.max(data, axis=0)) #the function max of numpy is going to generate a 1 dimensional array with the maximum value found in every column of our object 'data'. Important to mention here: axis=0 --> refers to rows axis=1 --> refers to columns When using np.max(data, axis=0) you are asking numpy to find the maximum value ACROSS all ROWS (axis=0). Meaning it will find the maximum PER COLUMN. Setting axis=1 would then return the maximum value across all columns, meaning finding the maximum per row. > plt.plot(np.max(data, axis=1)) > print(np.max(data, axis=0)) #you will see this is a one dimensional array with the maximum number of inflammation episodes per day of the clinical trial (looking at all patients). > dir(data) #to see attributes > print(np.max(data)) #here we are not specifying axis, so np.max will just return the maximum number of inflammation episodes looking at all days and all patients. > print(data.max()) #here we have obtained the same as before but using the attribute max() of the object 'data'. Before we performed the numpy function max() (np.max()) on the object 'data'. Let's plot the data then...first load the data into a 2 dimensional numpy array (so basically a matrix with rows and columns) > data = np.loadtxt('/data/inflammation-01.csv', delimiter=',') Create the frame of the figure (the canvas of the figure). The frame will have a width of 10 inches, and a height of 3 inches. > fig = plt.figure(figsize = (10.0, 3.0)) #plt.figure receives a tuple Now the object 'fig' has an empty frame. Let's say how we want to split this frame into different plots. For that we use add_subplot. This receives: i) the number of rows into which you want to split the frame; ii) the number of columns into which you want to split the frame; and iii) which number of subplot it is. > axes1 = fig.add_subplot(1, 3, 1) > axes2 = fig.add_subplot(1, 3, 2) > axes3 = fig.add_subplot(1, 3, 3) axes1 will be the object that (in the 'fig' frame splitted in 1 row and 3 columns) will be located in the first subplot. The first one is the one on the left. Thus, axes2 will refer to the subplot in the middle, and axes3 will refer to the subplot in the right. Let's set some things like the text to be at the y-axis of each plot (set_ylabel) and what is actually going to be plotted in each subplot (plot). > axes1.set_ylabel('average') > axes1.plot(np.mean(data, axis=0)) > axes2.set_ylabel('max') > axes2.plot(np.max(data, axis=0)) > axes3.set_ylabel('min') > axes3.plot(np.min(data, axis=0)) > fig.tight_layout() # this is to maximize the visualization of the plot >plt.show() #to print the plot in the screen Important information: A tuple is a collection which is immutable. In Python tuples are written withround brackets (parenthesis). rerun the same cell as before byt omitting fig.tight_layout() by putting # in front of it. You will see then that the graphs are squeezed together more closely, and that it looked better when we used fig.tight_layout(). Let's go back to the cell where we produced the plot of the data within inflammation-01.csv. But now, let's save that figure as a .jpeg file: > data = np.loadtxt('/data/inflammation-01.csv', delimiter=',') > fig = plt.figure(figsize = (10.0, 3.0)) #plt.figure receives a tuple > axes1 = fig.add_subplot(1, 3, 1) > axes2 = fig.add_subplot(1, 3, 2) > axes3 = fig.add_subplot(1, 3, 3) > axes1.set_ylabel('average') > axes1.plot(np.mean(data, axis=0)) > axes2.set_ylabel('max') > axes2.plot(np.max(data, axis=0)) > axes3.set_ylabel('min') > axes3.plot(np.min(data, axis=0)) > fig.tight_layout() > plt.savefig('inflammation-01_plot.jpeg') > plt.show() Always put plt.savefig BEFORE the plt.show. why? Because plt.show() basically throws all the output plots of your cell, and dumps them to the screen, leaving the standard output of plt empty. So if you call plt.savefig, it won't plot anything because plt will be 'empty'. ---------------------------------------------------------- EXTRA TIP - YOU CAN RUN SHELL COMMANDS IN JUPYTER CELLS For this, using an exclamation mark before the shell command should do the trick: > !ls -> to see the contents of the current directory ---------------------------------------------------------- Now, what if we want to plot the values of the object 'data' as an image? Meaning we are going to color code the values of each element within the matrix and create a plot out of it. For this we use the imshow() function within the pyplot module of the matplotlib library. > plt.imshow? #in Jupyter use question mark to see the documentation on a given function > image = plt.imshow(data) #we are saving the 2 dimensional image into an object that we call 'image' > plt.colorbar() # to see what colors represent what value If you did not like the color scheme, then you can change it via the cmap parameter: > image = plt.imshow(data, cmap='gray') > plt.colorbar() Or perhaps something more colorful! > image = plt.imshow(data, cmap='inferno') > plt.colorbar() > plt.savefig('Data_map.jpeg') #let's save it! Now, we have plotted the entire 'data' into an image, and we have plotted the average, max, and minimum number of inflammation episodes considering all patients per day of the first clinical trial (so we have only looked at the information contained in inflammation-01.csv file). What if we want to plot automatically the same but for all the clinical trials (so for all inflammation-*.csv files we have in the data directory). For that we can make use of FOR loops! Repeating Actions with Loops https://swcarpentry.github.io/python-novice-inflammation/02-loop/index.html In order to see how for loops work in python, let's start with taking a look at a string. Why? Because a string is a sequence of elements, and in for loops we need a sequence so that each cycle of the loop goes through its elements. That is why FOR loops work with strings, but also with lists and Numpy arrays. > word = 'lead' #word is a string. To define a string, enclose it between double quotes (or single quotes) > print(type(word)) What if we want to print to the screen each character of this string? In principle we could do: > print(word[0]) > print(word[1]) > print(word[2]) > print(word[3]) This is a bad approach for three reasons: Not scalable. Imagine you need to print characters of a string that is hundreds of letters long. It might be easier just to type them in manually. Difficult to maintain. If we want to decorate each printed character with an asterix or any other character, we would have to change four lines of code. While this might not be a problem for short strings, it would definitely be a problem for longer ones. Fragile. If we use it with a word that has more characters than what we initially envisioned, it will only display part of the word’s characters. A shorter string, on the other hand, will cause an error because it will be trying to display part of the string that don’t exist. Here FOR loops can help to overcome all the previous issues! Let's create one: for char in word: print(char) print('this is outside') Always consider indentation! Meaning those 4 spaces we did before we wrote the print(char). Indentation is very important in python. It tells python that whatever is indented, is inside something (e.g., inside a FOR loop, inside an IF statement, etc.). We said FOR loops work with strings, lists and numpy arrays. Let's corroborate that: > a = np.array([1, 2, 4]) #here we are defining 'a' as a 1 dimensional array. In this case it is a sequence of integer numbers: '1', '2' and '4'. > for num in a: print(num) In each loop, the variable 'num' will be each one of the elements of the 'a' array. Now, what if 'a' is not a numpy array, but a list: > a = [1, 2, 4] #define lists between squared brackets > print(type(a)) > for num in a: print(num) print(a+a) #we are going to print this out of the loop to also see what the difference is between numpy arrays and lists. You can see this by comparing what happens when you 'add' two lists versus what happens when you add two numpy arrays. Now, let's go back to our plotting code. We said we want to plot the average, max and minimum number of inflammation episodes registered for all patients per day of NOT ONLY 1 clinical trial, but ALL clinical trials (so for all inflammation*.csv files). For that we can use the glob function within the glob library: > import glob The glob library contains a function, also called glob, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character. > glob.glob? #to see documentation of the glob function in jupyter notebook > glob.glob('data/inflammation*.csv') #this will generate a list of all files whose name are of the form inflammation*.csv found in data directory. We can see glob.glob give the output in an arbitrary order. So we need to sort it. >sorted(glob.glob('data/inflammation*.csv')) Here is something important about python, which we can exemplify with he call we make to sorted(glob.glob). We saw glob.glob() returns a list where each element of that list correspond to a string 'data/inflammation*.csv'. With sorted we can sort that list so that the first element is data/inflammation-01.csv, the second element of the list is data/inflammation-02.csv, the third is data/inflammation-03.csv, etc. So technically we can call the elements of the sorted(glob.glob('data/inflammation*.csv)) by indicating the index of the element we are interested in. So for example if we want to see the first element of that list (index 0) we can do: >print(sorted(glob.glob('data/inflammation*.csv')) [0]) There we have not saved the list to a variable (we have not saved the list in the memory). But we can still see the first element by using '[0]'. If we want to save the list to the memory we would do: >list_sorted = sorted(glob.glob('data/inflammation*.csv')) So now list_sorted is the name of the list, and we can print its first element by using: >print(list_sorted[0]) Both print() above show the same, but in the first one we have not saved the list to the memory. In the second we have saved it into a variable called list_sorted. So....now that we have a list of the filenames, we can create the for loop to print all the plots for each one of the inflammation*.csv files we have in the data/ directory: > filenames = sorted(glob.glob('data/inflammation*.csv')) for f in filenames: #after the semicolon: indentation! print(f) #Be aware of consistent indentation! data = np.loadtxt(f, delimiter=',') fig = plt.figure(figsize=(10.0, 3.0)) axes1 = fig.add_subplot(1, 3, 1) axes2 = fig.add_subplot(1, 3, 2) axes3 = fig.add_subplot(1, 3, 3) axes1.set_ylabel('average') axes1.plot(np.mean(data, axis=0)) axes2.set_ylabel('max') axes2.plot(np.max(data, axis=0)) axes3.set_ylabel('min') axes3.plot(np.min(data, axis=0)) fig.tight_layout() plt.show() #show all the plots for one file DAY 2 Kees den Heijer (instructor) Yuki Murakami Stefano Lovato Samaneh Babayan Sharef Neemat Renfei Bu Francesca Greco Shahrzad Nikghadam Boram Kim Fatma ibis Samprajani Rout Nidarshan Kumar Jeroen van Dijk Davide Suellen Pereira Espindola Changrang Zhou Xiaocong Lyu Mingliang Chen Yelena Adrian Gonzalez Tanveer Ahmad Alina Colling Lucia Beloqui Larumbe Kiarsh Mansour Pour Version control with Git Git is useful for version control if you work with code and plain text (think of your software code and LATeX) There are different distributions of Git Question: Github vs Gitlab Gitlab can be installed locally TU Delft has some infrastructure for that. More to follow We will use Git Bash (Windows) Linux and Mac usually have git interface available Configurate git: $ git config --global user.name "Your Name" $ git config --global user.email "YourEmail" These settings are done the first time you get started There are also graphical user interfaces for Git, but in this course we will do the command line. It will be useful to understand how it works Windows $ git config --global core.autocrlf true Mac $ git config --global core.autocrlf input Check the configurations $ git config --list (use Q to exit to the normal terminal) $git config -h $git config --help (use Q to exit to the help manuals) Creating a git repository (in your Desktop folder) $ cd ~/Desktop $ mkdir planets $ cd planets $ git init Make sure that it works $ ls $ ls -a $ git status use nano to create a text file $ nano mars.txt Cold and dry, but everything is my favorite color $ ls $ git status $ git add mars.txt $ git status $ git commit -m "Start notes on Mars as a base" For the visualization of the process see https://swcarpentry.github.io/git-novice/04-changes/index.html Staging Area $ git status show the project’s history: $ git log $nano mars.txt The two moons may be a problem for Wolfman $ git status In case of accidental unwanted changes: $ git checkout -- mars.txt To show the exact differences between the files (e.g. txt. csv) $ git diff $ git commit -m "add concerns about facts of Mars' moons on Wolfman" $git status Commit didn't happen because git add was not done. Staging area was skipped $ git add mars.txt $ git commit -m "add concerns about facts of Mars' moons on Wolfman" $ nano mars.txt But the Mummy will appreciate the lack of humidity $ git diff $ git add mars.txt $ git diff Diff shows no differences because the file is already staged $ git diff --staged $ git commit -m "Discuss concerns about Mars' climate for Mummy" $git status $ git log To see the last log message only $ git log -1 $ git log -2 To summarize changes in one line: $ git log --onelinegit gi log gives a storyline of how your work was developed $ nano mars.txt An ill-considered change $ git diff HEAD mars.txt You can see changes at different stages by using HEAD~ $ git diff HEAD~1 mars.txt $ git log copy the long number next to commit $ git status $ git checkout HEAD mars.txt $ cat mars.txt $ git log --oneline copy the short code next to the commit $ git checkout PASTEYOURCODE mars.txt $ cat mars.txt $ git status $ git checkout HEAD mars.txt $ mkdir results $ touch a.dat b.dat c.dat results/a.out results/b.out $ git status $ nano .gitignore *.dat results $ git status $ git add .gitignore $ git commit - m "Ignore data files and the result folder" $ git status $ ls $ git add a.dat $ git status --ignored To connect to remote repository, we use GitHub Go to github.com Create a remote repository - use the '+' at the up right corner, create the new repository 'Planet' (you can't name different respositories with the same name) - make it private - not chossing README, .gitignore for the moment, to avoid conflict with the ones created in the local repository - for real work, always choose a license. The MIT licence will be recommended by TUD ICT, unless there is any other constraint from e.g. licensed libraries/codes, external parties etc. Connect local repository to remote repository $ git remote add origin https://github.com/YOURURL check the repository $ git remote -v push local repository to remote repository $ git push origin master (better to use the same user email for git local account and github account) check update made by others on the remote repository (always pull to get the latest version before working on yours, if working in a collaborative project), it updates both your local repository and working directory $ git pull origin master create local repository based on remote $ pwd $ cd .. (always clone it outside the repository you want to create) $ git clone URL of the remote repository (take the URL from your neighbor and start collaboration, 'Clone or download') $ cd cloned folder $ nano pluto.txt $ git add pluto.txt $ git commit -m 'Add notes about Pluto' $ git push origin master (On github, go to setting --> collaborators, add your neighbor as your collaborator) Propose changed you made to the admin of remote repoistory in collaboration by using pull request Conflicts $ nano mars.txt (add line 'This line added to Wolfman's copy') $ git add mars.txt $ git push origin master SWC lessons for more information Conflicts https://swcarpentry.github.io/git-novice/09-conflict/index.html Open Science https://swcarpentry.github.io/git-novice/10-open/index.html Slides: https://surfdrive.surf.nl/files/index.php/s/9rS7TbHhWYZCJuh ================ PYTHON Day 2 Goal: final coding elements we need (conditionals and functions) write small programs and execute them from the command line to automate our analysises (? more than one analysis). Plan: starting with functions to analyze dodgy data sets Open your Jupyter Notebook > open Anaconda and run the jupyter notebook application (start the Jupyter web server) > move to your work folders (swc-python folder) > double click on your notebook to open it, or create a new one ! (up to you) Don't forget to rename your notebook ! %matplotlib inline <-- put that early in your notebook. Jupyter only command stating that graphs will be shown in the notebook ! import numpy as np #numerical library import matplotlib.pyplot as plt #for the graphs import glob # to manipulate file and file names data = np.loadtxt("data/inflammation-01.csv", delimiter=",") #beware of your current path, point to your target file with a "relative path" if you can. # delimiter value depends on how your file is formatter. Here "comma separated" so the delimiter is a the "," character fig = plt.figure(figsize= (10.0,3.0) ) #create a frame of size 10.0 x 3.0 (in inches!). axes1 = fig.add_subplot(1,3,1) # axes1 is a subplot of fig, given by fig divided in 1 row, 3 columns, and I take the FIRST one. axes1 is going to be that region. axes2 = fig.add_subplot(1,3,2) axes3 = fig.add_subplot(1,3,3) axes1.set_ylabel = "average" #set properties for that specific subplot axes1.plot( np.mean(data , axis=0) ) # calcultate the means across axis 0 (means across rows), and plot that axes2.set_ylabel = "max" #set properties for that specific subplot axes2.plot( np.mean(data , axis=0) ) # calcultate the means across axis 0 (means across rows), and plot that axes3.set_ylabel = "min" #set properties for that specific subplot axes3.plot( np.mean(data , axis=0) ) # calcultate the means across axis 0 (means across rows), and plot that fig.tight_layout() #make it pretty plt.savefig("data/inflammation-01.png") #if you want. This will create or update a file call "inflammation-01.png" containing the graphs, plt.show() # "plt" please display all the graphs you haven't shown so far! Update the first line in the notebook with %matplotlib inline by %matplotlib notebook #this will help you create nicer graphs! axes1.set_ylabel = "average" #set properties for that specific subplot axes1.plot( np.mean(data , axis=0), '*', color=(1.0, 0.4, 0.3 )) # we are modifing the marker and the color axes2.set_ylabel = "max" #set properties for that specific subplot axes2.plot( np.mean(data , axis=0), '-o', color='blue') # the color is being ina different way to what we did above axes3.set_ylabel = "min" #set properties for that specific subplot axes3.plot( np.mean(data , axis=0), '-', color='g') # calling color a third different way xlab_arr = "Days in trial" axes1.set_xlabel(xlab_arr) axes2.set_xlabel(xlab_arr) axes3.set_xlabel(xlab_arr) # labels for the three graphs have been set using the function call with the variable xlab_arr set with the axes name Conditionals: If / then / else (doc in cell of type "markdown" ) num = 37 if num > 100 : print ("Number is greater than 100") print ("\n ") #this line is indented, just like the previous one. They are in the same "scope". They both will be executed is the condition is found to be TRUE. else: print("number is lower/smaller than 100") print("Done") #no indentation, this is not bounded by the "if statement"). num = -3 if num > 0 : print ("Number greater than 0") elif num == 0 : print ("Number is 0") else: print ("Number is less than 0") print("Done") if "" : print ("empty string is true") if "word" : print ("not empty string is true") Check which one is actually true ! np.max(data, axis=0) #returns a one dim. array, max per day np.max(data, axis=0)[0] #getting the first element of that max # of inflammation episode - for day 0 print(np.max(data, axis=0)[0] ) => 0.0 print(np.max(data, axis=0)[20] ) => 20.0 if (np.max(data, axis=0)[0] == 0) and ( np.max(data,axis=0)[20] == 20.0): print("suspicious data detected!") elif np.sum(np.min(data,axis=0)) == 0: print("Minima add up to 0") else: print("No anomalies detected so far") Making functions! def fahr_to_cel (temp) : #"def" stands for "define" - Python keyword return ((temp -32) / (5/9)) Question: what does it do ? Functions do not HAVE to return something. def my_function() : #this is valid print("a message!") Defining a function does not "do" anything. It becomes usefull when you call the functions. fahr_to_cel(32) #expected result 0 ------------------------- Jupyter is great to try out scripts. But terminal is better. From your Git bash (if you can call python from there), or you Anaconda prompt if you couldn't. If you are not sure, stick a "Help me" post-it up ! "pwd" to check where you are. What do you want to do : We want to write a program that allows us to do : python readings.py --mean ../data/inflammation-01.csv Getting started. (example: get the first 4 lines of the csv, use that as input for python readings.py program, called with the --max option hh head -n 2 ../data/inflammation-01.csv | python readings.py --min > minimum_first2_inflammation-01.csv (get the min, write in down in a file) Edit the first program "readings_01.py" (with nano, or textedit, or PyCharm, or your prefered editor) import sys #import system libraries - to manipulate files, arguments, internals of the Operating system. sys.args <== array containing the various arguments passed through the command line to the program. python readings_02.py (to run the program). Line to add to set one method as the "main" method. if __name__ == "__main__ ": main() #name of the function you want to call python readings_03.py ../data/inflammation-01.csv ../data/inflammation-02.csv .... ^--- too long to type; try the following: python readings_03.py ../data/*.py What if you don't have all the arguments you need ? for file_name in sys.argv[1:] (<-- take all the inputs, from index 1 to the end) ---> do something... Calling the scripts with "action" (--mean in this case) python readings_04.py --mean ../data/small-01.csv (to try out ! always test on small exmaples) python readings_04.py --mean ../data/*.csv Assertions: syntax: assert [test] , action_if_not_true #by default, continue if all is well and good assert action in ["--min","--max","--mean"], print("action unavailable - exiting") If something goes wrong, "AssertionError" exception is raised, the message you gave is printed as the exception message. Use assertion to check inputs provided by other programs or files, or users. It's good to check. They may contain errors. Loading our scripts as a lbrary. import name_of_your_script then you can try: help(name_of_your_script.function_in_your_script) ## i.e help(readings_05.process) displays basic help. You can then use all function contained within the "name_of_your_script" file like you used functions from numpy ! Next step: what to do if you don't get the right input ? 1st way: check the "len" of the list of file names if len(filenames) == 0 : #note : len (a_list) returns the number of elements in the list. sys.stdin <-- a buffer contianing information from the standard input (what comes from keystroke in the command line, what may have been "piped" to the script - using " | " ) use it if you want to read from the console for instance, or read input parameters given AFTER the launch of the program. For parameters given at the launch, use the ARGS. Think about how your program may be used as part of a piping sequence ! Changing root directories https://www.digitalcitizen.life/command-prompt-how-use-basic-commands