16-tasks.tex

% The contents of this file is 
% Copyright (c) 2009-  Charles R. Severance, All Righs Reserved

\chapter{Automação de tarefas comuns no seu computador}
%\chapter{Automating common tasks on your computer}

Temos lido dados de arquivos, redes, serviços e banco de dados.
Python também pode navegar através de todas as pastas e diretórios 
no seu computador e ler os arquivos também.
%We have been reading data from files, networks, services,
%and databases.   Python can also go through all of the 
%directories and folders on your computers and read those files
%as well.

Neste capítulo, nós iremos escrever programas que analisam o seu computador e executam algumas operações em seus arquivos. Arquivos são organizados em diretórios (também chamados de pastas).
Scripts Python simples podem fazer o trabalho de tarefas simples que devem ser feitas em
centenas ou milhares de arquivos espalhados por uma árvore de diretórios ou todo o seu computador.
%In this chapter, we will write programs that scan 
%through your computer and 
%perform some operation on each file.  
%Files are organized into directories (also called ``folders'').
%Simple Python scripts
%can make short work of simple tasks that must be done to 
%hundreds or thousands of files
%spread across a directory tree or your entire computer.

Para navegar através de todos os diretórios e arquivos em uma árvore nós utilizamos 
{\tt os.walk} e um laço de repetição {\tt for}. Isto é similar ao comando {\tt open} e nos permite escrever um laço de repetição para ler o conteúdo de um arquivo, {\tt socket} nos permite escrever um laço de repetição para ler o conteúdo de uma conexão e {\tt urllib} nos permite abrir um documento web e navegar por meio de um laço de repetição no seu conteúdo. 
%To walk through all the directories and files in a tree we use 
%{\tt os.walk} and a {\tt for} loop.  This is similar to how 
%{\tt open} allows us to write a loop to read the contents of a file,
%{\tt socket} allows us to write a loop to read the contents of a network connection, and
%{\tt urllib} allows us to open a web document and loop through its contents.

\section{Nomes e caminhos de arquivos}
\label{paths}
%\section{File names and paths}
%\label{paths}

\index{file name}
\index{path}
\index{directory}
\index{folder}
%\index{file name}
%\index{path}
%\index{directory}
%\index{folder}

{\it Todo programa em execução tem um ``diretório atual'' que é o diretório padrão para a maioria das operações. Por exemplo, quando você abre um arquivo para leitura, o Python o procura no diretório atual.}
%Every running program has a ``current directory,'' which is the
%default directory for most operations.  For example, when you open a file
%for reading, Python looks for it in the current directory.

\index{os module}
\index{module!os}
%\index{os module}
%\index{module!os}

O módulo {\tt os} disponibiliza funções para trabalhar com arquivos e diretórios ({\tt os} do inglês "operating system" que significa sistema operacional).  
{\tt os.getcwd} retorna o nome do diretório atual:
%The {\tt os} module provides functions for working with files and
%directories ({\tt os} stands for ``operating system'').  {\tt os.getcwd}
%returns the name of the current directory:

\index{getcwd function}
\index{function!getcwd}
%\index{getcwd function}
%\index{function!getcwd}

\beforeverb
\begin{verbatim}
>>> import os
>>> cwd = os.getcwd()
>>> print cwd
/Users/csev
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%>>> import os
%>>> cwd = os.getcwd()
%>>> print cwd
%/Users/csev
%\end{verbatim}
%\afterverb
%
{\tt cwd} significa {\bf diretório atual de trabalho}. O resultado neste exemplo é 
{\tt /Users/csev}, que é o diretório home do usuário chamado {\tt csev}.
%{\tt cwd} stands for {\bf current working directory}.  The result in
%this example is {\tt /Users/csev}, which is the home directory of a
%user named {\tt csev}.

\index{working directory}
\index{directory!working}
%\index{working directory}
%\index{directory!working}

Uma string {\tt cwd} que identifica um arquivo é chamado de path.
Um {\bf caminho relativo} é relativo ao diretório atual (corrente);
Um {\bf caminho absoluto} tem inicio no diretório raiz do sistema de arquivos.
%A string like {\tt cwd} that identifies a file is called a path.
%A {\bf relative path} starts from the current directory;
%an {\bf absolute path} starts from the topmost directory in the
%file system.

\index{relative path}
\index{path!relative}
\index{absolute path}
\index{path!absolute}
%\index{relative path}
%\index{path!relative}
%\index{absolute path}
%\index{path!absolute}

Os caminhos que temos visto até agora são nomes de arquivos simples, por isso são relativos ao diretório atual. Para encontrar o caminho absoluto de um arquivo, você pode usar {\tt os.path.abspath}:
%The paths we have seen so far are simple file names, so they are
%relative to the current directory.  To find the absolute path to
%a file, you can use {\tt os.path.abspath}:

\beforeverb
\begin{verbatim}
>>> os.path.abspath('memo.txt')
'/Users/csev/memo.txt'
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%>>> os.path.abspath('memo.txt')
%'/Users/csev/memo.txt'
%\end{verbatim}
%\afterverb

{\tt os.path.exists} verifica se um determinado arquivo existe:
%{\tt os.path.exists} checks
%whether a file or directory exists:

\index{exists function}
\index{function!exists}
%\index{exists function}
%\index{function!exists}

\beforeverb
\begin{verbatim}
>>> os.path.exists('memo.txt')
True
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%>>> os.path.exists('memo.txt')
%True
%\end{verbatim}
%\afterverb

Se existir, {\tt os.path.isdir} verifica se é um diretório:
%If it exists, {\tt os.path.isdir} checks whether it's a directory:

\beforeverb
\begin{verbatim}
>>> os.path.isdir('memo.txt')
False
>>> os.path.isdir('music')
True
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%>>> os.path.isdir('memo.txt')
%False
%>>> os.path.isdir('music')
%True
%\end{verbatim}
%\afterverb

Da mesma forma, {\tt os.path.isfile} verifica se é um arquivo.
%Similarly, {\tt os.path.isfile} checks whether it's a file.

{\tt os.listdir} retorna uma lista com os arquivos (e outros diretórios) do diretório informado:
%{\tt os.listdir} returns a list of the files (and other directories)
%in the given directory:

\beforeverb
\begin{verbatim}
>>> os.listdir(cwd)
['musicas', 'fotos', 'memo.txt']
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%>>> os.listdir(cwd)
%['music', 'photos', 'memo.txt']
%\end{verbatim}
%\afterverb

\section{Exemplo: Limpando um diretório de fotos}
%\section{Example: Cleaning up a photo directory}

Há algum tempo atrás, desenvolvi um pequeno software tipo Flickr que recebe fotos do meu celular e armazena essas fotos no meu servidor. E escrevi isto antes do Flickr existir e continuo usando por que eu quero manter copias das minhas imagens originais para sempre.
%Some time ago, I built a bit of Flickr-like software that 
%received photos from my cell phone and stored those photos
%on my server.  I wrote this before Flickr existed and kept 
%using it after Flickr existed because I wanted to keep original
%copies of my images forever.

Eu também gostaria de enviar uma simples descrição numa mensagem MMS ou como um título de uma mensagem de e-mail. Eu armazenei essas mensagens em um arquivo de texto no mesmo diretório do arquivo das imagens. Eu criei uma estrutura de diretórios baseada no mês, ano, dia e hora no qual a foto foi tirada, abaixo um exemplo de nomenclatura para uma foto e sua descrição:
%I would also send a simple one-line text description in the MMS message
%or the subject line of the email message.  I stored these messages
%in a text file in the same directory as the image file.   I came up 
%with a directory structure based on the month, year, day, and time the 
%photo was taken.   The following would be an example of the naming for 
%one photo and its existing description:

\beforeverb
\begin{verbatim}
./2006/03/24-03-06_2018002.jpg
./2006/03/24-03-06_2018002.txt
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%./2006/03/24-03-06_2018002.jpg
%./2006/03/24-03-06_2018002.txt
%\end{verbatim}
%\afterverb

Após sete anos, eu tenho muitas fotos e legendas. Ao longo dos anos como eu troquei de celular, algumas vezes, meu código para extrair a legenda para uma mensagem quebrou e adicionou um bando de dados inúteis no meu servidor ao invés de legenda.
%After seven years, I had a lot of photos and captions.  Over the years
%as I switched cell phones, sometimes my code to extract the caption from the message 
%would break and add a bunch of useless data on my server instead of a caption.  

Eu queria passar por estes arquivos e descobrir quais dos arquivos texto eram realmente legendas e quais eram lixo e, em seguida, apagar os arquivos que eram lixo. A primeira coisa a fazer foi conseguir um simples inventário dos arquivos texto que eu tinha em uma das subpastas usando o seguinte programa:
%I wanted to go through these files and figure out which of the 
%text files were really captions and which were junk and then delete the bad
%files.  The first thing to do was to get a simple inventory of 
%how many text files I had in one the subfolders
%using the following program:

\beforeverb
\begin{verbatim}
import os
count = 0
for (dirname, dirs, files) in os.walk('.'):
   for filename in files:
       if filename.endswith('.txt') :
           count = count + 1
print 'Files:', count

python txtcount.py
Files: 1917
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%import os
%count = 0
%for (dirname, dirs, files) in os.walk('.'):
%   for filename in files:
%       if filename.endswith('.txt') :
%           count = count + 1
%print 'Files:', count
%
%python txtcount.py
%Files: 1917
%\end{verbatim}
%\afterverb

O segredo para um código tão pequeno é a utilização da biblioteca {\tt os.walk} do Python. Quando nós chamamos
{\tt os.walk} e inicializamos um diretório, ele "caminha" através de todos os diretórios e subdiretórios recursivamente.
O caractere ""."" indica para iniciar no diretório corrente e navegar para baixo.
Assim que encontra cada diretório, temos três valores em uma tupla no corpo do laço de repetição {\tt for}.
O primeiro valor é o diretório corrente, o segundo é uma lista de sub-diretórios e o terceiro valor é uma lista de arquivos no diretório corrente.
%The key bit of code that makes this possible is the {\tt os.walk}
%library in Python.  When we call {\tt os.walk} and give it a starting
%directory, it will ``walk'' through all of the directories 
%and subdirectories recursively.   The string ``.'' indicates
%to start in the current directory and walk downward.
%As it encounters each directory, we get three values in a tuple in the
%body of the {\tt for} loop.  The first value is the current
%directory name, the second value is the list of subdirectories 
%in the current directory, and the third value is a list of files
%in the current directory.

Nós não temos que procurar explicitamente dentro de cada diretório por que nós podemos contar com {\tt os.walk} para visitar eventualmente todas as pastas mas, nós queremos procurar em cada arquivo, então, escrevemos um simples laço de repetição {\tt for} para examinar cada um dos arquivos no diretório corrente. Vamos verificar se cada arquivo termina com ".txt" e depois contar o número de arquivos através de toda a árvore de diretórios que terminam com o sufixo ".txt".
%We do not have to explicitly look into each of the subdirectories
%because we can count on {\tt os.walk} to visit every 
%folder eventually.  But we do want to look at each file, so 
%we write a simple {\tt for} loop to examine each of the files 
%in the current directory.   We check each file to see if 
%it ends with ``.txt'' and then count the number of 
%files through the whole directory tree that end with the
%suffix ``.txt''.

Uma vez que nós temos uma noção da quantidade de arquivos terminados com ".txt", a próxima coisa a se fazer é tentar determinar  automaticamente no Python quais arquivos são maus e quais são bons. Para isto, escreveremos um programa simples para imprimir os arquivos e seus tamanhos.
%Once we have a sense of how many files end with ``.txt'', the next
%thing to do is try to automatically
%determine in Python which files are bad and which files
%are good.   So we write a simple program to print out the
%files and the size of each file:

\beforeverb
\begin{verbatim}
import os
from os.path import join
for (dirname, dirs, files) in os.walk('.'):
   for filename in files:
       if filename.endswith('.txt') :
           thefile = os.path.join(dirname,filename)
           print os.path.getsize(thefile), thefile
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%import os
%from os.path import join
%for (dirname, dirs, files) in os.walk('.'):
%   for filename in files:
%       if filename.endswith('.txt') :
%           thefile = os.path.join(dirname,filename)
%           print os.path.getsize(thefile), thefile
%\end{verbatim}
%\afterverb

Agora, em vez de apenas contar os arquivos, criamos um nome de arquivo concatenando o nome do diretório com
o nome do arquivo dentro do diretório usando {\tt os.path.join}.
É importante usar o {\tt os.path.join} para concatenar a sequência de caracteres por que no Windows usamos a
barra invertida para construir os caminhos de arquivos e no Linux ou Apple nós usamos a barra (\verb"/") para construir o caminho do arquivo.
O {\tt os.path.join} conhece essas diferenças e sabe qual sistema esta rodando dessa forma, faz a concatenação mais adequada considerando o sistema. Então, o mesmo código em Python roda tando no Windows quanto em sistemas tipo Unix.
%Now instead of just counting the files, we create 
%a file name by concatenating the directory name with
%the name of the file within the directory using
%{\tt os.path.join}.   It is important to use 
%{\tt os.path.join} instead of string concatenation 
%because on Windows we use a backslash
%(\verb"\") to construct file paths and on Linux
%or Apple we use a forward slash (\verb"/") 
%to construct file paths.  The {\tt os.path.join}
%knows these differences and knows what system
%we are running on and it does the proper concatenation
%depending on the system.  So the same Python code
%runs on either Windows or Unix-style systems.

Uma vez que temos o nome completo do arquivo com o caminho do diretório, nós usamos o utilitário {\tt os.path.getsize} para pegar e imprimir o tamanho, produzindo a seguinte saída.  
%Once we have the full file name with directory
%path, we use the {\tt os.path.getsize} utility
%to get the size and print it out, producing the 
%following output:

\beforeverb
\begin{verbatim}
python txtsize.py
...
18 ./2006/03/24-03-06_2303002.txt
22 ./2006/03/25-03-06_1340001.txt
22 ./2006/03/25-03-06_2034001.txt
...
2565 ./2005/09/28-09-05_1043004.txt
2565 ./2005/09/28-09-05_1141002.txt
...
2578 ./2006/03/27-03-06_1618001.txt
2578 ./2006/03/28-03-06_2109001.txt
2578 ./2006/03/29-03-06_1355001.txt
...
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%python txtsize.py
%...
%18 ./2006/03/24-03-06_2303002.txt
%22 ./2006/03/25-03-06_1340001.txt
%22 ./2006/03/25-03-06_2034001.txt
%...
%2565 ./2005/09/28-09-05_1043004.txt
%2565 ./2005/09/28-09-05_1141002.txt
%...
%2578 ./2006/03/27-03-06_1618001.txt
%2578 ./2006/03/28-03-06_2109001.txt
%2578 ./2006/03/29-03-06_1355001.txt
%...
%\end{verbatim}
%\afterverb

Analisando a saída, nós percebemos que alguns arquivos são bem pequenos e muitos dos arquivos são bem grandes e com o mesmo tamanho (2578 e 2565). Quando observamos alguns desses arquivos maiores manualmente, parece que os arquivos grandes são nada mais que HTML genérico idênticos que vinham de e-mails enviados para meu sistema a partir do meu próprio telefone:
%Scanning the output, we notice that some files are pretty short and 
%a lot of the files are pretty large and the same size (2578 and 2565). 
%When we take a look at a few of these larger files by hand, 
%it looks like the large 
%files are nothing but a generic bit of identical HTML that came 
%in from mail sent to my system from my T-Mobile phone:

\beforeverb
\begin{verbatim}
<html>
        <head>
                <title>T-Mobile</title>
...
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%<html>
%        <head>
%                <title>T-Mobile</title>
%...
%\end{verbatim}
%\afterverb

Espiando o conteúdo destes arquivos, parece que não há informações importantes, então provavelmente podemos eliminá-los.
%Skimming through the file, it looks like there is no good information
%in these files so we can probably delete them.

Mas antes de excluir os arquivos, vamos escrever um programa para procurar por arquivos que possuem mais de uma linha  e exibir o conteúdo do arquivo. 
Não vamos nos incomodar mostrando os arquivos que são exatamente 2578 ou 2565 caracteres, pois sabemos que estes não têm informações úteis.
%But before we delete the files, we will write a program to look for files
%that are more than one line long and show the contents of the file.
%We will not bother showing ourselves those files that are exactly
%2578 or 2565 characters long since we know that these files have no useful
%information.

Assim podemos escrever o seguinte programa:
%So we write the following program:

\beforeverb
\begin{verbatim}
import os
from os.path import join
for (dirname, dirs, files) in os.walk('.'):
   for filename in files:
       if filename.endswith('.txt') :
           thefile = os.path.join(dirname,filename)
           size = os.path.getsize(thefile)
           if size == 2578 or size == 2565:
               continue
           fhand = open(thefile,'r')
           lines = list()
           for line in fhand:
               lines.append(line)
           fhand.close()
           if len(lines) > 1:
                print len(lines), thefile
                print lines[:4]
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%import os
%from os.path import join
%for (dirname, dirs, files) in os.walk('.'):
%   for filename in files:
%       if filename.endswith('.txt') :
%           thefile = os.path.join(dirname,filename)
%           size = os.path.getsize(thefile)
%           if size == 2578 or size == 2565:
%               continue
%           fhand = open(thefile,'r')
%           lines = list()
%           for line in fhand:
%               lines.append(line)
%           fhand.close()
%           if len(lines) > 1:
%                print len(lines), thefile
%                print lines[:4]
%\end{verbatim}
%\afterverb


Nós usamos um {\tt continue} para ignorar arquivos com dois "Maus tamanhos", então, abrimos o resto dos arquivos e lemos as linhas do arquivo em uma lista Python, se o arquivo tiver mais que uma linha nós imprimimos a quantidade de linhas e as primeiras três linhas do arquivo.
%We use a {\tt continue} to skip files with the two 
%``bad sizes'', then open the rest of the files
%and read the lines of the file into a Python list
%and if the file has more than one line we print
%out how many lines are in the file and print out
%the first three lines.

Parece que filtrando esses dois tamanhos de arquivo ruins, e supondo
que todos os arquivos de uma linha estão corretos, nós temos abaixo alguns dados bastante limpos:
%It looks like filtering out those two bad file sizes, and assuming
%that all one-line files are correct, we are down to some pretty clean
%data:

\beforeverb
\begin{verbatim}
python txtcheck.py 
3 ./2004/03/22-03-04_2015.txt
['Little horse rider\r\n', '\r\n', '\r']
2 ./2004/11/30-11-04_1834001.txt
['Testing 123.\n', '\n']
3 ./2007/09/15-09-07_074202_03.txt
['\r\n', '\r\n', 'Sent from my iPhone\r\n']
3 ./2007/09/19-09-07_124857_01.txt
['\r\n', '\r\n', 'Sent from my iPhone\r\n']
3 ./2007/09/20-09-07_115617_01.txt
...
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%python txtcheck.py 
%3 ./2004/03/22-03-04_2015.txt
%['Little horse rider\r\n', '\r\n', '\r']
%2 ./2004/11/30-11-04_1834001.txt
%['Testing 123.\n', '\n']
%3 ./2007/09/15-09-07_074202_03.txt
%['\r\n', '\r\n', 'Sent from my iPhone\r\n']
%3 ./2007/09/19-09-07_124857_01.txt
%['\r\n', '\r\n', 'Sent from my iPhone\r\n']
%3 ./2007/09/20-09-07_115617_01.txt
%...
%\end{verbatim}
%\afterverb

Mas existe um ou mais padrões chatos de arquivo:
duas linhas brancas seguidas por uma linha que diz "Sent from my iPhone" que são exceção em meus dados.
Então, fizemos a seguinte mudança no programa para lidar com esses arquivos também.
%But there is one more annoying pattern of files: 
%there are some three-line files that consist of
%two blank lines followed by a line that says
%``Sent from my iPhone'' that have slipped 
%into my data.   So we make the following change
%to the program to deal with these files as well.

\beforeverb
\begin{verbatim}
           lines = list()
           for line in fhand:
               lines.append(line)
           if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
               continue
           if len(lines) > 1:
                print len(lines), thefile
                print lines[:4]
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%           lines = list()
%           for line in fhand:
%               lines.append(line)
%           if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
%               continue
%           if len(lines) > 1:
%                print len(lines), thefile
%                print lines[:4]
%\end{verbatim}
%\afterverb

Nós simplesmente verificamos se temos um arquivo com três linhas, e se a terceira linha inicia-se com o texto específico, então nós o pulamos. 
%We simply check if we have a three-line file, and if the third 
%line starts with the specified text, we skip it.
%
Agora quando rodamos o programa, vemos apenas quatro arquivos multi-linha restantes e todos esses arquivos parecem fazer sentido:
%Now when we run the program, we only see four remaining 
%multi-line files and all of those files look pretty reasonable:

\beforeverb
\begin{verbatim}
python txtcheck2.py 
3 ./2004/03/22-03-04_2015.txt
['Little horse rider\r\n', '\r\n', '\r']
2 ./2004/11/30-11-04_1834001.txt
['Testing 123.\n', '\n']
2 ./2006/03/17-03-06_1806001.txt
['On the road again...\r\n', '\r\n']
2 ./2006/03/24-03-06_1740001.txt
['On the road again...\r\n', '\r\n']
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%python txtcheck2.py 
%3 ./2004/03/22-03-04_2015.txt
%['Little horse rider\r\n', '\r\n', '\r']
%2 ./2004/11/30-11-04_1834001.txt
%['Testing 123.\n', '\n']
%2 ./2006/03/17-03-06_1806001.txt
%['On the road again...\r\n', '\r\n']
%2 ./2006/03/24-03-06_1740001.txt
%['On the road again...\r\n', '\r\n']
%\end{verbatim}
%\afterverb

Se você olhar para o padrão global deste programa, nós refinamos sucessivamente como aceitamos ou rejeitamos arquivos e uma vez encontrado um padrão que era "ruim" nós usamos {\tt continue} para ignorar os maus arquivos para que pudéssemos refinar nosso código para encontrar mais padrões que eram ruins.
%If you look at the overall pattern of this program,
%we have successively refined how we accept or reject
%files and once we found a pattern that was ``bad'' we used
%{\tt continue} to skip the bad files so we could refine
%our code to find more file patterns that were bad.

Agora estamos nos preparando para excluir os arquivos, nós vamos inverter a lógica e ao invés de imprimirmos os bons arquivos, vamos imprimir os maus arquivos que estamos prestes a excluir.
%Now we are getting ready to delete the files, so 
%we are going to flip the logic and instead of printing out 
%the remaining good files, we will print out the ``bad''
%files that we are about to delete.

\beforeverb
\begin{verbatim}
import os
from os.path import join
for (dirname, dirs, files) in os.walk('.'):
   for filename in files:
       if filename.endswith('.txt') :
           thefile = os.path.join(dirname,filename)
           size = os.path.getsize(thefile)
           if size == 2578 or size == 2565:
               print 'T-Mobile:',thefile
               continue
           fhand = open(thefile,'r')
           lines = list()
           for line in fhand:
               lines.append(line)
           fhand.close()
           if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
               print 'iPhone:', thefile
               continue
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%import os
%from os.path import join
%for (dirname, dirs, files) in os.walk('.'):
%   for filename in files:
%       if filename.endswith('.txt') :
%           thefile = os.path.join(dirname,filename)
%           size = os.path.getsize(thefile)
%           if size == 2578 or size == 2565:
%               print 'T-Mobile:',thefile
%               continue
%           fhand = open(thefile,'r')
%           lines = list()
%           for line in fhand:
%               lines.append(line)
%           fhand.close()
%           if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
%               print 'iPhone:', thefile
%               continue
%\end{verbatim}
%\afterverb

Podemos ver agora uma lista de possíveis arquivos que queremos apagar e por quê esses arquivos
são eleitos a exclusão.
O Programa produz a seguinte saída:
%We can now see a list of candidate files that we are about
%to delete and why these files are up for deleting.
%The program produces the following output:

\beforeverb
\begin{verbatim}
python txtcheck3.py

...
T-Mobile: ./2006/05/31-05-06_1540001.txt
T-Mobile: ./2006/05/31-05-06_1648001.txt
iPhone: ./2007/09/15-09-07_074202_03.txt
iPhone: ./2007/09/15-09-07_144641_01.txt
iPhone: ./2007/09/19-09-07_124857_01.txt
...
\end{verbatim}
\afterverb

%\beforeverb
%\begin{verbatim}
%python txtcheck3.py
%...
%T-Mobile: ./2006/05/31-05-06_1540001.txt
%T-Mobile: ./2006/05/31-05-06_1648001.txt
%iPhone: ./2007/09/15-09-07_074202_03.txt
%iPhone: ./2007/09/15-09-07_144641_01.txt
%iPhone: ./2007/09/19-09-07_124857_01.txt
%...
%\end{verbatim}
%\afterverb

Podemos verificar pontualmente esses arquivos para nos certificar que não inserimos um bug em nosso programa
ou talvez na nossa lógica, pegando arquivos que não queríamos.
Uma vez satisfeitos de que esta é a lista de arquivos que queremos excluir, faremos a seguinte mudança no programa:
%We can spot-check these files to make sure that we did not inadvertently
%end up introducing a bug in our program or perhaps our logic 
%caught some files we did not want to catch.
%Once we are satisfied that this is the list of files we want to delete,
%we make the following change to the program:

\beforeverb
\begin{verbatim}
           if size == 2578 or size == 2565:
               print 'T-Mobile:',thefile
               os.remove(thefile)
               continue
...
           if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
               print 'iPhone:', thefile
               os.remove(thefile)
               continue
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%           if size == 2578 or size == 2565:
%               print 'T-Mobile:',thefile
%               os.remove(thefile)
%               continue
%...
%           if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
%               print 'iPhone:', thefile
%               os.remove(thefile)
%               continue
%\end{verbatim}
%\afterverb

Nesta versão do programa, iremos fazer ambos, imprimir o arquivo e remover os arquivos ruins com {\tt os.remove}
%In this version of the program, we will both print the file out 
%and remove the bad files
%using {\tt os.remove}.

\beforeverb
\begin{verbatim}
python txtdelete.py 
T-Mobile: ./2005/01/02-01-05_1356001.txt
T-Mobile: ./2005/01/02-01-05_1858001.txt
...
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%python txtdelete.py 
%T-Mobile: ./2005/01/02-01-05_1356001.txt
%T-Mobile: ./2005/01/02-01-05_1858001.txt
%...
%\end{verbatim}
%\afterverb

Apenas por diversão, rodamos o programa uma segunda vez e o programa não irá produzir nenhuma saída desde que os arquivos ruins não existam.
%Just for fun, run the program a second time and it will produce no output
%since the bad files are already gone.

Se rodar novamente {\tt txtcount.py} podemos ver que removemos 899 arquivos ruins:
%If we rerun {\tt txtcount.py} we can see that we have removed
%899 bad files:

\beforeverb
\begin{verbatim}
python txtcount.py 
Files: 1018
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%python txtcount.py 
%Files: 1018
%\end{verbatim}
%\afterverb

Nesta seção, temos seguido uma sequência onde usamos o Python primeiro para navegar através dos diretórios e arquivos
procurando padrões. Usamos o Python devagar para ajudar a determinar como faríamos para limpar nosso diretório.
Uma vez descoberto quais arquivos são bons e quais não são, nós usamos o Python para excluir os arquivos e executar a limpeza.
%In this section, we have followed a sequence where we use Python 
%to first look through directories and files seeking
%patterns.  We slowly use Python to help determine what we 
%want to do to clean up our directories.  Once we
%figure out which files are good and which files are 
%not useful, we use Python to delete the files and 
%perform the cleanup.

O problema que você precisa resolver pode ser bastante simples 
precisando procurar pelos nomes dos arquivos,
ou talvez você precise ler cada arquivo, procurando por padrões dentro dos mesmos, às vezes 
você precisa ler o conteúdo dos arquivos fazendo alguma mudança em alguns deles, seguindo algum 
tipo de critério. Todos estes são bastante simples uma vez que você entenda como {\ tt os.walk}
e outros utilitários {\tt os} podem ser usados.
%The problem you may need to solve can either be quite simple 
%and might only depend on looking at the names of files,
%or perhaps you need to read every single file and
%look for patterns within the files.  Sometimes 
%you will need to read all the files and make a change 
%to some of the files.  All of these are pretty 
%straightforward once you understand how {\tt os.walk}
%and the other {\tt os} utilities can be used.

\section{Argumentos de linha de comando}
%\section{Command-line arguments}

\index{Argumentos}
%\index{arguments}

Nos capítulos anteriores tivemos uma série de programas que solicitavam
por um nome de arquivo usando \verb"raw_input" e então, liam os dados 
de um arquivo e processavam os dados, como a seguir:
%In earlier chapters, we had a number of programs that prompted
%for a file name using \verb"raw_input" and then read data 
%from the file and processed the data as follows:

\beforeverb
\begin{verbatim}
nome = raw_input('Informe o arquivo:')
handle = open(nome, 'r')
texto = handle.read()
...
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%name = raw_input('Enter file:')
%handle = open(name, 'r')
%text = handle.read()
%...
%\end{verbatim}
%\afterverb

Nós podemos simplificar este programa um pouco pegando o nome do arquivo
a partir de um comando quando iniciamos o Python. Até agora 
nós simplesmente executamos nossos programas em Python e respondemos a
solicitação como segue:
%We can simplify this program a bit by taking the file name
%from the command line when we start Python.  Up to now,
%we simply run our Python programs and respond to the 
%prompts as follows:

\beforeverb
\begin{verbatim}
python words.py
Informe o arquivo: mbox-short.txt
...
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%python words.py
%Enter file: mbox-short.txt
%...
%\end{verbatim}
%\afterverb

Nós podemos colocar strings adicionais depois do nome do arquivo Python na linha de
comando e acessá-los de dentro de um programa Python. Eles são chamados {\bf argumentos de linha
de comando}. Aqui está um simples programa que demonstra a leitura de argumentos a partir de uma 
linha de comando:
%We can place additional strings after the Python file and access
%those {\bf command-line arguments} in our Python program.  Here is a simple program 
%that demonstrates reading arguments from the command line:

\beforeverb
\begin{verbatim}
import sys
print 'Contagem:', len(sys.argv)
print 'Tipo:', type(sys.argv)
for arg in sys.argv:
   print 'Argumento:', arg
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%import sys
%print 'Count:', len(sys.argv)
%print 'Type:', type(sys.argv)
%for arg in sys.argv:
%   print 'Argument:', arg
%\end{verbatim}
%\afterverb

Os conteúdos de {\tt sys.argv} são uma lista de strings onde a primeira string 
contém o nome do programa Python e as outras são argumentos na linha de comando após 
o nome do arquivo Python.
%The contents of {\tt sys.argv} are a list of strings where the first string
%is the name of the Python program and the remaining strings are the arguments
%on the command line after the Python file.

O seguinte mostra nosso programa lendo uma série de argumentos de linha de comando de uma linha de comando:
%The following shows our program reading several command-line arguments from the command
%line:

\beforeverb
\begin{verbatim}
python argtest.py ola alguem
Contagem: 3
Tipo: <type 'list'>
Argumento: argtest.py
Argumento: ola
Argumento: alguem
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%python argtest.py hello there
%Count: 3
%Type: <type 'list'>
%Argument: argtest.py
%Argument: hello
%Argument: there
%\end{verbatim}
%\afterverb

Há três argumentos que são passados ao nosso programa como uma lista de três elementos. 
O primeiro elemento da lista é o nome do arquivo (argtest.py) e os outros são
os dois argumentos de linha de comando após o nome do arquivo.
%There are three arguments are passed into our program as a three-element list.  
%The first element of the list is the file name (argtest.py) and the others are 
%the two command-line arguments after the file name.

Nós podemos reescrever nosso programa para ler o arquivo, obtendo o nome do arquivo
a partir do argumento de linha de comando, como segue:
%We can rewrite our program to read the file, taking the file name 
%from the command-line argument as follows:

\beforeverb
\begin{verbatim}
import sys

name = sys.argv[1]
handle = open(name, 'r')
text = handle.read()
print name, 'is', len(text), 'bytes'
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%import sys
%
%name = sys.argv[1]
%handle = open(name, 'r')
%text = handle.read()
%print name, 'is', len(text), 'bytes'
%\end{verbatim}
%\afterverb

Nós pegamos o segundo argumento da linha de comando, que contém o nome do arquivo (pulando o nome do programa na entrada {\tt [0]}).
Nós abrimos o arquivo e lemos seu conteúdo, como segue:
%We take the second command-line argument as the name of the file (skipping past
%the program name in the {\tt [0]} entry).  We open the file and read 
%the contents as follows:

\beforeverb
\begin{verbatim}
python argfile.py mbox-short.txt
mbox-short.txt is 94626 bytes
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%python argfile.py mbox-short.txt
%mbox-short.txt is 94626 bytes
%\end{verbatim}
%\afterverb

Usar argumentos de linha de comando como entrada, torna o seu programa Python fácil de se reutilizar, 
especialmente quando você somente precisa passar uma ou duas strings.
%Using command-line arguments as input can make it easier to reuse your Python programs, 
%especially when you only need to input one or two strings.

\section{Pipes}
%\section{Pipes}

\index{shell}
\index{pipe}
%\index{shell}
%\index{pipe}

A maioria dos sistemas operacionais oferecem uma interface de linha de comando,
conhecido também como {\bf shell}. Shells normalmente normalmente disponibilizam comandos para
navegar entre arquivos do sistema e executar aplicações. Por exemplo, no Unix, você pode mudar de diretório
com {\tt cd}, mostrar na tela o conteúdo de um diretório com {\tt ls} e rodar um web browser digitando (por exemplo)
{\tt firefox}.
%Most operating systems provide a command-line interface,
%also known as a {\bf shell}.  Shells usually provide commands
%to navigate the file system and launch applications.  For
%example, in Unix, you can change directories with {\tt cd},
%display the contents of a directory with {\tt ls}, and launch
%a web browser by typing (for example) {\tt firefox}.

\index{ls (Unix command)}
\index{Unix command!ls}
%\index{ls (Unix command)}
%\index{Unix command!ls}

Qualquer programa que consiga rodar a partir do shell também pode ser
executado a partir do Python usando um {\bf pipe}. Um pipe é um objeto
que representa um processo em execução.
%Any program that you can launch from the shell can also be
%launched from Python using a {\bf pipe}.  A pipe is an object
%that represents a running process.

Por exemplo, o comando Unix \footnote{Ao usar pipes para interagir com comandos do sistema operacional como {\tt ls},
é importante saber qual sistema operacional você está usando e executar somente comandos pipe que
são suportados pelo seu sistema operacional.}
{\tt ls -l} normalmente mostra o conteúdo do diretório corrente (no modo detalhado). 
Você pode rodar {\tt ls} com {\tt os.open}:
%For example, the Unix command\footnote{When using pipes to talk 
%to operating system commands like {\tt ls}, it is important 
%for you to know which operating system you are using and only
%open pipes to commands that are supported on your operating system.}
%{\tt ls -l} normally displays the
%contents of the current directory (in long format).  You can
%launch {\tt ls} with {\tt os.popen}:

\index{popen function}
\index{function!popen}
%\index{popen function}
%\index{function!popen}

\beforeverb
\begin{verbatim}
>>> cmd = 'ls -l'
>>> fp = os.popen(cmd)
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%>>> cmd = 'ls -l'
%>>> fp = os.popen(cmd)
%\end{verbatim}
%\afterverb

Um argumento é uma string que contém um comando shell. O
valor de retorno é um ponteiro para um arquivo que se comporta exatamente como um arquivo
aberto. Você pode ler a saída do processo {\tt ls} uma
linha de cada vez com o comando {\tt readline} ou obter tudo de uma vez com o comando {\tt read}:
%The argument is a string that contains a shell command.  The
%return value is a file pointer that behaves just like an open
%file.  You can read the output from the {\tt ls} process one
%line at a time with {\tt readline} or get the whole thing at
%once with {\tt read}:

\index{readline method}
\index{method!readline}
\index{read method}
\index{method!read}
%\index{readline method}
%\index{method!readline}
%\index{read method}
%\index{method!read}

\beforeverb
\begin{verbatim}
>>> res = fp.read()
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%>>> res = fp.read()
%\end{verbatim}
%\afterverb

Quando terminar, você fecha o pipe como se fosse um arquivo:
%When you are done, you close the pipe like a file:

\index{close method}
\index{method!close}
%\index{close method}
%\index{method!close}

\beforeverb
\begin{verbatim}
>>> stat = fp.close()
>>> print stat
None
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%>>> stat = fp.close()
%>>> print stat
%None
%\end{verbatim}
%\afterverb

O valor de retorno é o status final do processo {\tt ls};
{\tt None} significa que ele terminou normalmente (sem erros).
%The return value is the final status of the {\tt ls} process;
%{\tt None} means that it ended normally (with no errors).

\section{Glossário}
%\section{Glossary}

\begin{description}
%\begin{description}

\item[absolute path:] Uma string que descreve onde um arquivo ou
diretório é armazenado, começando desde o ``topo da árvore de diretórios''
de modo que ele pode ser usado para acessar o arquivo ou diretório, independentemente
do diretório de trabalho corrente.
\index{path!absolute}
%\item[absolute path:] A string that describes where a file or
%directory is stored that starts at the ``top of the tree of directories''
%so that it can be used to access the file or directory, regardless
%of the current working directory.
%\index{path!absolute}

\item [checksum:] Ver também {\bf hashing}. O termo ``checksum''
vem da necessidade de se verificar se os dados corromperam durante
o envio pelo rede ou quando gravados em um meio de backup. 
Quando os dados são gravados ou enviados, o sistema emissor
calcula o checksum e também o envia. Quando o dado foi
completamente lido ou recebido, o sistema receptor calcula novamente
o checksum com base nos dados recebidos e os compara com o
checksum recebido. Se os checksum's não corresponderem, devemos
assumir que os dados estão corrompidos, uma vez que já finalizou a transmissão.
\index {checksum}
%\item[checksum:] See also {\bf hashing}.  The term ``checksum'' 
%comes from the need to verify if data was garbled as it was 
%sent across a network or written to a backup medium and then
%read back in.  When the data is written or sent, the sending system
%computes a checksum and also sends the checksum.  When the 
%data is read or received, the receiving system re-computes
%the checksum from the received data and compares it to the 
%received checksum.  If the checksums do not match, we must
%assume that the data was garbled as it was transferred.
%\index{checksum}

\item[command-line argument:] Parâmetros na linha de comando após o nome do arquivo Python.
%\item[command-line argument:] Parameters on the command line after the Python file name.

\item[current working directory:] O diretório corrente no qual você está. 
Você pode mudar seu diretório de trabalho usando o
comando {\tt cd}, disponível na maioria dos sistemas operacionais em sua interface de 
linha de comando.
Quando você abre um arquivo em Python usando apenas o nome do arquivo, sem o caminho, o arquivo 
deve estar no diretório de trabalho atual, onde está executando o programa.
\index{directory!current}
\index{directory!working}
\index{directory!cwd}
%\item[current working directory:] The current directory that you 
%are ``in''.  You can change your working directory using the 
%{\tt cd} command on most systems in their command-line interfaces.
%When you open a file in Python using just the file name with no path 
%information, the file must be in the current working directory
%where you are running the program.
%\index{directory!current}
%\index{directory!working}
%\index{directory!cwd}

\item[hashing:] Leitura através de uma grande quantidade de dados,
produzindo um checksum global para os dados. As melhores funções hash
produzem muito poucas ``colisões'', que é quando você passa diferentes 
dados para a função hash e recebe de volta o mesmo hash.
MD5, SHA1 e SHA256 são exemplos de funções hash mais usadas.
\index{hashing}
%\item[hashing:] Reading through a potentially large amount of data
%and producing a unique checksum for the data.  The best hash functions
%produce very few ``collisions'' where you can give two different
%streams of data to the hash function and get back the same hash. 
%MD5, SHA1, and SHA256 are examples of commonly used hash functions.
%\index{hashing}

\item[pipe:] Um pipe é uma conexão com um programa em execução. Usando
um pipe, você pode escrever um programa para enviar os dados para outro programa
ou receber dados a partir desse programa. Um pipe é semelhante a um
{\bf socket}, com exceção de que o pipe só pode ser usado para
conectar programas em execução no mesmo computador (ou seja, não
através de uma rede).
\index {pipe}
%\item[pipe:] A pipe is a connection to a running program.  Using
%a pipe, you can write a program to send data to another program
%or receive data from that program.  A pipe is similar to a 
%{\bf socket} except that a pipe can only be used to 
%connect programs running on the same computer (i.e., not
%across a network).
%\index{pipe}

\item[relative path:] Uma string que descreve onde um arquivo ou
diretório é armazenado em relação ao diretório de trabalho atual.
\index{path!relative}
%\item[relative path:] A string that describes where a file or
%directory is stored relative to the current working 
%directory.
%\index{path!relative}

\item[shell:] Uma interface de linha de comando para um sistema operacional.
Também chamado em alguns sistemas operacionais de ``terminal''.
Nesta interface, você digita um comando com parâmetros em uma única linha e pressiona "enter" 
para executar o comando.
\index{shell}
%\item[shell:] A command-line interface to an operating system.
%Also called a ``terminal program'' in some systems. In this interface
%you type a command and parameters on a line and press ``enter''
%to execute the command.
%\index{shell}

\item[walk:] Um termo que usamos para descrever a noção de visitar
uma árvore inteira de diretórios e sub-diretórios, até que tenhamos
visitado todos eles. Nós chamamos isso de ``caminhar pela árvore de diretórios''.
\index{walk}
%\item[walk:] A term we use to describe the notion of visiting
%the entire tree of directories, sub-directories, sub-sub-directories, 
%until we have visited the all of the directories.  We call this
%``walking the directory tree''.
%\index{walk}

\end{description}
%\end{description}

\section{Exercícios}
%\section{Exercises}

\begin{ex}
%\begin{ex}

\label{checksum}
%\label{checksum}

\index{MP3}
%\index{MP3}

Numa grande coleção de arquivos MP3, pode existir mais de uma 
cópia de um mesmo som, armazenado em diferentes diretórios ou com 
diferentes nomes de arquivo. O objetivo deste exercício é 
procurar por essas duplicatas.
%In a large collection of MP3 files there may be more than one
%copy of the same song, stored in different directories or with
%different file names.  The goal of this exercise is to search for
%these duplicates.
%
\begin{enumerate}
%\begin{enumerate}

\item Escreva um programa que caminhe no diretório e em todos os seus
subdiretórios, procurando por todos os arquivos com o sufixo {\tt .mp3}
e liste o par de arquivos com o mesmo tamanho.
Dica: Use um dicionário onde a chave seja o tamanho
do arquivo do {\tt os.path.getsize} e o valor seja o nome do caminho 
concatenado com o nome do arquivo.
Conforme você for encontrando cada arquivo, verifique se já tem um
arquivo que tem o mesmo tamanho do arquivo atual. Se assim for, você tem um
arquivo duplicado, então imprima o tamanho e os nomes dos dois arquivos
(um a partir do hash e o outro a partir do arquivo que você está olhando no momento).
%\item Write a program that walks a directory and all of its
%subdirectories for all files with a given suffix (like {\tt .mp3})
%and lists pairs of files with that are the same size.
%Hint: Use a dictionary where the key of the dictionary is the size
%of the file from {\tt  os.path.getsize} and the value in the 
%dictionary is the path name concatenated with the file name.  
%As you encounter each file, check to see if you already have a
%file that has the same size as the current file.  If so, you have a
%duplicate size file, so print out the file size and the two file names 
%(one from the hash and the other file you are looking at).

\index{duplicate}
\index{MD5 algorithm}
\index{algorithm!MD5}
\index{checksum}
%\index{duplicate}
%\index{MD5 algorithm}
%\index{algorithm!MD5}
%\index{checksum}

\item Adaptar o programa anterior para procurar arquivos
com conteúdo duplicado usando um hash ou um {\bf checksum}. Por exemplo,
MD5 (Message-Digest algorithm 5) recebe uma ``mensagem'' grande
e retorna um ``checksum'' de 128 bits. A probabilidade
de que dois arquivos com diferentes conteúdos retornem o mesmo checksum
é muito pequena.
%\item Adapt the previous program to look for files that 
%have duplicate content using a hashing or {\bf checksum}
%algorithm.  For example,
%MD5 (Message-Digest algorithm 5) takes an arbitrarily-long
%``message'' and returns a 128-bit ``checksum''.  The probability
%is very small that two files with different contents will
%return the same checksum.

Você pode ler sobre o MD5 em wikipedia.org/wiki/Md5. O
seguinte trecho de código abre um arquivo, o lê, e calcula
o seu checksum.
%You can read about MD5 at \url{wikipedia.org/wiki/Md5}.  The 
%following code snippet opens a file, reads it, and computes
%its checksum.

\beforeverb
\begin{verbatim}
import hashlib 
...
           fhand = open(thefile,'r')
           data = fhand.read()
           fhand.close()
           checksum = hashlib.md5(data).hexdigest()
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%import hashlib 
%...
%           fhand = open(thefile,'r')
%           data = fhand.read()
%           fhand.close()
%           checksum = hashlib.md5(data).hexdigest()
%\end{verbatim}
%\afterverb

Você deve criar um dicionário onde o checksum é a chave
e o nome do arquivo é o valor. Quando você calcular um checksum
e ele já existir no dicionário como uma chave, então você terá
dois arquivos duplicados. Então imprima o arquivo existente no dicionário
e o arquivo que você acabou de ler. Aqui estão algumas saídas
de uma execução sob uma pasta com arquivos de imagens.
%You should create a dictionary where the checksum is the key 
%and the file name is the value.   When you compute a checksum
%and it is already in the dictionary as a key, you have two files with 
%duplicate content, so print out the file in the dictionary
%and the file you just read.  Here is some sample output
%from a run in a folder of image files:

\beforeverb
\begin{verbatim}
./2004/11/15-11-04_0923001.jpg ./2004/11/15-11-04_1016001.jpg
./2005/06/28-06-05_1500001.jpg ./2005/06/28-06-05_1502001.jpg
./2006/08/11-08-06_205948_01.jpg ./2006/08/12-08-06_155318_02.jpg
\end{verbatim}
\afterverb
%\beforeverb
%\begin{verbatim}
%./2004/11/15-11-04_0923001.jpg ./2004/11/15-11-04_1016001.jpg
%./2005/06/28-06-05_1500001.jpg ./2005/06/28-06-05_1502001.jpg
%./2006/08/11-08-06_205948_01.jpg ./2006/08/12-08-06_155318_02.jpg
%\end{verbatim}
%\afterverb

Aparentemente, eu às vezes envio a mesma foto mais de uma vez
ou faço uma cópia de uma foto de vez em quando sem excluir
a original.
%Apparently I sometimes sent the same photo more than once 
%or made a copy of a photo from time to time without deleting
%the original.

\end{enumerate}

\end{ex}