pdr Reference
At the moment there are the following types of data sources
available:
|
these four data sources work
with expressions
|
|
these data sources work with
specific data formats in files
|
Input per command line
The simplest (and most uncomfortablest) way to get data into the
system is the pdr command line, this means the invocation
of pdr. There's nothing needed to be configured for this.
pdr has the command line option -e (--expression)
which allows to specify an expression. This option can be multiply
used. Moreover all characters behind pdr that are not part or argument of a command
line option are summed up to one big expression and processed at
once (see there).
If an expression on the command line doesn't have a timestamp the
current date and time will be used.
If there's a failure during processing because of any incorrectness
in an expression pdr produces a message. A data transfer into the
rejections doesn't take place.
Input per mail (POP3
and IMAP)
For the use of e-mail mailboxes we assume that data (mails) have
been arrived in the mailbox and that they are not processed by any
other application. These mails must have the following properties:
- a unique subject
- an exploitable timestamp (normally the SMTP server adds one
during sending)
- plain, continuous ASCII text format (no HTML, RTF ...)
- text completely in expressions
If there's an e-mail data source configured the mail server will be
requested during the next invocation. pdr looks if there are mails
on the server, checks their subject and processes matching e-mails
one by one, line by line, each line is an expression. If a line has
a timestamp this one has priority. Otherwise the timestamp of the
e-mail is valid implicitly. This is very handy because normally you
will never have to enter a timestamp manually in usual, single line
e-mails.
Here's a complete e-mail source:
From:
superhero <Mymail@gmx.net>
To: MyMail@gmx.net
Subject: Q
Date: Thu, 04 Feb 2010 17:56:11 +0100
Message-ID: <87pr4ley8k.fsf@castor.ch>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
5.3 8i
Normally most of the values in the header lines are taken from
default values. Date
and Message-ID are
added by the server, MIME-Version
and Content-Type come
from the e-mail client application. The only remaining text parts
that have really to be entered are the subject (that's why it should
be short, the single letter Q
here) and the contents of the message, the data line.
On POP3 servers processed e-mails are deleted from the server
regardless of the success. So they never get processed a second
time. This deletion can be suppressed by configuration. On IMAP
servers the user can configure if the mails should be deleted or
marked as read. In this case the mails remain on the server and can
still be archived.
If there's a failure during processing because of any incorrectness
in an expression pdr transfers these expressions into the rejections
and writes out a message.
Input per Twitter
To use a Twitter account is probably the most elegant way of getting
data into the database. Twitter has a front end on almost every
platform, especially on mobile phones, which makes data input very
easy. Twitter sets the timestamps himself, the user has just to type
the expressions.
pdr gets only those data from the feed which are younger than the
youngest data value in the database. So it is possible to manipulate
data in the database. These manipulations will never be overwritten.
If there's a failure during processing because of any incorrectness
in an expression pdr transfers these expressions into the rejections
and writes out a message.
Note
- It is useful to protect
the Twitter account,
at first to hide the data from the public eye, on the other hand
to avoid comments and other disturbances. This option is to be
set in the Twitter user#s account under Options, Account, "Tweet
security [x] protect my Tweets".
- A protected Twitter account needs authentication. This authentication is done
while pdr accesses the first time a Twitter account. During this
step two private keys are created and saved locally, so the
authentication must not be done again.
- Unfortunally Twitter requires the authentication to be interactive: the user gets
presented a Twitter-owned web page in his browser on which he
must press at least a button. Then he gets a seven digit number,
a PIN, he must enter in pdr. For this reason we need a browser
specification in the moment of the first access to a Twitter
account (command line parameter --browser).
- After the authentication has succeeded a Twitter data source
behaves like every other one.
Input per text file
If we use a text file for data input every line counts as
expression. This method is practical if you get data in a period
without any opportunity to transmit them online. So you have to
collect them in a file manually, expression by expression.
Lines starting with #
are not processed.
If there's a failure during processing because of any incorrectness
in an expression pdr produces a message. A data transfer into the
rejections doesn't take place.
Text files that are processed successfully are deleted if they are
configured. So they are not processed a second time. This deletion
can be suppressed during configuration.
Input per CSV file
The abbreviation CSV means "comma
separated values". Instead of the comma pdr also accepts
the semicolon and the tabulator as separator between the values.
There are two different ways to tell pdr what comma separated data
value should get into which collection:
- a control line in the CSV file preceding the data lines
- a control line in the configuration file, valid for the entire
CSV file
In the first case a pdr CSV file would have the following structure:
control line
data line1
[...]
data lineN
control line
data line1
[...]
data lineN
[...]
This kind of use of control lines is unusually but gives us the
wanted flexibility and openness. Normally you can insert them easily
by hand or by a program like sed.
In the second case the CSV file would contain only data lines as
expected.
A control line has the
following structure:
[# pdr] datetime [separator
collection]+
Example:
# pdr datetime, *, n, l; h; q»p, #
(» means a tabulator)
This is a control line for data lines with a timestamp and seven
values for the collections *,
n, l, h, q, p and #.
Each control line in a CSV file will be known on it's prefix # pdr, a control line in a
configuration file doesn't need this prefix. The following keyword datetime marks the position
of the timestamp on the data lines. It doesn't have to be on the
beginning but every line must have one - there are no data values
without a timestamp. In the example we can see that we can have
several separators on one data line. Data lines according to this
control line whould look like this:
2008-10-11 12:31:38, 5.2, 7, 8; 42.3; 12»96, first measuring
2008-10-12 12:48:08, 6.1, , 8; 53.1; 16»93,
2008-10-13 12:43:57, 5.8, 7, 7; 34.2; 15»94, third measuring
The second line has no values for the collections n and #. In the case of missing
values just no inserts are made.
If you have CSV files containing more values than you want to import
into collections you can declare omissions in the control line:
# pdr datetime, a, b, , , , c, d, e
Here we read a timestamp and two collections, then we omit three
values on the data lines and read again three values.
Lines starting with #
are not processed.
During the processing of a CSV file the whole file is handled in a
single transaction. If there's a failure because for instance a data
value on a line doesn't match the type of the declared collection
the whole file is dismissed. A data transfer into the rejections
doesn't take place.
CSV files that are processed successfully are deleted if they are
configured. So they are not processed a second time. This deletion
can be suppressed during configuration.
Input per XML file
pdr can read XML files for data input. These files are well formed,
read- and editable, and are the ideal thing for data exchange
between different software systems. pdr defines an own, intentional
very simple format. But the responsible part of the program is
designed to be extended for further XML formats.
The pdr XML format
The pdr XML format is completely documented in the file pdr.xsd:
<?xml version="1.0"
encoding="iso-8859-1" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" >
<xsd:annotation>
<xsd:documentation xml:lang="en">
pdr XML input file definition (C) T.M.
Bremgarten 2010-01-31
</xsd:documentation>
</xsd:annotation>
<xsd:element name="pdr">
<xsd:complexType>
<xsd:sequence>
<xsd:element
name="collection" type="collection" minOccurs="0"
maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:complexType name="collection">
<xsd:sequence>
<xsd:element name="item" minOccurs="0"
maxOccurs="unbounded">
<xsd:complexType>
<xsd:attribute name="datetime" type="xs:string" />
<xsd:attribute name="value" type="xs:string" />
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="name"
type="xs:string" use="required" />
<xsd:attribute name="type"
type="collection_type" use="required" />
<xsd:attribute name="purpose"
type="xs:string" />
</xsd:complexType>
</xsd:schema>
This definition allows files that look like this:
<?xml version="1.0"
encoding="ISO-8859-1"?>
<pdr>
<collection
name="#" type="text">
<item datetime="2001-07-09
18:27:11" value="first measuring"/>
<item date
time
="2001-
07
-10
07:52:01" value="second measuring"/>
<item date
time
="2001-
07
-10
10:07:00" value="third measuring"/>
[...]
</collection>
<collection
name="*" type="numeric">
<item date
time
="2001-
07
-12
13:57:01" value="9.3"/>
<item date
time
="2001-
07
-12
14:46:45" value="5.6"/>
<item date
time
="2001-
07
-12
18:25:36" value="5.7"/>
[...]
</collection>
<collection
name="l" type="numeric">
<item date
time
="2001-
07
-03
21:41:58" value="7"/>
<item date
time
="2001-
07
-04
21:48:43" value="8"/>
<item date
time
="2001-
07
-05
21:50:49" value="7"/>
[...]
</collection>
</pdr>
This format is self explaining. The data of the collections are
specified directly and well readable.
During the processing of a XML file the whole file is handled in a
single transaction. If there's a failure because for instance a data
value doesn't match the type of a collection the whole file is
dismissed. A data transfer into the rejections doesn't take place.
XML files that are processed successfully are deleted if they are
configured. So they are not processed a second time. This deletion
can be suppressed during configuration.
(more XML formats)
...