Links

Ben Laurie blathering

29 Nov 2011

Fixing CAs

Filed under: Security — Ben @ 12:58

Adam Langley and I have a proposal to bolster up the rather fragile Certificate Authority infrastructure.

TL;DNR: certificates are registered in a public audit log. Servers present proofs that their certificate is registered, along with the certificate itself. Clients check these proofs and domain owners monitor the logs. If a CA mis-issues a certificate then either

  • There is no proof of registration, so the browser rejects the certificate, or
  • There is a proof of registration and the certificate is published in the log, in which case the domain owner notices and complains, or
  • There is a proof of registration but the certificate does not appear in the log, in which case the proof is now proof that the log misbehaved and should be struck off.

And that, as they say, is that.

Update: Adam has blogged, exploring the design space.

1 Oct 2011

Open Source Transcription Software Developer

Filed under: Open Data, Open Source, Programming — Ben @ 18:06

Since we set up FreeBMD, FreeREG and FreeCEN things have come a long way, and so we’re revisiting how we do transcription. Those great guys at Zooniverse have released their Scribe transcription software, which they developed to use with Old Weather and Ancient Lives (and more to come), as open source.

We are working with them to develop a new transcription platform for genealogical records, based on Scribe, and we want to hire a developer to help us with it. Scribe itself is written in Ruby, so some familiarity with that would help. We also use Python and EC2, so knowing about those would be good, too. And the front-end is sure to be using Javascript, so there’s another tickbox to tick.

Finally, we intend to open source everything, and so a developer used to working in an open source community would be helpful.

Everything is negotiable. FreeBMD does not have offices, so this would be “work from home” (or the beach, or whatever suits you).

If you’re interested, send email to freebmd-sd@links.org. Feel free to forward this post, of course.

19 Sep 2011

Lessons Not Learned

Filed under: Identity Management, Security — Ben @ 15:50

Anyone who has not had their head under a rock knows about the DigiNotar fiasco.

And those who’ve been paying attention will also know that DigiNotar’s failure is only the most recent in a long series of proofs of what we’ve known for a long time: Certificate Authorities are nothing but a money-making scam. They provide us with no protection whatsoever.

So imagine how delighted I am that we’ve learnt the lessons here (not!) and are now proceeding with an even less-likely-to-succeed plan using OpenID. Well, the US is.

If the plan works, consumers who opt in might soon be able to choose among trusted third parties — such as banks, technology companies or cellphone service providers — that could verify certain personal information about them and issue them secure credentials to use in online transactions.

Does this sound familiar? Rather like “websites that opt in can choose among trusted third parties – Certificate Authorities – that can verify certain information about them and issue them secure credentials to use in online transactions”, perhaps? We’ve seen how well that works. And this time there’s not even a small number of vendors (i.e. the browser vendors) who can remove a “trusted third party” who turns out not to be trustworthy. This time you have to persuade everyone in the world who might rely on the untrusted third party to remove them from their list. Good luck with that (good luck with even finding out who they are).

What is particularly poignant about this article is that even though it’s title is “Online ID Verification Plan Carries Risks” the risks we are supposed to be concerned about are mostly privacy risks, for example

people may not want the banks they might use as their authenticators to know which government sites they visit

and

the government would need new privacy laws or regulations to prohibit identity verifiers from selling user data or sharing it with law enforcement officials without a warrant.

Towards the end, if anyone gets there, is a small mention of some security risk

Carrying around cyber IDs seems even riskier than Social Security cards, Mr. Titus says, because they could let people complete even bigger transactions, like buying a house online. “What happens when you leave your phone at a bar?” he asks. “Could someone take it and use it to commit a form of hyper identity theft?”

Dude! If only the risk were that easy to manage! The real problem comes when someone sets up an account as you with one of these “banks, technology companies or cellphone service providers” (note that CAs are technology companies). Then you are going to get your ass kicked, and you won’t even know who issued the faulty credential or how to stop it.

And, by the way, don’t be fooled by the favourite get-out-of-jail-free clause beloved by policymakers and spammers alike, “opt in”. It won’t matter whether you opt in or not, because the proof you’ve opted in will be down to these “trusted” third parties. And the guy stealing your identity will have no compunction about that particular claim.

12 Sep 2011

DNSSEC on the Google Certificate Catalog

Filed under: DNSSEC, Security — Ben @ 14:47

I mentioned my work on the Google Certificate Catalog a while back. Now I’ve updated it to sign responses with DNSSEC.

I also updated the command-line utility to verify DNSSEC responses – and added a little utility to fetch the root DNSSEC keys and verify a PGP signature on them.

As always, feedback is welcome.

23 Jul 2011

An Efficient and Practical Distributed Currency

Filed under: Anonymity, Crypto, Security — Ben @ 15:51

Now that I’ve said what I don’t like about Bitcoin, it’s time to talk about efficient alternatives.

In my previous paper on the subject I amused myself by hypothesizing an efficient alternative to Bitcoin based on whatever mechanism it uses to achieve consensus on checkpoints. Whilst this is fun, it is pretty clear that no such decentralised mechanism exists. Bitcoin enthusiasts believe that I have made an error by discounting proof-of-work as the mechanism, for example

I believe Laurie’s paper is missing a key element in bitcoin’s reliance on hashing power as the primary means of achieving consensus: it can survive attacks by governments.

If bitcoin relied solely on a core development team to establish the authoritative block chain, then the currency would have a Single Point of Failure, that governments could easily target if they wanted to take bitcoin down. As it is, every one in the bitcoin community knows that if governments started coming after bitcoin’s development team, the insertion of checkpoints might be disrupted, but the block chain could go on.

Checkpoints are just an added security measure, that are not essential to bitcoin’s operation and that are used as long as the option exists. It is important for the credibility of a decentralized currency that it be possible for it to function without such a relatively easy to disrupt method of establishing consensus, and bitcoin, by relying on hashing power, can.

or

Ben, your analysis reads as though you took your well-known and long-standing bias against proof-of-work and reverse engineered that ideology to fit into an ad hoc criticism of bitcoin cryptography. You must know that bitcoin represents an example of Byzantine fault tolerance in use and that the bitcoin proof-of-work chain is the key to solving the Byzantine Generals’ Problem of synchronising the global view.

My response is simple: yes, I know that proof-of-work, as used in Bitcoin, is intended to give Byzantine fault tolerance, but my contention is that it doesn’t. And, furthermore, that it fails in a spectacularly inefficient way. I can’t believe I have to keep reiterating the core point, but here we go again: the flaw in proof-of-work as used in Bitcoin is that you have to expend 50% of all the computing power in the universe, for the rest of time in order to keep the currency stable (67% if you want to go for the full Byzantine model). There are two problems with this plan. Firstly, there’s no way you can actually expend 50% (67%), in practice. Secondly, even if you could, it’s far, far too high a price to pay.

In any case, in the end, control of computing power is roughly equivalent to control of money – so why not cut out the middleman and simply buy Bitcoins? It would be just as cheap and it would not burn fossil fuels in the process.

Finally, if the hash chain really works so well, why do the Bitcoin developers include checkpoints? The currency isn’t even under attack and yet they have deemed them necessary. Imagine how much more needed they would be if there were deliberate disruption of Bitcoin (which seems quite easy to do to me).

But then the question would arise: how do we efficiently manage a distributed currency? I present an answer in my next preprint: “An Efficient Distributed Currency”.

2 Jul 2011

Decentralised Currencies Are Probably Impossible (But Let’s At Least Make Them Efficient)

Filed under: General — Ben @ 20:04

How time flies. Following my admittedly somewhat rambling posts on Bitcoin, I decided to write a proper paper about the problem. So, here’s a preprint of “Decentralised Currencies Are Probably Impossible (But Let’s At Least Make Them Efficient)”. It’s short! Enjoy.

I may submit this to a conference, I haven’t decided yet. Suggestions of where are welcome.

By the way, Bitcoin fanboys: I see I have been taken to task for my heretic views on the Bitcoin forums. Since those have cunningly been closed to anyone who does not already have some kind of track record of conforming to the standards of the forums (presumably meaning “don’t diss Bitcoin”) I am unable to respond to comments there, but I would like to note, for the record, that I have not deleted a single non-spam comment on my Bitcoin posts, contrary to claims I see there.

21 May 2011

Bitcoin is Slow Motion

Filed under: Anonymity, Crypto, General, Privacy, Security — Ben @ 5:32

OK, let’s approach this from another angle.

The core problem Bitcoin tries to solve is how to get consensus in a continuously changing, free-for-all group. It “solves” this essentially insoluble problem by making everyone walk through treacle, so it’s always evident who is in front.

But the problem is, it isn’t really evident. Slowing everyone down doesn’t take away the core problem: that someone with more resources than you can eat your lunch. Right now, with only modest resources, I could rewrite all of Bitcoin history. By the rules of the game, you’d have to accept my longer chain and just swallow the fact you thought you’d minted money.

If you want to avoid that, then you have to have some other route to achieve a consensus view of history. Once you have a way to achieve such a consensus, then you could mint coins by just sequentially numbering them instead of burning CPU on slowing yourself down, using the same consensus mechanism.

Now, I don’t claim to have a robust way to achieve consensus; any route seems to open to attacks by people with more resources. But I make this observation: as several people have noted, currencies are founded on trust: trust that others will honour the currency. It seems to me that there must be some way to leverage this trust into a mechanism for consensus.

Right now, for example, in the UK, I can only spend GBP. At any one time, in a privacy preserving way, it would in theory be possible to know who was in the UK and therefore formed part of the consensus group for the GBP. We could then base consensus on current wielders of private keys known to be in the UK, the vast majority of whom would be honest. Or their devices would be honest on their behalf, to be precise. Once we have such a consensus group, we can issue coins simply by agreeing that they are issued. No CPU burning required.

20 May 2011

Bitcoin 2

Filed under: Anonymity, Crypto, Security — Ben @ 16:32

Well, that got a flood of comments.

Suppose I take 20 £5 notes, burn them and offer you a certificate for the smoke for £101. Would you buy the certificate?

This is the value proposition of Bitcoin. I don’t get it. How does that make sense? Why would you burn £100 worth of non-renewable resources and then use it to represent £100 of buying power. Really? That’s just nuts, isn’t it?

I mean, it’s nice for the early adopters, so long as new suckers keep coming along. But in the long run it’s just a pointless waste of stuff we can never get back.

Secondly, the point of referencing “Proof-of-work Proves Not to Work” was just to highlight that cycles are much cheaper for some people than others (particularly botnet operators), which makes them a poor fit for defence.

Finally, consensus is easy if the majority are honest. And then coins become cheap to make. Just saying.

17 May 2011

Bitcoin

Filed under: Anonymity, Distributed stuff, Security — Ben @ 17:03

A friend alerted to me to a sudden wave of excitement about Bitcoin.

I have to ask: why? What has changed in the last 10 years to make this work when it didn’t in, say, 1999, when many other related systems (including one of my own) were causing similar excitement? Or in the 20 years since the wave before that, in 1990?

As far as I can see, nothing.

Also, for what its worth, if you are going to deploy electronic coins, why on earth make them expensive to create? That’s just burning money – the idea is to make something unforgeable as cheaply as possible. This is why all modern currencies are fiat currencies instead of being made out of gold.

Bitcoins are designed to be expensive to make: they rely on proof-of-work. It is far more sensible to use signatures over random numbers as a basis, as asymmetric encryption gives us the required unforgeability without any need to involve work. This is how Chaum’s original system worked. And the only real improvement since then has been Brands‘ selective disclosure work.

If you want to limit supply, there are cheaper ways to do that, too. And proof-of-work doesn’t, anyway (it just gives the lion’s share to the guy with the cheapest/biggest hardware).

Incidentally, Lucre has recently been used as the basis for a fully-fledged transaction system, Open Transactions. Note: I have not used this system, so make no claims about how well it works.

(Edit: background reading – “Proof-of-Work” Proves Not to Work)

8 May 2011

Checking SSL Certificates

Filed under: Security — Ben @ 12:31

I mentioned my work on the Google Certificate Catalog recently. One thing I forgot is a command-line utility I wrote to perform the check for you automatically.

You can find it here.

29 Apr 2011

Pepper-crusted Tuna

Filed under: Recipes — Ben @ 10:53

I came across this on my frequent travels to the US (where they tend to call it pepper-crusted ahi, or even, rather redundantly, pepper-crusted ahi tuna). I don’t think I’ve ever seen it in the UK, but it is fantastically easy to cook. And delicious.

Tuna steaks (nice and fresh, so you can leave them rare)
Black peppercorns
Szechuan pepper (optional)

Crush the peppercorns in a pestle and mortar (or mortar and pestle if you’re American). They don’t need to be particularly finally divided, but try to at least split each one in half. Spread half the mix over one side of your tuna steaks and press it in – it sticks surprisingly well. Turn over and repeat. Then fry in hot olive oil for about 4 minutes a side (up to 8 if you’re too chicken for rare tuna, but I promise it tastes/feels much better rare). Do not keep turning them over, turn them just once. Sprinkle on some sea salt when done.

That’s it.

I often serve with plain boiled rice and pak choi.

3 Apr 2011

Improving SSL Certificate Security

Filed under: Crypto, DNSSEC, Security — Ben @ 19:47

Given how often I say on this blog that I am not speaking for my employer, I am amused to be able to say for once that I am. Over here.

28 Mar 2011

Census FAIL

Filed under: Rants — Ben @ 13:19

Once every ten years, every household in the UK gets to fill in a census form. This year, for the first time ever, I think, you can do it online. So, imagine how delighted we are that I am the only person in my household whose name actually fits in the box. Yes, really, there’s a 50 character limit.

Why? Suppose they’d splashed out and allowed 500 characters instead. What would that cost? Well, let’s assume 100M names. That’s an extra 450 x 100 = 45,000 MB of data, assuming they’re still using databases with fixed width fields. 45 GB. That would’ve cost them nearly an extra £5 at today’s prices. Not £5 per person, or £5 per household. £5 total.

Thank god for government savings, eh?

BTW, my wife rang and asked what to do. Amazingly, they opt for the least useful possible answer: start at the beginning of your name and keep going ’til you run out of space. I’m sure future generations will be very happy to have complete middle names and no surname. Not.

27 Mar 2011

ZFS Part 3: Replacing Dead Disks

Filed under: Open Source — Ben @ 16:39

As discussed in my previous article, if a disk fails then a ZFS system will just carry on as if nothing has happened. Of course, we’d like to restore the system to its former redundant glory, so here’s how…

Once more, we simulate a failure by removing the primary disk, but this time replace it with a new unformatted disk (I guess if the new disk was already bootable you’d need to fix that first).

Let’s assume we’re several years down the line and no longer have any documentation at all. First off, find your disks by inspecting dmesg. As before we have ad4 and ad8. ad4 is the new disk.

# diskinfo -v ad4 ad8
ad4
        512             # sectorsize
        500107862016    # mediasize in bytes (466G)
        976773168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        969021          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        S20BJ9AB212006  # Disk ident.

ad8
        512             # sectorsize
        500107862016    # mediasize in bytes (466G)
        976773168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        969021          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        9VMYLC5V        # Disk ident.

This time they are conveniently exactly the same size, despite having diffferent manufacturers (Samsung and Seagate respectively). We already know from the first article in this series that we can deal with disks that don’t look the same, and in any case only 250GB is currently replicated. So, let’s partition the new disk as the old one…

# gpart show ad8
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)
  488397135  488376000    4  freebsd-zfs  (233G)

# gpart show -l ad8
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  (null)  (64K)
        162    4194304    2  swap8  (2.0G)
    4194466  484202669    3  system8  (231G)
  488397135  488376000    4  scratch8  (233G)

# gpart create -s gpt ad4
ad4 created
# gpart add -b 34 -s 128 -t freebsd-boot ad4
ad4p1 added
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ad4
bootcode written to ad4
# gpart add -s 4194304 -t freebsd-swap -l swap4 ad4
ad4p2 added
# gpart add -s 484202669 -t freebsd-zfs -l system4 ad4
ad4p3 added
# gpart add -t freebsd-zfs -l scratch4 ad4
ad4p4 added
# gpart show ad4
=>       34  976773101  ad4  GPT  (466G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)
  488397135  488376000    4  freebsd-zfs  (233G)

Now we’re ready to reattach the disk to the various filesystems.

First the swap. Since we can’t remove the dead disk from the gmirror setup, first we forget then add the new swap partition back in.

# gmirror forget swap
# gmirror insert -h -p 1 swap /dev/gpt/swap4
# gmirror status
       Name    Status  Components
mirror/swap  DEGRADED  gpt/swap8
                       gpt/swap4 (29%)

and after a while

# gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  gpt/swap8
                       gpt/swap4

Next the main filesystem. In this case, since the new device has the same name as the old one, we can just write

# zpool replace system /dev/gpt/system4
If you boot from pool 'system', you may need to update
boot code on newly attached disk '/dev/gpt/system4'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

Once more we’ve already done this step, so no need to do it again. Note, this command took a little while, don’t be alarmed!

# zpool status
  pool: scratch
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        scratch         ONLINE       0     0     0
          gpt/scratch8  ONLINE       0     0     0

errors: No known data errors

  pool: system
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 9.77% done, 0h2m to go
config:

        NAME                   STATE     READ WRITE CKSUM
        system                 DEGRADED     0     0     0
          mirror               DEGRADED     0     0     0
            gpt/system8        ONLINE       0     0     0
            replacing          DEGRADED     0     0     0
              gpt/system4/old  UNAVAIL      0     0     0  cannot open
              gpt/system4      ONLINE       0     0     0  221M resilvered

errors: No known data errors

and after not very long

# zpool status
  pool: scratch
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        scratch         ONLINE       0     0     0
          gpt/scratch8  ONLINE       0     0     0

errors: No known data errors

  pool: system
 state: ONLINE
 scrub: resilver completed after 0h1m with 0 errors on Sun Mar 27 13:04:02 2011
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  ONLINE       0     0     0  2.21G resilvered

errors: No known data errors

And we’re all good, back to where we were before. Reboot to check everything is fine.

Note, by the way, that all of this was done on a live system in multi-user mode. Apart from the occasional reboot there was no loss of service whatsoever.

Also, because the primary disk didn’t really fail, if I wanted I could put it in my other machine and end up with a working replicated system there without any need for setup.

There is one niggling question remaining: I started off with one 250 GB and one 500 GB disk. I now have two 500 GBs, which means the non-redundant scratch file system I had before could now become redundant. Or they could become part of the system pool. Or they could become a bigger non-redundant scratch filesystem.

In the end I decided to do the simplest thing, which is to make the scratch partitions part of the larger system partition. If I ever need to rearrange that is always possible either with the help of an additional disk or, even, with less safety, by taking one of the disks out of the pools and rearranging onto that (see a description of doing this kind of thing on freenas).

So, to make them part of the existing pool, first destroy the scratch filesystem (if I’d already used it I’d have to copy it before I started, but since I haven’t I can just blow it away). Since we mounted the pool direct, we destroy it with zpool:

# zpool destroy scratch

(and we can confirm it has gone with zpool list and zfs list). Just
for naming sanity, I rename the two scratch partitions:

# gpart modify -i 4 -l system8.p2 ad8
ad8p2 modified
# gpart modify -i 4 -l system4.p2 ad4
ad4p2 modified

and since those aren’t reflected in /dev/gpt, reboot. Then finally

# zpool add system mirror /dev/gpt/system4.p2 /dev/gpt/system8.p2

and presto

# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
system 463G 2.21G 461G 0% ONLINE -

ZFS Part 2: Disk Failure

Filed under: Open Source — Ben @ 16:12

Before I’m ready to trust ZFS I need to make sure I can replace a disk when it dies. With the setup described here, as a first experiment I removed the primary disk.

So, power down and remove the primary disk (ad4). Note that if you’re doing this on the Proliant system I mentioned, then you really should replace the drive mount (it is needed for cooling). Luckily I have a spare system so I just borrowed one.

Reboot. Comes up fine on the secondary disk without further intervention.

$ zpool status
  pool: scratch
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        scratch         ONLINE       0     0     0
          gpt/scratch8  ONLINE       0     0     0

errors: No known data errors

  pool: system
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        system           DEGRADED     0     0     0
          mirror         DEGRADED     0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  UNAVAIL      0     0     0  cannot open

errors: No known data errors

Note that the system pool is now degraded. How would we have known if we hadn’t checked? Well, turns out we missed something from the previous setup.

We should have put

daily_status_zfs_enable="YES"
daily_status_gmirror_enable="YES"

in /etc/periodic.conf. Then in the daily mail we’d see:

Checking status of zfs pools:
  pool: system
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

	NAME             STATE     READ WRITE CKSUM
	system           DEGRADED     0     0     0
	  mirror         DEGRADED     0     0     0
	    gpt/system8  ONLINE       0     0     0
	    gpt/system4  UNAVAIL      0     0     0  cannot open

errors: No known data errors

Checking status of gmirror(8) devices:
       Name    Status  Components
mirror/swap  DEGRADED  gpt/swap8

So remember, boys and girls, read your daily mails!

So far, so good. One disk failed, the system came back up without intervention, and would have alerted us in daily mails had we configured it correctly (of course it now is). So what happens if we put the disk back in? Since we’ve modified the other disk in the meantime, we’d hope that would get reconciled. Let’s see…

Power down and replace the missing disk, reboot.

Now we see

$ zpool status
  pool: scratch
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        scratch         ONLINE       0     0     0
          gpt/scratch8  ONLINE       0     0     0

errors: No known data errors

  pool: system
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Sat Mar 26 10:48:56 2011
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  ONLINE       0     0     0  345K resilvered

errors: No known data errors

$ gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  gpt/swap4
                       gpt/swap8

and there we are, back to where we started. But suppose the disk had really failed, then what? See the next exciting installment!

18 Mar 2011

Completely Redundant Disks with ZFS and FreeBSD

Filed under: General, Open Source — Ben @ 15:00

A while back, I bought a ReadyNAS device for my network, attracted by the idea of RAID I can grow over time and mirrored disks.

Today I just finished building the same thing “by hand”, using FreeBSD and ZFS. At a fraction of the cost. Here’s how.

First off, I bought this amazing bargain: an HP ProLiant MicroServer. These would be cheap even at list price, but with the current £100 cashback offer, they’re just stupidly cheap. And rather nice.

Since I want to cater for a realistic future, I am assuming by the time I need to replace a drive I will no longer be able to buy a matching device, so I started from day one with a different second drive (the primary is 250 GB, secondary is 500 GB – both Seagate, which was not the plan, but I’ll remedy that in the next episode). I also added an extra 1GB of RAM to the machine (this is important for ZFS which is apparently not happy with less than 2GB of system RAM).

I then followed, more or less, Pawel’s excellent instructions for creating a fully mirrored setup. However, I had to deviate from them somewhat, so here’s my version.

The broad overview of the process is as follows

  1. Install FreeBSD on the primary disk, using a standard sysinstall.
  2. Create and populate gmirror and ZFS partitions on the secondary disk.
  3. Boot from the primary disk, but mount the secondary.
  4. Create and populate gmirror and ZFS partitions on the primary disk.
  5. Use excess secondary disk as scratch.

In my case the two disks are ad4 (primary, 250 GB) and ad8 (secondary, 500 GB). Stuff I typed is in italic.

Since we need identical size partitions for the mirror, we need to simulate the first disk (since it happens to be smaller). Get the disk’s size

# diskinfo -v /dev/ad4
/dev/ad4
        512             # sectorsize
        250059350016    # mediasize in bytes (233G)
        488397168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        484521          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        9VMQN8T5        # Disk ident.

Create a memory disk the same size. Note that the sector sizes must match!

# mdconfig -a -t swap -s 488397168
md0

Verify they are the same.

# diskinfo -v /dev/ad4 /dev/md0
/dev/ad4
        512             # sectorsize
        250059350016    # mediasize in bytes (233G)
        488397168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        484521          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        9VMQN8T5        # Disk ident.

/dev/md0
        512             # sectorsize
        250059350016    # mediasize in bytes (233G)
        488397168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset

Now partition the memory disk as we will the first disk later on.

# gpart create -s gpt md0
md0 created
# gpart add -b 34 -s 128 -t freebsd-boot md0
md0p1 added
# gpart add -s 2g -t freebsd-swap -l swap1 md0
md0p2 added
# gpart add -t freebsd-zfs -l systemx md0
md0p3 added

and show the resulting sizes

# gpart show md0
=>       34  488397101  md0  GPT  (233G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)

Now blow away the memory disk, we don’t need it any more.

# mdconfig -d -u 0

Create the partitions on the second disk.

# gpart create -s gpt ad8
ad8 created
# gpart add -b 34 -s 128 -t freebsd-boot ad8
ad8p1 added
# gpart add -s 2g -t freebsd-swap -l swap1 ad8
ad8p2 added
# gpart add -s 484202669 -t freebsd-zfs -l system8 ad8
ad8p3 added

And eat the rest of the disk as a scratch area (this area will not be mirrored, and so should only be used for disposable stuff).

# gpart add -t freebsd-zfs -l scratch8 ad8
ad8p4 added

Check it matches the md0 simulation

# gpart show ad8
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)
  488397135  488376000    4  freebsd-zfs  (233G)

And don’t forget to set up the bootloader

# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ad8
bootcode written to ad8

I realised as this point I had intended to label everything with an 8, to match the unit number, and had not done so for swap, so for completeness, here’s how you fix it

# gpart modify -i 2 -l swap8 ad8
ad8p2 modified
# gpart show -l ad8
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  (null)  (64K)
        162    4194304    2  swap8  (2.0G)
    4194466  484202669    3  system8  (231G)
  488397135  488376000    4  scratch8  (233G)

Note that the label change is not reflected by the device names in /dev/gpt, which is needed for the next step, so at this point I rebooted.

Now set up the swap mirror.

# gmirror label -F -h -b round-robin swap /dev/gpt/swap8

Create the ZFS storage pool called system, consisting only of our system8 partition.

# zpool create -O mountpoint=/mnt -O atime=off -O setuid=off -O canmount=off system /dev/gpt/system8

And create a dataset – “mountpoint=legacy” stops ZFS from managing it.

# zfs create -o mountpoint=legacy -o setuid=on system/root

Mark it as the default bootable dataset.

# zpool set bootfs=system/root system

Mount it

# mount -t zfs system/root /mnt
# mount
/dev/ad4s1a on / (ufs, local)
devfs on /dev (devfs, local, multilabel)
system/root on /mnt (zfs, local, noatime)

And create the remaining mountpoints according to Pawel’s suggested layout…

# zfs create -o compress=lzjb system/tmp
# chmod 1777 /mnt/tmp
# zfs create -o canmount=off system/usr
# zfs create -o setuid=on system/usr/local
# zfs create -o compress=gzip system/usr/src
# zfs create -o compress=lzjb system/usr/obj
# zfs create -o compress=gzip system/usr/ports
# zfs create -o compress=off system/usr/ports/distfiles
# zfs create -o canmount=off system/var
# zfs create -o compress=gzip system/var/log
# zfs create -o compress=lzjb system/var/audit
# zfs create -o compress=lzjb system/var/tmp
# chmod 1777 /mnt/var/tmp
# zfs create -o canmount=off system/usr/home

And create one for each user:

# zfs create system/usr/home/ben

Now, at a slightly different point from Pawel, I edit the various config files. First /boot/loader.conf. Note that some of these are commented out: this is because, although they appear in Pawel’s version, they are already built into the kernel (this is because I use a GENERIC kernel and he uses a stripped-down one). Including them seems to cause problems (particularly geom_part_gpt, which causes a hang during boot if present).

geom_eli_load=YES
#geom_label_load=YES
geom_mirror_load=YES
#geom_part_gpt_load=YES
zfs_load=YES
vm.kmem_size=3G # This should be 150% of your RAM.

Enable ZFS

# echo zfs_enable=YES >> /etc/rc.conf

Change fstab for the new layout (note, you might want to edit these in – for example, my system had an entry for cd drives).

# cat > /etc/fstab
system/root / zfs rw,noatime 0 0
/dev/mirror/swap.eli none swap sw 0 0
^D

The .eli extension here is magic: geom_eli finds it at startup and automatically encrypts it.

Set the work directory for ports (so that it uses the faster compression scheme during builds).

# echo WRKDIRPREFIX=/usr/obj >> /etc/make.conf

These need to be done now because the next step is to copy the entire install to the new ZFS filesystem. Note that this particular command pastes completely incorrectly from Pawel’s blog post so be careful!

# tar -c --one-file-system -f - . | tar xpf - -C /mnt/

Tar can’t copy some types of file, so expect an error or two at this point:

tar: ./var/run/devd.pipe: tar format cannot archive socket
tar: ./var/run/log: tar format cannot archive socket
tar: ./var/run/logpriv: tar format cannot archive socket

Just for fun, take a look at the ZFS we’ve created so far…

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
system 1.12G 225G 21K /mnt
system/root 495M 225G 495M legacy
system/tmp 30K 225G 30K /mnt/tmp
system/usr 652M 225G 21K /mnt/usr
system/usr/home 50K 225G 21K /mnt/usr/home
system/usr/home/ben 29K 225G 29K /mnt/usr/home/ben
system/usr/local 297M 225G 297M /mnt/usr/local
system/usr/obj 21K 225G 21K /mnt/usr/obj
system/usr/ports 190M 225G 159M /mnt/usr/ports
system/usr/ports/distfiles 30.8M 225G 30.8M /mnt/usr/ports/distfiles
system/usr/src 165M 225G 165M /mnt/usr/src
system/var 100K 225G 21K /mnt/var
system/var/audit 21K 225G 21K /mnt/var/audit
system/var/log 35K 225G 35K /mnt/var/log
system/var/tmp 23K 225G 23K /mnt/var/tmp

Unmount ZFS

# zfs umount -a

And the one we mounted by hand

# umount /mnt

And set the new ZFS-based system to be mounted on /

# zfs set mountpoint=/ system

And … reboot! (this is the moment of truth)

After the reboot, you should see

$ mount
system/root on / (zfs, local, noatime)
devfs on /dev (devfs, local, multilabel)
system/tmp on /tmp (zfs, local, noatime, nosuid)
system/usr/home/ben on /usr/home/ben (zfs, local, noatime, nosuid)
system/usr/local on /usr/local (zfs, local, noatime)
system/usr/obj on /usr/obj (zfs, local, noatime, nosuid)
system/usr/ports on /usr/ports (zfs, local, noatime, nosuid)
system/usr/ports/distfiles on /usr/ports/distfiles (zfs, local, noatime, nosuid)
system/usr/src on /usr/src (zfs, local, noatime, nosuid)
system/var/audit on /var/audit (zfs, local, noatime, nosuid)
system/var/log on /var/log (zfs, local, noatime, nosuid)
system/var/tmp on /var/tmp (zfs, local, noatime, nosuid)
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
system 1.79G 225G 21K /
system/root 763M 225G 763M legacy
system/tmp 43K 225G 43K /tmp
system/usr 1.04G 225G 21K /usr
system/usr/home 50.5K 225G 21K /usr/home
system/usr/home/ben 29.5K 225G 29.5K /usr/home/ben
system/usr/local 297M 225G 297M /usr/local
system/usr/obj 416M 225G 416M /usr/obj
system/usr/ports 190M 225G 159M /usr/ports
system/usr/ports/distfiles 30.8M 225G 30.8M /usr/ports/distfiles
system/usr/src 165M 225G 165M /usr/src
system/var 106K 225G 21K /var
system/var/audit 21K 225G 21K /var/audit
system/var/log 41.5K 225G 41.5K /var/log
system/var/tmp 23K 225G 23K /var/tmp
$ swapinfo
Device 1K-blocks Used Avail Capacity
/dev/mirror/swap.eli 2097148 0 2097148 0%

Note that system is not actually mounted (it has canmount=off) – it is used to allow all the other filesystems to inherit the / mountpoint. The one that is actually mounted on / is system/root, which is marked as legacy because it is mounted before zfs is up.

Now we’re up on the second disk, time to get the first disk back in the picture (we’re using it for boot but nothing else right now).

First blow away the MBR

# dd if=/dev/zero of=/dev/ad4 count=79
79+0 records in
79+0 records out
40448 bytes transferred in 0.008059 secs (5018970 bytes/sec)

and create the GPT partitions:

# gpart create -s GPT ad4
ad4 created
# gpart add -b 34 -s 128 -t freebsd-boot ad4
ad4p1 added
# gpart add -s 2g -t freebsd-swap -l swap4 ad4
ad4p2 added
# gpart add -t freebsd-zfs -l system4 ad4
ad4p3 added
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ad4
bootcode written to ad4

No scratch partition on this one, there’s no room. Now the two disks should match

# gpart show
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)
  488397135  488376000    4  freebsd-zfs  (233G)

=>       34  488397101  ad4  GPT  (233G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)

apart from the scratch partition, of course.

Add the mirrored swap

# gmirror insert -h -p 1 swap /dev/gpt/swap4

And when rebuilding is finished, you should see

# gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  gpt/swap8
                       gpt/swap4

Now add the second disk’s zfs partition

# zpool attach system /dev/gpt/system8 /dev/gpt/system4
If you boot from pool 'system', you may need to update
boot code on newly attached disk '/dev/gpt/system4'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

We already did this part, so no need to do anything. Wait for it to finish. Here it is partway through

# zpool status
  pool: system
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 39.20% done, 0h0m to go
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  ONLINE       0     0     0  718M resilvered

errors: No known data errors

and now done

# zpool status
  pool: system
 state: ONLINE
 scrub: resilver completed after 0h2m with 0 errors on Fri Mar 18 12:13:19 2011
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  ONLINE       0     0     0  1.79G resilvered

errors: No known data errors

And we’re done. Reboot one last time to check everything worked.

One final task not relevant to the mirroring is to mount the scratch disk area.

Create a mountpoint

# mkdir /scratch

And a pool

# zpool create -O mountpoint=/scratch -O atime=off -O setuid=off scratch /dev/gpt/scratch8

This filesystem has no redundancy, as previously mentioned. (edit: I am told that the mkdir and mountpoint are both redundant – zfs will create the directory as needed, and uses the pool name as the mount point by default)

In the next installment I will fail and replace one of the disks.

Edit:

daily_status_zfs_enable="YES"
daily_status_gmirror_enable="YES"

should be added to /etc/periodic.conf so checks are added to the daily mails.

9 Mar 2011

Capsicum Wins Cambridge Ring Award

Filed under: Capabilities, Security — Ben @ 19:28

Of course, I know that capabilities are really important, and that the work we (I say we as if I did much – the hard graft is down to Robert Watson and Jon Anderson) have done on adding capabilities to FreeBSD is particularly awesome. But I continue to be amazed at the community reaction to it.

The latest accolade is the rather unwieldy Cambridge Ring Hall of Fame Award for Best Publication of the Year.

You know, I’m beginning to think we might actually make some serious progress with capabilities in the next year or two. Watch this space, there’s a lot going on in this field!

5 Mar 2011

Chicken and Lentils

Filed under: Recipes — Ben @ 15:56

chicken thighs/drumsticks/whatever – skin on
green lentils (I doubt it matters much, but this is what I used)
cardamom pods
cloves
coriander
cumin
dried red chillies
diced ginger
sliced onion
chicken stock
tinned tomatoes

Grind the spices (I use a coffee grinder, but pestle and mortar is fine). Fry with the ginger in whatever oil takes your fancy (or even ghee) for a minute or so, then add the chicken pieces. Fry pretty vigorously until nicely browned, but try not to burn the spices. I try to have enough spice so the chicken all gets nicely coated and there’s some left over for the next stage…

Once the chicken is browned (but not cooked through), set aside, leaving whatever oil and spices aren’t sticking to them in the pan. Add the onions and fry until clear, then add the lentils and some water.

Some, but not all, lentils need to be cooked carefully (they contain an enzyme that ain’t so good for you) so make sure you incorporate the cooking instructions into this recipe. The green lentils I used don’t need soaking (in fact, in my experience you can generally substitute more boiling for soaking anyway), but they do need boiling hard for ten minutes, so … do that now.

Once any hard cooking needed by the lentils is over, return the chicken to the pan, add the tinned tomatoes and chicken stock – smash the tomatoes up, bring to the boil, then simmer until the lentils are done (varies according to type). Note that lentils can soak up a lot of water, so stir occasionally and add more if needed. Season to taste and serve with rice and whatever.

I expect a raita would be nice with this.

I don’t usually do quantities, but since I was asked nicely, here’s some guidance: for eight chicken thighs (enough for four people, if they’re not too greedy) I used around 200g of lentils, one tin of tomatoes and probably around a third of a cup (or more) of spices, after they’d been ground. Mostly cumin and coriander. I know it sounds like a lot, but you need a lot – and its hard to overdo them. Stock should be enough to cook the lentils, bearing in mind the liquid from the tomatoes – the aim is for a thick lentil sauce, not a soup.

24 Feb 2011

Who the Hell are 2o7?

Filed under: Privacy, Security — Ben @ 13:38

My friend Adriana pointed me to this cool track-blocking extension for Chrome.

Back in the day, I used to do this kind of blocking “by hand” – i.e. by manually deciding which cookies to block and which to allow. This is far from an exact science – it’s fairly easy to block some sites into uselessness – so I’m pleased to see an automated alternative.

In any case, it all came to an end when Chrome decided (without any explanation I ever saw) to drop the ability to control cookies, so extensions are probably the only way now.

Anyway, it reminded me of something I kept meaning to look into but never really got very far, which is 2o7.net. This domain crops up all the time if you start monitoring cookies, and clearly is some massive tracking operation. But I’ve never heard of it, and nor has anyone else I know.

So … who the hell are 2o7? (And yes, I can do whois, which leads me to Omniture. Not much the wiser. Except they now seem to be owned by Adobe – mmm – looking forward to mixing all that tracking data with Adobe’s careful attention to security).

Note, btw, the cool track-blocking extension doesn’t appear to have heard of 2o7 either. From my experience you can just block all their cookies without harm.

16 Feb 2011

Two Cool Caja Things

Filed under: Caja — Ben @ 14:01

Firstly, Paypal are using Caja to protect their customers from errors or evilness in gadgets. There are also some performance hints here.

Secondly, my esteemed colleague, Jasvir Nagra, has put together a really nice playground for Caja. Have a go, it’s pretty.

That is all.

Next Page »

Powered by WordPress

Close
E-mail It