Software systems are the product of cognitive capitalism

July 19th, 2017 No comments

Economics obviously has a significant impact on the production of software systems; it is the second chapter of my empirical software engineering book (humans, who are the primary influencers, are the first chapter; technically the Introduction is the first chapter, but you know what I mean).

I have never been happy with the chapter title “Economics”; it does not capture the spirit of what I want to talk about. Yes, a lot of the technical details covered are to be found in economics related books and courses, but how do these technical details fit into a grand scheme?

I was recently reading the slim volume “Dead Man Working” by Cederström and Fleming and the phrase cognitive capitalism jumped out at me; here was a term that fitted the ideas I had been trying to articulate. It took a couple of days before I took the plunge and changed the chapter title. In the current draft pdf little else has changed in the ex-Economics chapter (e.g., a bit of a rewrite of the first few lines), but now there is a coherent concept to mold the material around.

Ecosystems chapter added to “Empirical software engineering using R”

July 17th, 2017 No comments

The Ecosystems chapter of my Empirical software engineering book has been added to the draft pdf (download here).

I don’t seem to be able to get away from rewriting everything, despite working on the software engineering material for many years. Fortunately the sparsity of the data keeps me in check, but I keep finding new and interesting data (not a lot, but enough to slow me down).

There is still a lot of work to be done on the ecosystems chapter, not least integrating all the data I have been promised. The basic threads are there, they just need filling out (assuming the promised data sets arrive).

I did not get any time to integrate in the developer and economics data received since those draft chapters were released; there has been some minor reorganization.

As always, if you know of any interesting software engineering data, please tell me.

I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

Projects chapter next.

Tags: , ,

2017 in the programming language standards’ world

July 12th, 2017 No comments

Yesterday I was at the British Standards Institution for a meeting of IST/5, the committee responsible for programming languages.

The amount of management control over those wanting to get to the meeting room, from outside the building, has increased. There is now a sensor activated sliding door between the car-park and side-walk from the rear of the building to the front, and there are now two receptions; the ground floor reception gets visitors a pass to the first floor, where a pass to the fifth floor is obtained from another reception (I was totally confused by being told to go to the first floor, which housed the canteen last time I was there, and still does, the second reception is perched just inside the automatic barriers to the canteen {these barriers are also new; the food is reasonable, but not free}).

Visitors are supposed to show proof that they are attending a meeting, such as a meeting calling notice or an agenda. I have always managed to look sufficiently important/knowledgeable/harmless to get in without showing any such documents. I was asked to show them this time, perhaps my image is slipping, but my obvious bafflement at the new setup rescued me.

Why does BSI do this? My theory is that it’s all about image, BSI is the UK’s standard setting body and as such has to be seen to follow these standards. There is probably some security standard for rules to follow to prevent people sneaking into buildings. It could be argued that the name British Standards is enough to put anybody off wanting to enter the building in the first place, but this does not sound like a good rationale for BSI to give. Instead, we have lots of sliding doors/gates, multiple receptions (I suspect this has more to do with a building management cat fight over reception costs), lifts with no buttons ‘inside’ for selecting floors, and proof of reasons to be in the building.

There are also new chairs in the open spaces. The chairs have very high backs and side-baffles that surround the head area, excellent for having secret conversations and in-tune with all the security. These open areas are an image of what people in the 1970s thought the future would look like (BSI is a traditional organization after all).

So what happened in the meeting?

Cobol standard’s work becomes even more dead. PL22.4, the US Cobol group is no more (there were insufficient people willing to pay membership fees, so the group was closed down).

People are continuing to work on Fortran (still the language of choice for supercomputer Apps), Ada (some new people have started attending meetings and support for @ is still being fought over), C, Internationalization (all about character sets these days). Unprompted somebody pointed out that the UK C++ panel seemed to be attracting lots of people from the financial industry (I was very professional and did not relay my theory that it’s all about bored consultants wanting an outlet for their creative urges).

SC22, the ISO committee responsible for programming languages, is meeting at BSI next month, and our chairman asked if any of us planned to attend. The chair’s response, to my request to sell the meeting to us, was that his vocabulary was not up to the task; a two-day management meeting (no technical discussions permitted at this level) on programming languages is that exciting (and they are setting up a special reception so that visitors don’t have to go to the first floor to get a pass to attend a meeting on the ground floor).

Information on computers from the 1970s and earlier

July 7th, 2017 No comments

A collection of links to sources of hardware and software related information from the 1970s and earlier.

Computers and Automation, a monthly journal published between 1954 and 1978, by far and away the best source of detailed information from this period. The June issue contained an extensive computer directory and buyers guide, including a census of installed computers. The collected census for 1962-1974 must rank in the top ten of pdf files that need to be reliably converted to text.

Computer characteristics quarterly, the title says it all; the stories about the weird and wonderful computers that used to be on sale really are true. Only a couple of issues available online at the moment.

Bitsavers huge collection of scanned computer manuals. The directory listing of computer companies is a resource in its own right.

DTIC (Defense Technical Information Center). A treasure trove of work sponsored by the US military from the time of Rome and late.

Ed Thelan’s computer history: note his contains material that can be hard to find via the main page, e.g., the BRL 1961 report.

“Inventory of Automatic data processing equipment in the Federal Government”: There are all sorts of interesting documents lurking in pdfs waiting to be found by the right search query.

Books

“Software Reliability” by Thayer, Lipow and Nelson is now available online.

The Economics of Computers” by William F. Sharpe contains lots of analysis and data on computer purchase/leasing and usage/performace details from the mid-1960s.

“Data processing technology and economics” by Montgomery Phister is still only available in dead tree form (and uses up a substantial amount of tree).

Missing in Action

“A Study of Technological Innovation: The Evolution of Digital Computers”, Kenneth Knight’s PhD thesis at Carnegie Institute of Technology, published in 1963. Given Knight’s later work, this will probably be a very interesting read.

“Computer Survey”, compiled by Mr Peddar, was a quarterly list of computers installed in the UK. It relied on readers (paper) mailing in details of computers in use. There are a handful of references and that’s all I can find.

What have I missed? Suggests and links very welcome.

Increase your citation count, send me your data!

June 30th, 2017 2 comments

I regularly email people asking for a copy of the data used in a paper they wrote. In around 32% of cases I don’t get any reply, around 12% promise to send the data when they are less busy (a few say they are just too busy) and every now and again people ask why I want the data.

After 6-12 months, I email again saying that I am still interested in their data; a few have replied with apologies and the data.

I need a new strategy to motivate people to spend some time tracking down their data and sending it to me; there are now over 200 data-sets possibly lost forever!

I think those motivated by the greater good will have already responded. It is time to appeal to baser instincts, e.g., self-interest. The currency of academic life is paper citations, which translate into status in the community, which translate into greater likelihood of grant proposals being accepted (i.e., money to do what they want to do).

Sending data gets researchers one citation in my book (I am ruthless about not citing a paper if I don’t get any data).

My current argument is that once their data is publicly available (and advertised in my book) lots of other researchers will use it and more citation to their work will follow; they also get an exclusive, I only use one data-set for each topic (actually data is hard to get hold of, so the exclusivity offer is spin).

To back up my advertising claims I point out that influential people are writing about my book and it’s all over social media. If you want me to add you to the list of influential people, send me a link to what you have written (I have no shame).

If you write about my book, please talk about the data and that researchers who make their data public are the only ones who deserve funding and may citations rain down on them.

That is the carrot approach, how can I apply some stick to motivate people?

I could point out that if they don’t send me their data their work is doomed to obscurity, because I will use somebody else’s (skipping over the minor detail of data being hard to find). Research has found that people are less willing to share their data if the strength of the evidence is weak; calling out somebody like that is do-or-die.

If you write about my book, please talk about the data and point out that researchers who don’t make their data public have something to hide and should not be funded.

Since the start of 2017, researchers in the UK receiving government research grants are required to archive their data and make it available. This is good for future researchers, but not a lot of use for me now.

What do readers think? Ideas and suggestions welcome.

Whole-program optimization: there’s gold in them hills

June 29th, 2017 No comments

Information is the life-blood of compiler optimization and compiler writers are always saying “If only this-or-that were known, we could do this optimization.”

Burrowing down the knowing this-or-that rabbit-hole leads to requiring that code generation be delayed until link-time, because it is only at link-time that all the necessary information is available.

Whole program optimization, as it is known, has been a reality in the major desktop compilers for several years, in part because computers having 64G+ of memory have become generally available (compiler optimization has always been limited by the amount of memory available in developer computers). By reality, I mean compilers support whole-program-optimization flags and do some optimizations that require holding a representation of the entire program in memory (even Microsoft, not a company known for leading edge compiler products {although they have leading edge compiler people in their research group} supports an option having this name).

It is still early days for optimizations that are “whole-program”, rather like the early days of code optimization (when things like loop unrolling were leading edge and even constant folding was not common).

An annoying developer characteristic is writing code that calls a function every five statements, on average. Calling a function means that all those potentially reusable values that have been loaded into registers, before the call, cannot be safely used after the call (figuring out whether it is worth saving/restoring around the call is hard; yes developers, its all your fault that us compiler writers have such a tough job :-P).

Solutions to the function call problem include inlining and flow-analysis to figure out the consequences of the call. However, the called function calls other functions which in-turn burrow further down the rabbit hole.

With whole-program optimization, all the code is available for analysis; given enough memory and processor time lots of useful information can be figured out. Most functions are only called once, so there are lots of savings to be had from using known parameter values (many are numeric literals) to figure out whether an if-statement is necessary (i.e., is dead code) and how many times loops iterate.

More fully applying known techniques is the obvious easy use-case for whole-program optimization, but the savings probably won’t be that big. What about new techniques that previously could not even be attempted?

For instance, walking data structures until some condition is met is a common operation. Loading the field/member being tested and the next/prev field, results in one or more cache lines being filled (on the assumption that values adjacent in storage are likely to be accessed in the immediate future). However, data structures often contain many fields, only a few of which need to be accessed during the search process, when the next value needed is in another instance of the struct/record/class it is unlikely to already be available in the loaded cache line. One useful optimization is to split the data structure into two structures, one holding the fields accessed during the iterative search and the other holding everything else. This data-remapping means that cache lines are more likely to contain the next value accessed approaches increases the likelihood that cache lines will hold a values needed in the near future; the compiler looks after the details. Performance gains of 27% have been reported

One study of C++ found that on average 12% of members were dead, i.e., never accessed in the code. Removing these saved 4.4% of storage, but again the potential performance gain comes from improve the cache hit ratio.

The row/column storage layout of arrays is not cache friendly, using Morton-order layout can have a big performance impact.

There are really really big savings to be had by providing compilers with a means of controlling the processors caches, e.g., instructions to load and flush cache lines. At the moment researchers are limited to simulations show that substantial power savings+some performance gain are possible.

Go forth and think “whole-program”.

How indeterminate is an indeterminate value?

June 18th, 2017 9 comments

One of the unwritten design aims of the C Standard is that it should be possible to fully implement the C Standard library in conforming C. It turned out that this was not possible in C90; the problem was implementing the memcpy function when the object being copied was an object having a struct type containing one or more padding bytes. The memcpy library function copies the bytes in one object to another object. The padding bytes might be uninitialized (they have an indeterminate value), which means accessing them is undefined behavior (in C90), i.e., use of memcpy for copying structs containing padding results in a non-conforming program.

struct {
        char c; // Occupies 1 byte
        // Possible padding bytes here
        int i;  // A 2/4-byte int sometimes has to be aligned on a 2/4-byte storage boundary
       };

Padding bytes could be set to a known value by, for instance, using memcpy to zero the storage; requiring this usage was thought to be excessive, and a hefty chunk of new words was added in C99 (some of the issues raised by this problem also cropped up elsewhere, which contributed to the will to do this).

One consequence of the new wording is that objects having type unsigned char are special in that while their uninitialized value is still indeterminate, the possible set of values excludes a trap representation, they have an unspecified value making accesses unspecified behavior (which conforming programs can contain). The uninitialized value of objects having other types can be a trap representation; it’s the possibility of a value being a trap representation that makes accessing such uninitialized objects undefined behavior.

All well and good, memcpy can now be implemented in conforming C(99) by copying unsigned chars.

Having made it possible for a conforming program to access an uninitialized object (having type unsigned char), questions about it actual value can be asked. Its value is indeterminate you say, the clue is in the term indeterminate value. Ok, what does the following value function return?

unsigned char random(void)
{
unsigned char x;
 
return x ^ x;
}

Exclusiving-oring a value with itself always produces zero. An unsigned char taking, say, values 0 to 255, pick one and you always get zero; case closed. But where does it say that an indeterminate value is always the same value? There is no wording preventing an indeterminate value being different every time it is accessed. The sound of people not breathing could be heard when this was pointed out to WG14 (the C Standard’s committee), followed by furious argument on one side or the other.

The following illustrates one situation where the value of padding bytes could change with every access. That volatile qualifier specifies that the value of c could change between two accesses (e.g., it represents the storage layout of some memory mapped I/O device). Perhaps any padding bytes following it are also effectively volatile-qualified.

struct {
        volatile char c; // A changeable 1 byte
        // Possible padding bytes may be volatile
        int i;  // No volatility here
       };

The local object x, above, is not associated with a volatile-qualified object. But, so what? Another unwritten design aim of the C Standard is to keep the wording simple, so edge cases are not called out and the behavior intended to handle padding bytes gets applied to local unsigned chars.

A compiler could decide that calls to random always return zero, based on the assumption that while indeterminate values may not be known, they are not time varying.

Experiment, replicate, replicate, replicate,…

June 14th, 2017 No comments

Popular science writing often talks about how one experiment proved this-or-that theory or disproved ‘existing theories’. In practice, it takes a lot more than one experiment before people are willing to accept a new theory or drop an existing theory. Many, many experiments are involved, but things need to be simplified for a popular audience and so one experiment emerges to represent the pivotal moment.

The idea of one experiment being enough to validate a theory has seeped into the world view of software engineering (and perhaps other domains as well). This thinking is evident in articles where one experiment is cited as proof for this-or-that and I am regularly asked what recommendations can be extracted from the results discussed in my empirical software book (which contains very few replications, because they rarely exist). This is a very wrong.

A statistically significant experimental result is a positive signal that the measured behavior might be real. The space of possible experiments is vast and any signal that narrows the search space is very welcome. Multiple replication, by others and with variations on the experimental conditions (to gain an understanding of limits/boundaries), are needed first to provide confidence the behavior is repeatable and then to provide data for building practical models.

Psychology is currently going through a replication crisis. The incentive structure for researchers is not to replicate and for journals not to publish replications. The Reproducibility Project is doing some amazing work.

Software engineering has had an experiment problem for decades (the problem is lack of experiments), but this is slowly starting to change. A replication problem is in the future.

Single experiments do have uses other than helping to build a case for a theory. They can be useful in ruling out proposed theories; results that are very different from those predicted can require ideas to be substantially modified or thrown away.

In the short term (i.e., at least the next five years) the benefit of experiments is in ruling out possibilities, as well as providing general pointers to the possible shape of things. Theories backed by substantial replications are many years away.

Unappreciated bubble research

June 7th, 2017 1 comment

Every now and again an academic journal dedicates a single issue to one topic. I laughed when I saw the topic of an upcoming special issue on “Enhancing Credibility of Empirical Software Engineering”.

If you work in industry, you probably have a completely different interpretation of the intent of this issue, compared to somebody working in academia, i.e., you think the topic is about getting academic researchers to work on stuff of interest to industry. In academia the issue is about getting industry to treat the research work being done in universities as relevant to their needs, i.e., industry just does not appreciate how useful the work being done in universities is to solving real world problems.

Yes fellow industrialists, the credibility problem is all down to us not appreciating the work of those hard-working academics (I was once at a university meeting and the Dean referred to the industrialists at the meeting, which confused me because I did not know any were present; sometime later the penny dropped and I realised he was talking abut me and another guy who was working in industry).

The real problem is that most research academics have little idea what goes on in industry and what research results might be of interest to industry. This is not surprising given that the academic career ladder keeps people within the confines of the university bubble.

I regularly have academics express surprise that somebody in industry, i.e., me, knows about this-that-or-the-other. This baffled me for a while, until I realised that many academics really do regard people working in industry as simpletons; I now reply that its because I paid more for my degree and did not have the usual labotomy before graduating. Now they are baffled.

The solution to the problem of industrial research relevance is for academics to be willing to move outside the university bubble, to go out and interact with people in industry. However, there are powerful incentives pushing academics away from talking to industry:

  • academic performance is measured by papers published and the chances of getting a paper published are improved if it involves a fashionable topic (yes fellow industrialists, academics suffer from this problem too). Stuff that industry is interested in is not fashionable, at least not yet. I don’t see many researchers being willing to risk working on very unfashionable topics in the hope that their work might get published,
  • contact with industry will open the eyes of many academics to the interesting work being done there and the much higher paying jobs available (at least for those who are any good). Heads’ of department don’t want to lose their good people and have every incentive to discourage researchers having any contact with industry. The senior staff are sufficiently embedded in the system that they can be trusted to talk to industry, rather like senior communist party members being allowed to visit the West during the cold war.

An alternative way for academic research to connect with industry is for the research to be done by people with a lot of industry experience. There are a surprising number of people working in industry who are bored and are contemplating doing a PhD for something interesting to do (e.g., a public proclamation).

Again there are powerful incentives pushing against industry contact. PhD students do the academic grunt work and so compliant people are needed, i.e., recent graduates who will accept that this is how things work, not independent people who know better (such as those with a decent amount of industry experience). Worries about industrialists not being willing to tow-the-line with respect to departmental thinking are probably groundless, plenty of this sort of thing goes on in industry.

I found out at the weekend that only one central London university offers a computing related part-time PhD program (Birkbeck; few people can afford to a significant drop in income); part-time students are not around to do the grunt work.

Signaling cognitive firepower as a software developer

May 31st, 2017 2 comments

Female Peacock mate selection is driven by the number of ‘eyes’ in the tail of the available males; the more the better. Supporting a large fancy tail is biologically expensive for the male and so tail quality is a reliable signal of reproductive fitness.

A university degree used to be a reliable signal of the cognitive firepower of its owner, a quality of interest to employers looking to fill jobs that required such firepower.

Some time ago the UK government expressed the desire for 50% of the population to attend university (when I went to university the figure was around 5%). These days a university degree is a signal of being desperate for a job to start paying off a large debt and having an IQ in the top 50% of the population. Dumbing down is the elephant in the room.

The idea behind shifting the payment of tuition fees from the state to the student, was that as paying customers students would somehow actively ensure that universities taught stuff that was useful for getting a job. In my day lecturers laughed when students asked them about the relevance of the material being taught to working in industry; those that persisted had their motives for attending university questioned. I’m not sure that the material taught these days is anymore relevant to industry than it was in my day, but students don’t get laughed at (at least not to their face) and there is more engagement.

What could universities teach that is useful in industry? For some subjects the possible subject matter can at least be delineated (e.g., becoming a doctor), while for others a good knowledge of what is currently known about how the universe works and a familiarity with some of the maths involved is the most that can sensibly be covered in three years (when the final job of the student is unknown).

Software development related jobs often prize knowledge of the application domain above knowledge that might be learned on a computing degree, e.g., accounting knowledge when developing software for accounting systems, chemistry knowledge when working of chemical engineering software, and so on. Employers don’t want to employ people who are going to spend all their time working on the kind of issues their computing lecturers have taught them to be concerned about.

Despite the hype, computing does not appear to be as popular as other STEM subjects. I don’t see this as a problem.

With universities falling over themselves to award computing degrees to anybody who can pay and is willing to sit around for three years, how can employers separate the wheat from the chaff?

Asking a potential employee to solve a simple coding problem is a remarkably effective filter. By simple, I mean something that can be coded in 10 lines or so (e.g., read in two numbers and print their sum). There is no need to require any knowledge of fancy algorithms, the wheat/chaff division is very sharp.

The secret is to ask them to solve the problem in their head and then speak the code (or, more often than not, say it as the solution is coded in their head, with the usual edits, etc).