some stuff is wrong what you just said jkm
Cg HLSL GLSL
----------------------
Cg was developped by Nvidia, it was the first high-level language for the mass market sorta. In a cooperation project HLSL was made with Microsoft, hence Cg and HLSL are very close codewise, and often no translation is necessary.
Cg supports a few more things than the others, ie interfaces, unsized arrays and so on, which makes dynamic shader code generation easier.
Cg and HLSL compile to "profiles", which often are ASM code like representation of a shader. It is very important to know that there is no common assembly language for all graphic cards, but every chip generation has its own microcodes... its not like in the CPU world, where there is a standard. So even the ASM like stuff on GPU will be compiled/optimized again to native instructions.
HLSL supports various "shader model" profiles, each shader model has a differnt set of features / instruction limits. ps2_0 ps3_0 and so on
Cg has the most profiles, it can compile to the opengl ARB asm like code, to Nvidia specific stuff (mostly), and the hlsl profiles. It can also turn Cg to hlsl and glsl code.
GLSL code generation is still quite buggy. Mostly Cg is very Nvidia centric (well its their thing) so a lot of latest stuff like branching/loops will not work in OpenGL so well on non-nvidia hardware.
The reason is that the ARB ASM languages are not updated anymore and basically frozen at directx' shader model 2 featureset, without the instruction limits however. Nvidia has their own ASM extensions that expand on this. Ati and others have not.
Latest Version of Cg can also compile to Geforce8 stuff, so you get Geometry shaders and so on.
GLSL is a different story from the other two, because it does not compile to ASM at all, but directly to microcode of GPU. So the vendor has to do a robust compiler himself but can also benefit from a bit more optimization information. Nevertheless it makes GLSL a bit buggy. Especially ATI have suffered here a lot, but their new Vista OpenGL driver + Linux one features a complete new core. Sometime they will make that available to XP, too.
I am not sure how Intel's GLSL support is, but well.
With the lack of a "target compilation" anything other than directly to the target hardware, there is no "profiles" in GLSL, a shader either compiles or not. Depending on current hardware/driver. This leaves maximum flexibility to GLSL to encorporate new stuff, vendor specific and so on, which was always the strength of GL. But also means a certain lack of standards and a bit uglier for software developers.
Ati offers a HLSL to GLSL library, and Cg tries to be as multiplatform as possible (it is used for PS3 as well).
As said syntax wise Cg and HLSL are basically the same. GLSL is a bit different. GLSL is also used for OpenGL ES 2.0 which are portable 3d devices, think mobile phones.
Shaders
--------------
Nevertheless principles of shaders are always the same. You have a vertex and a pixel (in opengl called fragment) stage (and latest hardware may have the geometry stage).
The Vertex Shader gets data from the application "per vertex". Attributes like Colors / Positions / Texcoords. Shaders "native" datatype are Vectors with 4 components. Also attributes are like that. Typical limits is that you can send up to 16 Vector4s per Vertex to the VertexShader. How you use those 16 Vectors is up to you, but you must always use 1 to send Positions. In many shading languages you will get "default" bindings like POSITON NORMAL TEXCOORD0 and so on. Which is more the internal "feeding" name.
The Vertex shader's sole job is computing the position on screen. ie turn 3d coordinates (mostly object space). into a box called "clipspace" basically -1 to 1 in each dimension.
Any other stuff it outputs, like Color, Texcoords and such is totally optional.
The Fragment/Pixel Shader takes those outputs of a Vertexshader. Where each output now becomes an interpolated value of the 3 Vertices involved in a triangle. We always shade triangles, nothing less... Fragment shader must output one Vector4 which typically is the Color of a pixel.
It can modify the pixel's depth, but shouldn't, as that kills a lot of speed. With multiple rendertargets, it can output more than one Vector4.
It can never read the "current pixels" color, or depth, to perform blending. It can discard a pixel so (ie not write anything).
Another difference in OpenGL and DirectX is that depth-textures can be read "by value" ie 0-1 in OpenGL, and not just "by comparison" ie 0/1 as in DirectX prior 10.
After vector4 is passed out, additional tests are performed (these are still fixed function render state and not part of the shadercode). Such tests are alphatest, depthtest, stenciltest. If all tests pass, you do the framebuffer blending. Ie just overwrite, or "decal" or "modulate" or "add"... the way a pixel is blended is very simple and doesnt allow a lot of variety.
As said you cannot read a current pixel, so many techniques will actually render the whole scene not to the windows framebuffer but to a texture. And then perform postprocessing effects by drawing a simple fullscreen quad reading from the scene texture. But again you can never read and write from/to the same texture.
You may have multiple scene textures that encode color, shadow, lighting... and mix them at the end.
precision within a shader is typically 16,24 or 32 bit. But the result color is stored at 8-bit (256 values representing 0-1) per component. If you render to a texture, you can render to higher precisions and outside of 0-1.
the key to writing shaders is learning something about vector math. Cross and Dot products, matrix multiplication. You dont need to know a lot however. It mostly ends up being * + - and / anyway [img]/images/graemlins/wink.gif[/img]
I could go on endlessly but basically you would need to focus on what kind of effect work you want to do.
GPU a Stream Processor
----------------------------
it is very important to know that GPUs are "stream" processors, ie they just flow straight thru the data. They are memoryless. Say you have a multipass algorithm, your vertex transforms will be done every time. Also because they have small caches, even in the same pass, if more triangles use the same vertex and the vertex is outside of the cache, it will be fully reevaluated again. That is why triangle strips and internal optimizations of the order of triangles are important.
Only latest sm4 (geforce

hardware knows "primitives" ie triangle IDs and such. Before every vertex will not know its "neighbors", nor its triangle its part of, same for the pixel. They just get those attribute vectors + shared common "uniforms/constants". Those uniforms are like parameters "lightcolor", matrices and so on. And of course they can access textures and sample from them. You can give every vertex manually an id and use it to index a constant array. Like bone-skinning can be done that way, each vert gets an idea to the matrix array of all bones + weight (or multiple thereof).
Basically you can only use the Color output value to "memorize" your results. Geforce8/SM4 can actually store transform results of a vertexshader, too.
Because the lack of "standards" or say standards are made afterwards, OpenGL will nomally feature the latest technology. As every vendor can just extend their driver. Hence you get all the so-called DirectX10 features, like Geometry Shaders, Texture Arrays in OpenGL already under XP and Linux.
Engines Shadersystems
-----------------------------
About engines and shaders. Many (at least the real top ones) engines will have systems that generate shader code from different options. Like varying lightcounts, if shadow is on, off... That large number of permutations of a single effect will lead to tons of shaders per game. DirectX allows precompiling and storing binary (ie non readable as a txt file). Engines like Crysis/Unreal3 ship with large archives of precompiled shaders for different hardware.
The "internal shader system" way, however prevents you guys writing a .fx file in shadermonkey/fxcomposer or whatever and just "use it". As those engines are far too much optimized/pipelined you just wont have that freedom, but mostly tweak values of given shaders. Or use their tools to build a shader.
You can see the .fx stuff in max and so on more as "I want something like this" or a simple presentation of one of the engine shaders, so you can more easily tweak/view your models outside the engine.
Of course less "über" engines, will allow a bit more flexibility plugging in your own shader, and not worry about unifying lighting/shadowing and whatever.